In [None]:
from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True

# Data 6: Histograms
* NumPy ranges
* Histograms
* Density as height

Source: [Data 8 Fall 2025 Lecture 08](https://github.com/data-8/materials-fa25/blob/main/lec/lec08/lec08.ipynb) and [Lecture 04](https://github.com/data-8/materials-fa25/blob/main/lec/lec04/lec04.ipynb)

## Ranges are sequences of consecutive numbers

In [None]:
make_array(0,1,2,3,4,5,6)

In [None]:
np.arange(7)

What are the arguments to `np.arange`? Use Notebook help (`?`):

In [None]:
np.arange?

In [None]:
np.arange(stop=7)

In [None]:
np.arange(start=0, stop=7, step=3)

In [None]:
np.arange(5, 11)     # step is optional (default 1)

In [None]:
np.arange(0, 1, 0.1)

In [None]:
np.arange(20, 0, -2)   # what do negative steps do?

## Distributions

### Every variable has a distribution

* `Title`: title of the movie
* `Studio`: name of the studio that produced the movie
* `Gross`: domestic box office gross in dollars
* `Gross (Adjusted)`: the gross amount that would have been earned from ticket sales at 2016 prices
* `Year`: release year of the movie.

In [None]:
top_movies = Table.read_table('data/top_movies_2017.csv')
top_movies.show(6)

In [None]:
studio_distribution = top_movies.group('Studio')

In [None]:
studio_distribution.show(6)

#### **Task (Review):** Visualize the distribution of studios responsible for the highest grossing movies as of 2017.

In [None]:
studio_distribution.barh('Studio')

In [None]:
# note how `.take` is used here with `np.arange`
studio_distribution.sort('count', descending=True).take(np.arange(5)).barh('Studio')
print("Five studios are largely responsible for the highest grossing movies")

**STOP**

### Use binning for numerical distributions

In slides.

## Histograms: The Area Principle

#### **Task**: Visualize the distribution of how long the highest grossing movies as of 2017 have been out (in years).

In [None]:
ages = 2025 - top_movies.column('Year')
ages

In [None]:
top_movies = top_movies.with_column('Age', ages)
top_movies

In [None]:
top_movies.select('Title', 'Age').show(6)

In [None]:
min(ages), max(ages)

If you want to make equally sized bins, `np.arange()` is a great tool to help you.

In [None]:
top_movies.hist('Age', bins = np.arange(0, 110, 10), unit = 'Year', density=False)

What if we want to plot percentages?

**STOP**

## Histograms: Density

In [None]:
# default is density=True
top_movies.hist('Age', bins = np.arange(0, 110, 10), unit = 'Year')

In [None]:
# top_movies.bin('Age', bins=np.arange(0, 110, 10)).show()

Otherwise, you can pick your own bins. These are just bins that we picked out.


In [None]:
my_bins = make_array(0, 5, 10, 15, 25, 40, 65, 101)

You may then use the `bin` table method to make a table having your bins, along with the number of observations within each.

In [None]:
binned_data = top_movies.bin('Age', bins = my_bins)
binned_data

**Note:** The last "bin" does not include any observations!!

##### Now, plot the histogram!

In [None]:
top_movies.hist('Age', bins = my_bins, unit = 'Year')

**STOP**

#### **Discussion Question (1 min)**: Compare the bins $[25, 40)$ and $[40, 65)$. 

- Which one has more movies?
- Which one is more crowded?

In [None]:
top_movies.hist('Age', bins = my_bins, unit = 'Year')

## [Practice] Bar Charts vs. Histograms

I would not worry about the technical details of the code until next week! Right now, I just want you to see the different styles of bar chart that we might use.

### Challenge tasks

#### **Task**: Find the height of the $[40,65)$ bin in the histogram above.

$$\text{height} = \frac{\text{percent}}{\text{width}}$$

Add a column containing what percent of movies are in each bin (the **area** of each bin)

In [None]:
binned_data = binned_data.with_column('Percent', 100*binned_data.column('Age count')/top_movies.num_rows)

In [None]:
binned_data.show()

In [None]:
percent = binned_data.where('bin', 40).column('Percent').item(0)

In [None]:
width = 65-40
height = percent / width

In [None]:
height

#### **Task**: Find the heights of the (rest of the) bins.

$$\text{height} = \frac{\text{percent}}{\text{width}}$$

Remember that the last row in the table does not represent a bin!

In [None]:
height_table = binned_data.take(np.arange(binned_data.num_rows - 1))
height_table 

Remember `np.diff`?

In [None]:
bin_widths = np.diff(binned_data.column('bin'))

In [None]:
bin_widths

In [None]:
height_table = height_table.with_column('Width', bin_widths)
height_table

In [None]:
height_table = height_table.with_column('Height',
                                        height_table.column('Percent')/height_table.column('Width'))

In [None]:
height_table

To check our work one last time, let's see if the numbers in the last column match the heights of the histogram:

In [None]:
top_movies.hist('Age', bins = my_bins, unit = 'Year')

### Bar char variants

#### One categorical attribute

In [None]:
cones = Table.read_table('data/cones.csv')
cones

In [None]:
flavor_table = cones.group('Flavor')
flavor_table

In [None]:
flavor_table.barh('Flavor')

#### One categorical attribute, one numerical attribute

In [None]:
cone_average_price_table = cones.drop('Color').group('Flavor', np.average)
cone_average_price_table

In [None]:
cone_average_price_table.barh('Flavor')

#### Two categorical attributes

(We will cover `pivot` in more detail next week)

In [None]:
cones_pivot_table = cones.pivot('Flavor','Color')
cones_pivot_table

In [None]:
cones_pivot_table.barh('Color')