## Lecture Notes - Charts and Histograms ##

**Helpful Resource:**
- [Python Reference](http://data8.org/sp22/python-reference.html): Cheat sheet of helpful array & table methods used in Data 8!

**Recommended Readings:**
- [Visualizing Numerical Distributions](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html)
- [Visualizing Categorical Distributions](https://inferentialthinking.com/chapters/07/1/Visualizing_Categorical_Distributions.html)
- [Visualization](https://inferentialthinking.com/chapters/07/Visualization.html)
- [Tables](https://inferentialthinking.com/chapters/06/Tables.html)
- [Arrays](https://inferentialthinking.com/chapters/05/1/Arrays.html)
- [Data Types](https://inferentialthinking.com/chapters/04/Data_Types.html)
- [Programming in Python](http://www.inferentialthinking.com/chapters/03/programming-in-python.html)

## Table Methods ##

- Creating tables: `Table.read_table` 
- Extending tables: `Table().with_columns`
- Finding numbers of rows in a table: `num_rows`
- Finding numbers of columns in a table: `num_columns`
- Referring to columns: by labels or indices: column indices start at 0
- Accessing data in a column: `column` takes a label or index and returns an array
- Using array methods to work with data in columns: `item`, `sum`, `min`, `max`, and so on
- Creating new tables containing some of the original columns: `select`, `drop`

## Manipulating Rows ##

- `tbl.sort(column)` sorts the rows in increasing order
- `tbl.sort(column, descending=True)` sorts the rows in decreasing order
- `tbl.take(row_numbers)` keeps the numbered rows and each row has an index, starting at 0
- `tbl.where(column, condition)` where *condition* can be a value or a predictor, keeps all rows for which a column's value satisfies a condition.  

    For example:
    
    ```tbl.where(column, are.equal_to(value))``` keeps all rows for which a column's value equals some particular value, shorter form: `tbl.where(column, value)`

In [None]:
# Just run the cell to import the required module

from datascience import *
import numpy as np
import warnings
warnings.simplefilter('ignore', FutureWarning)

# These lines do some fancy plotting magic.\n",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In [None]:
%%html
<style>
table {float: left}
</style>

## Visualizing Categorical Distributions ##

Data come in different forms, some are numerical and many are not numerical. 

For examples, data can be pieces of music and places on a map. They can also be categories into which you can place individuals. We call them categorical variables.

Examples:

- Ice cream flavors
- Moive genre
- Job titles
- Types of nutrients
- College majors


## Bar Chart ##

The bar chart is a way of visualizing categorical distributions. It displays a bar for each category. The bars are equally spaced and equally wide. The length of each bar is proportional to the frequency of the corresponding category.

Table method to draw a horizontal bar chart:

`tbl.barh(categories, values)`

It takes two arguments: the first is the column label of the categories, and the second is the column label of the frequencies.

`tbl.barh(categories)`

If the table contains only 2 columns - a category and the frequency of the corresponding category, we can omit the second arguments int the tbl.barh() method

In [None]:
cars = Table.read_table("Cars2015_v1.csv")
cars.show(5)

### Distribution Table ###

A distribution table shows all the values of the variable along with the frequency of each one.

The `tbl.group()` allows us to count how frequently each car type appears in the table, by calling each car type a category and collecting all the rows in each of these new categories.

The `tbl.group()` takes as its argument the label of the column that contains the categories. It returns a table of counts of rows in each category.

Thus `tbl.group()` creates a distribution table that shows how the cars are distributed among the type categories.

The `tbl.group()` lists the categories in ascending order. Since our categories are type names and therefore represented as strings, ascending order means alphabetical order.

The column of counts is always called `count`, but you can change that if you like by using relabeled.

In [None]:
car_group_by_type = cars.group('Type')
car_group_by_type

In [None]:
# we can relabel `count` to be more descriptive

car_types_distribution = car_group_by_type.relabeled('count', 'Number of Cars')
car_types_distribution

In [None]:
car_types_distribution.barh('Type', 'Number of Cars')

# or if the table contains only 2 columns - a category and 
#    the frequency of the corresponding category,
#.   we can omit the second arguments int the tbl.barh() method

#car_types.barh('Type')

### Bar Chart vs Plot ###

The bar chart has categories on one axis and numerical quantities on the other.

The scatter plot and the line plot display two quantitative variables – the variables on both axes are quantitative.

### Order of the Bars ###

The bars can be drawn in different orders:

- ascending
- descending
- order that originated from the table

In [None]:
# ascending order

car_types_distribution.sort('Number of Cars').barh('Type')

In [None]:
# descending order

car_types_distribution.sort('Number of Cars', descending=True).barh('Type')

In [None]:
cars.select('Make', 'Model')

In [None]:
#cars_group_by_make = cars.select('Make', 'Model').group('Make').relabeled('count', 'Number of Models')

# or

cars_make_model = cars.select('Make', 'Model')
cars_make_model_distribution = cars_make_model.group('Make').relabeled('count', 'Number of Models')
cars_make_model_distribution

In [None]:
cars_make_model_distribution.sort('Number of Models').barh('Make')

## Visualizing Numerical Distributions ##


### Histogram ###

A histogram is a visualization of the distribution of a quantitative variable - numerical values.

`tbl.hist(column, unit, bins, group)`

It generates a histogram of the numerical values in a column. `unit` and `bins` are optional arguments, used to label the axes and group the values into intervals (bins), respectively. Bins have the form `[a, b)`, where **a** is included in the bin and **b** is not.

In [None]:
cars.show(3)

In [None]:
pound = cars.select('Model').with_columns('Adjusted Weight', cars.column('Weight')/100)

# or break it into multiple lines
#model = cars.select('Model')  # create a new table with 1 column - 'Model'
#hundred_pound = cars.column('Weight') / 100  # create an array for 'Weight' column, then divide each item in the array by 100
#pound = model.with_columns('Adjusted Weight', hundred_pound)  # append a column to the table with the array

pound

### Binning the Data ###

Groups values into intervals.

`tbl.bin(column_name_or_index, bins)`

It results in a two-column table that contains the number of rows in each bin. The first column lists the left endpoints of the bins, except in the last row.

Bins have the form `[a, b)`, where **a** is included in the bin and **b** is not.

In [None]:
# check the range of the 'Adjusted Weight'

# Step 1:
# convert a table column to an array
weight = pound.column('Adjusted Weight')

# Step 2:
# check the min and max value in the array
min(weight), max(weight)

In [None]:
# specify the bin size

bin_counts = pound.bin('Adjusted Weight', bins=np.arange(20,65,5))
bin_counts.show()

Bin 1 interval -- \[20, 25)  include 20, exclude 25

Bin 2 interval -- \[25, 30)  include 25, exclude 30

Bin 3 interval -- \[30, 35)  include 30, exclude 35

Bin 4 interval -- \[35, 40)  include 35, exclude 40

... etc


In [None]:
# specify a number of equally wide bins

pound.bin('Adjusted Weight', bins=3)

In [None]:
# no specification, by default, 
#    `tbl.bin` produce 10 equally wide bins between the minimum and maximum values of the data

pound.bin('Adjusted Weight').show()

### Visualization - Histogram ###

In [None]:
pound.hist('Adjusted Weight', bins=np.arange(20,65,10), unit='Hundred Pounds')

In [None]:
pound.bin('Adjusted Weight', bins=np.arange(20,65,10))

In [None]:
uneven = make_array(20, 22, 24, 26, 28, 30, 40, 60)
pound.hist('Adjusted Weight', bins=uneven, unit='Hundred Pounds')

In [None]:
pound.bin('Adjusted Weight', bins=uneven)

### Density Scale on The Vertical Axis ###

The height of bar is not the percent of entries in the bin. It is the percent of entries in the bin relative to the amount of space in the bin. That is why the height measures crowdedness or density. The vertical axis is said to be on the density scale.

In [None]:
pound.num_rows

In [None]:
percent = 41 / 110 * 100
percent

#### How to Compute the Height of each Bar in Density Scale? ####

[Density Scale](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html#the-vertical-axis-density-scale) in the textbook has a great explanation.

area = width * height

height = area / width

##### Example 1: #####

area - bin \[40, 60) contains the weight of 41 cars

the 41 cars represents 37.27\% of the of all 110 cars

width is the bin interval which is 60 - 40 = 20

In [None]:
height = percent / (60 - 40)
height

##### Example 2: #####

area - bin [30, 40) contains the weight of 52 cars

Step 1:

1. find the percentage of the 52 cars out of 110 cars
1. find the width of the bin that contains 52 cars


In [None]:
percent = 52 / 110 * 100
percent

In [None]:
height = percent / (40 - 30)
height

#### Benefits of Using Density Scale ####

The main reason for plotting density on the vertical axis is to be able to compare histograms and approximate them with smooth curves where proportions are represented by areas under the curve.

Drawing histograms on the density scale also allows us to compare histograms that are based on data sets of different sizes or have different choices of bins. For example, if 2 histograms are drawn to the density scale then areas and densities are comparable.

### Overlaid Graphs ###

In [None]:
# scatter plot - association of FuelCap with CityMPG and HwyMPG

fuel_mpg = cars.select('FuelCap', 'CityMPG', 'HwyMPG')
fuel_mpg.scatter('FuelCap')

In [None]:
rate = Table.read_table('ratings.csv')
rate

In [None]:
# bar chart - Yelp and Google rating distribution of burrito restaurants

rate.barh('Name')

In [None]:
# pick 1 out of every 10 restaurants

rate.take(np.arange(0,rate.num_rows,10)).barh('Name')

# or use multiple lines of code
#random = rate.take(np.arange(0,rate.num_rows,10))
#random.barh('Name')

In [None]:
# show only Yelp and Google ratings

rate.select('Name', 'Yelp', 'Google').take(np.arange(0,rate.num_rows,10)).barh('Name')

# or use multiple lines of code
#yelp_google = rate.select('Name', 'Yelp', 'Google')
#every_10_rows = yelp_google.take(np.arange(0,rate.num_rows,10))
#every_10_rows.barh('Name')

In [None]:
np.arange(0,rate.num_rows,10)