# Discussion 3

### Due Wednesday April 17, 11:59:59PM


---

## Lecture Review



### Hypothesis Testing

Hypothesis testing is a statistical method that assesses the likelihood that observed data is adequately explained by a hypothetical process.

#### Choosing a null and alternative hypothesis

1. Hypothesis tests are based on an assumption called the **null hypothesis** that is a potential explanation for how your data was generated.
    * The exact form of the null hypothesis varies from one type of test to another; it usually asserts the 'non-interesting' point-of-view.
    * If you are testing whether a group doesn't 'look the same' as the whole population, the null hypothesis states that the group looks the same as the population.
    * For instance, if you wanted to test whether the average age of voters in your home state is greater than the national average, the null hypothesis would be that there is no difference between the average ages.
  
2. The purpose of a hypothesis test is to determine whether a given observation is likely to be explained by your null hypothesis (or whether 'something more interesting' is going on).

3. If there is little evidence contradicting the null hypothesis, then it's not unlikely that such an assumption could have produced the observed data -- that is, you don't have enough evidence to *reject the explanation given by the null hypothesis*.

4. If there *is* evidence contradicting the null hypothesis, then you might reject the null hypothesis in favor of the **alternative hypothesis**.
    * Alternative hypothesis are usually 'interesting explanations' for an observation.
    * Exact form of the alternative hypothesis will depend on the specific test you are carrying out.
    * Continuing with the voting example above, the alternative hypothesis would be that the average age of voters in your state does in fact differ from the national average.

#### Carrying out a hypothesis test

1. Once you have the null and alternative hypothesis in hand, you choose a significance level (often denoted by the Greek letter $\alpha$). This threshold will determine when you reject the null hypothesis.
    * The significance level reflects how certain you want to be that you are not making an incorrect conclusion by rejecting the null hypothesis.
    * A choice of $\alpha = 0.05$ implies you are comfortable making an incorrect conclusion by rejecting the null hypothesis 1 in 20 times.
2. Generating data under the null hypothesis using simulating; doing this is "seeing what the world looks like" when created by the null hypothesis.
    * This simulation results in a distribution of your test-statistic as generated by the null hypothesis.
3. Look at the likelihood that the observed test-statistic was generated under the null hypothesis
    * Sometimes this likelihood is obviously small; sometimes you need the help of a p-value.
    * The **p-value** is the probability that a result generated under the null hypothesis is *at least as extreme* as the observed test-statistic. 
    * If the p-value is less than your chosen significance level, then you should *reject the null hypothesis in favor of the alternative*.

### Groupby

* The dataframe method `df.groupby(key)` enables you to split a dataframe `df` into *groups of dataframes* divided by the values in a `key` column.
* `groupby` objects have methods that compute functions on these groups and combine the result into a tabular object indexed by the values of `key`.

#### `groupby` method

* Given a dataframe `df`, the groupby object `df.group(key)` is built on a dictionary:
    - keyed by the values of the `key` column.
    - values are indices of `df` corresponding to values of `key`.
* You can inspect a groupby object by getting the dataframe corresponding to a given key 
    - e.g. `df.groupby(key).get_group(k)`

#### `groupby` object methods

Suppose we have a groupby object `G = df.groupby(key)`.

* You can select a column of `G` with the usual notation `G[col]`
    - This is really telling Pandas to select that column of the corresponding group dataframes.
* You can use many of the usual dataframe methods on `G`:
    - `min, max, mean, median, count, size, describe`
    - see: http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
* `aggregate` lets you apply (collections) of functions to **single columns** of the groups of a groupby object.
    - `G.aggregate({'col1': f})` applies the function `f` to `col1` of the groups in `G`.
    - Note: using aggregate, `f` has to be a function that takes in a *series*.
* `apply` lets you apply a function to the group-dataframes.
    - `G.apply(g)` applies `g` to each dataframe `G.get_group(k)`
    - `g` must be a function that takes a *dataframe* as input.
* `transform` lets you apply a function to each group, and outputs a dataframe with the same number of rows as the input dataframe `df`.
    - To apply `G.transform(h)`, the function `h` must take in a series/dataframe and return a series/dataframe of the same length.

#### `groupby` examples

Let us now see how the grouping objects can be applied to the DataFrame object:

In [None]:
# import libraries
import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
                     'Kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
            'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
            'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
            'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

df.head()

In [None]:
grouped = df.groupby('Year')

for name,group in grouped:
    print(name)
    print(group)
    
# by default, the groupby object has the same label name as the group name

In [None]:
# inspect a groupby object by getting the dataframe corresponding to a given key (here, 2014)
grouped = df.groupby('Year')

print(grouped.get_group(2014))

In [None]:
# an aggregated function returns a single aggregated value for each group
print(grouped['Points'].agg(np.mean))

In [None]:
# transformation on a group or a column returns an object that is indexed the same size of that is being grouped
score = lambda x: (x - x.mean()) / x.std()

print(grouped.transform(score))

In [None]:
# filtration filters the data on a defined criteria and returns the subset of data
print(df.groupby('Team').filter(lambda x: len(x) >= 3))

In the above filter condition, we are asking to return the teams which have participated three or more times in IPL.

---

# Plotting: Pandas, Seaborn, Matplotlib

In [None]:
# magic command for displaying plots in notebook
%matplotlib inline

In [None]:
import os

## Plotting in `pandas` is as easy as `.plot()`

* `Series.plot()` plots a column.

In [None]:
data = pd.read_csv('data.csv')

In [None]:
data.head()

In [None]:
# select a column from data
z0 = data['z0']
z0.head()

* Use a line plot to plot numeric data.
* `data.plot()` plots a line plot by default.
    - The x-axis is the index by default
    - Can be called out using the key-word argument `x`.

In [None]:
# index is [0...1000]
z0.plot()

In [None]:
# set index to plot correct x-axis
z0 = data.set_index('x').loc[:, 'z0']
z0.head()

In [None]:
z0.plot()

In [None]:
# set x-axis using a keyword argument
data.plot(x='x', y='z0')

### Plotting (quantitative) empirical distributions in Pandas

* Use the key-word argument `kind`
```
kind : str
    - 'hist' : histogram
    - 'box' : boxplot
    - 'kde' : Kernel Density Estimation plot
    ...
```
* The `hist` keyword by default uses 10 bins, and returns the *count* of observations within those bins.
    - use `density=True` to return a histogram whose area is normalized to 1.

In [None]:
# histogram of z0 values; 
# 25 bins.
# density = normalized histogram

z0.plot(kind='hist', bins=25, density=True)

In [None]:
# kernel density estimate of the distribution
# smooth approximation of the empirical distribution

z0.plot(kind='kde')

In [None]:
z0.plot(kind='box')

### Plotting (categorical) empirical distributions in Pandas

* Create a distribution from categorical columns using `value_counts`.
* Categorical columns should use *bar charts*.
* Use the key-word argument `kind`
```
kind : str
    - 'bar' : vertical bar plot
    - 'barh' : horizontal bar plot
    ...
```


In [None]:
empdistr = data['id'].value_counts(normalize=True)
empdistr

In [None]:
# nominal column
empdistr.plot(kind='bar')

In [None]:
# ordinal column: the x-axis has a meaningful order
empdistr.sort_index().plot(kind='bar')

In [None]:
# horizontal bar chart
empdistr.sort_index().plot(kind='barh')

### Plotting `pandas` DataFrames
* `DataFrame.plot()` plots the columns of a dataframe.
* Want multiple plot on the same axis? Get the data into the columns of a dataframe!

In [None]:
data.set_index('x').head()

In [None]:
# plot columns 'z0' and 'z1' with 'x' used as the x-axis
data.set_index('x')[['z0', 'z1']].plot()

In [None]:
# plot columns 'z0' and 'z1' with 'x' used as the x-axis on seperate plots
data.set_index('x')[['z0', 'z1']].plot(subplots=True);

In [None]:
# plot all columns using 'x' as x-axis, elongate plots
data.set_index('x').plot(subplots=True, figsize=(12,8));

### Scatter-plots with Pandas
* You can create scatter plots with `DataFrame.plot` by passing `kind='scatter'`. Scatter plot requires numeric columns for `x` and `y` axis. 
    * These can be specified by `x` and `y` keywords each.
* To plot multiple column groups in a single axes, repeat plot method specifying target `ax`. It is recommended to specify color and label keywords to distinguish each groups.

In [None]:
data.plot(kind='scatter', x='z0', y='z1')

In [None]:
# plot all the histograms and scatterplots in one plot!
# univariate + bivariate analysis
pd.plotting.scatter_matrix(data.drop(['id', 'x'], axis=1));

In [None]:
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])

df.plot(kind='scatter', x='a', y='b');

There are other keywords that can be used with scatter. The keyword `c` may be given as the name of a column to provide colors for each point:

In [None]:
samp = data.sample(100)

In [None]:
samp.plot(kind='scatter', x='z0', y='z1', c='z3', s=50);

You can pass other keywords supported by matplotlib `scatter`. The example below shows a bubble chart using a dataframe `column` values as bubble size.

In [None]:
samp.plot(kind='scatter', x='z0', y='z1', s=samp['x']);

### Seaborn: pretty plotting made easy

Installing `seaborn`: 
* `pip install seaborn==0.9`

or

* `pip install --user seaborn==0.9` (e.g. on `datahub.ucsd.edu`)

The seaborn documentation has a *great* series of tutorials: https://seaborn.pydata.org/tutorial.html


In [None]:
import seaborn as sns
sns.__version__

#### `sns.scatterplot`
* The relationship between `x` and `y` can be shown for different subsets of the data using the `hue`, `size`, and `style` parameters. 
* These parameters control what visual semantics are used to identify the different subsets. 
* It is possible to show up to three dimensions independently by using all three semantic types, but this style of plot can be hard to interpret and is often ineffective. 
    * Using redundant semantics (i.e. both `hue` and `style` for the same variable) can be helpful for making graphics more accessible.

Show a quantitative variable by using continuous colors:

In [None]:
sns.scatterplot(data=data, x='z0', y='z1', hue='id')

Also show a quantitative variable by varying the size of the points:

In [None]:
sns.scatterplot(data=data, x='z0', y='z1', size='id')

#### `sns.lmplot`

Plot a simple linear relationship between two variables:

In [None]:
# plot a line of best fit
sns.lmplot(data=data, x='z0', y='z2');

#### `sns.distplot`

Plot the distribution with a histogram, kernel density estimate, and rug plot:

In [None]:
z3 = data.sample(50)['z3']
sns.distplot(z3, hist=True, kde=True, rug=True)

#### `sns.boxplot`

Draw a vertical boxplot grouped by a categorical variable:

In [None]:
sns.boxplot(data=data, x='id', y='z2')

## custom plots with `matplotlib`

In [None]:
import matplotlib.pyplot as plt

### Matplotlib `axes` objects and Pandas plots

* An 'Axes' object contains the elements of a single plot.
    - contains a coordinate system (axis elements), 
    - the plot elements (e.g. line, bar), 
    - labels, 
    - tick-marks, etc.
    
* A `DataFrame.plot()` method call returns an `axes` object

In [None]:
# notice the <matplotlib.axes._subplots.AxesSubplot at 0x1a21f7bcf8>
data.set_index('x')['z0'].plot()

In [None]:
# save the plot as an variable
ax = data.set_index('x')['z0'].plot()

In [None]:
# get name of x-axis
ax.get_xlabel()

In [None]:
# get y-axis tick-labels
list(ax.get_yaxis().get_majorticklabels())

In [None]:
ax = data.set_index('x')['z0'].plot()
ax.set_xlabel('hi, this is my new axis label!')
ax.set_title('hi this is my new title!');

#### You can add elements to an Axes object

* The Pandas `.plot` method can add a plot to an existing Axes object using the `ax` keyword

In [None]:
ax = data['z0'].plot()

# add z1 to Axes
data['z1'].plot(ax=ax)

# add a vertical line using matplotlib
plt.plot([40,40],[-400, 300])

# add a point using matplotlib
plt.plot(15,-200, marker='x', markersize=10, color='red')

#### You can add a scatterplot to an existing scatterplot

In [None]:
ax = data.plot(kind='scatter', x='z0', y='z1', alpha=0.3)

data.plot(kind='scatter', x='z0', y='z3', ax=ax, c='g', alpha=0.3)

### Matplotlib `figure` and adding to empty subplots

* A 'Figure' object is a top-level container for all plotting objects.
    - controls overall size, title, fonts, coordination between different elements of subplots.

<img src="https://i.stack.imgur.com/HZWkV.png" width="25%">  

* Instantiate an empty figure containing multiple plots with `plt.subplots`
    - `fig, axes = plt.subplots(R, C)` returns a figure `fig` and an multi-array of `axes`.
    - `axes` has `R` rows and `C` columns corresponding to the subplots laid out on a grid.
    - The `axes` are initially empty; they need to be given data to plot.
   

In [None]:
fig, axes = plt.subplots(1, 2)

In [None]:
len(axes), type(axes), type(axes[0])

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12,4))

df = data.set_index('x')
df['z0'].plot(ax=axes[0], title='z0')
df['z1'].plot(ax=axes[1], title='z1')

In [None]:
fig, axes = plt.subplots(2, 1, sharex=True)

df = data.set_index('x')
df['z0'].plot(ax=axes[0], title='z0')
df['z1'].plot(ax=axes[1], title='z1')

### Practice: plots and groupby

* Can we plot histograms of `z2` for each value of `id`?

In [None]:
data.drop('x', axis=1).groupby('id')['z2'].plot(kind='hist', alpha=0.3);

In [None]:
data.drop('x', axis=1).groupby('id')['z2'].plot(kind='hist', subplots=True);

In [None]:
data['id'].nunique()

In [None]:
grps = data.groupby('id')
for k, gp in grps:
    print('**** ' + str(k) + ' ****', grps.get_group(k).head().to_string(), sep='\n', end='\n\n')

In [None]:
fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)

for k, gp in data.groupby('id'):
    x_idx = k // 2
    y_idx = k % 2
    ax = axes[x_idx, y_idx]
    title = 'id = %d' % k
    gp['z2'].plot(kind='hist', density=True, ax=ax, title=title)
    
fig.suptitle('Distribution of z2 by id-number');


**Question (Optional)**: Can you plot the histograms of each column by `id`? Each row should contain the histograms by `id` of a single variable (there should be 3 rows and 4 columns). Write this generally enough to handle an arbitrary number of variables and values of `id`.

### Practice problems

* Below is a dataset in the seaborn package that contains data on restaurant bills and (service) tips.
* Try to understand the dataset via plotting using the examples in the notebook.
    - Plot histograms and boxplots for quantitative columns
    - Plot counts of categorical values using bar plots
    - Plot a scatter plot of `tip` vs `total_bill` -- is the relationship linear?

In [None]:
tips = sns.load_dataset('tips')

In [None]:
tips.head()

**Question 1**

Plot the counts of meals in `tips` by day. Your plotting function, `plot_meal_by_day` should return an `matplotlib.axes._subplots.AxesSubplot` object; your plot should look like the plot below.

<img src="imgs/barh.png" width="50%"/>

**Question 2**

Plot a seaborn scatterplot using the `tips` data by day. Your plotting function, `plot_bill_by_tip` should return a `matplotlib.axes._subplots.AxesSubplot` object; your plot should look like the plot below.
* `tip` is on the x-axis.
* `total_bill` is on the y-axis.
* color of the dots are given by `day`.
* size of the dots are given by `size` of the table.

<img src="imgs/scatter.png" width="50%"/>

**Question 2**

Plot a figure with two subplots side-by-side. The left plot should contain the **counts** of tips *as a percentage of the total bill*. The right plot should contain the **density plot** of tips as a percentage of the total bill. Your plotting function, `plot_tip_percentages` should return a `matplotlib.Figure` object; your plot should look like the plot below.

<img src="imgs/hist.png" width="50%"/>