###### Content under Creative Commons Attribution license CC-BY 4.0, code under BSD 3-Clause License © 2017 L.A. Barba, N.C. Clementi

# Seeing stats in a new light

Welcome to the second lesson in "Take off with stats," Module 2 of our course in _Engineering Computations_. In the previous lesson, [Cheers! Stats with Beers](http://go.gwu.edu/engcomp2lesson1), we did some exploratory data analysis with a data set of canned craft beers in the US. We'll continue using that same data set here, but with a new focus on _visualizing statistics_.

In her lecture ["Looking at Data"](https://youtu.be/QYDuAo9r1xE), Prof. Kristin Sainani says that you should always plot your data. Immediatly, several things can come to light: are there outliers in your data? (Outliers are data points that look abnormally far from other values in the sample.) Are there data points that don't make sense? (Errors in data entry can be spotted this way.) And especially, you want to get a _visual_ representation of how data are distributed in your sample.

In [None]:
import numpy
import pandas
from matplotlib import pyplot
%matplotlib inline

#Import rcParams to set font styles
from matplotlib import rcParams

#Set font style and size 
rcParams['font.family'] = 'serif'
rcParams['font.size'] = 16

In [None]:
beers = pandas.read_csv("../../data/beers.csv")

In [None]:
beers[0:10]

## Categorical vs. quantitative data

## Visualizing quantitative data

In [None]:
#Repeat cleaning values abv
abv_series = beers['abv']
abv_clean = abv_series.dropna()
abv = abv_clean.values

In [None]:
#Repeat cleaning values ibu
ibu_series = beers['ibu']
ibu_clean = ibu_series.dropna()
ibu = ibu_clean.values

We learned how to compute some of the quantites that give us information about our data. As you might expect, there are versions of them in NumPy and we will learn some others too.

We knew about the mean, and that we can compute with `numpy.mean()`. But what about the variance or standard deviation?

There is a [`numpy.var()`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.var.html) that we can use, but we need to read the documentation to be certain that is exactly what we need. 


##### Exercise:

Go to the documentation of [`numpy.var()`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.var.html) and analyze if this is the variance that correspond to the sample variance. 

*Hint*: Check what it says about the degrees of freedom.

If you did the reading you might have notice that, by default, the argument `ddof` in `numpy.var()` is set to zero. If we use the default option, then we are not really calculating the sample variance. From the previous lesson we can recall that the **sample variance** is determined by:

$$
\begin{equation*}     
     var_{sample} = \frac{1}{N-1}\sum_{i} (x_i - \bar{x})^2
\end{equation*}
$$

Therefore, we need to be explicit about the division by $N-1$ when calling `numpy.var()`. How do we do that? We explicitly set `ddof` to `1`.  

For example, to compute the sample variance for our `abv` variable we do:

In [None]:
var_abv = numpy.var(abv, ddof=1)
print(var_abv)

Now we can compute the standard deviation by taking the square root of `var_abv`:

In [None]:
std_abv = numpy.sqrt(var_abv)
print(std_abv)

You might be wondering if there is a built in function for the standard deviation in NumPy, aren't you? I encourage you to go to google and try to find something.

**Spoiler alert!!!**
You will. 

##### Exercise:

1. Read the documentation about the NumPy standard deviation function, compute the standard deviation for `abv` using this function, and check that you obtained the same value than if you take the square root of the variance computed with NumPy.

2. Compute the varaince and standard deviation for the variable `ibu`.

### More stats (median and percentiles)

So far, we've learned what the mean, variance and standard deviation tell us about our data. However, these are not the only quantities that give us information. 

We still haven't explore the concept of percentiles. The most known percentile is the $50\%$, also known as **median**. The **median** is the value that separates  in half. If the number of data values (sorted) is odd, the median is the middle value, otherwise, themedian is the average between the 2 values in the middle. 

If you went ahead, you probably already "google" if there is a NumPy function that computes the **median**, and you might have run into [`numpy.median()`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.median.html). 

##### Exercise:

Compute the median using NumPy, for our variables `abv` and `ibu`.

**Percentiles**

A Percentile is a measurements that we use in statistics to indicate the value for which a vertain percentage of observations are lower. 

The Nth percentile is the value for which N% of the observations are lower. 

For example, imagine that you are the second tallest person in a group of 10 people. This mean that you are the 80th percentile. Let's say your height is 1.70 meters (~ 5' 7"), this means that "1.70 m" is the 80th percentile. In other words, the height of 80% of the population (10 peeple) is under 1.70 m. 

The percentiles 25, 50, and 75 are named quartiles, since the divide the data into four groups. They are named first, second (median) and third quartile respectively. 




## Visualizing categorical data

In [None]:
style_series = beers['style']

In [None]:
style_series[0:10]

In [None]:
type(style_series)

In [None]:
style_counts = style_series.value_counts()
style_counts[0:5]

In [None]:
len(style_counts)

In [None]:
style_counts[0:21].plot.barh(figsize=(10,8), color='#008367', edgecolor='gray');

In [None]:
pyplot.boxplot(abv, labels=['abv']);

In [None]:
pyplot.boxplot(ibu, labels=['ibu']);

In [None]:
beers_clean = beers.dropna()

In [None]:
beers_bystyle = beers_clean.groupby('style').mean()

In [None]:
beers_clean.index[beers_clean['style'] == 'Wheat Ale'].tolist()

In [None]:
beers_clean['ibu'][1337]

In [None]:
beers_clean['style'].value_counts()

In [None]:
type(beers_clean['style'].value_counts())

In [None]:
beers_bystyle

In [None]:
beers_bystyle.plot.scatter(figsize=(10,10), 
                           x='abv', y='ibu', s=20, 
                           alpha=0.5);

## References

1. [Craft beer datatset](https://github.com/nickhould/craft-beers-dataset) by Jean-Nicholas Hould.

### Recommended viewing

From ["Statistics in Medicine,"](https://lagunita.stanford.edu/courses/Medicine/MedStats-SP/SelfPaced/about), a free course in Stanford Online by Prof. Kristin Sainani, we highly recommend that you watch this lecture:
* [Looking at data](https://youtu.be/QYDuAo9r1xE)

In [None]:
# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
css_file = '../../style/custom.css'
HTML(open(css_file, "r").read())