# Soft Introduction to Descriptive Statistics with `NumPy` 

<img src="../images/numpy.jpg" alt="numpy_logo" style="width: 200px;"/>

Now that we have the basics of `python` data carpentry covered; we are going to move on. That's not to say that we will not be using it again, in fact, quite the opposite; all of the topics in these modules are meant to be built upon and perfected throughout the course. Our aim is for these notebooks to build your skills as data scientists. 

Data exploration is typically the third step in most data science projects, after data acquisition and carpentry. Many times, we start off having a question that we want to know the answer to (or a problem that you would like to predict). The next step is often to find data that has the potential to satisfy your question, then load and format the data for analysis. Data exploration is the natural next step in this pipeline, given that your data is in some format that is probably not exactly the format and structure you need to put it into for analysis.

Once your data is reshaped into a usable form you can and should do descriptive analysis. Descriptive statistics along with some simple visualizations are essential for understanding the underlying characteristics of your data.

If you start trying to answer questions before getting a clear picture of your data's characteristics you are essentially walking on a frozen lake without checking the thickness of the ice. Chances are you will fall in and get wet. Starting with data exploration is often the difference between a poor analysis and a good analysis, or a good analysis and a great analysis.

We are now going to begin diving into descriptive statistics. Don't worry; we aren't leaving data carpentry behind. It will play a big role throughout the rest of course and our data science careers.


\* *[`intro_numpy.ipynb`](intro_numpy.ipynb) gives an overview of the `NumPy` package. This notebook is meant to give you an overview of `NumPy` within the scope of descriptive statistics.*

### The Data

In thi lab we are going to be working with some data on the book saga, *A Song of Ice and Fire*. This may be more familiar to you in the TV show adaptation form, *Game of Thrones*, or it may not be familiar to you at all. If this last category of person is you, the important thing to know about the series is that none of the characters are safe from death, even the most beloved ones (*see gif below*). This is a dataset of the characters, their ages, and whether or not they have died. Imagine we were interested in who  George R. R. Martin (the author) was most likely to kill off next in his books\*. Descriptive statistics allow us to investigate some characteristics of who dies and when, which is a valuable bit of information considering our hypothetical prediction problem. Let's begin exploring our data...

To avoid **Spoilers** and to not over complicate things. This data set is from **2014** and was done using Wiki data - See [here](https://jordanschermer.wordpress.com/2014/08/06/valar-morghulis/) for a description. Knowing that this is entirely outdated, and the author has this disclaimer 'Everything I know about the Game of Thrones universe came from writing this post and the ~5 episodes I drunk-watched in college', transport yourself back to 2014 and explore this data with all of this in mind.   

\* This is something that has actually been done before by a group of data scientists. Their website is called *A Song of Ice and Data* and their probabilistic death predictions can be found [here](https://got.show/ranking).

![GoT](../images/GoT_death.gif)

### Read in Data Set

In [1]:
import pandas as pd
import numpy as np

with open('/dsa/data/all_datasets/game-of-thrones/GoT_age_at_death.csv') as file:
    df = pd.read_csv(file)
    df.columns  = ['character', 'age', 'dead', 'gender', 'affiliation'] # change file header names
    
    # change column types
    df['dead'] = df['dead'].astype('category')
    df['gender'] = df['gender'].astype('category')
    df['affiliation'] = df['affiliation'].astype('category')

In [2]:
df.head() 

Unnamed: 0,character,age,dead,gender,affiliation
0,Sandor Clegan,29,1,1,4
1,Benjen Stark,35,1,1,10
2,Syrio Forel,41,1,1,1
3,Tysha,29,0,0,4
4,Jeyne Pool,12,1,0,1


## Descriptive Statistics

Descriptive statistics reveal the underlying characteristics of the data. Together with simple visualizations, descriptive statistics can tell us a lot about what the data looks like. But exactly how do they help us?

### Visualizing the Distribution

In much of this lab, we are going to be looking at the *distribution* of `Age` variable under certain conditions. The distribution of a variable shows us how the data are spread. One way of visualizing this is through the use of a histogram. A histogram is a univariate (meaning "single variable") plot that displays the range of values of a variable on the x-axis (horizontal axis). On the graph, there are bins that extend upward on the graph and represent the count of values that exist within that bin's value range. Take the histogram of the `df$Age` below. The blue section represents a single bin between the values of 52 and 54 years of age. There are ten rows of data that fall in the bin, which you can visualize by going up the bin and looking at the value on the y-axis. 

There are other ways to visualize the distribution, but the histogram is one of the most common and convenient ways to understand where data is concentrated and where it is more sparse. But don't worry, we will touch on these other visualizations as well.

<img src="../images/hist_bin_r.jpg" alt="Hist" style="width: 450px;"/>


Often we will be dealing with a lot of values within a variable, and while a distribution is nice to visualize how the data is spread out, descriptive statistics provide a way for us to simplify the variable down into one number.

### The Mean
When people talk about the average of a dataset, they are most often referring to the arithmetic mean. 

In [3]:
np.mean(df.age)

35.59891598915989

We can see that the mean age of the entire dataset is about 35.6 years old. However, sometimes knowing just the mean isn't good enough for analysis. We also want to know how spread out the data points are around the mean. For that we would use another statistic.

#### A brief note about `pandas` and `NumPy`

It is important to know that `pandas` is built on top of `NumPy` and therefore much of the functionality that we will be introducing as `NumPy` methods is also available with `pandas`. Let's see for ourselves...

In [4]:
pandas_mean = df.age.mean()
numpy_mean = np.mean(df.age)

print("I am the mean constructed with pandas: {}".format(pandas_mean))
print("And I am the mean constructed with NumPy: {}".format(numpy_mean))

I am the mean constructed with pandas: 35.59891598915989
And I am the mean constructed with NumPy: 35.59891598915989


See... the two produce the same result, although the methods look a bit different. Either way is fine for finding the mean from a numeric column of a data frame. But the `pandas` way will only work on a `pandas` object. Imagine if we wanted to find the mean of a numeric list we created without `pandas`. 

In [5]:
x = [1,2,3,4,5] # create a list of numbers

The `pandas` way won't work...
 * **ERROR EXPECTED**

In [6]:
x.mean()

AttributeError: 'list' object has no attribute 'mean'

But the `NumPy` way will...

In [7]:
np.mean(x)

3.0

For the majority of the following descriptive statistics we will be using `NumPy` so we can familiarize ourselves with the `NumPy` package, which will be used heavily in other modules and courses.

### The Standard Deviation

The standard deviation is an expression of the amount of variance, or how spread out the data points are in a dataset. The higher the standard deviation, the more spread out the points are from the mean. 

Below is how to find the standard deviation with `NumPy`.

In [8]:
# This gives us the standard deviation for age. Standard deviation gives us a measure of how diverse the ages are. 
# If everyone is 30 years old, the standard deviation will be zero, for example. 
np.std(df.age)

18.99184246263994

**spoiler for next week** when you run the standard deviation using `R` next week it will be 19.017, although 18.992 is rather close to 19.0..., the result is different based on design.

When looking at statistics we have populations and then we have samples. Simply put, a population is everyone while a sample is a subset of the population. Imagine we wanted to measure the heights of Oregonians. The population would be a measurement for every single person from Oregon. This would be impossible to do, so instead we would take a sample of individuals from Oregon. 

Now, standard deviation is found by taking the square root of a statistic called the "variance" and the way variance is computed for the population is just a tad bit different than how it is computed for the sample (we will get more into the math behind this in the Statistical and Mathematical Foundations course). It is this small difference that produces the different results. `NumPy` defaults to the population standard deviation why `R` defaults to the sample standard deviation. 

Here is how we specify the sample standard deviation...

In [9]:
np.std(df.age,ddof = 1)

19.01762909021605

There we have it! Instead of getting too deep into the math behind this `ddof` parameter and the argument that we passed it, just know that this is subtracting 1 from the sample size (amount of data points), which is what it takes to produce the sample standard deviation. We will get into this parameter a bit more during the Mathematical and Statistical Foundations course.

Now back to what the result means. 

So one standard deviation away from the mean in the sample's age is about plus or minus 19 years. Two deviations would about plus or minus 38 years and so on. You will notice that once we get to 2 standard deviations below the mean, we start to talk about negative age, which we know to be impossible. This tells us something about our data, that there are data points (or people) who are quite a bit older than the mean age of this sample. Let's take a look at our distribution again.

<img src="../images/hist_bin_r.jpg" alt="Hist" style="width: 450px;"/>

We can see that there are some old people in this dataset, which increases the spread of the data.

### The Median

The mean and standard deviation are highly influenced by extreme values known as outliers. Outliers have the ability to pull the distribution in one direction or the other. For example, imagine we had the ages of 10, 11, 13, 13, 11, and 153. It isn't hard to tell that 153 is an outlier in this sample, but since the mean takes all of the points into consideration, it will be highly influenced by it. The mean of this sample would be 35.16666 with a standard deviation of 57.739 years! 

Sometimes you want an average value that isn't highly influenced by these outlying values. In this case, you would use the median. The median is the middle positioned value in an ordered set of numbers. Following our example above, we would first order our numbers from least to greatest (or vise versa), 10, 11, 11, 13, 13, 153 and then select the number in the middle position. Since this list of numbers is even and there is no middle value, we would take the two most central values and find their mean. These two values would be 11 and 13, whose mean is 12. So the median is 12. 

Much like the finding the mean and the standard deviation, finding the median in `NumPy` simply requires you to call the function `np.median()` on a numerical object. Let's do so with our dataset's `Age` column.

The median is simple to find with `NumPy`. Remember, the median is not sensitive to outliers and, therefore, is sometimes more preferable than the mean when trying to find the average.

Here's how we do it...

In [10]:
# Median is the midpoint in the data. There are an equal number of records above and below this value.
np.median(df.age)

35.0

### Quartiles / Summary
Finding the quartiles (splitting the data in 4 equal groups) of a numerical object is going to tell us how many points are within 25% above and below the median, as well as the maximum and minimum points. `Python` provides a convenient function for us to see what the quartiles and mean of variable are by simply calling the `describe()` function on a numeric object.

Run the code below and then we will talk about its output...


Let's try it out below.

In [11]:
df.age.describe()

count    369.000000
mean      35.598916
std       19.017629
min        0.000000
25%       20.000000
50%       35.000000
75%       47.000000
max      102.000000
Name: age, dtype: float64

There are multiple values here to discuss. We will start with the simplest. We see the minimum age (0) of the dataset and its maximum age (102). 50% is going to be the median, as you can see from the `median` function called above. In other words, what this is telling us is that 50% of the data points fall below the age of 35 and 50% fall above. 100% of the data points are below or equal to the age of 102. This makes it simpler to interpret the 25% and the 75%, in which a quarter and three quarters of the ages fall below 20 and 47 respectively. 

That's nice and all, but what if we wanted to see the max value of the points at a different percentage of the dataset? `NumPy` provides us with a very convenient function to do so.

In [12]:
np.percentile(df.age, 65)

41.0

So 65% of the values in the age variable are 41 and below. This method is simple. We inserted two arguments: the first is a numeric object, in this case the `age` variable of our *Game of Thrones* dataset, and the second is a number between 0 and 100 signifying the percentage.

### Maximum and Minimum Values

Once again, we see that there are multiple ways to do the same thing. We can also find the maximum and minimum value of a variable by calling the `amax` and `amin` method.

In [13]:
np.amax(df.age)

102

In [14]:
np.amin(df.age)

0

## Bivariate Analysis

Sometimes we want to know how variables change together or whether or not there is a correlation between variables in your dataset. We can do that by looking at the covariance and correlation of two variables.

For this part, we are going to change gears and look at a different dataset to better aid us in our exploration.  This is the Stature Hand and Foot dataset, which simply gathers the height, hand length, and foot length of individuals.

Lets take a look below...

For these two statistics, we will be using `pandas` again.

### Read in the Data

In [15]:
with open('/dsa/data/all_datasets/stature-hand-foot/stature-hand-foot.csv') as file2:
    df2 = pd.read_csv(file2)
    df2['gender'] = df2['gender'].astype('category')
    df2.columns = ['gender', 'height', 'hand_length', 'foot_length']

In [16]:
df2.head()

Unnamed: 0,gender,height,hand_length,foot_length
0,1,1760.2,208.6,269.6
1,1,1730.1,207.6,251.3
2,1,1659.6,173.2,193.6
3,1,1751.3,258.0,223.8
4,1,1780.6,212.3,282.1


### Covariance

The covariance of two variables measures how the two variables of the sample change linearly together. In other words, if one variable increases, what does the other one do? If the covariance is positive then as one variable increase so does the other. If it is negative then as one variable increases, the other decreases. If the value is 0 then an increase in one variable does not affect (linearly) the other variable.

When finding the covariance of two variables, you are going be performing a method on a pandas series type object. In other words, we are going to be calling a method on a numeric variable and passing another variable as an argument. Take a look at the example below. 

In [17]:
df2.hand_length.cov(df2.foot_length)

195.07014411395065

As you can see, we run the method on the `hand_length` variable and pass the `foot_length` variable as the argument.
The above result is rather intuitive given what covariance tells us. An increase in hand length is generally associated with an increase in foot length, thus the positive covariance.

### Correlation

Now, what if we want to assess the strength of the linear relationship between these two variables. In other words, how much does an increase in one affect the movement of the other? Correlation is our statistic to assess that relationship.

As you can imagine, the sytax is the same as it was for the covariance. 

In [18]:
df2.hand_length.corr(df2.foot_length)

0.7882243081238717

This is a rather strong relationship between the hand length and foot length. A value of 1 indicates a perfectly, positive linear relationship (an increase of 1 in the first variable means an increase of 1 in the second), a value of -1 indicates, a perfect, negative linear relationship (an increase of 1 in the first variable means a decrease of 1 in the second variable), and a value of 0 means that there is not a linear relationship between the two variables.


This wraps up our lab notebooks for this module. 

# Save your notebook, then `File > Close and Halt`