# Analyzing Distributions of Numbers

### Types of Distributions
Below are just a few of the types of distributions that you will encounter as you begin to analyze sets of data. Play close attention to the data type that each distribution is discribing, as there are different analyses available, for each type. All examples are present in the Titanic Dataset!

- **Continuous**: The Numbers in dist can take on any value accross a certain range. 
    - Examples
    - `Fare`:`[8.50, 9.33, 7.15,...]`
    - `Age`:`[1,4,43,21,22,...]`
    - ...


- **Categorical**: A *countable* set of categories
    - Examples
    - `Embarked`:`['S', 'S', 'Q', ...]`
    - `Cabin`:`['C85', 'C123', 'E46', ...]`
    - `building material`:`['wood', 'brick', 'hay'....]`
    - ...
    
    
- **Binary**: A choice between one value or another
    -Examples
    - `Survived`:`[1, 0, 0, 1, 0, 1, 0, ...]`
    - `Gender` : `['M', 'F', 'F', 'M', ...]`
    - ...
   
   
- **Ordinal**: A mix between both above. Numbers that can be infered as categories
    - Examples
    - `Pclass`:`[3,2,1]`
    - ...
    
   
- **Unique** Usually some unique identifier for each row, could be an index of some sort.
    - Examples
    - `PassengerId`:`[1,2,3,4,5,...]`
    - `Ticket` : `['PC 17599' , '347082', ...]`

##### Import Libraries

In [1]:
import numpy as np
import pandas as pd

##### Load Data

We will be using a partially cleaned version of the titanic dataset, this is actually a file that you will produce later on in the course as we get into week 3 where we will cover data cleaning.

In [2]:
df = pd.read_csv('assets/titanic_cleaned.csv')

##### Look at the different columns at our disposal and their dtype

In [3]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Gender          object
Age            float64
SibSp            int64
ParCh            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
Title           object
dtype: object

##### Examine some distributions here with describe method

In [4]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,ParCh,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.841942,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.281525,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,30.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,36.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


# Detour: Central Tendency - Mean vs Median

Before we continue with the Titanic Dataset, lets first discuss a fundamental concept in Statistics called `Central Tendency`. The central tendency of a distribution is eqiuvalent to the average value of the distribution, howerver, there are many ways to define this central tendency. The two most common are `mean` and `median`, Another would be the `mode` (much less common)

- The **mean** is calculated by finding the sum of all numbers in a distribution and dividing by the total number of items in said distribution. 

- The **median** is calculated by finding the middle value of the distribution when sorted in order from least to greatest. This can vary depending on if there are an even or odd amount of vaules in the distribution. 

For our purposes, calculating the `mean` or `median` will be as easy as calling either the `numpy` `mean`/`median` functions or the `pandas` variations, on the appropriate set of data ofourse.

##### Calculating Central Tendency

Calculating GPA is a really common example of calculating a central tendency of a distribution, In this case the distribution are the number values for each grade, where an A is a 4, B is a 3, and so on...

In [5]:
my_grades = [3,4,4,4,3,3]

gpa = np.mean(my_grades)

print("My GPA was great this year! I got a", str(gpa) + "!")

My GPA was great this year! I got a 3.5!


##### Outliers affect Central Tendency

An `outlier` is a value that is unllike the rest of the values found in a distribution. Some examples of outliers are a `16lb` baby in a distibution of `baby_birth_weights`, a `10,000,000` home in a distribution of `house_prices`, a `110` year old person in a distibution of `age` for the `san_diego_census`. 

The main reason we have to contend with `outliers` is that they tend to drag the central tendency of a distribution in their direction, distorting the average and leading to false insight. 

##### Below we'll see that the `median` is better at handling central tendency in the presence of outliers.

In [6]:
some_dist = np.array([0,1,2,3,4,5,6,7,8,9,10,34,56,100])

print("Central Tendency For:", some_dist)
print("----------------")
print("mean:", np.mean(some_dist))
print("median:", np.median(some_dist))

Central Tendency For: [  0   1   2   3   4   5   6   7   8   9  10  34  56 100]
----------------
mean: 17.5
median: 6.5


We can see that the `mean` is severly distorted by the presence of the `100`, the median on the other hand seems much more reasonable

# Unique Values in a distribution
Knowing what the unique values of a distribution of numbers can simply things greatly! Let's explore the three most relevant pandas functions below.

#### First lets look at the first 20 values of `Embarked`
This doesn't tell us much about the distribution at large, but does give us a quick glance and can definitely be useful. To do this we use the `.head()`, which normally returns 5 examples of a `dataframe` or `series`, but in this case it will give us `20` because we pass in that value as the sole argument.

In [7]:
print("First 20 values of Embarked:")

df['Embarked'].head(20)

First 20 values of Embarked:


0     S
1     C
2     S
3     S
4     S
5     Q
6     S
7     S
8     S
9     C
10    S
11    S
12    S
13    S
14    S
15    S
16    Q
17    S
18    S
19    C
Name: Embarked, dtype: object

##### df[column].unique()
This method will return all unique values in a distribution

In [8]:
print("unique values in Embarked:")
df['Embarked'].unique()

unique values in Embarked:


array(['S', 'C', 'Q'], dtype=object)

Notice how seeing the unique values simplifies the distribution greatly, as now we know that every value in this distribution can only take on one of 3 values!

##### df[column].nunique()

In the case when you have many unique values in a categorical distribution, sometimes it is useful to know how many different unique values there are. Ive seen datasets where there were over 20,000 unique values for a degree column. Some people describe their MBA as `M.B.A`, `mba`, `m.b.a`, `Mba`, etc...

In [9]:
print("Num Unique Values in Embarked:")
df['Embarked'].nunique()

Num Unique Values in Embarked:


3

##### df[column].value_counts()

Probably the most useful of the three, this function will return all unique values in a distribution as well as the frequency of each. This one is definitely one to remember!

In [10]:
print('Counts for Unique Values in Embarked:')
df['Embarked'].value_counts()

Counts for Unique Values in Embarked:


S    646
C    168
Q     77
Name: Embarked, dtype: int64

# 5 Number Summary with `.describe()`
`Pandas` has some really useful functions, but few come as close as `.describe()`, this is the very first method that I run when examining a distribution or even an entire dataframe. 

Earlier, we examined several distributions with the `.describe()` method, however one thing that we did not go over is the fact that the `.describe()` method will also function on an entire dataframe, saving us the trouble of running it on each column, one at a time. 

However, because there are many types of distributions, the describe method look differently for numerical and categorical distribution. Each "flavor" of `.describe()` is described below.

##### Numerical - Default
Notice how the default behavior of the method only picks out the numerical columns and gives us a breakdown of EACH distribution as before, except we only do 1 method call, instead of 7.

In [11]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,ParCh,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.841942,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.281525,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,30.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,36.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The numerical flavor gives us a lot of information, In particular:

`count` - The number of non-empty or non `Nan` values in the column
`mean` - The mean aka basic average
`std` - the standard deviation is a measure of spread, specifically —the average distance of each value from the mean.
`min` - the smallest value of the distribution
`25%` - the value at the 25th percentile, or 1/4th of the way to max
`50%` - The median aka middle value
`75%` - The value at the 75th percentile, or 3/4th of the way to max
`max` - The largest value of the distribution

The last 5 values are what is known as a five number summary, which is as simple as organizing data points from smallest to largest and looking at what values appear at each "checkpoint".

I recommend going column by column to get a better idea of each distribution, this may involve walking away from the dataframe for a second to do some closer examination of a particular column, then moving on to the next.

# Categorical - custom argument
In pandas, categorical data is denoted as `object`, we pass this value in as the argument for the `include` parameter.

In [12]:
df.describe(include='object')

Unnamed: 0,Name,Gender,Ticket,Cabin,Embarked,Title
count,891,891,891,204,891,891
unique,891,2,681,147,3,5
top,"Stoytcheff, Mr. Ilia",male,CA. 2343,C23 C25 C27,S,Mr
freq,1,577,7,4,646,528


The "object" flavor gives us a good amount information as well

`count` - The number of non-empty or non `Nan` values in each column
`unique` - The amount of unique categories per column
`top` - the most frequent unique value
`freq` - the actual frequency for that top value

This is much more straightforward than the numerical variety.