# Pandas Review

### Before getting into Stats, lets go over a few Pandas examples:
- ##### Working with CSVs
    - Assigning path names
    - [Reading CSVs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
    - [Writing CSVs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)
- ##### How to get statistics of a dataframe
    - [Describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)
    - [Info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)
    - [Shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html)
- ##### Visual inspection of the dataframe
    - [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)
    - [tail](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html)
- ##### Introduce Pandas methods
    - [isin](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html)
    - [value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)

In [None]:
# Import Libraries
import pandas as pd

# Create variable to assign filepath
filepath = '../data/iris.csv'

# Read CSV file to dataframe
df = pd.read_csv(filepath)

#### Get dataframe statistics

In [None]:
# Look at shape
print('DF Shape')
print(df.shape)

In [None]:
# Look at info
df.info()

In [None]:
# Look at describe
df.describe()

#### Visual inspection of head & tail of dataframe

In [None]:
df.head()

In [None]:
df.tail()

#### Pandas isin()

Pandas isin() method is used to filter data frames. isin() method helps in selecting rows with having a particular(or Multiple) value in a particular column.

[isin Docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html)

In [None]:
# This will return a series of bool values based on the SepalLengthCm column
# If the value is 6.5, return true
new = df["SepalLengthCm"].isin([6.5]) 
  
# displaying data with SepalLengthCm = 6.5 only 
new_df = df[new].head()

#### Pandas value_counts()

Pandas Series.value_counts() function return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

[value_counts Docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)

In [None]:
# This will return the count of each occurance of a value within the selected column
df['Species'].value_counts()

#### Save the data to a new CSV


In [None]:
# Save that new dataframe to CSV and name it new_csv.csv
new_df.to_csv('new_csv.csv', index=None)

# Time to pair up for a Pandas Exercise!

# Basic Statistical Functions

Statistics is a field of study concerned with collecting and analyzing data. 

A well trained statistician is able to use the conclusions borne out from these analyses, to help a business make better decisions.

Two main types of *activities* in Statistics: 
- Descriptive statistics: 
> Encompasses the many tools used to *DESCRIBE* data 
- Inferential statistics: 
> Encompasses the many tools used to *INFER* from data.   


### Mean

- The Central Tendency of a set refers to the general behavior of the set at its *middle*

- Mean and Median are just 2 different definitions for this "middle" of the set

- Mean - The *AVERAGE* value of a set

- The mean can be too sensitive to outliers, which is one reason why the median is sometimes used instead of the mean.

Given a set X with n items: 
> X = (X1, X2, X3, ... Xn)

The MEAN can be generalized as: 
> MEAN = (1/n) * (X1 + X2 + X3 + ... + Xn)

#### Example:  Find MEAN for X = [1,10,3,4,7]

    MEAN = (1/5) * (1 + 3 + 4 + 7 + 10)      = (1/5) * (25) = 25/5 = 5
    MEAN  = 5
    
Now lets use Numpy to verify our example...

In [None]:
# Import Libraries
import numpy as np

# Mock data
a_set = 1,10,3,4,7

mean = np.mean(a_set)

print("MEAN = {}".format(mean))

### Medium
- The MIDDLE value of an ORDERED set

Given an ordered set X with n items: 
> X = (X1, X2, X3, ... Xn)

The MEDIAN can be generalized as:
- If n = ODD
> MEDIAN = X((n/2) + 1) 

- If n = EVEN
> MEDIAN = 1/2 * [X(n/2) + X((n/2) + 1)] 

#### Example:  Find MEAN for X = [1, 3, 4, 7, 10]

    MEDIAN = X(5/2 + 1) =  X(2 + 1) = X(3) = 4
    MEDIAN = 4
    
Now lets use Numpy to verify our example...

In [None]:
# Import Libraries
import numpy as np

# Mock data
a_set = 1,10,3,4,7

median = np.median(a_set)

print("MEDIAN = {}".format(median))

In this case we had the very same set but we calculated a different value for our Central Tendecy. 

MEAN = 5 || MEDIAN = 4

### Variance and Standard Deviation

Variance and Std. Deviation are values that refer to the "spread" of a dataset aka distance between the individual points of a given set.

#### Variance

The VARIANCE uses the MEAN to calculate spread

Given an ordered set X with n items and MEAN M: 
> X = (X1, X2, X3, ... Xn) || mean = M

The VARIANCE can be generalized below where X stands for each element in set:
> VARIANCE = [ Sum[(X - MEAN)^2]] / n 


Formula:
> $$\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - M)^2} {n}$$

#### Example:  Find VARIANCE for X = [1, 10, 3, 4, 7]
Take each difference (Xi - MEAN). Square it. Then average the result:

    σ2 = [ (1-5)^2 + (10-5)^2 + (3-5)^2 + (4-5)^2 + (7-5)^2 ] / 5
    σ2 = [ (-4)^2 + (5)^2 + (-2)^2 + (-1)^2 + (2)^2 ] / 5
    σ2 = [ 16 + 25 + 4 + 1 + 4 ] / 5
    σ2 = [ 50 ] / 5
    σ2 = 10.0
    
Now lets use Numpy to verify our example...

In [None]:
# Import Libraries
import numpy as np

# Mock data
a_set = 1,10,3,4,7

variance = np.var(a_set)

print("VARIANCE = {}".format(variance))

#### Standard Deviation
The STD Deviation is a measure of VARIANCE

Given an ordered set X with n items and MEAN M: 
> X = (X1, X2, X3, ... Xn) || mean = M

Formula:

$$SD = \sqrt{\frac{\displaystyle\sum_{i=1}^{n}(x_i - M)^2} {n}}$$

#### Example:  Find STANDARD DEVIATION for X = [1, 10, 3, 4, 7]
Notice it is simply the square root of the variance equation

    SD = Square Root(Variance)
    SD = Square Root(10.0)
    SD = 3.162277...

In [None]:
# Import Libraries
import numpy as np

# Mock data
a_set = 1,10,3,4,7

std_dev = np.std(a_set)

print("STANDARD DEVIATION = {}".format(std_dev))

### Pulling It All Together!

In [None]:
# X is a Python List
X = [37.89, 53.18, 27.31, 39.33, 44.64, 53.79, 11.11, 22.12, 19.55]

# Sorting the data and printing it.
X.sort()
print(X)

In [None]:
# Using NumPy's built-in functions to Find Mean, Median, Variance 
# and Standard Deviation
mean = np.mean(X)
median = np.median(X)
variance = np.var(X)
sd = np.std(X)

In [None]:
# Printing the values
print("Mean:", mean)
print("Median:", median)
print("Variance:", variance)
print("Standard Deviation:", sd) 

# Time to pair up for a Exercise on Numpy Stats!

# Descriptive Statistical Plots


[Matplotlib](http://matplotlib.org/) is the most common charting package, see its [documentation](http://matplotlib.org/api/pyplot_api.html) for details, and its [examples](http://matplotlib.org/gallery.html#statistics) for inspiration.

### Line Chart

In [None]:
import matplotlib.pyplot as plt

# Params set to change axis color as Colab theme is black
# text is unreadable so they are changed to white
params = {"ytick.color" : "b",
          "xtick.color" : "b",
          "axes.labelcolor" : "b",
          "axes.edgecolor" : "b"}

plt.rcParams.update(params)

# list of random values
x  = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y1 = [1, 3, 5, 3, 1, 3, 5, 3, 1]
y2 = [2, 4, 6, 4, 2, 4, 6, 4, 2]

# add legend label
plt.plot(x, y1, label="line L")
plt.plot(x, y2, label="line H")
plt.plot()

# add label for axis
plt.xlabel("x axis")
plt.ylabel("y axis")

# add label for title
title_obj = plt.title("Line Graph Example")
plt.setp(title_obj, color='b') 
plt.legend()

# display chart
plt.show()

### Pie Chart

In [None]:
import matplotlib.pyplot as plt

labels = 'S1', 'S2', 'S3'
sections = [56, 66, 24]
colors = ['c', 'g', 'y']

plt.pie(sections, labels=labels, colors=colors,
        startangle=90,
        explode = (0.02, 0.02, 0.02), # specifies the fraction of the radius with which to offset each wedge
        autopct = '%1.1f%%') # used to label the wedges with their numeric value

plt.axis('equal') # Try commenting this out.
title_obj = plt.title("Pie Chart Example")
plt.setp(title_obj, color='b') 
plt.show()

### Bar Chart

In [None]:
import matplotlib.pyplot as plt

# Look at index 4 and 6, which demonstrate overlapping cases.
x1 = [1, 3, 4, 5, 6, 7, 9]
y1 = [4, 7, 2, 4, 7, 8, 3]

x2 = [2, 4, 6, 8, 10]
y2 = [5, 6, 2, 6, 2]

plt.bar(x1, y1, label="Blue Bar", color='b')
plt.bar(x2, y2, label="Green Bar", color='g')
plt.plot()

plt.xlabel("bar number")
plt.ylabel("bar height")
title_obj = plt.title("Bar Chart Example")
plt.setp(title_obj, color='b') 
plt.legend()
plt.show()

### Histogram

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Use numpy to generate a bunch of random data in a bell curve around 5.
n = 5 + np.random.randn(1000)

m = [m for m in range(len(n))]
plt.bar(m, n)
title_obj = plt.title("Raw Data")
plt.setp(title_obj, color='b') 
plt.show()

plt.hist(n, bins=20)
title_obj = plt.title("Histogram")
plt.setp(title_obj, color='b') 
plt.show()

plt.hist(n, cumulative=True, bins=20)
title_obj = plt.title("Cumulative Histogram")
plt.setp(title_obj, color='b') 
plt.show()

### Boxplot

In [None]:
import matplotlib.pyplot as plt

## Create data
np.random.seed(10)
collectn_1 = np.random.normal(100, 10, 200)
collectn_2 = np.random.normal(80, 30, 200)
collectn_3 = np.random.normal(90, 20, 200)
collectn_4 = np.random.normal(70, 25, 200)

# combine these different collections into a list    
data_to_plot = [collectn_1, collectn_2, collectn_3, collectn_4]

# Create a figure instance
fig = plt.figure(1, figsize=(9, 6))

# Create an axes instance
ax = fig.add_subplot(111)

# Create the boxplot
bp = ax.boxplot(data_to_plot)

### Scatter Plot

In [None]:
import matplotlib.pyplot as plt

# random list of cordinates
x1 = [2, 3, 4]
y1 = [5, 5, 5]
x2 = [1, 2, 3, 4, 5]
y2 = [2, 3, 2, 3, 4]
y3 = [6, 8, 7, 8, 7]

# plot points
plt.scatter(x1, y1)
plt.scatter(x2, y2, marker='v', color='r')
plt.scatter(x2, y3, marker='^', color='m')

# Add title
title_obj = plt.title("Scatter Plot Example")
plt.setp(title_obj, color='b') 

# display chart
plt.show()

# Time to pair up for a Exercise on Descriptive Statistical Plots