# Statistics Fundamentals 

_October 27, 2020_

Agenda today:
- Measure of central tendency: mean, median, mode
- Measure of dispersion: variance, standard deviation
- Measure of relationship: covariance and correlation

In [19]:
import numpy as np
import matplotlib.pyplot as plt

## Part I. Mean, Median, and Mode
What are the definition of the three measurements?

In [20]:
array = [10,11,11,12,11,13,14,16,17,18,19,20,22,24,26,22,24]
# plot it out and examine it 
plt.style.use('fivethirtyeight')

What is the above plot called? What kind of values can it be used to represent?

## Part II. Measure of Dispersion
Two measurements of dispersion we will be concerned with is **variance** and **standard deviation**. They are both measurement of variability of dataset. Why might we need a measure of variability in addition to central tendency?

#### Variance calculation:
$$ \large \sigma^2 = \dfrac{1}{n}\displaystyle\sum^n_{i=1}(x_i-\mu)^2 $$

#### Standard deviation calculation:
$$ \large \sigma = \sqrt{\dfrac{1}{n}\displaystyle\sum^n_{i=1}(x_i-\mu)^2} $$

In [52]:
# exercises

# can you write a function that takes in an array, calculate the variance and standard deviation?
def calculate_variance(array):
    '''
    calculate the variance of an array
    '''
    average = sum(array) / (len(array)-1)
    sum_dev_about_mean_sqd = []
    for num in array:
        sum_dev_about_mean_sqd.append((num - average)**2)
    return sum(sum_dev_about_mean_sqd)/len(array)
calculate_variance(array)
        

27.662683823529413

In [22]:
import math
def calculate_std(array):
    '''
    calculate the standard deviation of an array
    '''
    return math.sqrt(calculate_variance(array))
calculate_std(array)

5.150335091728831

## Part III. Covariance and Correlation
Covariance and correlation measures the degree of two variables' relationship. 

#### Covariance calculation:
$$Cov_{X,Y} = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)$$

#### Correlation calculation:
$$ r = \frac{cov(X,Y)} {\sigma_x  \sigma_y}$$

<img src= 'https://raw.githubusercontent.com/learn-co-curriculum/dsc-correlation-covariance/master/images/correx.svg'>

In [47]:
## exercises

# write a function that calculates the correlation and covariance of two arrays 

def calculate_covariance(array1, array2):
    '''
    calculate the covariance of two arrays
    '''
    if len(array1) != len(array2):
        return print("There must be the same number \n data points in both arrays. lol [shrug]")
    
    x_average = sum(array1)/len(array1)
    y_average = sum(array2)/len(array2)
    x_dev_about_mean = [(x_i - x_average) for x_i in array1]
    y_dev_about_mean = [(y_i - y_average) for y_i in array2] 
    x_y_dev_products = [x_dev_about_mean[i]* y_dev_about_mean[i] for i in range(len(x_dev_about_mean))]
    return sum(x_y_dev_products)/(len(x_y_dev_products)-1)
        

In [24]:
def calculate_correlation(array1, array2):
    '''
    calculate the correlation of two arrays
    '''
    return calculate_covariance(array1, array2)/(calculate_std(array1)*calculate_std(array2))

In [25]:
list1 = [2,3,4,12,12,43,13]
list2 = [5,7,8,3,5,7,13]

In [48]:
calculate_covariance(list1, list2)

3.119047619047619

In [53]:
calculate_correlation(list1, list2)

0.07426120095223739

In [42]:
import pandas as pd
data = {'x': list1, 'y': list2}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,x,y
0,2,5
1,3,7
2,4,8
3,12,3
4,12,5


In [43]:
df.cov()

Unnamed: 0,x,y
x,200.571429,3.119048
y,3.119048,10.142857


In [50]:
df.corr()

Unnamed: 0,x,y
x,1.0,0.069152
y,0.069152,1.0
