
## Principle 3 - Distributions: the Normal Distribution


As it turns out, when we make observations about our data (counting M&M colors, measuring student heights, taking blood pressure reading) we often come across something called the [Normal Distribution](https://en.wikipedia.org/wiki/Normal_distribution). There are several special values and terms, but put very simply, the normal distribution tells us that most of our samples should be about average (that's what average means right). The idea that most things of the same type (students in the same school/grade, cars being made at a factory, etc.) should have lots of similarities. Sometimes (in the case of cars) the individual members of a population should be nearly identical - it would not be good business if a every 5th car assembled was a junker. Although (adult) rabbits might look largely the same to you as a population they do vary (have more [variance](https://en.wikipedia.org/wiki/Variance)), but still they are largely average. 

In general believing that you are not average may make you susceptible to a type of cognitive fallacy: see [Lake Woebegon effect](https://en.wikipedia.org/wiki/Illusory_superiority). 

### Python Challenge - calculating the standard deviation

In order to understand the graph of the normal distribution we need to define and calculate terms (some of which you are already familiar with). First let's consider some grades for 1st period AP Bio on a test:

In [None]:
first_period = [74,76,86,88,86,96,83,53,99,71,75,73,66,84,62,97,71,69,73,74,84,68,77,37,79,61,81,86,74,62,57]

The statistical terms we need to know in order to calculate the distribution are as follows

1. **Mean** (µ) = sum of (all of the numbers / number of observations) or **(∑(x_1…x_n))/n**
2. **Mean Deviation** = for each of the numbers subtract the number from the mean; sum these results and divide by (one less than the total number of observations or **∑(x-µ)/(n-1)**
3. **Variance** = is almost the same as the mean deviation, except you square the sum of (x-µ) or **∑(x-µ)^2/(n-1)**
4. **Standard Deviation** = The square root of the variance or **sqrt(∑(x-µ)^2/(n-1))**

In [None]:
#Let's calculate these in Python using a small set of data first

subset = [74,76,86,88,86,96]


In [None]:
# Calculate mean, first let's get the number of observations:

number_of_subset_observations = len(subset)
print(number_of_subset_observations)

In [None]:
#Now we want to take all of the values of the list and sum them and divide by the total. 
# We could do it this way

sum_of_subset = subset[0] + subset[1] + subset[2] + subset[3] + subset[4] + subset[5] 
mean_subset = sum_of_subset/number_of_subset_observations 
print(mean_subset)

In [None]:

#There is a better way of doing this. We will use a 'For' loop to do the same
# operation to each item in our list. First we will create a varbiable to hold our
# final answer. Remember data is in the variable 'subset'

final_sum = 0

for observation in subset:
    final_sum = final_sum + observation
    
print(final_sum)


In [None]:
#after the for loop, we can easily calculate the mean

for_loop_mean_subset = final_sum/number_of_subset_observations 
print(for_loop_mean_subset)

# Since we don't need the mean deviation directly, we will just caluclate the Varience
# to do so we need to take each number, subtract the mean from that number, and sum the results
# We could do it this way

A for loop has the following structure:

### for temporary_variable in itterable :
### (indent)instruction[temporary_variable]

Let's break this down a bit...

* ``for`` - a for loop must start with a for statement
* ``temporary_variable`` - the next character(s) right after the ``for`` are actually the name of a special, variable. This variable is a placeholder for the objects that will come next in the loop.
* ``in`` - this ``in`` must be included and tells Python what itterable it should execute the for loop on
* ``itterable:`` The itterable is any ordered collection (such as a string or a list. A ``:`` must come after the interable.
* (indent) - the next line of a for loop must always be indented. The best practice is to use 4 spaces (not the tab key)
* ``instruction`` - these are the instructions you want Python to execute. If your instructions make use of the variable (they don't have to) you will use ``temporary_variable`` (whatever you have named it)
    

### Create a for loop to calculate the varience

In [None]:
#You will need the following pieces for your for loop
# try your and and then scroll down to a solved cell


varience_of_subset = 0 
for_loop_mean_subset
squared_result
number_of_subset_observations_minus1= number_of_subset_observations - 1

for observation in subset:
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

In [None]:

# Solved

varience_of_subset = 0 
for_loop_mean_subset
observation
number_of_subset_observations_minus1= number_of_subset_observations - 1

for observation in subset:
    squared_result = (observation - for_loop_mean_subset)**2
    varience_of_subset = varience_of_subset + squared_result

final_varience_of_subset = varience_of_subset/number_of_subset_observations_minus1
print('Our calculated value of varience',final_varience_of_subset)


#We can check this using a python library

import statistics

print('Python statistics value for varience',statistics.variance(subset))

In [None]:
# To calculate the standard deviation we just need the square root of the varience

# We will import the squar root function from math. 
import math

print(math.sqrt(final_varience_of_subset))
print(statistics.stdev(subset))

In [None]:
#We can also graph our data (going back to our whole class)


import matplotlib.pyplot as plot
import numpy as np
import scipy.stats as stats
% matplotlib inline

In [None]:
#generate the plot

sorted_data = sorted(first_period)
fit = stats.norm.pdf(sorted_data, np.mean(sorted_data), np.std(sorted_data))
plot.plot(sorted_data,fit,'-o')
plot.hist(sorted_data,normed=True)
plot.show()