![ADSA Logo](http://i.imgur.com/BV0CdHZ.png?2 "ADSA Logo")

# Spring 2016 ADSA Workshop - Data Science Fundamentals Series: Numpy, Statistics and Probability

Workshop content adapted from:
* https://github.com/ADSA-UIUC/PythonWorkshop_2

In [61]:
import numpy as np

This workshop dives into data science fundamentals - statistics and probability. We will talk about the following topics:
* How to use Numpy Library
* Linear Algebra
* Statistics
* Probability with Python

***

Along with managing sets of data, python and numpy give you the tools to describe your set of data. 

Ways to describe data sets
* Length
* Max/Min
* Mean, Median, Mode
* Dispersion (Spread) of values
* Standard Deviation

In [18]:
basic_list = [14, 7, 15, 7, 3, 5, 6, 8, 10]

In [12]:
print "length: ", len(basic_list)

length:  5


In [13]:
print "min: ", min(basic_list)
print "max: ", max(basic_list)

min:  3
max:  10


In [14]:
def mean(x):
 return sum(x) / len(x)

print "average: ", mean(basic_list)

average:  6


***
You can also easily sort lists with sorted(), which helps when defining central tendencies

In [51]:
print "original: ", basic_list
sorted_list = sorted(basic_list)
print "sorted:   ", sorted_list

original:  [14, 7, 15, 7, 3, 5, 6, 8, 10]
sorted:    [3, 5, 6, 7, 7, 8, 10, 14, 15]


If you already have a sorted list, you can use indexes to get the min/max values: 

In [32]:
print "Min: ", sorted_list[0]
print "Max: ", sorted_list[-1] # 1st from last

Min:  3
Max:  15


Finding the median is a little less straightforward, just depends on whether length is even or odd

In [46]:
def median(v):
 n = len(v)
 sorted_v = sorted(v)
 midpoint = n // 2 # the '//' makes sure result is an int
 if n % 2 == 1: # if odd, return the middle value
     return sorted_v[midpoint]
 else: # if even, return the average of the middle values
     lo = midpoint - 1
     hi = midpoint
     return((sorted_v[lo] + sorted_v[hi]) / 2)

In [47]:
print "Median: ", median(basic_list)

Median:  7


The quantile of a data set returns the pth percentile value

In [52]:
def quantile(x, p):
 p_index = int(p * len(x))
 return sorted(x)[p_index]

In [55]:
print "1st Quartile (25th Percantile): ", quantile(basic_list, .25)
print "3rd Quartile (75th Percantile): ", quantile(basic_list, .75)

1st Quantile (25th Percantile):  6
3rd Quantile (75th Percantile):  10


In [56]:
def IQR(x): #IQR - interquartile range
 return quantile(x, 0.75) - quantile(x, 0.25)

In [58]:
print "Interquartile Range: ", IQR(basic_list)

 Interquartile Range:  4


In addition to the many simple statistical functions you can write yourself, numpy gives you access to a lot more, including the common ones from above.

In [63]:
print "Standard Deviation: ", np.std(basic_list)

Standard Deviation:  3.77123616633


In [67]:
x = [14, 7, 15, 7, 3, 5, 6, 8, 10]
y = [44, 3, 7, 2, 17, 5, 3, 11, 14]
print "Correlation between x and y: \n", np.corrcoef(x, y)

Correlation between x and y: 
[[ 1.          0.46158177]
 [ 0.46158177  1.        ]]


***
Now it's time for some basic probability and distributions to help with the upcoming workshops.
* Probability: Quantifiying the uncertainty associated with a certain set of events
* Used heavily to build and evaluate models

Dependent vs Independent Events
* E, F independent if P(E, F) = P(E)*P(F)
* (The probability of both E and F happening is P(E)*P(F))

* E, F dependent when P(E|F) = P(E,F)/P(F) = P(E|F)*P(F)

Tricky Example: Family with two children
1. Each child is equally likely to be a boy or a girl
2. The gender of the second child is independent of the gender of the first child

* B = "both children are girls", G = "the older child is a girl"
* P(B|G) = P(B, G)/P(G) = P(B)/P(G) = 1/2

* B = "both children are girls", G = "at least one of the children is a girl"
* P(B|L) = P(B, L)/P(L) = P(B)/P(L) = 1/4 / 3/4 = 1/3 (????)

In [90]:
import random

def random_kid():
    return np.random.choice(["boy", "girl"])

both_girls = 0
older_girl = 0
either_girl = 0
random.seed(0)
for _ in range(10000):
    younger = random_kid()
    older = random_kid()
    if older == "girl":
     older_girl += 1
    if older == "girl" and younger == "girl":
     both_girls += 1
    if older == "girl" or younger == "girl":
     either_girl += 1
print "P(both | older):", both_girls / older_girl # 0.514 ~ 1/2
print "P(both | either): ", both_girls / either_girl # 0.342 ~ 1/3
# ??? code will not run p71

P(both | older): 0
P(both | either):  0
