![ADSA Logo](http://i.imgur.com/BV0CdHZ.png?2 "ADSA Logo")

# Spring 2016 ADSA Workshop - Data Science Fundamentals Series: Numpy, Statistics and Probability

Workshop content adapted from:
* https://github.com/ADSA-UIUC/PythonWorkshop_2
* [Data Science from Scratch - First Principles with Python](http://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/149190142X/)

This workshop dives into data science fundamentals - statistics and probability. We will talk about the following topics:
* How to use NumPy
* Linear Algebra
* Statistics
* Probability with Python

***

## An Introduction to NumPy

NumPy (or Numerical Python), is part of a great set of free scientific computing libraries called SciPy that provide mathematical and numerical functions that work very fast. NumPy is like MATLAB, and you can use it to create very powerful arrays and matrices, and it also has various kinds of optimization algorithms and linear algebra functions that are very useful for data science and analytics

In [2]:
# Let's import numpy to use some of its functions
import numpy as np

The central feature of NumPy is the array object class. Arrays are similar to lists in Python, except that every element of an array must be of the same type, typically a numeric type like `float` or `int`. Arrays make operations with large amounts of numeric data very fast and are generally much more efficient than lists.

In [3]:
my_list = [1, 4, 5, 8]
a = np.array(my_list)

print a

[1 4 5 8]


Array elements are accessed, sliced, and manipulated just like lists.

In [13]:
# accessing elements of the array using an index
# return the 4th element in the array (0-indexed!)
print a[3]

# accessing multiple continuous elements of the array, also called slicing
print a[:2]

# modifying elements of the array
a[0] = 5
print a

8
[5 4]
[5 4 5 8]


Note that the type of a is **`ndarray`**

In [6]:
print type(a)

<type 'numpy.ndarray'>


This means that numpy can handle multi-dimensional arrays. Let's create a 2-dimensional array

In [7]:
b = np.array( [[1, 2, 3], [4, 5, 6]] )
print b

[[1 2 3]
 [4 5 6]]


In [8]:
# access the element is the first row, second column
print b[0, 1]

2


In [9]:
# slice the array and access only the 3rd column
print b[:, 2]

[3 6]


The **`shape`** property returns the size of each dimension of the array

In [10]:
print a.shape
print b.shape

(4,)
(2, 3)


The **`in`** statement can be used to check if values are present in the array

In [11]:
print 3 in b

True


In [12]:
print 7 in a

False


Arrays can be reshaped to different dimension sizes.

In [14]:
a = np.array(range(10), float)
print a
print a.shape

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
(10,)


In [16]:
# reshape (10,) array to (5,2)
a = a.reshape((5, 2))
print a
print a.shape

[[ 0.  1.]
 [ 2.  3.]
 [ 4.  5.]
 [ 6.  7.]
 [ 8.  9.]]
(5, 2)


We can create special matrices in NumPy too! Remember that they are still referred to as arrays in NumPy.

In [17]:
# create the identity 2-dimensional array of shape (4,4)
i = np.identity(4)
print i

[[ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]]


In [18]:
# create a (3,3) array with all ones
o = np.ones((3,3))
print o

[[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]


We can even do math operations on these arrays. All of the operations below happen element-wise. To do matrix multiplication and other matrix-specific math, we will have to use NumPy's linear algebra functions.

In [19]:
a = np.array([1,2,3], float)
b = np.array([5,2,6], float)
print a
print b

[ 1.  2.  3.]
[ 5.  2.  6.]


In [20]:
print a + b

[ 6.  4.  9.]


In [21]:
print a - b

[-4.  0. -3.]


In [22]:
print a * b

[  5.   4.  18.]


In [23]:
print b / a

[ 5.  1.  2.]


In [24]:
print b ** a

[   5.    4.  216.]


***

## Linear Algebra: Vectors

Linear Algebra is very important in the context of data science. It provides concepts and structures that allow data scientists to efficiently represent data, and do various computations with them. There are two structures that we will talk about today, **vectors** and **matrices**.

Abstractly, vectors are objects that can be added together (to form new vectors) and that can be multiplied by scalars (i.e., numbers), also to form new vectors.

For example, if you have the heights, weights, and ages of a large number of people, you can treat your data as three-dimensional vectors `(height, weight, age)`. If you’re teaching a class with four exams, you can treat student grades as four-dimensional vectors `(exam1, exam2, exam3, exam4)`.

In [27]:
mike_data = np.array([70,  # inches,
                      170, # pounds,
                      40   # years
                     ])
print mike_data

[ 70 170  40]


To begin with, we’ll frequently need to **add** two vectors. Vectors add componentwise. This means that if two vectors v and w are the same length, their sum is just the vector whose first element is `v[0] + w[0]`, whose second element is `v[1] + w[1]`, and so on. (If they’re not the same length, then we’re not allowed to add them.)

In [28]:
# create a second vector with Adam's data
adam_data = np.array([72, 192, 31])

print mike_data + adam_data

[142 362  71]


Similarly we can **subtract** vectors too.

In [29]:
print mike_data - adam_data

[ -2 -22   9]


We’ll also need to be able to multiply a vector by a **scalar**, which we do simply by multiplying each element of the vector by that number.

In [30]:
print 2.3 * adam_data

[ 165.6  441.6   71.3]


A less obvious tool is the **dot** product. The dot product of two vectors is the sum of their componentwise products. The mathematical computation for $v \cdot w$ looks like this: $v_1 w_1 + v_2 w_2 + \dots + v_n w_n$

In [31]:
print mike_data.dot(adam_data)

38920


The dot product measures how far the vector `v` extends in the `w` direction. For example, if `w = [1, 0]` then `dot(v, w)` is just the first component of `v`. Another way of saying this is that it’s the length of the vector you would get if you projected `v` onto `w`.
![Dot Product Graph](http://i.imgur.com/jPBLBEK.png?1)

Finally, we need to be able to compute the **magnitude** of a vector. In graphical terms, it is just the length of the vector. The mathematical computation for the magnitude of a vector `v` is:
$\sqrt{v_1 ^ 2 + v_2 ^ 2 + \dots + v_n ^ 2}$

In [32]:
# print magnitude of mike_data
print np.sqrt(mike_data[0]**2 + mike_data[1]**2 + mike_data[2]**2)

188.148877222


In [33]:
# another way to compute the magnitude of a vector
print np.sqrt(mike_data.dot(mike_data))

188.148877222


## Linear Algebra: Matrices

A matrix is a two-dimensional collection of numbers. We will represent matrices as `lists` of `lists`, with each inner list having the same size and representing a row of the matrix. If `A` is a matrix, then `A[i][j]` is the element in the `i`th row and the `j`th column. Per mathematical convention, we will typically use capital letters to represent matrices.

In [37]:
mat_a = np.array( [[1, 2], [5, 6], [14, 15]] )
mat_b = np.array( [[4, 5, 6], [8, 9, 10]] )

print "Matrix A:\n", mat_a
print "\nMatrix B:\n", mat_b

Matrix A:
[[ 1  2]
 [ 5  6]
 [14 15]]

Matrix B:
[[ 4  5  6]
 [ 8  9 10]]


To find out what the dimensions (number of rows vs. number of columns) of a matrix are, we can use the Numpy shape property.

In [38]:
print "Matrix A shape: ", mat_a.shape
print "Matrix B shape: ", mat_b.shape

Matrix A shape:  (3, 2)
Matrix B shape:  (2, 3)


Matrices will be important to us for several reasons.

First, we can use a matrix to represent a data set consisting of multiple vectors, simply by considering each vector as a row of the matrix. For example, if you had the heights, weights, and ages of 1,000 people you could put them in a 1,000 × 3 matrix:

    data = [[70, 170, 40],
            [65, 120, 26],
            [77, 250, 19],
            # ....
            ]

A differentiating feature of matrices is an operation called transpose. This swaps elements of the matrix along the leading diagonal. It can be thought of as an operation that makes all the rows - columns, and all the columns - rows. Let's see an example.

In [39]:
print "Matrix A:\n", mat_a
print "\nMatrix A Transposed:\n", mat_a.T

Matrix A:
[[ 1  2]
 [ 5  6]
 [14 15]]

Matrix A Transposed:
[[ 1  5 14]
 [ 2  6 15]]


We can run similar mathematical operations with matrices, like we did with vectors.

**Adding** and **subtracting** matrices (matrices need to have similar shapes):

In [47]:
try:
    print mat_a + mat_b # will throw an error
except ValueError as e:
    print "Error:", e

Error: operands could not be broadcast together with shapes (3,2) (2,3) 


In [41]:
print mat_a + mat_b.T

[[ 5 10]
 [10 15]
 [20 25]]


**Element-wise multiplication**, which returns a matrix of the same dimensions.

In [48]:
print mat_a * mat_b.T

[[  4  16]
 [ 25  54]
 [ 84 150]]


**Matrix multiplication** is a more complex calculation, which computes a dot product of the rows of the first matrix and the columns of the second matrix, to return a new matrix. You can learn more about matrix multiplication at [Khan Academy – Basic Matrix operations](http://www.khanacademy.org/math/algebra/algebra-matrices) and [Khan Academy – Linear Algebra](http://www.khanacademy.org/math/linear-algebra).

***

## Statistics

Along with managing sets of data, python and numpy give you the tools to describe your set of data. 

Ways to describe data sets:
* Length
* Max/Min
* Mean, Median, Mode
* Dispersion (Spread) of values
* Standard Deviation


In [50]:
basic_list = [14, 7, 15, 7, 3, 5, 6, 8, 10]




In [51]:
print "length: ", len(basic_list)

length:  9


In [52]:
print "min: ", min(basic_list)
print "max: ", max(basic_list)

min:  3
max:  15


In [53]:
def mean(x):
    return sum(x) / len(x)

print "average: ", mean(basic_list)

average:  8


You can also easily sort lists with sorted(), which helps when defining central tendencies

In [54]:
print "original: ", basic_list
sorted_list = sorted(basic_list)
print "sorted:   ", sorted_list

original:  [14, 7, 15, 7, 3, 5, 6, 8, 10]
sorted:    [3, 5, 6, 7, 7, 8, 10, 14, 15]


If you already have a sorted list, you can use indexes to get the min/max values: 

In [55]:
print "Min: ", sorted_list[0]
print "Max: ", sorted_list[-1] # 1st from last

Min:  3
Max:  15


Otherwise you can also use the `min` and `max` functions:

In [64]:
print "Min: ", min(basic_list)
print "Max: ", max(basic_list)

Min:  3
Max:  15


Finding the median is a little less straightforward, just depends on whether length is even or odd

In [56]:
def median(v):
    n = len(v)
    sorted_v = sorted(v)
    midpoint = n // 2 # the '//' makes sure result is an int
    if n % 2 == 1: # if odd, return the middle value
        return sorted_v[midpoint]
    else: # if even, return the average of the middle values
        lo = midpoint - 1
        hi = midpoint
        return((sorted_v[lo] + sorted_v[hi]) / 2)

In [57]:
print "Median: ", median(basic_list)

Median:  7


The quantile of a data set returns the pth percentile value

In [58]:
def quantile(x, p):
    p_index = int(p * len(x))
    return sorted(x)[p_index]

In [59]:
print "1st Quartile (25th Percantile): ", quantile(basic_list, .25)
print "3rd Quartile (75th Percantile): ", quantile(basic_list, .75)

1st Quartile (25th Percantile):  6
3rd Quartile (75th Percantile):  10


In [60]:
def IQR(x): #IQR - interquartile range
    return quantile(x, 0.75) - quantile(x, 0.25)

In [61]:
print "Interquartile Range: ", IQR(basic_list)

Interquartile Range:  4


In addition to the many simple statistical functions you can write yourself, numpy gives you access to a lot more, including the common ones from above.

In [63]:
print "Standard Deviation: ", np.std(basic_list)

Standard Deviation:  3.77123616633


In [67]:
x = [14, 7, 15, 7, 3, 5, 6, 8, 10]
y = [44, 3, 7, 2, 17, 5, 3, 11, 14]
print "Correlation between x and y: \n", np.corrcoef(x, y)

Correlation between x and y: 
[[ 1.          0.46158177]
 [ 0.46158177  1.        ]]


***

## Probability

Now it's time for some basic probability and distributions to help with the upcoming workshops.
* Probability: Quantifiying the uncertainty associated with a certain set of events
* Used heavily to build and evaluate models

Dependent vs Independent Events
* E, F independent if P(E, F) = P(E)*P(F)
* (The probability of both E and F happening is P(E)*P(F))
* E, F dependent when P(E|F) = P(E,F)/P(F) = P(E|F)*P(F)

Tricky Example: Family with two children
1. Each child is equally likely to be a boy or a girl
2. The gender of the second child is independent of the gender of the first child


* B = "both children are girls", G = "the older child is a girl"
* P(B|G) = P(B, G)/P(G) = P(B)/P(G) = 1/2
* B = "both children are girls", G = "at least one of the children is a girl"
* P(B|L) = P(B, L)/P(L) = P(B)/P(L) = 1/4 / 3/4 = 1/3 (????)

In [67]:
import random

def random_kid():
    return np.random.choice(["boy", "girl"])

both_girls = 0.0
older_girl = 0.0
either_girl = 0.0
random.seed(0)
for _ in range(10000):
    younger = random_kid()
    older = random_kid()
    if older == "girl":
        older_girl += 1
    if older == "girl" and younger == "girl":
        both_girls += 1
    if older == "girl" or younger == "girl":
        either_girl += 1
print "P(both | older):", both_girls / older_girl # 0.514 ~ 1/2
print "P(both | either): ", both_girls / either_girl # 0.342 ~ 1/3

P(both | older): 0.506722857716
P(both | either):  0.339336110738
