![ADSA Logo](http://i.imgur.com/BV0CdHZ.png?2 "ADSA Logo")

# Spring 2017 ADSA Workshop - Data Science Fundamentals Series: Numpy, Statistics and Probability

Workshop content adapted from:
* https://github.com/ADSA-UIUC/PythonWorkshop_2
* [Data Science from Scratch - First Principles with Python](http://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/149190142X/)

This workshop dives into data science fundamentals - statistics and probability. We will talk about the following topics:
* How to use NumPy
* Linear Algebra
* Statistics
* Histograms
* Probability with Python

***

## An Introduction to NumPy

NumPy (or Numerical Python), is part of a great set of free scientific computing libraries called SciPy that provide mathematical and numerical functions that work very fast. NumPy is like MATLAB, and you can use it to create very powerful arrays and matrices, and it also has various kinds of optimization algorithms and linear algebra functions that are very useful for data science and analytics

In [None]:
# Let's import numpy to use some of its functions
import numpy as np

The central feature of NumPy is the array object class. Arrays are similar to lists in Python, except that every element of an array must be of the same type, typically a numeric type like `float` or `int`. Arrays make operations with large amounts of numeric data very fast and are generally much more efficient than lists.

In [None]:
my_list = [1, 4, 5, 8]
a = np.array(my_list)

print(a)

Array elements are accessed, sliced, and manipulated just like lists.


In [None]:
# accessing elements of the array using an index
# return the 4th element in the array (0-indexed!)
print(a[3])

# accessing multiple continuous elements of the array, also called slicing
print(a[:2])

# modifying elements of the array
a[0] = 5
print(a)

Note that the type of a is **`numpy.ndarray`**

In [None]:
print(type(a))

This means that numpy can handle multi-dimensional arrays. Let's create a 2-dimensional array

In [None]:
b = np.array( [[1, 2, 3], [4, 5, 6]] )
print(b)

In [None]:
# access the element is the first row, second column
print(b[0, 1])

In [None]:
# slice the array and access only the 3rd column
print(b[:, 2])

The **`shape`** property returns the size of each dimension of the array

In [None]:
print(a.shape)
print(b.shape)

The **`in`** statement can be used to check if values are present in the array

In [None]:
print(3 in b)

Check if 7 is in the array a. 

In [None]:
#insert code here

Arrays can be reshaped to different dimension sizes.

In [None]:
a = np.array(range(10), float)
print(a)
print(a.shape)

In [None]:
# reshape (10,) array to (5,2)
a = a.reshape((5, 2))#the product of the two arguements 
                     #must equal the total elements in the array
print(a)
print(a.shape)

We can create special matrices in NumPy too! Remember that they are still referred to as arrays in NumPy.

In [None]:
# create the identity 2-dimensional array of shape (4,4)
i = np.identity(4)
print(i)

In [None]:
# create a (3,3) array with all ones
o = np.ones((3,3))
print(o)

We can even do math operations on these arrays. All of the operations below happen element-wise. To do matrix multiplication and other matrix-specific math, we will have to use NumPy's linear algebra functions.

In [None]:
a = np.array([1,2,3], float)
b = np.array([5,2,6], float)
print(a)
print(b)

In [None]:
#sum 

In [None]:
#subtract

In [None]:
#multiplication

In [None]:
#division

In [None]:
#square each elemet

***

## Linear Algebra: Vectors

Linear Algebra is very important in the context of data science. It provides concepts and structures that allow data scientists to efficiently represent data, and do various computations with them. There are two structures that we will talk about today, **vectors** and **matrices**.

Abstractly, vectors are objects that can be added together (to form new vectors) and that can be multiplied by scalars (i.e., numbers), also to form new vectors.

For example, if you have the heights, weights, and ages of a large number of people, you can treat your data as three-dimensional vectors `(height, weight, age)`. If you’re teaching a class with four exams, you can treat student grades as four-dimensional vectors `(exam1, exam2, exam3, exam4)`.

In [None]:
mike_data = np.array([70,  # inches,
                      170, # pounds,
                      40   # years
                     ])
print(mike_data)

To begin with, we’ll frequently need to **add** two vectors. Vectors add componentwise. This means that if two vectors v and w are the same length, their sum is just the vector whose first element is `v[0] + w[0]`, whose second element is `v[1] + w[1]`, and so on. (If they’re not the same length, then we’re not allowed to add them.)

In [None]:
# create a second vector with Adam's data
adam_data = np.array([72, 192, 31])

print(mike_data + adam_data)

Similarly we can **subtract** vectors too.

In [None]:
print(mike_data - adam_data)

We’ll also need to be able to multiply a vector by a **scalar**, which we do simply by multiplying each element of the vector by that number.

In [None]:
print(2.3 * adam_data)

A less obvious tool is the **dot** product. The dot product of two vectors is the sum of their componentwise products. The mathematical computation for $v \cdot w$ looks like this: $v_1 w_1 + v_2 w_2 + \dots + v_n w_n$

In [None]:
print(mike_data.dot(adam_data))

You can also compute the cross product.

In [None]:
print(np.cross(mike_data, adam_data))

The dot product measures how far the vector `v` extends in the `w` direction. For example, if `w = [1, 0]` then `dot(v, w)` is just the first component of `v`. Another way of saying this is that it’s the length of the vector you would get if you projected `v` onto `w`.
![Dot Product Graph](http://i.imgur.com/jPBLBEK.png?1)

Finally, we need to be able to compute the **magnitude** of a vector. In graphical terms, it is just the length of the vector. The mathematical computation for the magnitude of a vector `v` is:
$\sqrt{v_1 ^ 2 + v_2 ^ 2 + \dots + v_n ^ 2}$

In [None]:
# print magnitude of mike_data
print(np.sqrt(mike_data[0]**2 + mike_data[1]**2 + mike_data[2]**2))

In [None]:
# another way to compute the magnitude of a vector 
# insert code here

***

## Linear Algebra: Matrices

A matrix is a two-dimensional collection of numbers. We will represent matrices as `lists` of `lists`, with each inner list having the same size and representing a row of the matrix. If `A` is a matrix, then `A[i][j]` is the element in the `i`th row and the `j`th column. Per mathematical convention, we will typically use capital letters to represent matrices.

In [None]:
mat_a = np.array( [[1, 2], [5, 6], [14, 15]] )
mat_b = np.array( [[4, 5, 6], [8, 9, 10]] )

print("Matrix A:\n", mat_a)
print("\nMatrix B:\n", mat_b)

To find out what the dimensions (number of rows vs. number of columns) of a matrix are, we can use the Numpy shape property.

In [None]:
print ("Matrix A shape: ", mat_a.shape)
print ("Matrix B shape: ", mat_b.shape)

Matrices will be important to us for several reasons.

First, we can use a matrix to represent a data set consisting of multiple vectors, simply by considering each vector as a row of the matrix. For example, if you had the heights, weights, and ages of 1,000 people you could put them in a 1,000 × 3 matrix:

    data = [[70, 170, 40],
            [65, 120, 26],
            [77, 250, 19],
            # ....
            ]

A differentiating feature of matrices is an operation called transpose. This swaps elements of the matrix along the leading diagonal. It can be thought of as an operation that makes all the rows - columns, and all the columns - rows. Let's see an example.

In [None]:
print ("Matrix A:\n", mat_a)
print ("\nMatrix A Transposed:\n", mat_a.T)

We can run similar mathematical operations with matrices, like we did with vectors.

**Adding** and **subtracting** matrices (matrices need to have similar shapes):

In [None]:
try:
    print(mat_a + mat_b) # will throw an error
except ValueError as e:
    print("Error:", e)

In [None]:
print(mat_a + mat_b.T)

**Element-wise multiplication**, which returns a matrix of the same dimensions.

In [None]:
print(mat_a * mat_b.T)

**Matrix multiplication** is a more complex calculation, which computes a dot product of the rows of the first matrix and the columns of the second matrix, to return a new matrix. You can learn more about matrix multiplication at [Khan Academy – Basic Matrix operations](http://www.khanacademy.org/math/algebra/algebra-matrices) and [Khan Academy – Linear Algebra](http://www.khanacademy.org/math/linear-algebra).

***

## Statistics

Along with managing sets of data, python and numpy give you the tools to describe your set of data. 

Ways to describe data sets:
* Length
* Max/Min
* Mean, Median, Mode
* Dispersion (Spread) of values
* Standard Deviation


In [None]:
basic_list = [14, 7, 15, 7, 3, 5, 6, 8, 10]

In [None]:
print("length: ", len(basic_list))

In [None]:
print("min: ", min(basic_list))
print("max: ", max(basic_list))

In [None]:
def mean(x):
    return sum(x) / len(x)

print("average: ", mean(basic_list))

#numpy has a mean function as well
#insert code here

You can also easily sort lists with sorted(), which helps when defining central tendencies

In [None]:
print("original: ", basic_list)
sorted_list = sorted(basic_list)
print("sorted:   ", sorted_list)

If you already have a sorted list, you can use indexes to get the min/max values: 

In [None]:
#insert code here

Otherwise you can also use the `min` and `max` functions:

In [None]:
#insert code here

Finding the median is a little less straightforward, just depends on whether length is even or odd

In [None]:
#median function
#def median(v):

In [None]:
print("Median: ", median(basic_list))

The quantile of a data set returns the pth percentile value

In [None]:
def quantile(x, p):
    p_index = int(p * len(x))
    return sorted(x)[p_index]

In [None]:
print("1st Quartile (25th Percantile): ", quantile(basic_list, .25))
print("3rd Quartile (75th Percantile): ", quantile(basic_list, .75))

In [None]:
def IQR(x): #IQR - interquartile range
    return quantile(x, 0.75) - quantile(x, 0.25)

In [None]:
print("Interquartile Range: ", IQR(basic_list))

In addition to the many simple statistical functions you can write yourself, numpy gives you access to a lot more, including the common ones from above.

In [None]:
print("Standard Deviation: ", np.std(basic_list))

In [None]:
x = [14, 7, 15, 7, 3, 5, 6, 8, 10]
y = [44, 3, 7, 2, 17, 5, 3, 11, 14]
print("Correlation between x and y: \n", np.corrcoef(x, y))

***
## Histograms
Histograms are a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval. They are a very useful tool in visualizing data.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

data = [0,1,2,1]
bins = [0,1,2,3]
x = np.histogram(data, bins)
print(x) #output is a tuple of the number of occurences in each bin and the bins. 
        #x[0] is the frequency of each bin, x[1] is the range of each bin.

What are bins? For this example, we have 3 bins. Ranging from 0 to 1, 1 to 2, and 2 to 3, respectively.

In [None]:
plt.bar(range(0,3),x[0])

plt.title('Our Histogram')
plt.xlabel('Data')
plt.ylabel('Frequency')
plt.axis([0, 5, 0, 5])#[start of x-axis, end of x-axis, start of y-axis, end of y-axis]
plt.grid(True)

plt.show()

***

## Probability

Now it's time for some basic probability to help with the upcoming workshops.

What is probability? It is the process of quantifiying the uncertainty associated with a certain set of events. There are plenty of random events in nature that make probability very useful for building and evaluating models that analyze and predict on data.

### Dependent vs Independent Events

Two events E and F are independent if $P(E, F) = P(E) \cdot P(F)$.
That is, the probability of both E and F happening is $P(E) \cdot P(F)$.

E and F are dependent events when $P(E | F) = \frac{P(E,F)}{P(F)} = P(E | F) \cdot P(F)$

Let's program some examples in Python.

### Tricky Example 1: Family with two children

Let's assume the following statements:
1. Each child is equally likely to be a boy or a girl
2. The gender of the second child is independent of the gender of the first child

Based on these assumptions, we know that the event “no girls” has probability $\frac{1}{4}$, the event “one girl, one boy” has probability $\frac{1}{2}$, and the event “two girls” has probability $\frac{1}{4}$.

Now, let's think about the events:
* B = "both children are girls"
* G = "the older child is a girl"
* L = "at least one of the children is a girl"

Using the concept of conditional probability, we can ask what is the likelihood of these events conditioned on each other. Or, what is the probability that both children are girls and the older child is a girl?

By mathematical calculation, we know $P(B|G) = \frac{P(B, G)}{P(G)} = \frac{P(B)}{P(G)} = \frac{1}{2}$

And this makes sense because the event B and G (“both children are girls and the older child is a girl”) is just the event B. (Once you know that both children are girls, it’s necessarily true that the older child is a girl.)

We could also ask about the probability of the event “both children are girls” conditional on the event “at least one of the children is a girl” (L). Surprisingly, the answer is different from before! We know that the event B and L (“both children are girls and at least one of the children is a girl”) is just the event B.

This means we have $P(B|L) = \frac{P(B,L)}{P(L)} = \frac{P(B)}{P(L)} = \frac{\frac{1}{4}}{\frac{3}{4}} = \frac{1}{3}$

In [None]:
# a function that randomly returns a choice between "boy" and "girl"
def random_kid():
    return np.random.choice(["boy", "girl"])

# counter variables
both_girls = 0.0
older_girl = 0.0
either_girl = 0.0

for i in range(10000):
    younger = random_kid()
    older = random_kid()
    if older == "girl":
        older_girl += 1
    if older == "girl" and younger == "girl":
        both_girls += 1
    if older == "girl" or younger == "girl":
        either_girl += 1
        

print("P(both | older):", both_girls / older_girl) # 1/2
print("P(both | either): ", both_girls / either_girl) # 1/3