# NumPy and Pandas Tutorial
## HODP Bootcamp Week 3
### February 27, 2019

## Some Python refreshers . . . 
- datatypes (strings, integers)
- functions
- data structures like lists and dictionaries

In [1]:
lst = [1, "Emma", 5.0, {"name": "Emma", "age": 20}]

In [2]:
lst[0]

1

In [3]:
lst[-1]

{'name': 'Emma', 'age': 20}

In [4]:
lst[-1]['age']

20

In [5]:
for key in lst[-1].keys():
    print(key)
    print(lst[-1][key])

name
Emma
age
20


## This week:
* Learn how to use Python libraries numpy and pandas to make data analysis easy and efficient
* Understand key differences between Python, NumPy, Pandas
* Practice your new data science skills!

## Getting Started

In [6]:
import numpy as np
import pandas as pd

## Python vs. NumPy
* Python lists are flexible, but bugs can be tough to find and for-loops to manipulate data can be slow
* NumPy arrays have fixed types and functions can be __vectorized__ and operations can be __broadcast__ across arrays

In [7]:
lst = ["Emma", "Jeffrey", 1, 2] # This is a valid Python list
lst

['Emma', 'Jeffrey', 1, 2]

In [8]:
np_lst = np.array(lst) # Numpy forces them all to be strings
np_lst

array(['Emma', 'Jeffrey', '1', '2'],
      dtype='<U7')

In [9]:
for elt in lst:
    print(elt + " 4")

Emma 4
Jeffrey 4


TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [13]:
for elt in np_lst:
    print(elt + " is studying abroad")

Emma is studying abroad
Jeffrey is studying abroad
1 is studying abroad
2 is studying abroad


## Creating NumPy arrays

First, we can use ``np.array`` to create arrays from Python lists:

In [14]:
# integer array:
np.array([1, 4, 2, 5, 3])

array([1, 4, 2, 5, 3])

Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type.
If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):

In [15]:
np.array([3.14, 4, 2, 3]) # Notice how the elements in the resulting array are all floats

array([ 3.14,  4.  ,  2.  ,  3.  ])

In [16]:
np.array([1, 2, 3, 4], dtype='float32') # You can explicitly set the type with the dtype keyword

array([ 1.,  2.,  3.,  4.], dtype=float32)

Numpy has a bunch of handy built-in functions to generate arrays:

In [17]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [18]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [19]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])

In [20]:
array = np.arange(9).reshape(3,3)
array

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

We can slice NumPy arrays and index into them using bracket notation:

In [21]:
array

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [22]:
array[0, 1]

1

In [23]:
array[:, 2]

array([2, 5, 8])

In [24]:
array[1, :]

array([3, 4, 5])

## Rule of Thumb: Don't reinvent the wheel
Google if a function already exists that does what you want

## So, how is this useful for data analysis?

Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question.
Perhaps the most common summary statistics are the __mean__ and __standard deviation__, which allow you to summarize the "typical" values in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).

NumPy has fast built-in aggregation functions for working on arrays; we'll discuss and demonstrate some of them here.

In [29]:
big_array = np.random.rand(1000000)

# -n 10 means run it 10 times
%timeit -n 10 sum(big_array)
%timeit -n 10 np.sum(big_array)

[ 0.29095445  0.96093482  0.9150162   0.99711659  0.24647991  0.67677318
  0.84853801  0.50136886  0.76233503  0.3993964 ]
2.13 µs ± 259 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.97 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Some more handy features of NumPy:

One common type of aggregation operation is an aggregate along a row or column.

Say you have some data stored in a two-dimensional array:

In [28]:
M = np.random.random((3, 4))
print(M)

[[ 0.2581423   0.54036969  0.01132417  0.73857904]
 [ 0.67673625  0.10617384  0.38333626  0.27012546]
 [ 0.90012291  0.23315813  0.91114747  0.35997978]]


By default, each NumPy aggregation function will return the aggregate over the entire array:

In [30]:
M.min()

0.011324169108072768

But what if you want the min for each row or each column?

In [33]:
# min of each column
M.min(axis=0)

array([ 0.2581423 ,  0.10617384,  0.01132417,  0.27012546])

In [4]:
# min of each row
M.min(axis=1)

NameError: name 'M' is not defined

### Other aggregation functions

Most aggregates have a ``NaN``-safe counterpart that computes the result while ignoring missing values, which are marked by the special floating-point ``NaN`` value.

The following table provides a list of useful aggregation functions available in NumPy:

|Function Name      |   NaN-safe Version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

## Pandas

* Pandas is another useful library for data analysis.
* While NumPy is really useful for math, it relies on __arrays__ of specific datatypes (ints, floats, etc).
* Pandas uses two data structures: `Series` and `DataFrame` that are designed to package lots of different types of data similar to a spreadsheet.
* It combines the functionality of Python and NumPy with the ease of use of Google Sheets.

## Example: House Rankings

We will:
1. Read in the data
2. Manipulate the data into a more useable form
3. Analyze the data
4. Plot our results

### Reading in the data

It's super easy to use Pandas to read in data from csv files:

In [38]:
rankings = pd.read_csv("house_rankings_2018.csv")
rankings.head()

Unnamed: 0,House,1,2,3,4,5,6,7,8,9,10,11,12
0,Adams,20,15,24,38,37,44,67,75,74,28,32,80
1,Cabot,5,13,16,17,7,20,16,31,49,118,148,94
2,Kirkland,19,19,35,50,71,63,72,70,56,24,24,31
3,Mather,17,15,19,25,27,40,44,67,112,37,55,76
4,Quincy,28,43,55,90,71,82,65,44,21,17,14,4


And it looks beautiful:

In [3]:
rankings.set_index("House", inplace=True)
rankings

NameError: name 'rankings' is not defined

### Manipulating the data

It may be useful to also have this data in a NumPy array so we can use some of the NumPy aggregate functions to analyze our data (although Pandas also has its own version of these functions).  It's easy to convert between types:

In [41]:
rankings.values

array([[ 20,  15,  24,  38,  37,  44,  67,  75,  74,  28,  32,  80],
       [  5,  13,  16,  17,   7,  20,  16,  31,  49, 118, 148,  94],
       [ 19,  19,  35,  50,  71,  63,  72,  70,  56,  24,  24,  31],
       [ 17,  15,  19,  25,  27,  40,  44,  67, 112,  37,  55,  76],
       [ 28,  43,  55,  90,  71,  82,  65,  44,  21,  17,  14,   4],
       [ 11,  22,  40,  73,  76,  81,  94,  66,  36,  18,  11,   6],
       [ 45,  67, 113,  56,  70,  42,  44,  52,  19,  10,  11,   5],
       [ 14,  10,  16,  15,  18,  19,  20,  23,  43,  92, 114, 150],
       [ 37,  57,  60,  67,  57,  76,  49,  40,  38,  23,  16,  14],
       [152, 106,  63,  51,  45,  35,  22,  24,  14,   5,   7,  10],
       [ 10,  21,  15,   6,  16,  19,  29,  33,  66, 158,  98,  63],
       [176, 146,  78,  46,  39,  13,  12,   9,   6,   4,   4,   1]])

We can also splice this array to just get the values for the first column or row:

In [42]:
rankings.values[:, 0]

array([ 20,   5,  19,  17,  28,  11,  45,  14,  37, 152,  10, 176])

In [43]:
rankings.values[0, :]

array([20, 15, 24, 38, 37, 44, 67, 75, 74, 28, 32, 80])

### Analyzing the data

 #### First, how many students filled out the survey?

In [46]:
n = rankings.sum(axis=1)[0]
print(n)

534


#### Which house was the most popular? The least popular?

In [47]:
rankings.iloc[:, 0].argmax()

'Winthrop'

In [2]:
rankings.iloc[:, 11].argmax()

NameError: name 'rankings' is not defined

#### Make a `DataFrame` with the percentage of first place rankings for each house.

In [49]:
rankings.iloc[:,0] / n * 100

House
Adams           3.745318
Cabot           0.936330
Kirkland        3.558052
Mather          3.183521
Quincy          5.243446
Leverett        2.059925
Dunster         8.426966
Currier         2.621723
Eliot           6.928839
Lowell         28.464419
Pforzheimer     1.872659
Winthrop       32.958801
Name: 1, dtype: float64

#### Make a `DataFrame` with the average ranking for each house.

You could use a `for` loop like this:

In [57]:
w_rankings = rankings.copy()
for i in range(12):
    w_rankings.iloc[:, i] = w_rankings.iloc[:, i] * (i + 1)

In [58]:
w_rankings

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12
House,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Adams,20,30,72,152,185,264,469,600,666,280,352,960
Cabot,5,26,48,68,35,120,112,248,441,1180,1628,1128
Kirkland,19,38,105,200,355,378,504,560,504,240,264,372
Mather,17,30,57,100,135,240,308,536,1008,370,605,912
Quincy,28,86,165,360,355,492,455,352,189,170,154,48
Leverett,11,44,120,292,380,486,658,528,324,180,121,72
Dunster,45,134,339,224,350,252,308,416,171,100,121,60
Currier,14,20,48,60,90,114,140,184,387,920,1254,1800
Eliot,37,114,180,268,285,456,343,320,342,230,176,168
Lowell,152,212,189,204,225,210,154,192,126,50,77,120


Or you could use Pandas `pd.DataFrame.apply()` to apply a function to your `DataFrame`.

In [59]:
def f(row):
    for i in range(12):
        row[i] *= i + 1
    return row

In [60]:
weighted_rankings = rankings.apply(f, axis=1)

In [61]:
weighted_rankings

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12
House,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Adams,20,30,72,152,185,264,469,600,666,280,352,960
Cabot,5,26,48,68,35,120,112,248,441,1180,1628,1128
Kirkland,19,38,105,200,355,378,504,560,504,240,264,372
Mather,17,30,57,100,135,240,308,536,1008,370,605,912
Quincy,28,86,165,360,355,492,455,352,189,170,154,48
Leverett,11,44,120,292,380,486,658,528,324,180,121,72
Dunster,45,134,339,224,350,252,308,416,171,100,121,60
Currier,14,20,48,60,90,114,140,184,387,920,1254,1800
Eliot,37,114,180,268,285,456,343,320,342,230,176,168
Lowell,152,212,189,204,225,210,154,192,126,50,77,120


In [64]:
mean_rankings = weighted_rankings.sum(axis=1) / n
print(mean_rankings)

House
Adams          7.584270
Cabot          9.436330
Kirkland       6.627341
Mather         8.086142
Quincy         5.344569
Leverett       6.022472
Dunster        4.719101
Currier        9.421348
Eliot          5.466292
Lowell         3.578652
Pforzheimer    8.970037
Winthrop       2.743446
dtype: float64


In [34]:
mean_rankings.sort_values()

House
Winthrop       2.743446
Lowell         3.578652
Dunster        4.719101
Quincy         5.344569
Eliot          5.466292
Leverett       6.022472
Kirkland       6.627341
Adams          7.584270
Mather         8.086142
Pforzheimer    8.970037
Currier        9.421348
Cabot          9.436330
dtype: float64

## Congrats! You're on your way to becoming a data science expert!
### Next week we'll tackle making visualizations of our findings using matplotlib and d3