# NumPy and Pandas Tutorial
## HODP Bootcamp Week 4
### October 10, 2018

## Some Python refreshers . . . 
- datatypes (strings, integers)
- functions
- data structures like lists and dictionaries

In [None]:
lst = [1, "Emma", 5.0, {"name": "Emma", "age": 20}]

In [None]:
# Get the first element of the list

In [None]:
# Get the last element of the list

In [None]:
# Get all of the keys of the dictionary

In [None]:
# Get all of the values of the dictionary

## This week:
* Learn how to use Python libraries numpy and pandas to make data analysis easy and efficient
* Understand key differences between Python, NumPy, Pandas, and more traditional tools like Google Sheets
* Practice your new data science skills!

## Getting Started

In [None]:
import numpy as np
import pandas as pd

## Python vs. NumPy
* Python lists are flexible, but bugs can be tough to find and for-loops to manipulate data can be slow
* NumPy arrays have fixed types and functions can be __vectorized__ and operations can be __broadcast__ across arrays

In [None]:
lst = ["Emma", "Jeffrey", 1, 2] # This is a valid Python list
lst

In [None]:
np_lst = np.array(lst) # Numpy forces them all to be strings
np_lst

In [None]:
for elt in lst:
    print(elt + " 4")

In [None]:
for elt in np_lst:
    print(elt + " is fun")

## Creating NumPy arrays

First, we can use ``np.array`` to create arrays from Python lists:

In [None]:
# integer array:
np.array([1, 4, 2, 5, 3])

Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type.
If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):

In [None]:
np.array([3.14, 4, 2, 3]) # Notice how the elements in the resulting array are all floats

In [None]:
np.array([1, 2, 3, 4], dtype='float32') # You can explicitly set the type with the dtype keyword

Numpy has a bunch of handy built-in functions to generate arrays:

In [None]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

In [None]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

In [None]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

In [None]:
array = np.arange(9).reshape(3,3)
array

We can slice NumPy arrays and index into them using bracket notation:

In [None]:
array

In [None]:
array[0, 1]

In [None]:
array[:, 2]

In [None]:
array[1, :]

## Rule of Thumb: Don't reinvent the wheel
Google if a function already exists that does what you want

## So, how is this useful for data analysis?

Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question.
Perhaps the most common summary statistics are the __mean__ and __standard deviation__, which allow you to summarize the "typical" values in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).

NumPy has fast built-in aggregation functions for working on arrays; we'll discuss and demonstrate some of them here.

In [None]:
big_array = np.random.rand(1000000)
%timeit -n 10 sum(big_array)
%timeit -n 10 np.sum(big_array)

### Some more handy features of NumPy:

One common type of aggregation operation is an aggregate along a row or column.

Say you have some data stored in a two-dimensional array:

In [None]:
M = np.random.random((3, 4))
print(M)

By default, each NumPy aggregation function will return the aggregate over the entire array:

In [None]:
M.min()

But what if you want the min for each row or each column?

In [None]:
# Find the min of each row

In [None]:
# Find the min of each column

### Other aggregation functions

Most aggregates have a ``NaN``-safe counterpart that computes the result while ignoring missing values, which are marked by the special floating-point ``NaN`` value.

The following table provides a list of useful aggregation functions available in NumPy:

|Function Name      |   NaN-safe Version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

## Pandas

* Pandas is another useful library for data analysis.
* While NumPy is really useful for math, it relies on __arrays__ of specific datatypes (ints, floats, etc).
* Pandas uses two data structures: `Series` and `DataFrame` that are designed to package lots of different types of data similar to a spreadsheet.
* It combines the functionality of Python and NumPy with the ease of use of Google Sheets.

## Example: House Rankings

We will:
1. Read in the data
2. Manipulate the data into a more useable form
3. Analyze the data
4. Plot our results

### Reading in the data

It's super easy to use Pandas to read in data from csv files:

In [None]:
rankings = pd.read_csv("house_rankings_2018.csv")
rankings

And it looks beautiful:

In [None]:
rankings.set_index("House", inplace=True)
rankings

### Manipulating the data

It may be useful to also have this data in a NumPy array so we can use some of the NumPy aggregate functions to analyze our data (although Pandas also has its own version of these functions).  It's easy to convert between types:

In [None]:
rankings.values

We can also splice this array to just get the values for the first column or row:

In [None]:
# The first column

In [None]:
# The first row

### Analyzing the data

#### First, how many students filled out the survey?

In [None]:
# TODO

#### Which house was the most popular? The least popular?

In [None]:
# Most popular -- TODO

In [None]:
# Least popular -- TODO

#### Make a `DataFrame` with the percentage of first place rankings for each house.

In [None]:
# TODO

#### Make a `DataFrame` with the average ranking for each house.

Hint: You could use a `for` loop

In [None]:
# For loop approach -- TODO
w_rankings = rankings.copy()

In [None]:
w_rankings

Or you could use Pandas `pd.DataFrame.apply()` to apply a function to your `DataFrame`.

In [None]:
def f(row):
    # TODO
    return row

In [None]:
weighted_rankings = rankings.apply(f, axis=1)

In [None]:
weighted_rankings

In [None]:
mean_rankings = weighted_rankings.sum(axis=1) / n
print(mean_rankings)

In [None]:
mean_rankings.sort_values()

## Congrats! You're on your way to becoming a data science expert!
### Next week we'll tackle making visualizations of our findings using matplotlib and d3