## Numpy and Pandas Bootcamp

Welcome back to another HODP bootcamp! This week, we'll be introducing **Numpy** and **Pandas**, two of the most popular Python libraries used in data science.

## Python Review
Below are some Python practice problems. The problems focus on lists, dictionaries, and functions because these are some of the most important concepts when using Python for data science!

#### We are going to start the bootcamp with Numpy and Pandas. For now, scroll down to those sections. 
If you want to do these exercises later, such as after the 10 minute slide deck, then scroll back up! 


1. Define a function ```reverse()``` which takes in a list and returns the reverse of the list - so the last element is
now the first, the second to last element is now the second, etc.

In [None]:
def reverse(lst):
    # TODO

2. Define a function ```remove_dupes()``` that removes all duplicate elements of the list. You may choose to preserve the
list's order if you like.

In [None]:
def remove_dupes(lst):
    # TODO

3. Define a function ```count()``` that takes in a list and returns a dictionary with keys being the unique values in the
list and respective values being the number of times they appear in the list. For example, given the list
```["a", "a", "b", "c", "c"]```, your function should return the dictionary ```{"a": 2, "b": 1, "c": 2}```.

In [None]:
def count(lst):
    # TODO

Now, let's jump right in with numpy!

## Getting Started

Before we can use numpy and pandas, we must first import the libraries into our notebook. You should already have numpy
and pandas installed, but if you're having any trouble, call one of the bootcamp leaders and we'll help sort it out
for you.

In [None]:
import numpy as np
import pandas as pd

## Python vs. NumPy
Python lists are flexible and they are a powerful tool in many situations. However, their dynamic memory allocation means that lists are effectively just a set of pointers, which point to objects that are NOT in one location. On the other hand, numpy arrays have fixed numerical types, meaning they can be stored in continuous space in memory. Therefore, operations on Numpy arrays can be much faster than traditional Python. 

What do we mean by numpy arrays having fixed types? Consider the following Python list:

In [None]:
lst = ["Asher", "Sahana", 1.0, 2]
print(lst)

```lst``` contains strings, floats, and integers. But once we set it as a numpy array, we notice a subtle change:

In [None]:
np_lst = np.array(lst)
print(np_lst)

Every element is now a string! So numpy arrays can only consist of a single data type. While this may seem like a hindrance,
forcing fixed types is why numpy operations execute much faster than equivalent code in Python. You probably won't notice the
difference in speed when working with small lists and arrays, but with large datasets the difference is very significant.

## Creating NumPy arrays
First, we can use ``np.array`` to create arrays from Python lists:

In [None]:
# integer array:
np.array([1, 4, 2, 5, 3])

If types do not match, NumPy will up-cast if possible. Here, integers are up-cast to floats:

In [None]:
np.array([3.14, 4, 2, 3])

You can also explicitly set the type by providing a ```dtype``` argument:

In [None]:
np.array([1, 2, 3, 4], dtype='float32')

Numpy has a bunch of handy built-in functions to generate arrays. For example, we can create an array of length 10
filled with zeros.

In [None]:
np.zeros(10, dtype=int)

The ```np.arange()``` function works in a very similar way to Python's ```range()``` function. How would you create an
array of the first 10 even integers, starting from 0?

In [None]:
# TODO

We can also create matrices by specifying the number of rows and columns as a tuple. The ```np.ones()``` function
simply initializes every value in the matrix to 1.

In [None]:
np.ones((2, 3), dtype=int)

You can access elements of an array by using slice notation:

In [None]:
array = np.arange(9)
array[1:4] # returns an array from the first index (inclusive) to the fourth index (exclusive)

You can access elements of a matrix in a similar fashion by using two slice expressions - the first determines which row(s)
to extract, and the second determines which column(s).

In [None]:
matrix = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])
# We can get the value 2 by specifying the first row (0th index) and second column (1st index)
matrix[0, 1]

How could you extract the second row? Hint: the slice operator ```:``` without any arguments returns the entire array.
So a call like ```matrix[:, :]``` would return all rows and all columns - the entire matrix itself!

In [None]:
# TODO

How could you extract the numbers 7 and 8?

In [None]:
# TODO

As a rule of thumb, don't reinvent the wheel. Google if a function already exists that does what you want!

## NumPy and Data Analysis
Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question.
Perhaps the most common summary statistics are the __mean__ and __standard deviation__, which allow you to summarize the
"typical" values in a dataset, but other aggregates are useful as well (e.g. the sum, product, median, minimum and maximum,
quantiles, etc.).

Numpy has fast built-in aggregation functions for working on arrays; we'll discuss and demonstrate some of them here.
Let's create a large array of random numbers from 0 to 1 and take the sum using Python's function and numpy's.

In [None]:
big_array = np.random.rand(1000000)

# -n 10 means run it 10 times
%timeit -n 10 sum(big_array)
%timeit -n 10 np.sum(big_array)

We can see that numpy's ```sum()``` function is much faster.

One common type of aggregation operation is an aggregate along a row or column.
Say you have some data stored in a two-dimensional array:

In [None]:
M = np.random.rand(3, 4)
print(M)

By default, each NumPy aggregation function will return the aggregate over the entire array:

In [None]:
M.min()

But if you wanted to get the minimum of each column, you can specify which **axis** you want your function to act upon:
- axis = 0 means your function will act on all rows for each column 
- axis = 1 means your function will act on all columns for each row

So we want to set ```axis=0``` because for each column, we want to take the minimum over all rows.

In [None]:
# minimum of each column
M.min(axis=0)

How could you get the maximum of each row? If you do not know, GOOGLE IT! ;D

In [None]:
# TODO

What is the dot product of [1,3,4,5, 6,6,1,1,1,3,3,5,3,31,1,3,5,54,4,313,1,1] and [5,3,9,5, 6,6,1,1,1,-30,3,16,3,31,1,-134,5,54,4,31,15,51]?

In [None]:
# TODO

Broadcasting example: Create a (3,2) array, and add it to a (2,) array. 

In [None]:
# TODO 

### Other functions
Most aggregates have a ``NaN``-safe counterpart that computes the result while ignoring missing values, which are marked
by the special floating-point ``NaN`` value.

The following table provides a list of useful aggregation functions available in NumPy:

|Function Name      |   NaN-safe Version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute mean of elements                      |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |
| ``np.concatenate``| N/A                 | Concatenate arrays (without manual copying!)  |

## Pandas
Pandas is another useful library for data analysis. While NumPy is powerful for mathematically-heavy analysis, it relies
on __arrays__ of specific data types. Pandas mainly uses two data structures - `Series` and `DataFrames` - that are organized
similarly to a spreadsheet. It combines the functionality of Python and NumPy with the ease of use of Google Sheets or Excel.

We can create DataFrames from Python lists and dictionaries:

In [None]:
states = pd.DataFrame({'State': ['Massachusetts','Ohio','Alaska','California','Arkansas'],
                       'Population Rank': [15, 7, 48, 1, 33]})
print(states)

However, this is often not very useful. Instead, we will usually read in data from another file.

## Activity: House Rankings
It's your turn to work with an example dataset! We'll be using pandas to analyze freshman house rankings prior to 2019
Housing Day. We will:
1. Read in the data from an external file
2. Manipulate the data and make it more convenient to use
3. Analyze and gather statistics about the data
4. Plot our results

If you don't remember everything that we discussed about pandas, don't worry! You can find a reference on
[HODP Docs](https://hodp-docs.netlify.app/docs/numpy-pandas), and also feel free to ask any questions you may have!

### Reading in the data
First, let's read in the data from the file ```house_rankings_2019.csv``` into a DataFrame called ```rankings``` and print
out the first five rows.

In [None]:
# TODO

Since each row consists of all the votes for a single house, let's make the index be the name of each house instead of the
default numerical index. Try to modify the existing DataFrame instead of creating a new one. Hint: set ```inplace``` as
```True``` when you're changing the index.

In [None]:
# TODO

### Manipulating the data
We can easily extract certain data from our DataFrame. How could you get the number of first place votes for each House?

In [None]:
# TODO

Similarly, how could you get the distribution of votes for Lowell?

In [None]:
# TODO

We can also rename data labels pretty easily:

In [None]:
rankings.rename(index = {'Pforzheimer':'Pfoho'})

### Analyzing the data
We can use Pandas to answer some questions we might have about the data. For example, how can we get the total number of
students that filled out the survey?

In [None]:
# TODO

Which house was the most popular? The least popular? Hint: the ```idxmax()``` function returns the index of the (first)
maximum value. Likewise for ```idxmin()```. You may need to break ties.

In [None]:
# TODO

We can sort by popularity too!

In [None]:
rankings.sort_values(by='1',ascending=False)

Let's extract the column of first place votes for each house again, but this time, let's change the data from the
number of students to the percentage of total students. Hint: given a DataFrame ```df```, numerical expressions involving ```df```
will be applied to every value in the DataFrame. For example, ```df * 2``` returns a DataFrame with all its values doubled.

In [None]:
# TODO

Finally, let's find the average ranking for each house. This will require a weighted average. For example, if a house had
25 first place votes, 25 second place votes, and 50 third place votes, the average rating would be
$\frac{(25 * 1) + (25 * 2) + (50 * 3)}{100} = 2.25$.  
Save your results into a Series/DataFrame called ```avg_rankings```.

In [None]:
# TODO


Let's sort the average rankings to see what we have:

In [None]:
avg_rankings.sort_values()

Would you look at that! The three most recently renovated houses are at the top, while the Quad houses are at the bottom
of the list. It certainly doesn't look like a coincidence, but we can't be certain until we perform some statistical tests.
You'll learn how to do that in an upcoming bootcamp.

The data we have right now is nice and all, but it would be even better if we could summarize our data visually. We can
use another library **matplotlib** to create some rudimentary figures:

In [None]:
import matplotlib

In [None]:
rankings.plot()

In [None]:
avg_rankings.plot.bar()

In [None]:
matplotlib.style.use('ggplot')
rankings[0:4].plot.bar(stacked=True);

There will also be bootcamps on data visualization, so stay tuned for those as well!