In [6]:
import numpy as np
import pandas as pd

# Pandas Review

This is a midterm review sheet for *Pandas topics only*. This does not touch the other aspects of the course that you are still responsible for knowing!

Pandas [has *too many* methods](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428) and many of them are not *core* to using the library (I mostly agree with the content of that medium post; I agree with everything in spirit). In the HW, you had to look up pandas functions from documentation that might *do the trick* for a given problem, which is a valuble skill. In lecture, we restricted ourselves to the core features that are necessary for *working with tables in general*. Each of these is an analogue of a DSC 10 `datascience` table operation (except for the operaters for dealing with null values).

Below is a collection of core Pandas operations (taken from lecture) that you should be comfortable with using.

## Pandas topics

Here are things that you you should have "down" and therefore are good midterm material.

* Different ways of instantiating Series / DataFrames.
* Pandas data-types (floats, object, NaN) and `.dtypes`
* Understanding Indexes and how they related to slicing/joining
* Reading in data (don't worry about kwargs)
* Selecting Rows and columns with `[]`, `loc[]`, `iloc[]`
    - Passing lists/slices
    - Boolean array selection (and operators)
* Adding and modifying rows and columns
* Useful methods/functions
    - pandas: count, unique, nunique, value_counts, describe, sort_values, drop_duplicates, replace
    - numpy: mean, std, min, max, percentile
* Null Values:
    - dropna and *all kwargs*
    - fillna and *all kwargs* used in lecture
    - Understand what type `np.NaN` is, and how comparisons work (`pd.isnull`)
* `apply` and user-defined functions (applied to series, dataframe)
* groupby: apply, (agg, transform, less so -- all can be derived from apply!)
* Know `pivot_table` or `pivot`. If you prefer `pivot`, know how to combine it with `groupby` to get `pivot_table` functionality!
* `concat` and `merge`.

Then there are the additional 'applications' we've learned:
* datetime 
* rolling windows


# Practice Problems

* You should be able to do these problems with confidence. The dataframes given are small (and changeable), so you can easily assess whether your answer is correct. Come to lab hours if your not 100% about your answer.  **There will be no solutions for these problems**.
* There are a lot of other practice problems with Pandas available on the internet -- search them out, they are probably relevant.

###  A note on loops in Pandas:

You should avoid writing loops in Pandas if they *could* be looping over a **large** number of observations. In general this means:
- looping over columns is ok if necessary, as long as the number of columns are small!
- looping over rows is **not** ok, unless you know the number of rows is small (e.g. the dataframe is the result of a groupby with few distinct keys).

You should be able to do *everything* below without looping through rows!

### Pandas Series and DataFrame basics
Instantiate a dataframe from: 
* a list-of-lists, 
* a dictionary of column arrays, 
* reading from file.

In [12]:
byhand = pd.DataFrame([[1,2,4],[5,8,7]], index=['m', 'n'], columns='a b c'.split())
byhand

Unnamed: 0,a,b,c
m,1,2,4
n,5,8,7


In [111]:
num_rows = 11
data = pd.DataFrame({
    'c1': np.random.choice('a b c'.split(), size=num_rows),
    'c2': np.random.uniform(size=num_rows),
    'c3': np.random.randint(0,10, size=num_rows),
    'c4': np.random.choice([True, False], size=num_rows)
}, index=np.random.randint(0, 10, size=num_rows))
data.head()

Unnamed: 0,c1,c2,c3,c4
3,c,0.497027,8,True
1,a,0.889928,5,False
9,b,0.151157,3,False
1,b,0.890076,5,True
5,c,0.36168,4,True


### Selecting Rows and columns with `[]`, `loc[]`, `iloc[]`

How to select:
* The first value of `c3` where index is `3`.
* The last row (series) that had index equal to `5`.
* The first half of the dataframe
* All rows with index either 7 or 3
* All rows index by an even number
* All rows where either `c1` is equal to `a` or `c4` is `False`, but not both.
* All rows where `c3` is above average

### Add, modify

* return a dataframe like `data` so that `c3` is of float type (it should not change `data`).
* add a column `z3` to a (deep) copy of `data` whose values are the values of `c3` in standard units.
* add a column `nonsense` to a (deep) copy of `data` that contains the concatenation of `c1` and `c3` (e.g. values like `c4`)
* add a row with index `-1` with values equal to the last row of `data`.
* add a column that contains the difference between the value of the elements of `c3` and its row number.


In [1]:
import pandas as pd

### Methods and apply (for both series and dataframes)

* Filtering: A dataframe with the same columns as `data` and for each value of `c3`, it contains only the row of `data` with the greatest value of `c2`.
* Filtering: return all rows where the entry of `c3` appears in the column more than three times.
* Return a series that contains the strings
    1. `'Truthy'`/`'Falsey'` if `c3` is even and `c4` is `True`/`False`
    2. Otherwise, the column should contain the values of `c4`.


### Null Data

* Explain the difference in data types between the oringal `nulldata` and the one with `NaN` values.
* What is the proportion of null values for each column? Return this information as as Series indexed by column.
* Drop any row/columns for which at least half the values are null (*don't* use dropna)
* Fill in the nulls for each column with:
    1. the mean of the non-null values if the column is numeric
    2. the mode of the non-null values if the column is categorical
* Fill in the nulls of `c2,c3,c4` with the maximum value of the three columns. If all are null, drop the row.

In [91]:
# first, create a complete dataset
nulldata = pd.DataFrame({
    'c1': np.random.choice('a b c'.split(), size=num_rows),
    'c2': np.random.randint(0,15, size=num_rows),
    'c3': np.random.randint(0,10, size=num_rows),
    'c4': np.random.choice([0, 1], size=num_rows)
})

nulldata.head()

Unnamed: 0,c1,c2,c3,c4
0,b,2,1,1
1,c,8,4,1
2,c,9,8,0
3,b,5,1,0
4,b,1,1,1


In [92]:
# now, null out entries of the complete data

for c in nulldata.columns:
    idx = np.random.choice(nulldata.index, size=(num_rows//2))
    nulldata.loc[idx, c] = np.NaN
    
nulldata.head()

Unnamed: 0,c1,c2,c3,c4
0,b,,,
1,c,8.0,,1.0
2,c,9.0,8.0,
3,b,5.0,1.0,0.0
4,,,1.0,1.0


### Groupby: methods and apply
* Compute the mean of each column within each group defined by different values of `c1`
* keep only the largest and smallest value from each group
* The number of unique values of `c3` for each value of `c1`
* For each value of `c1`, 
    1. if the average of `c2` is < 0.5 and the majority of `c4` is True return 'small',
    2. if the average of `c2` is < 0.5 and the majority of `c4` is False return 'large',
    3. if greater than 0.5 and `c4` the majority of `c4` is True return 'large'
    4. if greater than 0.5 and `c4` the majority of `c4` is False return 'small'
* Suppose for each row, column `c1` represent types of fruit and `c3` represents the number of pieces of fruits observed. 
    - What are the total number of fruits found in the table? 
    - How many different types are there? 
    - What is the empirical disrtribution of fruit types found in the table as a whole?

Consider the table of `logins` below where the column `user` specifies the user logging into a server, and `time` is the date and time of that log-in event.

* How many log ins did each user have? Return a series indexed by user.
* What's the maximum, minimum, average between log-ins (time since the last log-in) for each user. Return a dataframe keyed by user.
* Convert the dataframe `logins` so that instead of a `time` column, you have a column `time_diff_z` that contains the time since the last log-in for that user, converted to standard units (where standard units are for that user only).
* Plot the distributions of these time_diff_z per user. (Not for the exam, but for a 40B connection: given these logins are chosen uniformly, why do these plots have this shape?)

In [36]:
# create logins using a uniform distribution
n_users = 1000
login = pd.DataFrame({
    'user': np.random.choice('a b c d e'.split(), size=n_users),
    'time': pd.to_datetime(np.random.randint(1483228800, 1514764800, size=n_users), unit='s'),
}).sort_values(by='time')

login.head()

Unnamed: 0,user,time
335,a,2017-01-01 03:19:28
370,c,2017-01-03 12:18:38
142,c,2017-01-03 21:35:55
945,e,2017-01-04 18:45:26
98,a,2017-01-05 05:23:04


### Pivot Tables

* Be able to verify simpson's paradox from an example! Be able to create the empirical conditional distributions for the verfication (e.g. the southwest vs jetblue HW).
* For each pair of values in `c1` and `c4` in `data`, what is the average value of `c2`? Return this in a dataframe with index given by values of `c1` and column given by values of `c4`. 
* In `data`, compute the empirical distribution of `c3` conditional on `c1`. That is, each row is an empirical distribution of `c3` indexed by the value of `c1`.

### Concat and Merge

* For `concat`: Understand the function and the examples in lecture. Understand the axis argument, how it's joining behavior for both `axis=0,1`. Understanding the other keyword arguments aren't necessary for the exam.
* For `merge`: Understand the function and the examples in lecture. The important keyword arguments to know are `how` and `on` (and the related `left_on`, `right_on`, `left_index`, `right_index`).
* For each of these, know 
    - what the number of rows and columns of the output will be
    - understand how null values affect your joins

In [49]:
num_rows = 11
data1 = pd.DataFrame({
    'c1': np.random.choice('a b c'.split(), size=num_rows),
    'c2': np.random.uniform(size=num_rows),
    'c3': np.random.randint(0,10, size=num_rows),
    'c4': np.random.choice([True, False], size=num_rows)
})

data2 = pd.DataFrame({
    'c1': np.random.choice('a b c'.split(), size=num_rows),
    'c2': np.random.uniform(size=num_rows),
    'c3': np.random.randint(0,10, size=num_rows),
    'c4': np.random.choice([True, False], size=num_rows)
})

data3 = pd.DataFrame({
    'd1': 'a b d'.split(),
    'd2': 'apple bananna dragonfruit'.split()
})

In [None]:
# How many rows? columns? null values in each row and column?
# data1.merge(data3, left_on=..., right_on=..., how=...)
# data2.merge(data3, left_on=..., right_on=..., how=...)

# data1.merge(data2)
# data1.merge(data2, on='c3', how=...)

The dataframes below represent the number of items sold from the specified store on (perhaps multiple) days, as well as a spreadsheet of the prices of each item at each store.

* add the price per unit to the inventory dataframe.
* add the price per unit to the inventory dataframe only using `price_per_unit` and without using the function `unstack` (which you don't have to know; it is roughly the 'un-pivot' function). Hint: `price_per_unit` will always be small; use a broadcast join.
* Find the revenue earned for each store in the table.
* Add the item descriptions to each of these tables!

In [95]:
num_rows = 11
inventory = pd.DataFrame({
    'store': np.random.choice(['store %d' %d for d in range(5)], size=num_rows),
    'item': np.random.choice('a b c d'.split(), size=num_rows),
    'number sold': np.random.randint(0, 100, size=num_rows),
    'number left': np.random.randint(0, 100, size=num_rows)
})

inventory

Unnamed: 0,store,item,number sold,number left
0,store 3,c,55,65
1,store 3,d,3,48
2,store 4,c,6,7
3,store 3,c,90,64
4,store 0,b,47,10
5,store 4,b,62,96
6,store 2,d,43,20
7,store 1,b,45,97
8,store 4,b,14,84
9,store 2,d,32,83


In [96]:
price_per_unit = pd.DataFrame(
    np.random.uniform(size=(5,5)),
    index=pd.Series('a b c d e'.split(), name='item'),
    columns=pd.Series(['store %d' %d for d in range(5)], name='store'),
)

price_per_unit_unstacked = price_per_unit.unstack().rename('price').reset_index()


In [103]:
item_description = pd.DataFrame({
    'item': 'a b c d e'.split(),
    'description': 'apple bananna cherry dragonfruit elderberry'.split()
})

Unnamed: 0,c1,c2,c3,c4
0,a,0.606245,1,False
1,b,0.065381,1,True
2,b,0.084807,8,True
3,b,0.543094,1,False
4,c,0.553677,6,False
