# Advanced Data Manipulation II: "I Love Group-By"

In this lecture, we'll work on some additional skills for manipulating and analyzing tabular data. Our focus will be on: 

- **Filtering** data, identifying specific rows according to complex criteria. 
- **Aggregating** data, computing complicated summaries of groups. 

In [None]:
import sqlite3
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

Let's start by reading in some data. We'll use the SQLite database that we created in a [recent lecture](https://nbviewer.jupyter.org/github/PhilChodrow/PIC16B/blob/master/lectures/sql/sql-1.ipynb) for this purpose. You may need to run the code in that lecture in order for the following block to correctly read in the data. Make sure that the string supplied to `sqlite3.connect()` points to the location of the database. 

We are going to extract measurements for stations south of -80 degrees latitude. 

### Takeaways for Today

1. The `transform()` and `apply()` methods can enable advanced computation on Pandas data frames using quite simple code. 
2. Pandas also offers specialized functions for common tasks, but getting the hang of `transform()` and `apply()` will take you far. 
3. Global warming is scary. 

## Grouped Summaries

We already know how to compute grouped summaries of the data using `pd.groupby().aggregate()`. For example, let's compute the mean temperature for each station within each month, averaged across years. Let's also compute the standard deviation and the number of observations. 

This is handy information, and it's convenient to be able to easily collect it in a summary table. However, there are some cases in which we may wish to compute new columns without creating a smaller summary table. Here's an example: 

## Temperature Anomaly Detection

Suppose we'd like to construct a list of unusually hot or cold months in our data set. For example, if February in 1995 is much warmer than average, we'd like to detect this. What makes a month "unusually hot or cold"? There are lots of valid ways to define this. How would you approach this? 

<br> <br> <br> <br> <br> 

For our first attempt, let's ask the following idea: 

> For each temperature reading, how does that reading compare to the average reading *in that month* and *at that measurement station*?

For example, if July in 2017 at Amundsen-Scott station was much warmer than the average July reading at that station, then we might say that July 2017 was anomalous. 

### Z-Scores

To make this concrete, let's say that a given month in a year is anomalous if it is more than two standard deviations away from the mean for that month. If you've taken a statistics class, this is the same as requiring that the *z-score* for that month is larger than 2 in absolute value. That is, we should compute:

$$z = \frac{\text{reading} - \text{average reading at station in month}}{\text{standard deviation at station in month}}$$

and ask whether $|z| > 2$. 

How to compute this? Well, we already know how to compute means and standard deviations using methods like our table above, but it's hard to make comparisons to individual months this way. Can you think of how you would perform such a computation in Python? 

<br> <br> <br> <br> <br> 

If you suggested that we `merge` the summary table from above to our original `df`, that would eventually work! But `merge` is a slow operation, and we can actually avoid it by using what are sometimes called *window functions.* A window function operates on grouped data, **without reducing the length of the data frame.** In `pandas`, the most general way to create window functions is by using the `transform()` method of data frames and series. For example: 

In [None]:
# compute the average temperature in each month for each station
# notice the length of the result! 


Compare this to:

In [None]:
# note the length! 


Because the length of the output of `transform` is the same as that of the original data, we can use `transform` to create new columns. Here's a simple function to compute z-scores of an array: 

Now we can compute the z-scores in one shot: 

Using `transform`, we can skip both computing the summary table and merging it later. We're now ready to find anomalous months in our data. Before we do, we're going to add a handy column to the original data for plotting purposes. We already saw this code in a [prior lecture](https://nbviewer.jupyter.org/github/PhilChodrow/PIC16B/blob/master/lectures/EDA/pd-1.ipynb). 

Ok, now let's get a subset data frame with the temperature anomalies: 

We can now, for example, plot these anomalies for a given station: 

It looks like the rate of anomalies is increasing with time -- yikes. Predictably, most of the anomalies are anomalously *warm*. 

### Max and Min (Optional)

That approach works fine, but suppose now that we'd like to try things a different approach: we want to compute the warmest and coldest instances of each month on record. For example, we'd like to answer questions like: 

> *In what year did Amundsen-Scott Station record the warmest February, on average?* 

One approach to this is to define a function to compute rankings on individual subsets, and invoke it using `transform()`. For example: 

In [None]:
# feel free to look up np.argsort() and think about why this works


In [None]:
# use the new function to create rankings


In [None]:
# coldest months on record


There's a bit of a sticky point here: it's easy to get the *coldest* months on record this way, because they all have the same rank (0). However, getting the *warmest* months is a little trickier, because they have different ranks (due to some stations not having data in all years or months): 

How can we extract the warmest months? We can use `transform()` again! 

## Other Approaches

Many tasks in `pandas` can be performed in more than one way. In this lecture so far, I've focused on the `transform` method of grouped data frames, which is extraordinarily flexible and can be used for many purposes. However, there are also specialized methods that you may wish to research on your own time. For example, Pandas offers a [dedicated ranking method](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.rank.html), as well as a [dedicated filtering](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html) method. If you find yourself needing to perform many ranking or filtering operations, learning these methods may be a good use of your time. 

# Custom Aggregation

Earlier in this lecture, we reviewed an example of how to compute aggregates like means and standard deviations using `aggregate`. Here it is again: 

That's fun, but we can also compute *custom aggregates* using any function we want that takes in a series of numbers and spits out a new number. This is a very powerful ability, especially if you try get a little creative with it! The `apply` method is usually the way to go. For example, let's compute a simple estimate of the **year-over-year average change in temperature** in each month at each station. For this, we'll use our old friend, linear regression. We'll use the statistical fact that, when regressing `Temp` against `Year`, the coefficient of `Year` will be an estimate of the yearly change in `Temp`. 

Although this might look a bit strange as a function for using `apply` (wasn't this from the machine learning part of the class?), it's a perfectly good way to compute data summaries, as it takes in two data columns and spits out a number. 

At what proportion of station/months is the temperature rising? 

### Takeaways for Today

1. The `transform()` and `apply()` methods can enable advanced computation using simple code. 
2. Pandas also offers specialized functions for common tasks, but getting the hang of `transform()` and `apply()` will take you far. 
3. Global warming is scary. 