<img src='img/logo.png' />

<img src='img/title.png'>

<img src='img/py3k.png'>

# Table of Contents
* [Learning Objectives](#Learning-Objectives)
* [Pandas: Tidy Data](#Pandas:-Tidy-Data)
	* [Set-Up](#Set-Up)
* [Overview](#Overview)
* [Demonstration](#Demonstration)
	* [Data Load](#Data-Load)
	* [Data Read](#Data-Read)
	* [Data Cleanup](#Data-Cleanup)
* [Question: Days of Rest](#Question:-Days-of-Rest)
	* [Data Organization](#Data-Organization)
	* [Translate Question to Operation](#Translate-Question-to-Operation)
* [Question: Home Team Advantage](#Question:-Home-Team-Advantage)
	* [Question: Team Strength](#Question:-Team-Strength)
		* [Mini Project: Home Court Advantage?](#Mini-Project:-Home-Court-Advantage?)
		* [Step 1. Calculate Win %](#Step-1.-Calculate-Win-%)
		* [Step 2: Find the win percent for each team](#Step-2:-Find-the-win-percent-for-each-team)
* [Merging](#Merging)
* [Pivoting](#Pivoting)
	* [Summarizing Pivot](#Summarizing-Pivot)
	* [Transform Pivot](#Transform-Pivot)
* [Concat](#Concat)

# Learning Objectives

After this notebook, the learner will be able to:
* Use pandas to tidy up data
* Limit file reads to just the columns of data needed
* Reorganize data to suite the question at hand
* Translate a data question into a data operation
* Perform SQL-like queries on a pandas DataFrame

# Pandas: Tidy Data

## Set-Up

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_rows = 6
pd.options.display.max_columns = 6

***

# Overview

Structuring datasets to facilitate analysis [(Wickham 2014)](http://www.jstatsoft.org/v59/i10/paper)

If there's one maxim I can impart it's that your tools shouldn't get in the way of your analysis. Your problem is already difficult enough, don't let the data or your tools make it any harder.

In a tidy dataset...

1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

We'll cover a few methods that help you get there.

***

# Demonstration

## Data Load

Load some data from the web and save it locally

In [None]:
url    = "http://www.basketball-reference.com/leagues/NBA_2015_games.html"
tables = pd.read_html(url)
games  = tables[0]
games.to_csv('tmp/games.csv', index=False)

## Data Read

This is the raw data coming in. We need to clean it up a bit before reshaping.

Inspect the data file before trying to read it

In [None]:
!head -n 5 tmp/games.csv

Build a list of the column names we want to process

In [None]:
column_names = ['date1', 'time1', '_', 'away_team', 'away_points', 
                'home_team', 'home_points', 'n_ot', 'notes']

Read in all of the columns, but rename then according the the passed *names*

In [None]:
games = pd.read_csv('tmp/games.csv', names=column_names, header=None, skiprows=2)
games

## Data Cleanup

So the``date1`` and the ``time1`` columns in concert form the date with which we want to work.

Let's convert these to a new column ``date`` that is a dtype of ``datetime64[ns]``. This is the standard for datetime storage in pandas. We are going to string combine these; this is the ``+`` operator. Then using ``pd.to_datetime`` to do the conversion. We will mark any non-convertible strings with ``NaT``, the standard missing value indicator for ``datetimelikes``. (This is the ``errors='coerce')

In [None]:
games = games.assign(date=lambda x: 
                     pd.to_datetime(x['date1'] + ' ' + x['time1'], errors='coerce'))
games

In [None]:
games.dtypes

Drop the old columns we now no longer need

In [None]:
games = games.drop(['_', 'date1', 'time1', 'notes', 'n_ot'], axis='columns')
games

We can do a ``.set_index`` on a frame to take a coumn and make it the index

In [None]:
games.set_index('date')

In this case we want to et the Index to be the new ``date`` column values AND the current index.

by passing ``append=True``, we will form a ``MultiIndex`` of the existing ``Index`` as the first ``level`` and the ``date`` column as the second ``level``.



In [None]:
games = games.set_index('date', append=True)
games

We find having names on ``Index`` levels to be convenient, let's set them

In [None]:
games.index.names = ['game_id', 'date']
games       

# Question: Days of Rest

Whether or not your dataset is tidy depends on your question. 

> **How many days of rest did each team get between each game?**

Given our question, what is an observation?

## Data Organization

Is `games` a tidy dataset, given our question? No, we have multiple observations (teams) per row. We'll use `pd.melt` to fix that.

This is an operation that takes ``wide`` data and makes it ``long``

In [None]:
tidy = pd.melt(games.reset_index(),
               id_vars=['game_id', 'date'], 
               var_name='which',
               value_vars=['away_team', 'home_team'],
               value_name='team')
tidy

So we took our data and ``un-pivotted`` it, by duplicating the ``game_id`` and ``date`` columns. So we have 2472 rows now, from 1236 before.

In [None]:
tidy[tidy.game_id==0]

To reverse this above ``wide`` to ``long`` operation, we can ``pivot`` to go from ``long`` to ``wide``

In [None]:
(tidy
     .pivot(index='game_id',columns='which')
     .reset_index()
 )

We have now the original 1236 rows. Its not *exactly* the same in that we created a ``MultiIndex`` on the columns. But it should be clear that this is the same structure

## Translate Question to Operation

Now we have tidy data!

The rows provide a singular observation. These are unique observations if you consider the tuple:

```python
(game_id, date, which)
```

And our ``variable`` is the ``team`` columns

In [None]:
tidy

Now that translation from question to operation is direct:

For each team... get number of days between games


In [None]:
tidy.groupby('team')['date'].diff().dt.days

This is grouped for all teams so the calculation is done *per-team*

In [None]:
# here is a single team results

tidy.groupby('team').get_group('Los Angeles Lakers')

Note that this is effectively rounding down from the number of days

In [None]:
tidy.groupby('team').get_group('Los Angeles Lakers')['date'].diff().dt.days

Let's add on the ``rest`` column to indicate how many days of reset we get

In [None]:
tidy['rest'] = (tidy
                    .sort_values('date')
                    .groupby('team')
                    .date.diff()
                    .dt
                    .days
)
tidy.dropna()

Let's create a fancy plot, using the ``seaborn`` library. This is a nice ways of taking a set of data (our team), and displaying data about it (in the Categories ``which``)

In [None]:
(tidy.dropna()
     .pipe(sns.FacetGrid, col='team', col_wrap=9, hue='team')
     .map(sns.barplot, "which", "rest")
 )

Whoosh, that is an interesting plot. But what are we doing?

Let's select out a single team and examine

In [None]:
(tidy
     .dropna()
     .query('team == "Los Angeles Lakers"')
     .pipe(sns.FacetGrid, col='team', hue='team')
     .map(sns.barplot, "which", "rest")
 )

So we are effectively doing a ``mean`` on the key variable ``rest``

In addition we are illustrating using ``.query`` to perform an operation similar to ``.loc[..]``, that selects out data based on the passed criteria. ``.query`` accepts a string expression where you can use columns (``team`` in this case easily). This is analagous to a ``select`` in SQL-speak.

In [None]:
g = (tidy
        .dropna()
        .query('team == "Los Angeles Lakers"')
        .groupby('which')
     )
g.rest.mean()

***

# Question: Home Team Advantage

Let's now discuss some more reshaping operations. We have already seen: 

- ``.set_index()`` and ``.reset_index()`` to take a column and make it the index (and vice-versa).
- ``pd.melt()`` and ``.pivot()`` to take the uniques of a column and ``unstack`` or ``stack`` them.

Now let's meet a related pair of operations, ``.stack`` and ``.unstack``.

An "observation" depends on the question. Is there a Home team advantage?

In [None]:
home_adv = games.home_points - games.away_points
ax = home_adv.plot(kind='hist', bins=80, figsize=(10, 5))
ax.set_xlim(-40, 40)
ax.vlines(home_adv.mean(), *ax.get_ylim(), color='red', linewidth=3)
print('Home win percent:', (home_adv > 0).mean())

## Question: Team Strength

### Mini Project: Home Court Advantage?

What's the effect (in terms of probability to win) of being
the home team.

### Step 1. Calculate Win %

We need to create an indicator for whether the home team won.
Add it as a column called `home_win` in `games`.

In [None]:
games['home_win'] = games['home_points'] > games['away_points']
games

### Step 2: Find the win percent for each team

Teams are split across two columns. It's easiest to calculate the number of wins and number of games as away, and the number of wins and number of games as home. Then combine those two results to get the win percent.

This is using the ``.agg()`` function of groupby. You can easily specify different aggregations (the values of the dict) AND name them (the keys of the dict) at the same time.

In [None]:
games.home_win

We are using ``(~x)`` to select the invert of a boolean, IOW, ``False`` -> ``True`` and ``True`` -> ``False``

In [None]:
wins_as_away = games.groupby('away_team').home_win.agg(
    {'n_games': 'count', 'n_wins': lambda x: (~x).sum()}
)
wins_as_home = games.groupby('home_team').home_win.agg(
    {'n_games': 'count', 'n_wins': 'sum'}
)
wins = (wins_as_away + wins_as_home)
wins

Finally, calculate the win percent.

In [None]:
strength = wins.n_wins / wins.n_games
strength.index.name = 'team'
strength.name = 'strength'
strength

This is a plot of the strength, a viz of the above

In [None]:
strength.sort_values().plot.barh(figsize=(4,8))

# Merging

Merging is one way of combing data from two different DataFrames into one. They don't have to be the same shape. This is very similar to a ``join`` operation in ``SQL``.

Bring the `strength` values in for each team, for each game.

For SQL people

```sql
SELECT *
FROM games NATURAL JOIN strength
```

We just need to get the names worked out.

In [None]:
(strength
         .head()
         .reset_index()
         .rename(columns=lambda x: 'away_' + x)
 )

We need to do a sequence of merges; here are ``.pipe``-ing to ourselves to make the expression a tiny-bit more readable.

In [None]:
(pd.merge(games.reset_index(), 
          strength.reset_index().add_prefix('away_'))
   .pipe(pd.merge, 
         strength.reset_index().add_prefix('home_'))
   .set_index(['game_id', 'date'])
)

That seemed a bit complicated, so

For python people.

we can use the ``.map()`` function which take a dictionary-like (a dict or a Series), where it ``maps`` the keys onto the index of the target Series and replaces the target values with the values from the mappee.

This is conceptually what a ``merge`` does!

In [None]:
games = games.assign(away_strength=games.away_team.map(strength),
                     home_strength=games.home_team.map(strength))
games

***

# Pivoting

Let's revisit pivot-ing.

Pivot takes the uniques in a column and forms columns out of them. There is not data summarization.

In [None]:
tidy

In [None]:
(tidy
     .pivot(index='game_id',columns='which')
     .reset_index()
 )

## Summarizing Pivot

However, sometimes you DO want to summarize data; ``pd.pivot_table`` will by default aggretate with ``.mean()``. This is a ``summarizing`` pivot.

In [None]:
pd.pivot_table(tidy,
                     values='rest',
                     index='which',
                     columns='team',
                     aggfunc='mean'
              )


This is equivalent to a groupby by the aggregation function

In [None]:
(tidy.groupby(['team','which'])
     .rest
     .mean()
)

followed by an unstack

In [None]:
(tidy.groupby(['team','which'])
     .rest
     .mean()
     .unstack('team')
)

## Transform Pivot

In [None]:
(pd.pivot_table(tidy,
                     values='rest',
                     index=['game_id','date'],
                     columns='which',
                     aggfunc='mean')
)

In [None]:
un = (pd.pivot_table(tidy,
                     values='rest',
                     index=['game_id','date'],
                     columns='which',
                     aggfunc='mean')
        .rename(columns={'away_team': 'away_rest', 'home_team': 'home_rest'})
)
un.columns.name = None

In [None]:
un.dropna()

# Concat

Sometime we sould like to ``glue`` pandas objects together, without the need for a merge. These objects will be aligned, so they don't have to be the same shape'

In [None]:
res = pd.concat([games, un], axis=1).reset_index('date')
res

<img src='img/copyright.png'>