# Pandorable pandas

pandas is an extremely powerful and extremely useful library for doing data analysis in Python. In this tutorial, we'll emphasize writing code that is both clean and high-performance.

## Contents

- [Introduction](Introduction.ipynb)
- [Alignment](Alignment.ipynb)
- [Tidy Data](Tidy.ipynb)
- [Performance](Performance.ipynb)

## Preview

As a taste of what we'll learn today, let's read in some data on flights in the New York region. Each row is a single flight. We have columns for the date, airline, plance, origin and destination, delay, etc.

In [None]:
import pandas as pd
import seaborn as sns

flights = pd.read_csv("data/ny-flights.csv.gz", parse_dates=["fl_date", "dep", "arr"])
flights

With pandas we can quickly load data, select subsets, and transform it for downstream tasks like modelling or visualization. For example we'll we can answer the question "How many planes are usually taking off?" by with a small chain of methods that go from raw data to visual.

In [None]:
(flights['dep']
    .value_counts()
    .resample('H')
    .sum()
    .rolling(24).mean()
    .plot(figsize=(12, 6),
          title="Number of Flights (24H Rolling Mean)"))
sns.despine()

Combined with libraries like [seaborn](http://seaborn.pydata.org), we can quickly visualize the number of flights per carrier.

In [None]:
sns.countplot(
    x='unique_carrier',
    data=flights,
    order=flights['unique_carrier'].value_counts().index,
    palette='Blues_r'
)
sns.despine()

We can select subsets of the data (those with a delay between 1 and 500 minutes) and visualize the joint distribution of arrival and departure delays.

In [None]:
mask = (flights["dep_delay"] > 1) & (flights["dep_delay"] < 500)

sns.jointplot(x='dep_delay',
              y='arr_delay',
              data=flights[mask],
              alpha=.25, marker='.', height=8);

## A familiar story

Part of writing pandorable code is readability. We want your data analysis pipelines to flow clearly from step to step. To illustrate this, let's retell a story from [Jeff Allen's](http://trestletech.com/wp-content/uploads/2015/07/dplyr.pdf) presentation on dplyr:

```python
tumble_after(
    broke(
        fell_down(
            fetch(
                went_up(jack_jill, "hill"),
                "water"),
            jack),
        "crown"),
    "jill"
)
```

You probably recognized this as the story of *Jack and Jill*. But it may not have been immediately obvious, when told "inside-out" like that. Data analysis pipelines take raw data and transform them somehow into a useful result by applying a series of functions.

In English, we read left-to-right, top-to-bottom; not inside out. Let's rewrite the story, using *method chaining*.

```python
story = (
    jack_jill
        .went_up("hill")
        .fetch("water")
        .fell_down("jack")
        .broke("crown")
        .tumble_after("jill")
)
```

This story, told two ways, illustrates a couple interesting points:

1. Functions / methods are typically *verbs* (`went_up`, `fetch`, etc.).
2. Function arguments are typically *nouns* (`jack`, `"crown"`, etc.).
3. For readability, it's helpful to have the *structure of your code* reflect the data flow through the pipeline.


We'll see a lot of method chaining today. It's best used in moderation.

Next, we move onto [Alignment](Alignment.ipynb).