![pandas](figures/pandas_logo.png)

*pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.*

## Data Strucures

Python provides some high-performance data structures

- arrays of arbitrary objects (`list`)
- key-value pairs, (`dict`)
- records (`tuple`)
- unique items (`set`)

The scientific python community does use these, but relies on 3rd-party packages to provide

1. N-Dimensional array (`numpy.ndarray`)
2. Labeled, heteroguenous, tabular data (`pandas.DataFrame`)

## Comparing pandas and dyplr

Pandas is focused on the data wrangling side of analysis. It leaves statistics to other packages like statsmodels, scikit-learn, PyMC3, and others.

The rest of this notebook runs through the [Introduction to dplyr](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html), showing the R and equivalent pandas code.
To preview the differences you'll see:

- Pandas uses methods, where dplyr uses functions
- Pandas uses `lambda`s or string literals, where dplyr / R use [Non-standard evaluation](http://adv-r.had.co.nz/Computing-on-the-language.html)
- The biggest difference is pandas use of row / column labels for alignment
- Both are great

In [2]:
%load_ext rpy2.ipython

In [3]:
%%R
library(dplyr)
library(nycflights13)
library(feather)

Attaching package: ‘dplyr’



    filter, lag



    intersect, setdiff, setequal, union




In [4]:
%R head(flights)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
1,2013,1,1,517,515,2.0,830,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400.0,5.0,15.0,1357016000.0
2,2013,1,1,533,529,4.0,850,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416.0,5.0,29.0,1357016000.0
3,2013,1,1,542,540,2.0,923,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089.0,5.0,40.0,1357016000.0
4,2013,1,1,544,545,-1.0,1004,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576.0,5.0,45.0,1357016000.0
5,2013,1,1,554,600,-6.0,812,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762.0,6.0,0.0,1357020000.0
6,2013,1,1,554,558,-4.0,740,728,12.0,UA,1696,N39463,EWR,ORD,150.0,719.0,5.0,58.0,1357016000.0


In [5]:
%%R
write_feather(flights, "flights.feather")

In [6]:
%matplotlib inline

In [11]:
import pandas as pd
import feather

pd.options.display.max_rows = 10

In [8]:
flights = feather.read_dataframe("flights.feather")
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400.0,5.0,15.0,2013-01-01 05:00:00+00:00
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416.0,5.0,29.0,2013-01-01 05:00:00+00:00
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089.0,5.0,40.0,2013-01-01 05:00:00+00:00
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576.0,5.0,45.0,2013-01-01 05:00:00+00:00
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762.0,6.0,0.0,2013-01-01 06:00:00+00:00


## Filter Rows

In [15]:
%R filter(flights, month == 1, day == 1)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
1,2013,1,1,517,515,2.0,830,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400.0,5.0,15.0,1.357016e+09
2,2013,1,1,533,529,4.0,850,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416.0,5.0,29.0,1.357016e+09
3,2013,1,1,542,540,2.0,923,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089.0,5.0,40.0,1.357016e+09
4,2013,1,1,544,545,-1.0,1004,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576.0,5.0,45.0,1.357016e+09
5,2013,1,1,554,600,-6.0,812,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762.0,6.0,0.0,1.357020e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
838,2013,1,1,2356,2359,-3.0,425,437,-12.0,B6,727,N588JB,JFK,BQN,186.0,1576.0,23.0,59.0,1.357081e+09
839,2013,1,1,-2147483648,1630,,-2147483648,1815,,EV,4308,N18120,EWR,RDU,,416.0,16.0,30.0,1.357056e+09
840,2013,1,1,-2147483648,1935,,-2147483648,2240,,AA,791,N3EHAA,LGA,DFW,,1389.0,19.0,35.0,1.357067e+09
841,2013,1,1,-2147483648,1500,,-2147483648,1825,,AA,1925,N3EVAA,LGA,MIA,,1096.0,15.0,0.0,1.357052e+09


In [12]:
flights[(flights.month == 1) & (flights.day == 1)]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400.0,5.0,15.0,2013-01-01 05:00:00+00:00
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416.0,5.0,29.0,2013-01-01 05:00:00+00:00
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089.0,5.0,40.0,2013-01-01 05:00:00+00:00
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576.0,5.0,45.0,2013-01-01 05:00:00+00:00
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762.0,6.0,0.0,2013-01-01 06:00:00+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
837,2013,1,1,2356.0,2359,-3.0,425.0,437,-12.0,B6,727,N588JB,JFK,BQN,186.0,1576.0,23.0,59.0,2013-01-01 23:00:00+00:00
838,2013,1,1,,1630,,,1815,,EV,4308,N18120,EWR,RDU,,416.0,16.0,30.0,2013-01-01 16:00:00+00:00
839,2013,1,1,,1935,,,2240,,AA,791,N3EHAA,LGA,DFW,,1389.0,19.0,35.0,2013-01-01 19:00:00+00:00
840,2013,1,1,,1500,,,1825,,AA,1925,N3EVAA,LGA,MIA,,1096.0,15.0,0.0,2013-01-01 15:00:00+00:00


Alternatively, you can use `.query`

In [78]:
flights.query("month == 1 & day == 1")

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400.0,5.0,15.0,2013-01-01 05:00:00+00:00
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416.0,5.0,29.0,2013-01-01 05:00:00+00:00
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089.0,5.0,40.0,2013-01-01 05:00:00+00:00
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576.0,5.0,45.0,2013-01-01 05:00:00+00:00
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762.0,6.0,0.0,2013-01-01 06:00:00+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
837,2013,1,1,2356.0,2359,-3.0,425.0,437,-12.0,B6,727,N588JB,JFK,BQN,186.0,1576.0,23.0,59.0,2013-01-01 23:00:00+00:00
838,2013,1,1,,1630,,,1815,,EV,4308,N18120,EWR,RDU,,416.0,16.0,30.0,2013-01-01 16:00:00+00:00
839,2013,1,1,,1935,,,2240,,AA,791,N3EHAA,LGA,DFW,,1389.0,19.0,35.0,2013-01-01 19:00:00+00:00
840,2013,1,1,,1500,,,1825,,AA,1925,N3EVAA,LGA,MIA,,1096.0,15.0,0.0,2013-01-01 15:00:00+00:00


## Arrange rows

In [17]:
%R arrange(flights, year, month, day)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
1,2013,1,1,517,515,2.0,830,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400.0,5.0,15.0,1.357016e+09
2,2013,1,1,533,529,4.0,850,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416.0,5.0,29.0,1.357016e+09
3,2013,1,1,542,540,2.0,923,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089.0,5.0,40.0,1.357016e+09
4,2013,1,1,544,545,-1.0,1004,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576.0,5.0,45.0,1.357016e+09
5,2013,1,1,554,600,-6.0,812,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762.0,6.0,0.0,1.357020e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336772,2013,12,31,-2147483648,705,,-2147483648,931,,UA,1729,,EWR,DEN,,1605.0,7.0,5.0,1.388473e+09
336773,2013,12,31,-2147483648,825,,-2147483648,1029,,US,1831,,JFK,CLT,,541.0,8.0,25.0,1.388477e+09
336774,2013,12,31,-2147483648,1615,,-2147483648,1800,,MQ,3301,N844MQ,LGA,RDU,,431.0,16.0,15.0,1.388506e+09
336775,2013,12,31,-2147483648,600,,-2147483648,735,,UA,219,,EWR,ORD,,719.0,6.0,0.0,1.388470e+09


In [18]:
flights.sort_values(["year", "month", "day"])

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400.0,5.0,15.0,2013-01-01 05:00:00+00:00
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416.0,5.0,29.0,2013-01-01 05:00:00+00:00
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089.0,5.0,40.0,2013-01-01 05:00:00+00:00
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576.0,5.0,45.0,2013-01-01 05:00:00+00:00
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762.0,6.0,0.0,2013-01-01 06:00:00+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111291,2013,12,31,,705,,,931,,UA,1729,,EWR,DEN,,1605.0,7.0,5.0,2013-12-31 07:00:00+00:00
111292,2013,12,31,,825,,,1029,,US,1831,,JFK,CLT,,541.0,8.0,25.0,2013-12-31 08:00:00+00:00
111293,2013,12,31,,1615,,,1800,,MQ,3301,N844MQ,LGA,RDU,,431.0,16.0,15.0,2013-12-31 16:00:00+00:00
111294,2013,12,31,,600,,,735,,UA,219,,EWR,ORD,,719.0,6.0,0.0,2013-12-31 06:00:00+00:00


## Select columns

In [20]:
%R select(flights, year, month, day)

Unnamed: 0,year,month,day
1,2013,1,1
2,2013,1,1
3,2013,1,1
4,2013,1,1
5,2013,1,1
...,...,...,...
336772,2013,9,30
336773,2013,9,30
336774,2013,9,30
336775,2013,9,30


In [21]:
flights[['year', 'month', 'day']]

Unnamed: 0,year,month,day
0,2013,1,1
1,2013,1,1
2,2013,1,1
3,2013,1,1
4,2013,1,1
...,...,...,...
336771,2013,9,30
336772,2013,9,30
336773,2013,9,30
336774,2013,9,30


## Extract distinct (unique) rows

In [22]:
%R distinct(flights, tailnum)

Unnamed: 0,tailnum
1,N14228
2,N24211
3,N619AA
4,N804JB
5,N668DN
...,...
4040,N766SK
4041,N772SK
4042,N776SK
4043,N785SK


In [24]:
flights.tailnum.drop_duplicates()

0         N14228
1         N24211
2         N619AA
3         N804JB
4         N668DN
           ...  
327436    N766SK
329041    N772SK
330033    N776SK
331007    N785SK
334259    N557AS
Name: tailnum, Length: 4044, dtype: object

## Add new columns

In [26]:
%%R
mutate(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60)

# A tibble: 336,776 x 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with 336,766 more row

In [28]:
flights.assign(
    gain=flights.arr_delay - flights.dep_delay,
    speed=flights.distance / flights.air_time * 60
)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,...,tailnum,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,...,N14228,EWR,IAH,227.0,1400.0,5.0,15.0,2013-01-01 05:00:00+00:00,9.0,370.044053
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,...,N24211,LGA,IAH,227.0,1416.0,5.0,29.0,2013-01-01 05:00:00+00:00,16.0,374.273128
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,...,N619AA,JFK,MIA,160.0,1089.0,5.0,40.0,2013-01-01 05:00:00+00:00,31.0,408.375000
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,...,N804JB,JFK,BQN,183.0,1576.0,5.0,45.0,2013-01-01 05:00:00+00:00,-17.0,516.721311
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,...,N668DN,LGA,ATL,116.0,762.0,6.0,0.0,2013-01-01 06:00:00+00:00,-19.0,394.137931
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336771,2013,9,30,,1455,,,1634,,9E,...,,JFK,DCA,,213.0,14.0,55.0,2013-09-30 14:00:00+00:00,,
336772,2013,9,30,,2200,,,2312,,9E,...,,LGA,SYR,,198.0,22.0,0.0,2013-09-30 22:00:00+00:00,,
336773,2013,9,30,,1210,,,1330,,MQ,...,N535MQ,LGA,BNA,,764.0,12.0,10.0,2013-09-30 12:00:00+00:00,,
336774,2013,9,30,,1159,,,1344,,MQ,...,N511MQ,LGA,CLE,,419.0,11.0,59.0,2013-09-30 11:00:00+00:00,,


R's NSE really shines here. Since the arguments aren't evaluated until inside the `mutate`, you can create
a new column and use it in the same `mutate`.

In [29]:
%%R
mutate(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

# A tibble: 336,776 x 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with 336,766 more row

This would have to be done in two separate `.assign` calls with pandas.

It also enables makes using the pipe operator more elegant than in Python.

In [35]:
%%R
filter(flights, day==1) %>% mutate(
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)

# A tibble: 11,036 x 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with 11,026 more rows,

The typical pattern in pandas is to "delay" and argument's evaluation by wrapping it in a `lambda` (anonymous function). Notice the `lambda df:`, which defines a function that takes a single arugment, `df`.

In [43]:
flights[flights.day == 1].assign(
    gain = lambda df: df.arr_delay - df.dep_delay,
    speed = lambda df: df.distance / df.air_time * 60
)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,...,tailnum,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,...,N14228,EWR,IAH,227.0,1400.0,5.0,15.0,2013-01-01 05:00:00+00:00,9.0,370.044053
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,...,N24211,LGA,IAH,227.0,1416.0,5.0,29.0,2013-01-01 05:00:00+00:00,16.0,374.273128
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,...,N619AA,JFK,MIA,160.0,1089.0,5.0,40.0,2013-01-01 05:00:00+00:00,31.0,408.375000
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,...,N804JB,JFK,BQN,183.0,1576.0,5.0,45.0,2013-01-01 05:00:00+00:00,-17.0,516.721311
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,...,N668DN,LGA,ATL,116.0,762.0,6.0,0.0,2013-01-01 06:00:00+00:00,-19.0,394.137931
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
309915,2013,9,1,2302.0,2305,-3.0,8.0,13,-5.0,B6,...,N629JB,JFK,BOS,42.0,187.0,23.0,5.0,2013-09-01 23:00:00+00:00,-2.0,267.142857
309916,2013,9,1,2329.0,2245,44.0,40.0,1,39.0,B6,...,N373JB,JFK,BTV,48.0,266.0,22.0,45.0,2013-09-01 22:00:00+00:00,-5.0,332.500000
309917,2013,9,1,2351.0,2359,-8.0,335.0,350,-15.0,B6,...,N588JB,JFK,PSE,204.0,1617.0,23.0,59.0,2013-09-01 23:00:00+00:00,-7.0,475.588235
309918,2013,9,1,2352.0,2359,-7.0,323.0,344,-21.0,B6,...,N768JB,JFK,SJU,196.0,1598.0,23.0,59.0,2013-09-01 23:00:00+00:00,-14.0,489.183673


## Summarise values

In [44]:
%%R
summarise(flights,
  delay = mean(dep_delay, na.rm = TRUE))

# A tibble: 1 x 1
     delay
     <dbl>
1 12.63907


In [47]:
flights.aggregate({"dep_delay": "mean"}).rename({"dep_delay": "delay"})

delay    12.63907
dtype: float64

## Randomly sample rows

In [49]:
%R sample_n(flights, 10)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
1,2013,6,1,1948,1955,-7.0,2236,2253,-17.0,9E,3450,N912XJ,JFK,JAX,110.0,828.0,19.0,55.0,1370113000.0
2,2013,8,22,2000,1835,85.0,2203,2010,113.0,MQ,3674,N530MQ,LGA,CLE,62.0,419.0,18.0,35.0,1377194000.0
3,2013,7,21,45,1900,345.0,249,2132,317.0,DL,947,N991DL,LGA,ATL,104.0,762.0,19.0,0.0,1374433000.0
4,2013,3,2,1919,1925,-6.0,2112,2126,-14.0,9E,3899,N819AY,JFK,CLE,75.0,425.0,19.0,25.0,1362251000.0
5,2013,12,26,1532,1530,2.0,1854,1903,-9.0,DL,417,N156DL,JFK,LAX,333.0,2475.0,15.0,30.0,1388070000.0
6,2013,8,25,818,825,-7.0,1036,1104,-28.0,DL,857,N3750D,JFK,SAN,298.0,2446.0,8.0,25.0,1377418000.0
7,2013,5,7,1834,1845,-11.0,1957,2030,-33.0,MQ,4517,N713MQ,LGA,CRW,63.0,444.0,18.0,45.0,1367950000.0
8,2013,10,3,2246,2255,-9.0,2352,9,-17.0,B6,486,N284JB,JFK,ROC,47.0,264.0,22.0,55.0,1380838000.0
9,2013,9,17,2110,1919,111.0,6,2234,92.0,B6,71,N644JB,JFK,SLC,261.0,1990.0,19.0,19.0,1379444000.0
10,2013,7,7,1755,1735,20.0,2207,2108,59.0,DL,1543,N710TW,JFK,SEA,333.0,2422.0,17.0,35.0,1373216000.0


In [50]:
flights.sample(n=10)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
113354,2013,2,3,1543.0,1548,-5.0,1709.0,1722,-13.0,UA,279,N843UA,EWR,ORD,124.0,719.0,15.0,48.0,2013-02-03 15:00:00+00:00
191820,2013,4,29,1111.0,1125,-14.0,1238.0,1305,-27.0,AA,327,N583AA,LGA,ORD,111.0,733.0,11.0,25.0,2013-04-29 11:00:00+00:00
275740,2013,7,27,1720.0,1721,-1.0,1921.0,1920,1.0,EV,4581,N11548,EWR,CMH,73.0,463.0,17.0,21.0,2013-07-27 17:00:00+00:00
125081,2013,2,16,1824.0,1829,-5.0,1954.0,2018,-24.0,EV,4202,N14920,EWR,STL,128.0,872.0,18.0,29.0,2013-02-16 18:00:00+00:00
33690,2013,10,8,803.0,810,-7.0,1024.0,1037,-13.0,FL,346,N910AT,LGA,ATL,110.0,762.0,8.0,10.0,2013-10-08 08:00:00+00:00
40511,2013,10,15,1415.0,1415,0.0,1652.0,1636,16.0,DL,673,N338NB,EWR,ATL,107.0,746.0,14.0,15.0,2013-10-15 14:00:00+00:00
224930,2013,6,4,643.0,639,4.0,858.0,906,-8.0,UA,599,N487UA,EWR,DEN,226.0,1605.0,6.0,39.0,2013-06-04 06:00:00+00:00
217058,2013,5,26,1550.0,1555,-5.0,1754.0,1808,-14.0,DL,95,N315NB,JFK,DTW,83.0,509.0,15.0,55.0,2013-05-26 15:00:00+00:00
256490,2013,7,7,2043.0,1920,83.0,2246.0,2045,121.0,AA,1762,N3HTAA,JFK,BOS,42.0,187.0,19.0,20.0,2013-07-07 19:00:00+00:00
132974,2013,2,25,1509.0,1515,-6.0,1646.0,1700,-14.0,MQ,4333,N673MQ,JFK,PIT,71.0,340.0,15.0,15.0,2013-02-25 15:00:00+00:00


In [61]:
%%R
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
delay

# A tibble: 2,962 x 4
   tailnum count     dist      delay
     <chr> <int>    <dbl>      <dbl>
 1  N0EGMQ   371 676.1887  9.9829545
 2  N10156   153 757.9477 12.7172414
 3  N102UW    48 535.8750  2.9375000
 4  N103US    46 535.1957 -6.9347826
 5  N104UW    47 535.2553  1.8043478
 6  N10575   289 519.7024 20.6914498
 7  N105UW    45 524.8444 -0.2666667
 8  N107US    41 528.7073 -5.7317073
 9  N108UW    60 534.5000 -1.2500000
10  N109UW    48 535.8750 -2.5208333
# ... with 2,952 more rows


In [63]:
flights

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400.0,5.0,15.0,2013-01-01 05:00:00+00:00
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416.0,5.0,29.0,2013-01-01 05:00:00+00:00
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089.0,5.0,40.0,2013-01-01 05:00:00+00:00
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576.0,5.0,45.0,2013-01-01 05:00:00+00:00
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762.0,6.0,0.0,2013-01-01 06:00:00+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336771,2013,9,30,,1455,,,1634,,9E,3393,,JFK,DCA,,213.0,14.0,55.0,2013-09-30 14:00:00+00:00
336772,2013,9,30,,2200,,,2312,,9E,3525,,LGA,SYR,,198.0,22.0,0.0,2013-09-30 22:00:00+00:00
336773,2013,9,30,,1210,,,1330,,MQ,3461,N535MQ,LGA,BNA,,764.0,12.0,10.0,2013-09-30 12:00:00+00:00
336774,2013,9,30,,1159,,,1344,,MQ,3572,N511MQ,LGA,CLE,,419.0,11.0,59.0,2013-09-30 11:00:00+00:00


In [69]:
by_tailnum = flights.groupby("tailnum")
delay = by_tailnum.aggregate({
    "year": "count",
    "distance": "mean",
    "arr_delay": "mean"
}).rename(columns={"year": "count", "distance": "dist",
                   "arr_delay": "delay"})
delay = delay[(delay['count'] > 20) & delay['dist'] < 2000]
delay

Unnamed: 0_level_0,count,dist,delay
tailnum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
D942DN,4,854.500000,31.500000
N0EGMQ,371,676.188679,9.982955
N10156,153,757.947712,12.717241
N102UW,48,535.875000,2.937500
N103US,46,535.195652,-6.934783
...,...,...,...
N997DL,63,867.761905,4.903226
N998AT,26,593.538462,29.960000
N998DL,77,857.818182,16.394737
N999DN,61,895.459016,14.311475


## Chaining

In [73]:
%%R
flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)


Source: local data frame [49 x 5]
Groups: year, month [11]

# A tibble: 49 x 5
    year month   day      arr      dep
   <int> <int> <int>    <dbl>    <dbl>
 1  2013     1    16 34.24736 24.61287
 2  2013     1    31 32.60285 28.65836
 3  2013     2    11 36.29009 39.07360
 4  2013     2    27 31.25249 37.76327
 5  2013     3     8 85.86216 83.53692
 6  2013     3    18 41.29189 30.11796
 7  2013     4    10 38.41231 33.02368
 8  2013     4    12 36.04814 34.83843
 9  2013     4    18 36.02848 34.91536
10  2013     4    19 47.91170 46.12783
# ... with 39 more rows


In [77]:
(flights.groupby(['year', 'month', 'day'])
 [['arr_delay', 'dep_delay']]
 .mean()
 .rename(columns=lambda x: x.split('_')[0])
 .loc[lambda x: (x.arr > 30) | (x.dep > 30)]
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,arr,dep
year,month,day,Unnamed: 3_level_1,Unnamed: 4_level_1
2013,1,16,34.247362,24.612865
2013,1,31,32.602854,28.658363
2013,2,11,36.290094,39.073598
2013,2,27,31.252492,37.763274
2013,3,8,85.862155,83.536921
2013,...,...,...,...
2013,12,9,42.575556,34.800221
2013,12,10,44.508796,26.465494
2013,12,14,46.397504,28.361552
2013,12,17,55.871856,40.705602


## Automatic Alignment

In [12]:
from pandas_datareader.data import DataReader
import json

In [28]:
with open("data/states.json") as f:
    states = json.load(f)
states

{'AK': 'Alaska',
 'AL': 'Alabama',
 'AR': 'Arkansas',
 'AZ': 'Arizona',
 'CA': 'California',
 'CO': 'Colorado',
 'CT': 'Connecticut',
 'DC': 'District of Columbia',
 'DE': 'Delaware',
 'FL': 'Florida',
 'GA': 'Georgia',
 'HI': 'Hawaii',
 'IA': 'Iowa',
 'ID': 'Idaho',
 'IL': 'Illinois',
 'IN': 'Indiana',
 'KS': 'Kansas',
 'KY': 'Kentucky',
 'LA': 'Louisiana',
 'MA': 'Massachusetts',
 'MD': 'Maryland',
 'ME': 'Maine',
 'MI': 'Michigan',
 'MN': 'Minnesota',
 'MO': 'Missouri',
 'MS': 'Mississippi',
 'MT': 'Montana',
 'NC': 'North Carolina',
 'ND': 'North Dakota',
 'NE': 'Nebraska',
 'NH': 'New Hampshire',
 'NJ': 'New Jersey',
 'NM': 'New Mexico',
 'NV': 'Nevada',
 'NY': 'New York',
 'OH': 'Ohio',
 'OK': 'Oklahoma',
 'OR': 'Oregon',
 'PA': 'Pennsylvania',
 'RI': 'Rhode Island',
 'SC': 'South Carolina',
 'SD': 'South Dakota',
 'TN': 'Tennessee',
 'TX': 'Texas',
 'UT': 'Utah',
 'VA': 'Virginia',
 'VT': 'Vermont',
 'WA': 'Washington',
 'WI': 'Wisconsin',
 'WV': 'West Virginia',
 'WY': 'Wyoming

In [29]:
gdp_series = [f'{state}RGSP' for state in states]
pop_series = [f'{state}POP' for state in states]

In [32]:
gdp = DataReader(gdp_series, data_source="fred", start="1997-01-01")
gdp.to_csv("data/gdp.csv")

In [38]:
pop = DataReader(pop_series, data_source="fred", start="1900-01-01")
gdp.to_csv("data/pop.csv")