# Indexing Time Series

## Using pandas to read DateTime objects

* read_csv() function
    * Can read strings into datetime objects
    * Need to specify **'parse_dates=True'**
* ISO 8601 format
    * yyyy-mm-dd hh:mm:ss

 **Parse Dates**

In [1]:
import pandas as pd
df = pd.read_csv('https://assets.datacamp.com/production/repositories/497/datasets/5b808399816c8dcb8eef08336595ef9b4eb22902/austin_airport_departure_data_2015_july.csv'
                 , header=10, index_col = 'Date (MM/DD/YYYY)', parse_dates =True)
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1741 entries, 2015-07-01 to NaT
Data columns (total 17 columns):
  Carrier Code                            1741 non-null object
Flight Number                             1740 non-null float64
Tail Number                               1740 non-null object
Destination Airport                       1740 non-null object
Scheduled Departure Time                  1740 non-null object
Actual Departure Time                     1740 non-null object
Scheduled Elapsed Time(Minutes)           1740 non-null float64
Actual Elapsed Time(Minutes)              1740 non-null float64
Departure Delay(Minutes)                  1740 non-null float64
Wheels-off Time                           1740 non-null object
Taxi-out Time(Minutes)                    1740 non-null float64
DelayCarrier(Minutes)                     1740 non-null float64
DelayWeather(Minutes)                     1740 non-null float64
DelayNational Aviation System(Minutes)    1740 non-null 

## Partial datetime string selection

* Alternative formats:
    * df.loc['July 5, 2015']
    * df.loc['2015-July-5']
* Whole month: df.loc['2015-7']
* Whole year: df.loc['2015']

### Selecting whole month

In [2]:
july2015 = df.loc['2015-7']
july2015.head()

Unnamed: 0_level_0,Carrier Code,Flight Number,Tail Number,Destination Airport,Scheduled Departure Time,Actual Departure Time,Scheduled Elapsed Time(Minutes),Actual Elapsed Time(Minutes),Departure Delay(Minutes),Wheels-off Time,Taxi-out Time(Minutes),DelayCarrier(Minutes),DelayWeather(Minutes),DelayNational Aviation System(Minutes),DelaySecurity(Minutes),DelayLate Aircraft Arrival(Minutes),Unnamed: 17
Date (MM/DD/YYYY),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2015-07-01,WN,103.0,N8607M,MDW,06:30,06:52,165.0,147.0,22.0,07:01,9.0,0.0,0.0,0.0,0.0,0.0,
2015-07-01,WN,144.0,N8609A,SAN,20:55,20:50,170.0,158.0,-5.0,21:03,13.0,0.0,0.0,0.0,0.0,0.0,
2015-07-01,WN,178.0,N646SW,ELP,20:30,20:45,90.0,80.0,15.0,20:55,10.0,0.0,0.0,0.0,0.0,0.0,
2015-07-01,WN,232.0,N204WN,ATL,05:45,05:49,135.0,137.0,4.0,06:01,12.0,0.0,0.0,0.0,0.0,0.0,
2015-07-01,WN,238.0,N233LV,DAL,12:30,12:34,55.0,48.0,4.0,12:41,7.0,0.0,0.0,0.0,0.0,0.0,


### Slicing using dates / times

In [3]:
df.loc['2015-07-05' : '2015-07-10'].head()

Unnamed: 0_level_0,Carrier Code,Flight Number,Tail Number,Destination Airport,Scheduled Departure Time,Actual Departure Time,Scheduled Elapsed Time(Minutes),Actual Elapsed Time(Minutes),Departure Delay(Minutes),Wheels-off Time,Taxi-out Time(Minutes),DelayCarrier(Minutes),DelayWeather(Minutes),DelayNational Aviation System(Minutes),DelaySecurity(Minutes),DelayLate Aircraft Arrival(Minutes),Unnamed: 17
Date (MM/DD/YYYY),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2015-07-05,WN,144.0,N8651A,SAN,20:55,21:10,170.0,160.0,15.0,21:20,10.0,0.0,0.0,0.0,0.0,0.0,
2015-07-05,WN,178.0,N7703A,ELP,20:30,20:39,90.0,84.0,9.0,20:48,9.0,0.0,0.0,0.0,0.0,0.0,
2015-07-05,WN,238.0,N751SW,DAL,12:30,12:33,55.0,51.0,3.0,12:41,8.0,0.0,0.0,0.0,0.0,0.0,
2015-07-05,WN,285.0,N528SW,DAL,09:25,09:24,55.0,50.0,-1.0,09:32,8.0,0.0,0.0,0.0,0.0,0.0,
2015-07-05,WN,311.0,N489WN,MDW,15:05,15:34,150.0,143.0,29.0,15:42,8.0,22.0,0.0,0.0,0.0,0.0,


### Converting strings to datetime

The pandas .to_datetime() function can convert strings in ISO 8601 format to pandas datetime objects

In [4]:
july_11_2015 = pd.to_datetime(['2015-7-11'])

## Resampling time series data

* Statistical methods over different time intervals
    * mean(), sum(), count(), etc.
* Downsampling
    * Reduce datetime rows to a slower frequency
    * IE) Daily -> Weekly
* Upsampling
    * Increase datetime rows to a faster frequency
    * IE) Daily -> Hourly

In [5]:
# Downsampling
weekly_mean = df.resample('W').mean()
weekly_mean

Unnamed: 0_level_0,Flight Number,Scheduled Elapsed Time(Minutes),Actual Elapsed Time(Minutes),Departure Delay(Minutes),Taxi-out Time(Minutes),DelayCarrier(Minutes),DelayWeather(Minutes),DelayNational Aviation System(Minutes),DelaySecurity(Minutes),DelayLate Aircraft Arrival(Minutes),Unnamed: 17
Date (MM/DD/YYYY),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-07-05,1908.610487,130.337079,124.003745,10.164794,10.041199,2.370787,0.198502,0.902622,0.033708,5.243446,
2015-07-12,1930.015228,129.809645,126.30203,7.370558,10.098985,1.680203,0.101523,1.192893,0.015228,3.93401,
2015-07-19,1930.015228,129.809645,125.236041,14.441624,9.807107,3.060914,0.979695,1.167513,0.0,6.913706,
2015-07-26,1930.015228,129.809645,123.413706,10.418782,9.804569,5.395939,0.0,1.124365,0.0,2.621827,
2015-08-02,1779.835052,128.453608,122.257732,8.206186,9.979381,2.154639,0.0,1.04811,0.0,3.934708,


## Manipulating data

* String methods
    * Substring matching

In [6]:
df = pd.read_csv('https://assets.datacamp.com/production/repositories/497/datasets/2175fef4b3691db03449bbc7ddffb740319c1131/world_ind_pop_data.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13374 entries, 0 to 13373
Data columns (total 5 columns):
CountryName                      13374 non-null object
CountryCode                      13374 non-null object
Year                             13374 non-null int64
Total Population                 13374 non-null float64
Urban population (% of total)    13374 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 522.5+ KB


Unnamed: 0,CountryName,CountryCode,Year,Total Population,Urban population (% of total)
0,Arab World,ARB,1960,92495900.0,31.285384
1,Caribbean small states,CSS,1960,4190810.0,31.59749
2,Central Europe and the Baltics,CEB,1960,91401580.0,44.507921
3,East Asia & Pacific (all income levels),EAS,1960,1042475000.0,22.471132
4,East Asia & Pacific (developing only),EAP,1960,896493000.0,16.917679


In [7]:
# Using .info() to find that there are 220 matching entries
df[df['CountryName'].str.contains('North')].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 220 entries, 20 to 13303
Data columns (total 5 columns):
CountryName                      220 non-null object
CountryCode                      220 non-null object
Year                             220 non-null int64
Total Population                 220 non-null float64
Urban population (% of total)    220 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 10.3+ KB


In [8]:
# Using .sum() to infer there are 220 matching entries
df['CountryName'].str.contains('North').sum()

220

## Resampling

Using a process called resampling, we can apply statistical methods, (IE - mean, sum, count) computed over different time intervals

* Statistical methods over different time intervals
    * means(), sum(), count(), etc
* Downsampling
    * Reduce datetime rows to slower frequency
* Upsampling
    * Increase datetime rows to faster frequency

### Aggregating Means

Note: Non numerical columns are left out, however look at Flight Number. The column was a float type. This means the Flight Numbers got averaged, when in reality this is not what we would want.

In [7]:
daily_mean = df.resample('D').mean()
daily_mean

Unnamed: 0_level_0,Flight Number,Scheduled Elapsed Time(Minutes),Actual Elapsed Time(Minutes),Departure Delay(Minutes),Taxi-out Time(Minutes),DelayCarrier(Minutes),DelayWeather(Minutes),DelayNational Aviation System(Minutes),DelaySecurity(Minutes),DelayLate Aircraft Arrival(Minutes),Unnamed: 17
Date (MM/DD/YYYY),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-07-01,1780.896552,128.706897,120.793103,21.534483,10.431034,4.293103,0.0,1.87931,0.0,12.206897,
2015-07-02,1780.896552,128.706897,123.344828,15.655172,10.413793,3.5,0.0,1.362069,0.0,9.758621,
2015-07-03,2071.714286,127.142857,120.535714,5.625,9.732143,1.089286,0.0,0.125,0.160714,2.107143,
2015-07-04,1500.026316,137.763158,130.894737,-0.815789,9.631579,0.342105,0.0,0.078947,0.0,0.0,
2015-07-05,2280.666667,131.842105,126.754386,4.789474,9.842105,1.877193,0.929825,0.754386,0.0,0.140351,
2015-07-06,1780.896552,128.706897,126.913793,2.758621,10.913793,0.482759,0.0,1.655172,0.0,1.344828,
2015-07-07,1780.896552,128.706897,124.137931,11.706897,9.517241,1.724138,0.0,1.775862,0.0,7.706897,
2015-07-08,1780.896552,128.706897,126.827586,11.827586,10.810345,3.448276,0.0,2.534483,0.0,5.413793,
2015-07-09,1780.896552,128.706897,126.775862,4.482759,10.327586,1.258621,0.689655,0.724138,0.103448,1.982759,
2015-07-10,1775.661017,127.457627,123.423729,9.779661,9.711864,1.559322,0.0,0.322034,0.0,6.372881,


## Verifying

In [8]:
daily_mean.loc['2015-7-10']

Flight Number                             1775.661017
Scheduled Elapsed Time(Minutes)            127.457627
Actual Elapsed Time(Minutes)               123.423729
Departure Delay(Minutes)                     9.779661
Taxi-out Time(Minutes)                       9.711864
DelayCarrier(Minutes)                        1.559322
DelayWeather(Minutes)                        0.000000
DelayNational Aviation System(Minutes)       0.322034
DelaySecurity(Minutes)                       0.000000
DelayLate Aircraft Arrival(Minutes)          6.372881
Unnamed: 17                                       NaN
Name: 2015-07-10 00:00:00, dtype: float64

In [9]:
df.loc['2015-7-10'].head()

Unnamed: 0_level_0,Carrier Code,Flight Number,Tail Number,Destination Airport,Scheduled Departure Time,Actual Departure Time,Scheduled Elapsed Time(Minutes),Actual Elapsed Time(Minutes),Departure Delay(Minutes),Wheels-off Time,Taxi-out Time(Minutes),DelayCarrier(Minutes),DelayWeather(Minutes),DelayNational Aviation System(Minutes),DelaySecurity(Minutes),DelayLate Aircraft Arrival(Minutes),Unnamed: 17
Date (MM/DD/YYYY),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2015-07-10,WN,103.0,N8656B,MDW,06:30,06:36,165.0,150.0,6.0,06:45,9.0,0.0,0.0,0.0,0.0,0.0,
2015-07-10,WN,144.0,N8660A,SAN,20:55,20:50,170.0,181.0,-5.0,21:00,10.0,0.0,0.0,0.0,0.0,0.0,
2015-07-10,WN,178.0,N395SW,ELP,20:30,21:08,90.0,101.0,38.0,21:19,11.0,5.0,0.0,11.0,0.0,33.0,
2015-07-10,WN,232.0,N7818L,ATL,05:45,05:44,135.0,131.0,-1.0,05:54,10.0,0.0,0.0,0.0,0.0,0.0,
2015-07-10,WN,238.0,N614SW,DAL,12:30,12:37,55.0,55.0,7.0,12:47,10.0,0.0,0.0,0.0,0.0,0.0,


In [10]:
df.loc['2015-7-10'].mean()

Flight Number                             1775.661017
Scheduled Elapsed Time(Minutes)            127.457627
Actual Elapsed Time(Minutes)               123.423729
Departure Delay(Minutes)                     9.779661
Taxi-out Time(Minutes)                       9.711864
DelayCarrier(Minutes)                        1.559322
DelayWeather(Minutes)                        0.000000
DelayNational Aviation System(Minutes)       0.322034
DelaySecurity(Minutes)                       0.000000
DelayLate Aircraft Arrival(Minutes)          6.372881
Unnamed: 17                                       NaN
dtype: float64

## Method Chaining and Filtering

In [11]:
# Strip extra whitespace from column names
df.columns = df.columns.str.strip()
df.columns

Index(['Carrier Code', 'Flight Number', 'Tail Number', 'Destination Airport',
       'Scheduled Departure Time', 'Actual Departure Time',
       'Scheduled Elapsed Time(Minutes)', 'Actual Elapsed Time(Minutes)',
       'Departure Delay(Minutes)', 'Wheels-off Time', 'Taxi-out Time(Minutes)',
       'DelayCarrier(Minutes)', 'DelayWeather(Minutes)',
       'DelayNational Aviation System(Minutes)', 'DelaySecurity(Minutes)',
       'DelayLate Aircraft Arrival(Minutes)', 'Unnamed: 17'],
      dtype='object')

In [12]:
# Extract data where the airport is DAL (Dallas)
dallas = df['Destination Airport'].str.contains('DAL')
dallas.head()

Date (MM/DD/YYYY)
2015-07-01    False
2015-07-01    False
2015-07-01    False
2015-07-01    False
2015-07-01     True
Name: Destination Airport, dtype: object

In [13]:
# Compute the Ttotal number of Dallas departures each day
daily_departures = dallas.resample('D').sum()
daily_departures.head()

Date (MM/DD/YYYY)
2015-07-01    10
2015-07-02    10
2015-07-03    11
2015-07-04     3
2015-07-05     9
dtype: int64

In [14]:
# Generate the summary statistics for daily Dallas departures
daily_departures.describe()

count    31.000000
mean      9.322581
std       1.989759
min       3.000000
25%       9.500000
50%      10.000000
75%      10.000000
max      11.000000
dtype: float64