#### Take home project

If we wanted to do some machine learning, we would need to create as many informative features as we thought could be useful. This is called Feature Engineering.

Discrete data would most often be transformed by [one-hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f), which is [very easy to do in pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html).

![One Hot Encoding](https://hackernoon.com/photos/4HK5qyMbWfetPhAavzyTZrEb90N2-3o23tie)

With high cardinality discrete data such as airports or, especially, tail numbers, we would be left with many variables, most of which would not be very informative. There are [several options](https://www.datacamp.com/community/tutorials/encoding-methodologies) to deal with this. The most sophisticated is probably vector encoding, but we can do with a very simple approach: [target encoding](https://maxhalford.github.io/blog/target-encoding-done-the-right-way/).

This means that we substitute each value of the discrete variable by the average or median value of the target variable for that value of the independent discrete variable. However! if we want to use this in machine learning for predictions, we would need to use only the previous values.

Target encode the variable 'Origin' and the 'Tail_Number' variable, using for each cell only the values that were available the previous day. In other words: create a `median_delay_origin` variable that contains, for each record, the median delay at that airport _up to the previous day_. Create another one, `median_delay_plane`, with Tail_Number.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
from zipfile import ZipFile

zip_path='On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2019_9.zip'
zf=ZipFile(zip_path)    #Actuará como un iter
csv=zf.open(zf.filelist[0])   #Elegirá leer el archivo 1 del fichero zip

In [3]:
interesting_columns= ['DayOfWeek', 'Reporting_Airline', 'Tail_Number', 'Flight_Number_Reporting_Airline', 
                      'Origin', 'OriginCityName', 'OriginStateName', 'OriginCityMarketID',
                      'Dest', 'DestCityName', 'DestStateName', 'DestCityMarketID',
                      'DepTime', 'DepDelay', 'AirTime', 'Distance']

In [4]:
def cum_median_int(df,series,objective,by):
    new_df=pd.DataFrame()
    for dataframe in df:
        dataframe[series].dropna(inplace=True)
        dataframe[objective]=np.nan
        for i in range(dataframe.shape[0]):
            median_df=dataframe[[by,series]][:i].groupby(by).median()
            median=median_df.to_dict()
            if dataframe[by].iloc[i] not in median[series].keys():
                continue
            else:
                dataframe[objective].iloc[i]=median[series][dataframe[by].iloc[i]]
        new_df=new_df.append(dataframe)
    return new_df

In [5]:
csv.seek(0)
df = pd.read_csv(csv,usecols=interesting_columns,chunksize=10000)

In [6]:
%%time
df1=cum_median_int(df,'DepDelay','median_delay_origin',by='Origin')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


CPU times: user 26min 25s, sys: 11.4 s, total: 26min 36s
Wall time: 26min 41s


In [7]:
csv.seek(0)
df = pd.read_csv(csv,usecols=interesting_columns,chunksize=10000)

In [8]:
%%time
df2=cum_median_int(df,'DepDelay','median_delay_plane',by='Tail_Number')

CPU times: user 34min 7s, sys: 11.1 s, total: 34min 19s
Wall time: 34min 56s


In [9]:
df_delay=pd.concat([df1,df2['median_delay_plane']],axis=1)

In [11]:
df_delay.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 605979 entries, 0 to 605978
Data columns (total 18 columns):
DayOfWeek                          605979 non-null int64
Reporting_Airline                  605979 non-null object
Tail_Number                        604122 non-null object
Flight_Number_Reporting_Airline    605979 non-null int64
OriginCityMarketID                 605979 non-null int64
Origin                             605979 non-null object
OriginCityName                     605979 non-null object
OriginStateName                    605979 non-null object
DestCityMarketID                   605979 non-null int64
Dest                               605979 non-null object
DestCityName                       605979 non-null object
DestStateName                      605979 non-null object
DepTime                            596198 non-null float64
DepDelay                           596198 non-null float64
AirTime                            594716 non-null float64
Distance            