# Statement

S09 T01: Practice with training and test sets

Description
Get familiar with scientific programming using the SKLearn / Scikitlearn library.

Level 1 - Exercise 1
Split the data set DelayedFlights.csv into train and test. Study the two sets separately, at a descriptive level.

Level 2 - Exercise 2
Apply some transformation process (standardize numerical data, create dummy columns, polynomials...).

Level 3 - Exercise 3
Summarize the new columns generated in a statistical and graphical way.  

# Dataset information: ✈
![](2022-02-28-17-46-54.png)

# Level 1 - Exercise 1.
Split the data set DelayedFlights.csv into train and test. Study the two sets separately, at a descriptive level.

In [1]:
# Import libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Personal functions
# Dataframe's Information
def Pers_df_info(par_df):
    print("[-------------------------SHAPE------------------------]")
    display(par_df.shape)
    print("[-------------------------INFO-------------------------]")
    display(par_df.info())
    print("[-----------------------DESCRIBE-----------------------]")
    display(par_df.describe(include='all').round(2))
    print("[------------------------NaN's-------------------------]")
    list_cols = par_df.columns
    display(par_df[list_cols].isnull().sum())
    print("[--------------Values in categorical variables---------]")
    list_num_cols = par_df._get_numeric_data().columns
    list_cat_cols = list(set(list_cols) - set(list_num_cols))
    for i in list_cat_cols:
        print("------------------%s-------------------" %i)
        print("------------Unique Values--------------")
        print("Number of unique values is: %.0f" %df[i].unique().size)
        print(df[i].unique())
        print("------------Value Counts--------------")
        display(df[i].value_counts())


In [3]:
# Test
# print("Number of unique values is: %.0f" %df['Origin'].unique().size)
# df['Origin'].unique().size
# print(df['Origin'].unique())

In [4]:
# Read csv to dataframe
df = pd.read_csv('..\Data\DelayedFlights.csv')

## Split the dataset into train and test.

In [5]:
# split into train test sets (33% for test)
df_train, df_test = train_test_split(df, test_size=0.33)

## Describe both sets 

In [6]:
# Describe set train (using my personal functions)
Pers_df_info(df_train)

[-------------------------SHAPE------------------------]


(1297627, 30)

[-------------------------INFO-------------------------]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1297627 entries, 848675 to 1319217
Data columns (total 30 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Unnamed: 0         1297627 non-null  int64  
 1   Year               1297627 non-null  int64  
 2   Month              1297627 non-null  int64  
 3   DayofMonth         1297627 non-null  int64  
 4   DayOfWeek          1297627 non-null  int64  
 5   DepTime            1297627 non-null  float64
 6   CRSDepTime         1297627 non-null  int64  
 7   ArrTime            1292838 non-null  float64
 8   CRSArrTime         1297627 non-null  int64  
 9   UniqueCarrier      1297627 non-null  object 
 10  FlightNum          1297627 non-null  int64  
 11  TailNum            1297624 non-null  object 
 12  ActualElapsedTime  1291983 non-null  float64
 13  CRSElapsedTime     1297487 non-null  float64
 14  AirTime            1

None

[-----------------------DESCRIBE-----------------------]


Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
count,1297627.0,1297627.0,1297627.0,1297627.0,1297627.0,1297627.0,1297627.0,1292838.0,1297627.0,1297627,...,1292838.0,1297322.0,1297627.0,1297627,1297627.0,836040.0,836040.0,836040.0,836040.0,836040.0
unique,,,,,,,,,,20,...,,,,4,,,,,,
top,,,,,,,,,,WN,...,,,,N,,,,,,
freq,,,,,,,,,,253041,...,,,,1297198,,,,,,
mean,3340784.23,2008.0,6.11,15.76,3.99,1518.08,1467.05,1609.81,1633.74,,...,6.81,18.24,0.0,,0.0,19.22,3.7,15.06,0.09,25.3
std,2066221.22,0.0,3.48,8.78,2.0,450.69,424.9,548.07,464.69,,...,5.28,14.36,0.02,,0.06,43.8,21.48,33.89,2.03,42.14
min,0.0,2008.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,,...,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0
25%,1516240.0,2008.0,3.0,8.0,2.0,1203.0,1135.0,1315.0,1325.0,,...,4.0,10.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0
50%,3242163.0,2008.0,6.0,16.0,4.0,1545.0,1510.0,1715.0,1705.0,,...,6.0,14.0,0.0,,0.0,2.0,0.0,2.0,0.0,8.0
75%,4972266.5,2008.0,9.0,23.0,6.0,1900.0,1815.0,2030.0,2014.0,,...,8.0,21.0,0.0,,0.0,21.0,0.0,15.0,0.0,33.0


[------------------------NaN's-------------------------]


Unnamed: 0                0
Year                      0
Month                     0
DayofMonth                0
DayOfWeek                 0
DepTime                   0
CRSDepTime                0
ArrTime                4789
CRSArrTime                0
UniqueCarrier             0
FlightNum                 0
TailNum                   3
ActualElapsedTime      5644
CRSElapsedTime          140
AirTime                5644
ArrDelay               5644
DepDelay                  0
Origin                    0
Dest                      0
Distance                  0
TaxiIn                 4789
TaxiOut                 305
Cancelled                 0
CancellationCode          0
Diverted                  0
CarrierDelay         461587
WeatherDelay         461587
NASDelay             461587
SecurityDelay        461587
LateAircraftDelay    461587
dtype: int64

[--------------Values in categorical variables---------]
------------------Origin-------------------
------------Unique Values--------------
Number of unique values is: 303
['IAD' 'IND' 'ISP' 'JAN' 'JAX' 'LAS' 'LAX' 'LBB' 'LIT' 'MAF' 'MCI' 'MCO'
 'MDW' 'MHT' 'MSY' 'OAK' 'OKC' 'OMA' 'ONT' 'ORF' 'PBI' 'PDX' 'PHL' 'PHX'
 'PIT' 'PVD' 'RDU' 'RNO' 'RSW' 'SAN' 'SAT' 'SDF' 'SEA' 'SFO' 'SJC' 'SLC'
 'SMF' 'SNA' 'STL' 'TPA' 'TUL' 'TUS' 'ABQ' 'ALB' 'AMA' 'AUS' 'BDL' 'BHM'
 'BNA' 'BOI' 'BUF' 'BUR' 'BWI' 'CLE' 'CMH' 'CRP' 'DAL' 'DEN' 'DTW' 'ELP'
 'FLL' 'GEG' 'HOU' 'HRL' 'ROC' 'ORD' 'EWR' 'SYR' 'IAH' 'CRW' 'FAT' 'COS'
 'MRY' 'LGB' 'BFL' 'EUG' 'ICT' 'MEM' 'BTV' 'MKE' 'LFT' 'BRO' 'PWM' 'MSP'
 'SRQ' 'CLT' 'CVG' 'GSO' 'SHV' 'DCA' 'TYS' 'GSP' 'RIC' 'DFW' 'BGR' 'DAY'
 'GRR' 'CHS' 'CAE' 'TLH' 'XNA' 'GPT' 'VPS' 'LGA' 'ATL' 'MSN' 'SAV' 'BTR'
 'LEX' 'LRD' 'MOB' 'MTJ' 'GRK' 'AEX' 'PNS' 'ABE' 'HSV' 'CHA' 'MFE' 'MLU'
 'DSM' 'MGM' 'AVL' 'LCH' 'BOS' 'MYR' 'CLL' 'DAB' 'ASE' 'ATW' 'BMI' 'CAK'
 'CID' 'CPR' 'EGE' 'FLG'

ATL    131613
ORD    125979
DFW     95414
DEN     74323
LAX     58772
        ...  
BJI         4
PIR         3
PUB         2
INL         1
TUP         1
Name: Origin, Length: 303, dtype: int64

------------------TailNum-------------------
------------Unique Values--------------
Number of unique values is: 5367
['N712SW' 'N772SW' 'N428WN' ... 'N75428' 'N75429' 'N78008']
------------Value Counts--------------


N325SW    965
N676SW    951
N658SW    945
N313SW    937
N308SA    936
         ... 
9189E       1
N853NW      1
N856NW      1
N76010      1
N78008      1
Name: TailNum, Length: 5366, dtype: int64

------------------UniqueCarrier-------------------
------------Unique Values--------------
Number of unique values is: 20
['WN' 'XE' 'YV' 'OH' 'OO' 'UA' 'US' 'DL' 'EV' 'F9' 'FL' 'HA' 'MQ' 'NW'
 '9E' 'AA' 'AQ' 'AS' 'B6' 'CO']
------------Value Counts--------------


WN    377602
AA    191865
MQ    141920
UA    141426
OO    132433
DL    114238
XE    103663
CO    100195
US     98425
EV     81877
NW     79108
FL     71284
YV     67063
B6     55315
OH     52657
9E     51885
AS     39293
F9     28269
HA      7490
AQ       750
Name: UniqueCarrier, dtype: int64

------------------CancellationCode-------------------
------------Unique Values--------------
Number of unique values is: 4
['N' 'A' 'B' 'C']
------------Value Counts--------------


N    1936125
B        307
A        246
C         80
Name: CancellationCode, dtype: int64

------------------Dest-------------------
------------Unique Values--------------
Number of unique values is: 304
['TPA' 'BWI' 'JAX' 'LAS' 'MCO' 'MDW' 'PHX' 'FLL' 'PBI' 'RSW' 'HOU' 'BHM'
 'BNA' 'IND' 'PHL' 'ABQ' 'ALB' 'AMA' 'AUS' 'BDL' 'BOI' 'BUF' 'BUR' 'CLE'
 'CMH' 'DEN' 'ELP' 'GEG' 'IAD' 'ISP' 'LAX' 'LBB' 'LIT' 'MAF' 'MCI' 'MHT'
 'MSY' 'OAK' 'OKC' 'OMA' 'ONT' 'ORF' 'PDX' 'PIT' 'PVD' 'RDU' 'RNO' 'SAN'
 'SAT' 'SDF' 'SEA' 'SFO' 'SJC' 'SLC' 'SMF' 'SNA' 'STL' 'TUL' 'TUS' 'DAL'
 'DTW' 'JAN' 'HRL' 'CRP' 'EWR' 'IAH' 'ROC' 'MYR' 'GSO' 'SAV' 'RIC' 'COS'
 'FAT' 'MRY' 'LGB' 'BFL' 'EUG' 'ICT' 'CAE' 'DFW' 'DAY' 'MSP' 'GSP' 'MEM'
 'TYS' 'SHV' 'BTV' 'MFE' 'PWM' 'ATL' 'SYR' 'MKE' 'HSV' 'BTR' 'CHS' 'MSN'
 'LFT' 'LRD' 'SRQ' 'CLT' 'VPS' 'AVL' 'GPT' 'LGA' 'ABE' 'BGR' 'DCA' 'ORD'
 'GRR' 'MOB' 'PNS' 'CHA' 'MGM' 'CVG' 'GRK' 'PSP' 'TLH' 'LCH' 'BOS' 'BRO'
 'XNA' 'BPT' 'LEX' 'MTJ' 'AEX' 'MLU' 'DSM' 'CRW' 'CLL' 'ILM' 'JFK' 'ASE'
 'CPR' 'DRO' 'RAP' 'KOA' 'LIH' 'OGG' 'MDT' 'ROA' 'SPI' 'HNL' 'MFR' 'ATW'
 'BMI' 'CA

ORD    108984
ATL    106898
DFW     70657
DEN     63003
LAX     59969
        ...  
INL         9
PIR         3
CYS         1
TUP         1
OGD         1
Name: Dest, Length: 304, dtype: int64

In [7]:
# Describe set test (using my personal functions)
Pers_df_info(df_test)

[-------------------------SHAPE------------------------]


(639131, 30)

[-------------------------INFO-------------------------]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 639131 entries, 273938 to 1749694
Data columns (total 30 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Unnamed: 0         639131 non-null  int64  
 1   Year               639131 non-null  int64  
 2   Month              639131 non-null  int64  
 3   DayofMonth         639131 non-null  int64  
 4   DayOfWeek          639131 non-null  int64  
 5   DepTime            639131 non-null  float64
 6   CRSDepTime         639131 non-null  int64  
 7   ArrTime            636810 non-null  float64
 8   CRSArrTime         639131 non-null  int64  
 9   UniqueCarrier      639131 non-null  object 
 10  FlightNum          639131 non-null  int64  
 11  TailNum            639129 non-null  object 
 12  ActualElapsedTime  636388 non-null  float64
 13  CRSElapsedTime     639073 non-null  float64
 14  AirTime            636388 non-null  f

None

[-----------------------DESCRIBE-----------------------]


Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
count,639131.0,639131.0,639131.0,639131.0,639131.0,639131.0,639131.0,636810.0,639131.0,639131,...,636810.0,638981.0,639131.0,639131,639131.0,411448.0,411448.0,411448.0,411448.0,411448.0
unique,,,,,,,,,,20,...,,,,4,,,,,,
top,,,,,,,,,,WN,...,,,,N,,,,,,
freq,,,,,,,,,,124561,...,,,,638927,,,,,,
mean,3343411.27,2008.0,6.11,15.74,3.98,1519.45,1468.32,1610.81,1635.21,,...,6.81,18.21,0.0,,0.0,19.09,3.71,14.95,0.09,25.3
std,2065748.16,0.0,3.48,8.78,2.0,450.07,424.5,548.41,464.52,,...,5.27,14.3,0.02,,0.06,43.03,21.52,33.71,2.02,41.88
min,2.0,2008.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,,...,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0
25%,1519768.5,2008.0,3.0,8.0,2.0,1204.0,1135.0,1317.0,1325.0,,...,4.0,10.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0
50%,3243298.0,2008.0,6.0,16.0,4.0,1546.0,1510.0,1716.0,1707.0,,...,6.0,14.0,0.0,,0.0,2.0,0.0,2.0,0.0,8.0
75%,4972930.5,2008.0,9.0,23.0,6.0,1900.0,1816.0,2031.0,2015.0,,...,8.0,21.0,0.0,,0.0,21.0,0.0,14.0,0.0,33.0


[------------------------NaN's-------------------------]


Unnamed: 0                0
Year                      0
Month                     0
DayofMonth                0
DayOfWeek                 0
DepTime                   0
CRSDepTime                0
ArrTime                2321
CRSArrTime                0
UniqueCarrier             0
FlightNum                 0
TailNum                   2
ActualElapsedTime      2743
CRSElapsedTime           58
AirTime                2743
ArrDelay               2743
DepDelay                  0
Origin                    0
Dest                      0
Distance                  0
TaxiIn                 2321
TaxiOut                 150
Cancelled                 0
CancellationCode          0
Diverted                  0
CarrierDelay         227683
WeatherDelay         227683
NASDelay             227683
SecurityDelay        227683
LateAircraftDelay    227683
dtype: int64

[--------------Values in categorical variables---------]
------------------Origin-------------------
------------Unique Values--------------
Number of unique values is: 303
['IAD' 'IND' 'ISP' 'JAN' 'JAX' 'LAS' 'LAX' 'LBB' 'LIT' 'MAF' 'MCI' 'MCO'
 'MDW' 'MHT' 'MSY' 'OAK' 'OKC' 'OMA' 'ONT' 'ORF' 'PBI' 'PDX' 'PHL' 'PHX'
 'PIT' 'PVD' 'RDU' 'RNO' 'RSW' 'SAN' 'SAT' 'SDF' 'SEA' 'SFO' 'SJC' 'SLC'
 'SMF' 'SNA' 'STL' 'TPA' 'TUL' 'TUS' 'ABQ' 'ALB' 'AMA' 'AUS' 'BDL' 'BHM'
 'BNA' 'BOI' 'BUF' 'BUR' 'BWI' 'CLE' 'CMH' 'CRP' 'DAL' 'DEN' 'DTW' 'ELP'
 'FLL' 'GEG' 'HOU' 'HRL' 'ROC' 'ORD' 'EWR' 'SYR' 'IAH' 'CRW' 'FAT' 'COS'
 'MRY' 'LGB' 'BFL' 'EUG' 'ICT' 'MEM' 'BTV' 'MKE' 'LFT' 'BRO' 'PWM' 'MSP'
 'SRQ' 'CLT' 'CVG' 'GSO' 'SHV' 'DCA' 'TYS' 'GSP' 'RIC' 'DFW' 'BGR' 'DAY'
 'GRR' 'CHS' 'CAE' 'TLH' 'XNA' 'GPT' 'VPS' 'LGA' 'ATL' 'MSN' 'SAV' 'BTR'
 'LEX' 'LRD' 'MOB' 'MTJ' 'GRK' 'AEX' 'PNS' 'ABE' 'HSV' 'CHA' 'MFE' 'MLU'
 'DSM' 'MGM' 'AVL' 'LCH' 'BOS' 'MYR' 'CLL' 'DAB' 'ASE' 'ATW' 'BMI' 'CAK'
 'CID' 'CPR' 'EGE' 'FLG'

ATL    131613
ORD    125979
DFW     95414
DEN     74323
LAX     58772
        ...  
BJI         4
PIR         3
PUB         2
INL         1
TUP         1
Name: Origin, Length: 303, dtype: int64

------------------TailNum-------------------
------------Unique Values--------------
Number of unique values is: 5367
['N712SW' 'N772SW' 'N428WN' ... 'N75428' 'N75429' 'N78008']
------------Value Counts--------------


N325SW    965
N676SW    951
N658SW    945
N313SW    937
N308SA    936
         ... 
9189E       1
N853NW      1
N856NW      1
N76010      1
N78008      1
Name: TailNum, Length: 5366, dtype: int64

------------------UniqueCarrier-------------------
------------Unique Values--------------
Number of unique values is: 20
['WN' 'XE' 'YV' 'OH' 'OO' 'UA' 'US' 'DL' 'EV' 'F9' 'FL' 'HA' 'MQ' 'NW'
 '9E' 'AA' 'AQ' 'AS' 'B6' 'CO']
------------Value Counts--------------


WN    377602
AA    191865
MQ    141920
UA    141426
OO    132433
DL    114238
XE    103663
CO    100195
US     98425
EV     81877
NW     79108
FL     71284
YV     67063
B6     55315
OH     52657
9E     51885
AS     39293
F9     28269
HA      7490
AQ       750
Name: UniqueCarrier, dtype: int64

------------------CancellationCode-------------------
------------Unique Values--------------
Number of unique values is: 4
['N' 'A' 'B' 'C']
------------Value Counts--------------


N    1936125
B        307
A        246
C         80
Name: CancellationCode, dtype: int64

------------------Dest-------------------
------------Unique Values--------------
Number of unique values is: 304
['TPA' 'BWI' 'JAX' 'LAS' 'MCO' 'MDW' 'PHX' 'FLL' 'PBI' 'RSW' 'HOU' 'BHM'
 'BNA' 'IND' 'PHL' 'ABQ' 'ALB' 'AMA' 'AUS' 'BDL' 'BOI' 'BUF' 'BUR' 'CLE'
 'CMH' 'DEN' 'ELP' 'GEG' 'IAD' 'ISP' 'LAX' 'LBB' 'LIT' 'MAF' 'MCI' 'MHT'
 'MSY' 'OAK' 'OKC' 'OMA' 'ONT' 'ORF' 'PDX' 'PIT' 'PVD' 'RDU' 'RNO' 'SAN'
 'SAT' 'SDF' 'SEA' 'SFO' 'SJC' 'SLC' 'SMF' 'SNA' 'STL' 'TUL' 'TUS' 'DAL'
 'DTW' 'JAN' 'HRL' 'CRP' 'EWR' 'IAH' 'ROC' 'MYR' 'GSO' 'SAV' 'RIC' 'COS'
 'FAT' 'MRY' 'LGB' 'BFL' 'EUG' 'ICT' 'CAE' 'DFW' 'DAY' 'MSP' 'GSP' 'MEM'
 'TYS' 'SHV' 'BTV' 'MFE' 'PWM' 'ATL' 'SYR' 'MKE' 'HSV' 'BTR' 'CHS' 'MSN'
 'LFT' 'LRD' 'SRQ' 'CLT' 'VPS' 'AVL' 'GPT' 'LGA' 'ABE' 'BGR' 'DCA' 'ORD'
 'GRR' 'MOB' 'PNS' 'CHA' 'MGM' 'CVG' 'GRK' 'PSP' 'TLH' 'LCH' 'BOS' 'BRO'
 'XNA' 'BPT' 'LEX' 'MTJ' 'AEX' 'MLU' 'DSM' 'CRW' 'CLL' 'ILM' 'JFK' 'ASE'
 'CPR' 'DRO' 'RAP' 'KOA' 'LIH' 'OGG' 'MDT' 'ROA' 'SPI' 'HNL' 'MFR' 'ATW'
 'BMI' 'CA

ORD    108984
ATL    106898
DFW     70657
DEN     63003
LAX     59969
        ...  
INL         9
PIR         3
CYS         1
TUP         1
OGD         1
Name: Dest, Length: 304, dtype: int64

# Level 2 - Exercise 2.
### Apply some transformation process (standardize numerical data, create dummy columns, polynomials...).



#### Let's standardize some numerical columns:

In [8]:
# Let's standardize some numerical columns:
list_num_cols = df_train._get_numeric_data().columns
print(list_num_cols)

Index(['Unnamed: 0', 'Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime',
       'CRSDepTime', 'ArrTime', 'CRSArrTime', 'FlightNum', 'ActualElapsedTime',
       'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Distance',
       'TaxiIn', 'TaxiOut', 'Cancelled', 'Diverted', 'CarrierDelay',
       'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
      dtype='object')


In [9]:
# #list for cols to scale and scale them.
cols_to_scale = ['DepTime',
       'CRSDepTime', 'ArrTime', 'CRSArrTime', 'FlightNum', 'ActualElapsedTime',
       'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Distance',
       'TaxiIn', 'TaxiOut', 'Cancelled', 'Diverted', 'CarrierDelay',
       'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
#create and fit scaler
scaler = StandardScaler()
scaler.fit(df_train[cols_to_scale])
#scale selected data
df_train[cols_to_scale] = scaler.transform(df_train[cols_to_scale])

df_train.head()

Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
848675,2877566,2008,5,14,3,-1.071869,-1.04037,-0.660164,-0.879589,AA,...,0.224792,0.749269,-0.018186,N,-0.063522,-0.210616,-0.17233,-0.060703,-0.044224,-0.600254
482199,1511775,2008,3,16,7,1.519712,1.614373,1.085979,1.130344,EV,...,-0.154292,-0.086457,-0.018186,N,-0.063522,,,,,
1115318,3715051,2008,7,24,4,1.737157,1.560242,-2.929967,1.31326,XE,...,0.03525,-0.434676,-0.018186,N,-0.063522,1.3877,-0.17233,-0.444247,-0.044224,-0.600254
445612,1390500,2008,3,27,4,0.661029,0.618844,0.763025,0.600961,OO,...,0.224792,1.027844,-0.018186,N,-0.063522,-0.438946,-0.17233,1.768505,-0.044224,-0.600254
421164,1298096,2008,3,25,2,-1.766359,-1.781721,-0.915608,-1.251878,XE,...,-0.154292,-0.573963,-0.018186,N,-0.063522,-0.438946,-0.17233,-0.326233,5.877798,-0.600254


#### Let's create dummy columns of Cancellation Code attribute.  
(it has only 4 unique values, for the others, we would have to choose the first 5/10 more important and then use a last one with all the others.

In [10]:
# Let's create dummy columns of Cancellation Code attribute
df_train_norm = pd.get_dummies(df_train,columns=['CancellationCode'])
df_train_norm

Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,CancellationCode_A,CancellationCode_B,CancellationCode_C,CancellationCode_N
848675,2877566,2008,5,14,3,-1.071869,-1.040370,-0.660164,-0.879589,AA,...,-0.063522,-0.210616,-0.17233,-0.060703,-0.044224,-0.600254,0,0,0,1
482199,1511775,2008,3,16,7,1.519712,1.614373,1.085979,1.130344,EV,...,-0.063522,,,,,,0,0,0,1
1115318,3715051,2008,7,24,4,1.737157,1.560242,-2.929967,1.313260,XE,...,-0.063522,1.387700,-0.17233,-0.444247,-0.044224,-0.600254,0,0,0,1
445612,1390500,2008,3,27,4,0.661029,0.618844,0.763025,0.600961,OO,...,-0.063522,-0.438946,-0.17233,1.768505,-0.044224,-0.600254,0,0,0,1
421164,1298096,2008,3,25,2,-1.766359,-1.781721,-0.915608,-1.251878,XE,...,-0.063522,-0.438946,-0.17233,-0.326233,5.877798,-0.600254,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1355840,4573728,2008,8,1,5,1.393240,1.532000,1.310405,1.341236,EV,...,-0.063522,-0.073617,-0.17233,-0.001697,-0.044224,-0.600254,0,0,0,1
1165286,3888361,2008,7,14,1,-1.065212,-1.216883,-0.651041,-0.896805,US,...,-0.063522,0.383045,-0.17233,-0.444247,-0.044224,-0.600254,0,0,0,1
633513,2022495,2008,4,1,2,-0.898801,-1.005068,-0.348157,-0.483628,UA,...,-0.063522,,,,,,0,0,0,1
1606497,5838626,2008,10,9,4,0.902880,0.677681,1.330476,1.261613,AA,...,-0.063522,-0.438946,-0.17233,1.886518,-0.044224,-0.600254,0,0,0,1
