# Statement

S09 T01: Practice with training and test sets

Description
Get familiar with scientific programming using the SKLearn / Scikitlearn library.

Level 1 - Exercise 1
Split the data set DelayedFlights.csv into train and test. Study the two sets separately, at a descriptive level.

Level 2 - Exercise 2
Apply some transformation process (standardize numerical data, create dummy columns, polynomials...).

Level 3 - Exercise 3
Summarize the new columns generated in a statistical and graphical way.

# Dataset information: ✈
![](2022-02-28-17-46-54.png)

# Level 1 - Exercise 1 - Split between train and test and study the two sets.
Split the data set DelayedFlights.csv into train and test. Study the two sets separately, at a descriptive level.

In [1]:
# Import libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Personal functions
# Dataframe's Information
def Pers_df_info(par_df):
    print("[-------------------------SHAPE------------------------]")
    display(par_df.shape)
    print("[-------------------------INFO-------------------------]")
    display(par_df.info())
    print("[-----------------------DESCRIBE-----------------------]")
    display(par_df.describe(include='all').round(2))
    print("[------------------------NaN's-------------------------]")
    list_cols = par_df.columns
    display(par_df[list_cols].isnull().sum())
    print("[--------------Values in categorical variables---------]")
    list_num_cols = par_df._get_numeric_data().columns
    list_cat_cols = list(set(list_cols) - set(list_num_cols))
    for i in list_cat_cols:
        print("------------------%s-------------------" %i)
        print("------------Unique Values--------------")
        print("Number of unique values is: %.0f" %df[i].unique().size)
        print(df[i].unique())
        print("------------Value Counts--------------")
        display(df[i].value_counts())


In [3]:
# Test
# print("Number of unique values is: %.0f" %df['Origin'].unique().size)
# df['Origin'].unique().size
# print(df['Origin'].unique())

In [4]:
# Read csv to dataframe
df = pd.read_csv('..\Data\DelayedFlights.csv')

## Split the dataset into train and test.

In [5]:
# split into train test sets (33% for test)
df_train, df_test = train_test_split(df, test_size=0.33)

## Describe both sets 

In [6]:
# Describe set train (using my personal functions)
Pers_df_info(df_train)

[-------------------------SHAPE------------------------]


(1297627, 30)

[-------------------------INFO-------------------------]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1297627 entries, 1725493 to 1554307
Data columns (total 30 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Unnamed: 0         1297627 non-null  int64  
 1   Year               1297627 non-null  int64  
 2   Month              1297627 non-null  int64  
 3   DayofMonth         1297627 non-null  int64  
 4   DayOfWeek          1297627 non-null  int64  
 5   DepTime            1297627 non-null  float64
 6   CRSDepTime         1297627 non-null  int64  
 7   ArrTime            1292873 non-null  float64
 8   CRSArrTime         1297627 non-null  int64  
 9   UniqueCarrier      1297627 non-null  object 
 10  FlightNum          1297627 non-null  int64  
 11  TailNum            1297623 non-null  object 
 12  ActualElapsedTime  1292017 non-null  float64
 13  CRSElapsedTime     1297498 non-null  float64
 14  AirTime            

None

[-----------------------DESCRIBE-----------------------]


Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
count,1297627.0,1297627.0,1297627.0,1297627.0,1297627.0,1297627.0,1297627.0,1292873.0,1297627.0,1297627,...,1292873.0,1297319.0,1297627.0,1297627,1297627.0,835640.0,835640.0,835640.0,835640.0,835640.0
unique,,,,,,,,,,20,...,,,,4,,,,,,
top,,,,,,,,,,WN,...,,,,N,,,,,,
freq,,,,,,,,,,253214,...,,,,1297195,,,,,,
mean,3339996.04,2008.0,6.11,15.75,3.98,1518.62,1467.46,1610.16,1634.0,,...,6.81,18.23,0.0,,0.0,19.23,3.71,15.02,0.09,25.25
std,2066184.58,0.0,3.48,8.78,2.0,450.45,424.85,548.24,464.91,,...,5.27,14.31,0.02,,0.06,43.62,21.55,33.78,2.03,42.0
min,4.0,2008.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,,...,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0
25%,1516972.0,2008.0,3.0,8.0,2.0,1203.0,1135.0,1316.0,1325.0,,...,4.0,10.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0
50%,3240032.0,2008.0,6.0,16.0,4.0,1545.0,1510.0,1715.0,1705.0,,...,6.0,14.0,0.0,,0.0,2.0,0.0,2.0,0.0,8.0
75%,4971048.5,2008.0,9.0,23.0,6.0,1900.0,1815.0,2031.0,2014.0,,...,8.0,21.0,0.0,,0.0,21.0,0.0,15.0,0.0,33.0


[------------------------NaN's-------------------------]


Unnamed: 0                0
Year                      0
Month                     0
DayofMonth                0
DayOfWeek                 0
DepTime                   0
CRSDepTime                0
ArrTime                4754
CRSArrTime                0
UniqueCarrier             0
FlightNum                 0
TailNum                   4
ActualElapsedTime      5610
CRSElapsedTime          129
AirTime                5610
ArrDelay               5610
DepDelay                  0
Origin                    0
Dest                      0
Distance                  0
TaxiIn                 4754
TaxiOut                 308
Cancelled                 0
CancellationCode          0
Diverted                  0
CarrierDelay         461987
WeatherDelay         461987
NASDelay             461987
SecurityDelay        461987
LateAircraftDelay    461987
dtype: int64

[--------------Values in categorical variables---------]
------------------UniqueCarrier-------------------
------------Unique Values--------------
Number of unique values is: 20
['WN' 'XE' 'YV' 'OH' 'OO' 'UA' 'US' 'DL' 'EV' 'F9' 'FL' 'HA' 'MQ' 'NW'
 '9E' 'AA' 'AQ' 'AS' 'B6' 'CO']
------------Value Counts--------------


WN    377602
AA    191865
MQ    141920
UA    141426
OO    132433
DL    114238
XE    103663
CO    100195
US     98425
EV     81877
NW     79108
FL     71284
YV     67063
B6     55315
OH     52657
9E     51885
AS     39293
F9     28269
HA      7490
AQ       750
Name: UniqueCarrier, dtype: int64

------------------Dest-------------------
------------Unique Values--------------
Number of unique values is: 304
['TPA' 'BWI' 'JAX' 'LAS' 'MCO' 'MDW' 'PHX' 'FLL' 'PBI' 'RSW' 'HOU' 'BHM'
 'BNA' 'IND' 'PHL' 'ABQ' 'ALB' 'AMA' 'AUS' 'BDL' 'BOI' 'BUF' 'BUR' 'CLE'
 'CMH' 'DEN' 'ELP' 'GEG' 'IAD' 'ISP' 'LAX' 'LBB' 'LIT' 'MAF' 'MCI' 'MHT'
 'MSY' 'OAK' 'OKC' 'OMA' 'ONT' 'ORF' 'PDX' 'PIT' 'PVD' 'RDU' 'RNO' 'SAN'
 'SAT' 'SDF' 'SEA' 'SFO' 'SJC' 'SLC' 'SMF' 'SNA' 'STL' 'TUL' 'TUS' 'DAL'
 'DTW' 'JAN' 'HRL' 'CRP' 'EWR' 'IAH' 'ROC' 'MYR' 'GSO' 'SAV' 'RIC' 'COS'
 'FAT' 'MRY' 'LGB' 'BFL' 'EUG' 'ICT' 'CAE' 'DFW' 'DAY' 'MSP' 'GSP' 'MEM'
 'TYS' 'SHV' 'BTV' 'MFE' 'PWM' 'ATL' 'SYR' 'MKE' 'HSV' 'BTR' 'CHS' 'MSN'
 'LFT' 'LRD' 'SRQ' 'CLT' 'VPS' 'AVL' 'GPT' 'LGA' 'ABE' 'BGR' 'DCA' 'ORD'
 'GRR' 'MOB' 'PNS' 'CHA' 'MGM' 'CVG' 'GRK' 'PSP' 'TLH' 'LCH' 'BOS' 'BRO'
 'XNA' 'BPT' 'LEX' 'MTJ' 'AEX' 'MLU' 'DSM' 'CRW' 'CLL' 'ILM' 'JFK' 'ASE'
 'CPR' 'DRO' 'RAP' 'KOA' 'LIH' 'OGG' 'MDT' 'ROA' 'SPI' 'HNL' 'MFR' 'ATW'
 'BMI' 'CA

ORD    108984
ATL    106898
DFW     70657
DEN     63003
LAX     59969
        ...  
INL         9
PIR         3
CYS         1
TUP         1
OGD         1
Name: Dest, Length: 304, dtype: int64

------------------CancellationCode-------------------
------------Unique Values--------------
Number of unique values is: 4
['N' 'A' 'B' 'C']
------------Value Counts--------------


N    1936125
B        307
A        246
C         80
Name: CancellationCode, dtype: int64

------------------Origin-------------------
------------Unique Values--------------
Number of unique values is: 303
['IAD' 'IND' 'ISP' 'JAN' 'JAX' 'LAS' 'LAX' 'LBB' 'LIT' 'MAF' 'MCI' 'MCO'
 'MDW' 'MHT' 'MSY' 'OAK' 'OKC' 'OMA' 'ONT' 'ORF' 'PBI' 'PDX' 'PHL' 'PHX'
 'PIT' 'PVD' 'RDU' 'RNO' 'RSW' 'SAN' 'SAT' 'SDF' 'SEA' 'SFO' 'SJC' 'SLC'
 'SMF' 'SNA' 'STL' 'TPA' 'TUL' 'TUS' 'ABQ' 'ALB' 'AMA' 'AUS' 'BDL' 'BHM'
 'BNA' 'BOI' 'BUF' 'BUR' 'BWI' 'CLE' 'CMH' 'CRP' 'DAL' 'DEN' 'DTW' 'ELP'
 'FLL' 'GEG' 'HOU' 'HRL' 'ROC' 'ORD' 'EWR' 'SYR' 'IAH' 'CRW' 'FAT' 'COS'
 'MRY' 'LGB' 'BFL' 'EUG' 'ICT' 'MEM' 'BTV' 'MKE' 'LFT' 'BRO' 'PWM' 'MSP'
 'SRQ' 'CLT' 'CVG' 'GSO' 'SHV' 'DCA' 'TYS' 'GSP' 'RIC' 'DFW' 'BGR' 'DAY'
 'GRR' 'CHS' 'CAE' 'TLH' 'XNA' 'GPT' 'VPS' 'LGA' 'ATL' 'MSN' 'SAV' 'BTR'
 'LEX' 'LRD' 'MOB' 'MTJ' 'GRK' 'AEX' 'PNS' 'ABE' 'HSV' 'CHA' 'MFE' 'MLU'
 'DSM' 'MGM' 'AVL' 'LCH' 'BOS' 'MYR' 'CLL' 'DAB' 'ASE' 'ATW' 'BMI' 'CAK'
 'CID' 'CPR' 'EGE' 'FLG' 'FSD' 'FWA' 'GJT' 'GRB' 'HNL' 'KOA' 'LAN' 'LIH'
 'MBS' '

ATL    131613
ORD    125979
DFW     95414
DEN     74323
LAX     58772
        ...  
BJI         4
PIR         3
PUB         2
INL         1
TUP         1
Name: Origin, Length: 303, dtype: int64

------------------TailNum-------------------
------------Unique Values--------------
Number of unique values is: 5367
['N712SW' 'N772SW' 'N428WN' ... 'N75428' 'N75429' 'N78008']
------------Value Counts--------------


N325SW    965
N676SW    951
N658SW    945
N313SW    937
N308SA    936
         ... 
9189E       1
N853NW      1
N856NW      1
N76010      1
N78008      1
Name: TailNum, Length: 5366, dtype: int64

In [7]:
# Describe set test (using my personal functions)
Pers_df_info(df_test)

[-------------------------SHAPE------------------------]


(639131, 30)

[-------------------------INFO-------------------------]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 639131 entries, 1794076 to 1377433
Data columns (total 30 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Unnamed: 0         639131 non-null  int64  
 1   Year               639131 non-null  int64  
 2   Month              639131 non-null  int64  
 3   DayofMonth         639131 non-null  int64  
 4   DayOfWeek          639131 non-null  int64  
 5   DepTime            639131 non-null  float64
 6   CRSDepTime         639131 non-null  int64  
 7   ArrTime            636775 non-null  float64
 8   CRSArrTime         639131 non-null  int64  
 9   UniqueCarrier      639131 non-null  object 
 10  FlightNum          639131 non-null  int64  
 11  TailNum            639130 non-null  object 
 12  ActualElapsedTime  636354 non-null  float64
 13  CRSElapsedTime     639062 non-null  float64
 14  AirTime            636354 non-null  

None

[-----------------------DESCRIBE-----------------------]


Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
count,639131.0,639131.0,639131.0,639131.0,639131.0,639131.0,639131.0,636775.0,639131.0,639131,...,636775.0,638984.0,639131.0,639131,639131.0,411848.0,411848.0,411848.0,411848.0,411848.0
unique,,,,,,,,,,20,...,,,,4,,,,,,
top,,,,,,,,,,WN,...,,,,N,,,,,,
freq,,,,,,,,,,124388,...,,,,638930,,,,,,
mean,3345011.52,2008.0,6.12,15.76,3.98,1518.37,1467.5,1610.1,1634.67,,...,6.81,18.24,0.0,,0.0,19.09,3.69,15.03,0.09,25.39
std,2065819.61,0.0,3.48,8.78,2.0,450.55,424.6,548.05,464.07,,...,5.28,14.4,0.02,,0.06,43.39,21.38,33.93,2.01,42.17
min,0.0,2008.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,,...,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0
25%,1518570.5,2008.0,3.0,8.0,2.0,1204.0,1135.0,1316.0,1325.0,,...,4.0,10.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0
50%,3247308.0,2008.0,6.0,16.0,4.0,1545.0,1510.0,1715.0,1706.0,,...,6.0,14.0,0.0,,0.0,2.0,0.0,1.0,0.0,8.0
75%,4974655.5,2008.0,9.0,23.0,6.0,1900.0,1815.0,2030.0,2015.0,,...,8.0,21.0,0.0,,0.0,21.0,0.0,15.0,0.0,34.0


[------------------------NaN's-------------------------]


Unnamed: 0                0
Year                      0
Month                     0
DayofMonth                0
DayOfWeek                 0
DepTime                   0
CRSDepTime                0
ArrTime                2356
CRSArrTime                0
UniqueCarrier             0
FlightNum                 0
TailNum                   1
ActualElapsedTime      2777
CRSElapsedTime           69
AirTime                2777
ArrDelay               2777
DepDelay                  0
Origin                    0
Dest                      0
Distance                  0
TaxiIn                 2356
TaxiOut                 147
Cancelled                 0
CancellationCode          0
Diverted                  0
CarrierDelay         227283
WeatherDelay         227283
NASDelay             227283
SecurityDelay        227283
LateAircraftDelay    227283
dtype: int64

[--------------Values in categorical variables---------]
------------------UniqueCarrier-------------------
------------Unique Values--------------
Number of unique values is: 20
['WN' 'XE' 'YV' 'OH' 'OO' 'UA' 'US' 'DL' 'EV' 'F9' 'FL' 'HA' 'MQ' 'NW'
 '9E' 'AA' 'AQ' 'AS' 'B6' 'CO']
------------Value Counts--------------


WN    377602
AA    191865
MQ    141920
UA    141426
OO    132433
DL    114238
XE    103663
CO    100195
US     98425
EV     81877
NW     79108
FL     71284
YV     67063
B6     55315
OH     52657
9E     51885
AS     39293
F9     28269
HA      7490
AQ       750
Name: UniqueCarrier, dtype: int64

------------------Dest-------------------
------------Unique Values--------------
Number of unique values is: 304
['TPA' 'BWI' 'JAX' 'LAS' 'MCO' 'MDW' 'PHX' 'FLL' 'PBI' 'RSW' 'HOU' 'BHM'
 'BNA' 'IND' 'PHL' 'ABQ' 'ALB' 'AMA' 'AUS' 'BDL' 'BOI' 'BUF' 'BUR' 'CLE'
 'CMH' 'DEN' 'ELP' 'GEG' 'IAD' 'ISP' 'LAX' 'LBB' 'LIT' 'MAF' 'MCI' 'MHT'
 'MSY' 'OAK' 'OKC' 'OMA' 'ONT' 'ORF' 'PDX' 'PIT' 'PVD' 'RDU' 'RNO' 'SAN'
 'SAT' 'SDF' 'SEA' 'SFO' 'SJC' 'SLC' 'SMF' 'SNA' 'STL' 'TUL' 'TUS' 'DAL'
 'DTW' 'JAN' 'HRL' 'CRP' 'EWR' 'IAH' 'ROC' 'MYR' 'GSO' 'SAV' 'RIC' 'COS'
 'FAT' 'MRY' 'LGB' 'BFL' 'EUG' 'ICT' 'CAE' 'DFW' 'DAY' 'MSP' 'GSP' 'MEM'
 'TYS' 'SHV' 'BTV' 'MFE' 'PWM' 'ATL' 'SYR' 'MKE' 'HSV' 'BTR' 'CHS' 'MSN'
 'LFT' 'LRD' 'SRQ' 'CLT' 'VPS' 'AVL' 'GPT' 'LGA' 'ABE' 'BGR' 'DCA' 'ORD'
 'GRR' 'MOB' 'PNS' 'CHA' 'MGM' 'CVG' 'GRK' 'PSP' 'TLH' 'LCH' 'BOS' 'BRO'
 'XNA' 'BPT' 'LEX' 'MTJ' 'AEX' 'MLU' 'DSM' 'CRW' 'CLL' 'ILM' 'JFK' 'ASE'
 'CPR' 'DRO' 'RAP' 'KOA' 'LIH' 'OGG' 'MDT' 'ROA' 'SPI' 'HNL' 'MFR' 'ATW'
 'BMI' 'CA

ORD    108984
ATL    106898
DFW     70657
DEN     63003
LAX     59969
        ...  
INL         9
PIR         3
CYS         1
TUP         1
OGD         1
Name: Dest, Length: 304, dtype: int64

------------------CancellationCode-------------------
------------Unique Values--------------
Number of unique values is: 4
['N' 'A' 'B' 'C']
------------Value Counts--------------


N    1936125
B        307
A        246
C         80
Name: CancellationCode, dtype: int64

------------------Origin-------------------
------------Unique Values--------------
Number of unique values is: 303
['IAD' 'IND' 'ISP' 'JAN' 'JAX' 'LAS' 'LAX' 'LBB' 'LIT' 'MAF' 'MCI' 'MCO'
 'MDW' 'MHT' 'MSY' 'OAK' 'OKC' 'OMA' 'ONT' 'ORF' 'PBI' 'PDX' 'PHL' 'PHX'
 'PIT' 'PVD' 'RDU' 'RNO' 'RSW' 'SAN' 'SAT' 'SDF' 'SEA' 'SFO' 'SJC' 'SLC'
 'SMF' 'SNA' 'STL' 'TPA' 'TUL' 'TUS' 'ABQ' 'ALB' 'AMA' 'AUS' 'BDL' 'BHM'
 'BNA' 'BOI' 'BUF' 'BUR' 'BWI' 'CLE' 'CMH' 'CRP' 'DAL' 'DEN' 'DTW' 'ELP'
 'FLL' 'GEG' 'HOU' 'HRL' 'ROC' 'ORD' 'EWR' 'SYR' 'IAH' 'CRW' 'FAT' 'COS'
 'MRY' 'LGB' 'BFL' 'EUG' 'ICT' 'MEM' 'BTV' 'MKE' 'LFT' 'BRO' 'PWM' 'MSP'
 'SRQ' 'CLT' 'CVG' 'GSO' 'SHV' 'DCA' 'TYS' 'GSP' 'RIC' 'DFW' 'BGR' 'DAY'
 'GRR' 'CHS' 'CAE' 'TLH' 'XNA' 'GPT' 'VPS' 'LGA' 'ATL' 'MSN' 'SAV' 'BTR'
 'LEX' 'LRD' 'MOB' 'MTJ' 'GRK' 'AEX' 'PNS' 'ABE' 'HSV' 'CHA' 'MFE' 'MLU'
 'DSM' 'MGM' 'AVL' 'LCH' 'BOS' 'MYR' 'CLL' 'DAB' 'ASE' 'ATW' 'BMI' 'CAK'
 'CID' 'CPR' 'EGE' 'FLG' 'FSD' 'FWA' 'GJT' 'GRB' 'HNL' 'KOA' 'LAN' 'LIH'
 'MBS' '

ATL    131613
ORD    125979
DFW     95414
DEN     74323
LAX     58772
        ...  
BJI         4
PIR         3
PUB         2
INL         1
TUP         1
Name: Origin, Length: 303, dtype: int64

------------------TailNum-------------------
------------Unique Values--------------
Number of unique values is: 5367
['N712SW' 'N772SW' 'N428WN' ... 'N75428' 'N75429' 'N78008']
------------Value Counts--------------


N325SW    965
N676SW    951
N658SW    945
N313SW    937
N308SA    936
         ... 
9189E       1
N853NW      1
N856NW      1
N76010      1
N78008      1
Name: TailNum, Length: 5366, dtype: int64

# Level 2 - Exercise 2
## Apply some transformation process (standardize numerical data, create dummy columns, polynomials...).

Let's create dummy columns of Cancellation Code attribute.  
(it has only 4 unique values, for the others, we would have to choose the first 5/10 more important and then use a last one with all the others.

In [8]:
# Let's standardize some numerical columns:
list_num_cols = df_train._get_numeric_data().columns
print(list_num_cols)

Index(['Unnamed: 0', 'Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime',
       'CRSDepTime', 'ArrTime', 'CRSArrTime', 'FlightNum', 'ActualElapsedTime',
       'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Distance',
       'TaxiIn', 'TaxiOut', 'Cancelled', 'Diverted', 'CarrierDelay',
       'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
      dtype='object')


In [9]:
# #list for cols to scale
cols_to_scale = ['DepTime',
       'CRSDepTime', 'ArrTime', 'CRSArrTime', 'FlightNum', 'ActualElapsedTime',
       'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Distance',
       'TaxiIn', 'TaxiOut', 'Cancelled', 'Diverted', 'CarrierDelay',
       'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
#create and fit scaler
scaler = StandardScaler()
scaler.fit(df_train[cols_to_scale])
#scale selected data
df_train[cols_to_scale] = scaler.transform(df_train[cols_to_scale])

df_train.head()

Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
1725493,6430396,2008,11,25,2,-0.591886,-0.535388,0.41741,0.428029,CO,...,0.225159,0.403279,-0.018249,N,-0.063296,,,,,
433818,1341081,2008,3,24,1,-0.263328,-0.335317,-0.284839,-0.449552,OH,...,0.035449,-0.645066,-0.018249,N,-0.063296,-0.440717,-0.172293,-0.444452,-0.04456,0.089276
1297729,4345682,2008,8,11,1,1.357262,1.265255,1.161611,1.092669,XE,...,0.035449,0.682837,-0.018249,N,-0.063296,1.049315,-0.172293,-0.444452,-0.04456,-0.601241
1914658,6952847,2008,12,27,6,0.198429,0.147209,0.39005,0.378558,AS,...,-0.343971,-0.365507,-0.018249,N,-0.063296,,,,,
623257,1978776,2008,4,28,1,-1.304501,-1.288598,-1.0108,-1.245399,OO,...,-0.343971,-0.575176,-0.018249,N,-0.063296,,,,,


In [10]:
# Let's create dummy columns of Cancellation Code attribute
df_train_norm = pd.get_dummies(df_train,columns=['CancellationCode'])
df_train_norm

Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,CancellationCode_A,CancellationCode_B,CancellationCode_C,CancellationCode_N
1725493,6430396,2008,11,25,2,-0.591886,-0.535388,0.417410,0.428029,CO,...,-0.063296,,,,,,0,0,0,1
433818,1341081,2008,3,24,1,-0.263328,-0.335317,-0.284839,-0.449552,OH,...,-0.063296,-0.440717,-0.172293,-0.444452,-0.04456,0.089276,0,0,0,1
1297729,4345682,2008,8,11,1,1.357262,1.265255,1.161611,1.092669,XE,...,-0.063296,1.049315,-0.172293,-0.444452,-0.04456,-0.601241,0,0,0,1
1914658,6952847,2008,12,27,6,0.198429,0.147209,0.390050,0.378558,AS,...,-0.063296,,,,,,0,0,0,1
623257,1978776,2008,4,28,1,-1.304501,-1.288598,-1.010800,-1.245399,OO,...,-0.063296,,,,,,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
153968,501948,2008,1,2,3,-3.318029,2.053772,-1.933755,-2.417659,AA,...,-0.063296,-0.165635,-0.172293,-0.444452,-0.04456,0.065465,0,0,0,1
1582,2239,2008,1,3,4,1.179663,1.300562,1.075882,1.077612,WN,...,-0.063296,-0.440717,-0.172293,-0.444452,-0.04456,-0.005968,0,0,0,1
977013,3271006,2008,6,13,5,1.328402,1.124028,1.362254,1.264743,US,...,-0.063296,1.668251,-0.172293,-0.355651,-0.04456,-0.601241,0,0,0,1
22457,70526,2008,1,24,4,0.045250,0.123671,0.169343,0.012899,WN,...,-0.063296,-0.440717,-0.172293,-0.326050,-0.04456,-0.148833,0,0,0,1
