# Importing the required modules

In [1]:
# modules used for data handling and
# manipulation
import numpy as np
import pandas as pd

# modules used for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Reading the data

In [2]:
flight_df = pd.read_csv("DelayData.csv")

In [3]:
flight_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1201664 entries, 0 to 1201663
Data columns (total 61 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   depdelay                 1201664 non-null  int64  
 1   arrdelay                 1198458 non-null  float64
 2   scheduleddepartdatetime  1201664 non-null  object 
 3   origin                   1201664 non-null  object 
 4   dest                     1201664 non-null  object 
 5   uniquecarrier            1201664 non-null  object 
 6   marketshareorigin        1201664 non-null  float64
 7   marketsharedest          1201664 non-null  float64
 8   hhiorigin                1201664 non-null  float64
 9   hhidest                  1201664 non-null  float64
 10  nonhubairportorigin      1201664 non-null  int64  
 11  smallhubairportorigin    1201664 non-null  int64  
 12  mediumhubairportorigin   1201664 non-null  int64  
 13  largehubairportorigin    1201664 non-null 

# Converting dummies to ordinal values

Based on the column names, a few of the columns are already one-hot encoded. These include the following:
1. Temperature ranges: these columns are different temperature ranges and have a value of 1 when the `temperature` falls in the range. Otherwise, they have a value of 0. (`temp_ninfty_n10`, `temp_n10_0`, `temp_0_10`, `temp_10_20`, `temp_20_30`, `temp_30_40`, `temp_40_infty`)
2. Airport Connectivity Variables: these columns denote whether the origin and destination airports are hubs for some airline. It also categorises such hubs as small, medium, and large. (`nonhubairportorigin`, `smallhubairportorigin`, `mediumhubairportorigin`, `largehubairportorigin`, `nonhubairportdest`, `smallhubairportdest`, `mediumhubairportdest`, `largehubairportdest`)
3. Airline Connectivity Variables: these columns denote whether the origin and destination airports are hubs for the `uniquecarrier`. It also categorises such hubs as small, medium, and large. (`nonhubairlineorigin`, `smallhubairlineorigin`, `mediumhubairlineorigin`, `largehubairlineorigin`, `nonhubairlinedest`, `smallhubairlinedest`, `mediumhubairlinedest`, `largehubairlinedest`)


However, the variables in each of the above categories represent information that is ordered; hence, each of those different columns can be collectively represented as a single ordinal feature. Therefore, the following changes will be made for the respective categories:

1. For temperature range, a new column `temp_range` will be created, having seven categories: ninfty_n10, 10_0, 0_10, 10_20, 20_30, 40_infty.
2. For airport connectivity variables, two new columns will be created `hubairportorigin` and `hubairportdest`. Each of these columns will contain four categories: nonhub, smallhub, mediumhub, and largehub.
3. For airline connectivity variables, two new columns will be created `hubairlineorigin` and `hubairlinedest`. Each of these columns will contain four categories: nonhub, smallhub, mediumhub, and largehub.

All the previous columns will be dropped.

In [4]:
# extracting all dummy features from the respective categories.
cols = list(flight_df.columns)
temperature_range = cols[42:49]
airport_connectivity_origin = cols[10:14]
airport_connectivity_dest = cols[14:18]
airline_connectivity_origin = cols[18:22]
airline_connectivity_dest = cols[22:26]

In [5]:
# converts dummy columns into a single categorical feature.
def onehot2ordinal(new_colname: str, dummies: list, str2replace: str, sep = None):

    flight_df[new_colname] = pd.from_dummies(flight_df[dummies], 
                           default_category = "unknown",
                           sep = sep)
    if not sep:
       flight_df[new_colname] = flight_df[new_colname].astype(str).apply(lambda x: x.replace(str2replace, 
                                                                                             ''))

    flight_df.drop(dummies, axis = 1, inplace = True)

In [6]:
# applies the function to all the respective kinds of dummy features.
arguments = [["temp_range", temperature_range, None, "_"],
             ["hubairportorigin", airport_connectivity_origin, "airportorigin", None],
             ["hubairportdest", airport_connectivity_dest, "airportdest", None],
             ["hubairlineorigin", airline_connectivity_origin, "airlineorigin", None],
             ["hubairlinedest", airline_connectivity_dest, "airlinedest", None]]

for new_colname, dummies, str2replace, sep in arguments:
    
    onehot2ordinal(new_colname = new_colname, dummies = dummies,
                   str2replace = str2replace, sep = sep)