### Data Preprocessing (Stage 1)

This notebook is used to experiment and identify the raw data preprocessing.

#### Data Reading

In [1]:
import pandas as pd

df = pd.read_csv("../data/combine_files.csv")

  df = pd.read_csv("../data/combine_files.csv")


#### Data Analysis

In [4]:
df.Cancelled.value_counts()

Cancelled
0.0    18240587
1.0      265138
Name: count, dtype: int64

We will work on Classification task whether the flight is cancelled or not, but as you can see the "Cancelled" column is too imbalanced, therefore we decided to take only 600k rows of data: 265138 rows where Cancelled = 1, and rest is 0.

Deal with data imbalance

In [2]:
# Separate the two classes
cancelled_1 = df[df['Cancelled'] == 1.0]
cancelled_0 = df[df['Cancelled'] == 0.0].sample(
    n=600000 - len(cancelled_1), random_state=42)

# Combine them
df_balanced = pd.concat([cancelled_1, cancelled_0])

# Shuffle the resulting DataFrame
df_balanced = df_balanced.sample(
    frac=1, random_state=42).reset_index(drop=True)

In [6]:
df_balanced.Cancelled.value_counts()

Cancelled
0.0    334862
1.0    265138
Name: count, dtype: int64

Now, the data is good!

#### Let's look at other columns

In [3]:
df_balanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600000 entries, 0 to 599999
Data columns (total 26 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Year               600000 non-null  int64  
 1   Month              600000 non-null  int64  
 2   DayofMonth         600000 non-null  int64  
 3   DayOfWeek          600000 non-null  int64  
 4   DepTime            343919 non-null  float64
 5   CRSDepTime         600000 non-null  int64  
 6   ArrTime            334740 non-null  float64
 7   CRSArrTime         600000 non-null  int64  
 8   ActualElapsedTime  334032 non-null  float64
 9   CRSElapsedTime     599984 non-null  float64
 10  AirTime            334032 non-null  float64
 11  ArrDelay           333989 non-null  float64
 12  DepDelay           343631 non-null  float64
 13  Origin             600000 non-null  object 
 14  Dest               600000 non-null  object 
 15  Distance           600000 non-null  float64
 16  Ta

#### Lets convert time into datetime

In [2]:
import pandas as pd
import numpy as np
from datetime import time

df_final = df.copy()

# Convert hhmm time columns to datetime.time format for SQL


def hhmm_to_time(x):
    if pd.isnull(x):
        return None
    x = int(x)
    h = x // 100
    m = x % 100
    if h < 24 and m < 60:
        return time(hour=h, minute=m)
    return None


for col in ['DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime']:
    df_final[col] = df_final[col].apply(hhmm_to_time)

# df_final = df_final.astype(object).where(pd.notnull(df_final), None)

# Drop leakage or post-cancellation columns for Cancelled classification (will be used further)
# leakage_cols = [
#     'ActualElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay',
#     'TaxiIn', 'TaxiOut', 'CarrierDelay', 'WeatherDelay',
#     'NASDelay', 'SecurityDelay', 'LateAircraftDelay',
#     'CancellationCode'
# ]
# df_final.drop(columns=leakage_cols, inplace=True)

# Impute missing numeric values with median
# for col in df_final.select_dtypes(include=[np.number]).columns:
#     if df_final[col].isnull().any():
#         df_final[col].fillna(df_final[col].median(), inplace=True)

# Fill missing categorical values with None (to become NULL in SQL)
# for col in df_final.select_dtypes(include=['object']).columns:
#     df_final[col] = df_final[col].where(df_final[col].notna(), None)

In [5]:
df_final.to_csv("preprocessed_combine_files2.csv", index=False)

In [5]:
df_final.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,ActualElapsedTime,CRSElapsedTime,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2016,1,19,2,,14:05:00,,15:35:00,,90.0,...,,,1.0,C,0.0,,,,,
1,2016,8,22,1,,18:50:00,,19:59:00,,69.0,...,,,1.0,B,0.0,,,,,
2,2018,11,5,1,09:49:00,09:45:00,11:01:00,11:01:00,72.0,76.0,...,7.0,15.0,0.0,,0.0,,,,,
3,2016,5,20,5,,12:31:00,,13:45:00,,74.0,...,,,1.0,B,0.0,,,,,
4,2016,9,7,3,16:00:00,16:11:00,18:06:00,18:02:00,126.0,111.0,...,35.0,18.0,0.0,,0.0,,,,,


In [6]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18505725 entries, 0 to 18505724
Data columns (total 26 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Year               int64  
 1   Month              int64  
 2   DayofMonth         int64  
 3   DayOfWeek          int64  
 4   DepTime            object 
 5   CRSDepTime         object 
 6   ArrTime            object 
 7   CRSArrTime         object 
 8   ActualElapsedTime  float64
 9   CRSElapsedTime     float64
 10  AirTime            float64
 11  ArrDelay           float64
 12  DepDelay           float64
 13  Origin             object 
 14  Dest               object 
 15  Distance           float64
 16  TaxiIn             float64
 17  TaxiOut            float64
 18  Cancelled          float64
 19  CancellationCode   object 
 20  Diverted           float64
 21  CarrierDelay       float64
 22  WeatherDelay       float64
 23  NASDelay           float64
 24  SecurityDelay      float64
 25  LateAircraftDela

In [6]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600000 entries, 0 to 599999
Data columns (total 26 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Year               600000 non-null  int64  
 1   Month              600000 non-null  int64  
 2   DayofMonth         600000 non-null  int64  
 3   DayOfWeek          600000 non-null  int64  
 4   DepTime            343879 non-null  object 
 5   CRSDepTime         600000 non-null  object 
 6   ArrTime            334586 non-null  object 
 7   CRSArrTime         599995 non-null  object 
 8   ActualElapsedTime  334032 non-null  float64
 9   CRSElapsedTime     599984 non-null  float64
 10  AirTime            334032 non-null  float64
 11  ArrDelay           333989 non-null  float64
 12  DepDelay           343631 non-null  float64
 13  Origin             600000 non-null  object 
 14  Dest               600000 non-null  object 
 15  Distance           600000 non-null  float64
 16  Ta