## **Data Preparation**


### Essential Libraries

- Pandas: Library for Data Acquisition and Preparation

In [1]:
import pandas as pd

## **Departure Delay Dataset**
Dataset from Kaggle: **"Flight Status Prediction"** by *Rob Mulla*  
Source: https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022/data

## Import CSV file

In [2]:
df = pd.read_csv('Flights_2022_7.csv', low_memory = False) # Importing July dataset

In [3]:
df.head()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,...,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum,Duplicate,Unnamed: 119
0,2022,3,7,19,2,2022-07-19,AA,AA_CODESHARE,19805,AA,...,,,,,,,,,N,
1,2022,3,7,20,3,2022-07-20,AA,AA_CODESHARE,19805,AA,...,,,,,,,,,N,
2,2022,3,7,21,4,2022-07-21,AA,AA_CODESHARE,19805,AA,...,,,,,,,,,N,
3,2022,3,7,24,7,2022-07-24,AA,AA_CODESHARE,19805,AA,...,,,,,,,,,N,
4,2022,3,7,25,1,2022-07-25,AA,AA_CODESHARE,19805,AA,...,,,,,,,,,N,


In [4]:
print("Data dims : ", df.shape)

Data dims :  (618790, 120)


## Infomation on Columns

In [5]:
for column in df.columns:
    print(f"{column}: {df[column].dtype}")

Year: int64
Quarter: int64
Month: int64
DayofMonth: int64
DayOfWeek: int64
FlightDate: object
Marketing_Airline_Network: object
Operated_or_Branded_Code_Share_Partners: object
DOT_ID_Marketing_Airline: int64
IATA_Code_Marketing_Airline: object
Flight_Number_Marketing_Airline: int64
Originally_Scheduled_Code_Share_Airline: object
DOT_ID_Originally_Scheduled_Code_Share_Airline: float64
IATA_Code_Originally_Scheduled_Code_Share_Airline: object
Flight_Num_Originally_Scheduled_Code_Share_Airline: float64
Operating_Airline : object
DOT_ID_Operating_Airline: int64
IATA_Code_Operating_Airline: object
Tail_Number: object
Flight_Number_Operating_Airline: int64
OriginAirportID: int64
OriginAirportSeqID: int64
OriginCityMarketID: int64
Origin: object
OriginCityName: object
OriginState: object
OriginStateFips: int64
OriginStateName: object
OriginWac: int64
DestAirportID: int64
DestAirportSeqID: int64
DestCityMarketID: int64
Dest: object
DestCityName: object
DestState: object
DestStateFips: int64
De

### Cleaning Column Names

In [6]:
# Removing Whitespaces
df.columns = df.columns.str.strip()

# Renaming DayofMonth for consistency
df = df.rename(columns={"DayofMonth" : "DayOfMonth"})

## **Selecting Essential Columns**

> **DayOfMonth** : Day of Month  
> **DayOfWeek** : Day of Week  
> **Operating_Airline** :  Unique Carrier Code  
> **Origin** : Origin Airport  
> **Dest** : Destination Airport  
> **CRSDepTime** : CRS Departure Time (local time: hhmm)  
> **DepDelay** : Difference in minutes between scheduled and actual departure time. Early departures show negative numbers  
> **DepDelayMinutes** : Difference in minutes between scheduled and actual departure time. Early departures set to 0  
> **DepDel15** : Departure Delay Indicator, 15 Minutes or More (1=Yes)  
> **TaxiOut** : Taxi Out Time, in Minutes  
> **Distance** : 	Distance between airports (miles)  
> **DistanceGroup** : Distance Intervals, every 250 Miles, for Flight Segment  
> **Cancelled** : Cancelled Flight Indicator (1=Yes)  
> **Duplicate** : Duplicate flag marked Y if the flight is swapped based on Form-3A data

In [7]:
df = pd.DataFrame(df[['DayOfMonth','DayOfWeek','Operating_Airline','Origin','Dest',
                            'CRSDepTime','DepDelay','DepDelayMinutes','DepDel15','TaxiOut',
                            'Distance','DistanceGroup','Cancelled','Duplicate']])

## Removing Cancelled Flights and Cleaning NaN Values

In [8]:
# Removing Cancelled FLights
df = df.drop(df[(df.Cancelled == 1)].index)

# Dropping Cancelled Column
df = df.drop('Cancelled', axis=1)
print("Data dims : ", df.shape)

Data dims :  (607657, 13)


In [9]:
# Removing NaN values if any
df.dropna(inplace=True)
print("Data dims : ", df.shape)

Data dims :  (607657, 13)


## Check for Duplicate Rows

In [10]:
# Removing Duplicate FLights
df = df.drop(df[(df.Duplicate == 'Y')].index)

# Dropping Cancelled Column
df = df.drop('Duplicate', axis=1)
print("Data dims : ", df.shape)

Data dims :  (607657, 12)


In [11]:
# Manually Check
duplicate_rows = df[df.duplicated(keep=False)]
if not duplicate_rows.empty:
    print("Duplicate rows found:")
    print(duplicate_rows)
else:
    print("No duplicate rows found.")

No duplicate rows found.


<div class="alert alert-block alert-danger">  
Rows returned were not duplicates. No rows to drop
</div>

## **Finalized Columns**

> **DayOfMonth** : Day of Month  
> **DayOfWeek** : Day of Week  
> **Operating_Airline** :  Unique Carrier Code  
> **Origin** : Origin Airport  
> **Dest** : Destination Airport  
> **CRSDepTime** : CRS Departure Time (local time: hhmm)  
> **DepDelay** : Difference in minutes between scheduled and actual departure time. Early departures show negative numbers  
> **DepDelayMinutes** : Difference in minutes between scheduled and actual departure time. Early departures set to 0  
> **DepDel15** : Departure Delay Indicator, 15 Minutes or More (1=Yes)  
> **TaxiOut** : Taxi Out Time, in Minutes  
> **Distance** : 	Distance between airports (miles)  
> **DistanceGroup** : Distance Intervals, every 250 Miles, for Flight Segment  

### Summary

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 607657 entries, 0 to 618789
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   DayOfMonth         607657 non-null  int64  
 1   DayOfWeek          607657 non-null  int64  
 2   Operating_Airline  607657 non-null  object 
 3   Origin             607657 non-null  object 
 4   Dest               607657 non-null  object 
 5   CRSDepTime         607657 non-null  int64  
 6   DepDelay           607657 non-null  float64
 7   DepDelayMinutes    607657 non-null  float64
 8   DepDel15           607657 non-null  float64
 9   TaxiOut            607657 non-null  float64
 10  Distance           607657 non-null  float64
 11  DistanceGroup      607657 non-null  int64  
dtypes: float64(5), int64(4), object(3)
memory usage: 60.3+ MB


In [13]:
df.to_csv(r"C:\Users\yipip\Desktop\SC1015\Project\Flights_2022_7_cleaned.csv", index=False)