### Processing the Data
Loading in each of the datafiles and cleaning/extracting columns using python <br>
Saving each of them into separate files to use in SQL/Bash and PySpark. These tasks will be located in separate .ipynb files.

In [1]:
from google.colab import drive
from google.colab import files

drive.mount('/content/drive')

Mounted at /content/drive


Reading each file into a dataframe with only the columns we want to keep.

In [2]:
import pandas as pd
import numpy as np

# Change according to where you have saved the file
columns_keep = ["FlightDate", "Airline", "Origin", "Dest", "Cancelled",	"Diverted",	"DepTime", "DepDelay",	"ArrTime",	"AirTime",	"Distance",	"Year",	"Quarter",
                "Month",	"OriginCityName",	"OriginStateName",	"DestCityName",	"DestStateName",	"TaxiOut",	"WheelsOff",	"WheelsOn",	"TaxiIn",	"ArrDelay",
                "DivAirportLandings"]

df2018 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/BigData/Combined_Flights_2018.csv", usecols=columns_keep)

In [3]:
df2019 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/BigData/Combined_Flights_2019.csv", usecols=columns_keep)

In [4]:
df2022 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/BigData/Combined_Flights_2022.csv", usecols=columns_keep)

Cleaning dataframes and printing lengths to verify it's working

In [5]:
print("Before 2018df length %d" % len(df2018))

df2018["FlightDate"] = pd.to_datetime(df2018["FlightDate"], errors='coerce')
df2018 = df2018.dropna()
print("After 2018df length %d" % len(df2018))

# Change true/false to 1/0 to be able to store in db
df2018['Cancelled'] = df2018['Cancelled'].map({True: 1, False: 0})
df2018['Diverted'] = df2018['Diverted'].map({True: 1, False: 0})

print("2018df columns %d" % len(df2018.columns))

Before 2018df length 5689512
After 2018df length 5578618


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2018['Cancelled'] = df2018['Cancelled'].map({True: 1, False: 0})


2018df columns 24


In [6]:
print("Before 2019df length %d" % len(df2019))

df2019["FlightDate"] = pd.to_datetime(df2019["FlightDate"], errors='coerce')
df2019 = df2019.dropna()
print("After 2019df length %d" % len(df2019))

# Change true/false to 1/0 to be able to store in db
df2019['Cancelled'] = df2019['Cancelled'].map({True: 1, False: 0})
df2019['Diverted'] = df2019['Diverted'].map({True: 1, False: 0})

print("2019df columns %d" % len(df2019.columns))

Before 2019df length 8091684
After 2019df length 7917264


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2019['Cancelled'] = df2019['Cancelled'].map({True: 1, False: 0})


2019df columns 24


In [7]:
print("Before 2022df length %d" % len(df2022))

df2022["FlightDate"] = pd.to_datetime(df2022["FlightDate"], errors='coerce')
df2022 = df2022.dropna()
print("After 2022df length %d" % len(df2022))

df2022['Cancelled'] = df2022['Cancelled'].map({True: 1, False: 0})
df2022['Diverted'] = df2022['Diverted'].map({True: 1, False: 0})

print("2022df columns %d" % len(df2022.columns))

Before 2022df length 4078318
After 2022df length 3944916
2022df columns 24


Concatenating the dataframes so that it can be one dataframe.

In [8]:
df = pd.concat([df2018, df2019])
print(len(df))

13495882


In [9]:
df = pd.concat([df, df2022])
print(len(df))

17440798


In [10]:
# Save df to csv file
df.to_csv("filtered_flights.csv", index=False)

print("Saved")

Saved


Copying the file into Colab Notebooks to be able to be used by other programs.

In [14]:
!cp filtered_flights.csv /content/drive/MyDrive/'Colab Notebooks'