# <center>Flight Delay Prediction</center>

<div style="text-align: justify">
In the last decades, with the vast amount of digital data generated from various sources such 
as social media websites, social networks, audio and video content, and commercial and 
financial data, there has been a need for effective solutions to understand and extract 
information from this vast amount of data. Traditional data analysis approaches cannot handle 
large, complex datasets or cope with big data in general. To address the challenges of big data 
analysis, machine learning techniques coupled with scalable parallel computing systems have 
been combined as a promising solution. By leveraging parallel machine learning algorithms, 
scalable computing, and storage infrastructures, it becomes possible to analyze massive and 
intricate datasets, yielding valuable insights within reasonable timeframes (Talia and Trunfio, 
2012). The aim of the proposal is utilization to explore the utilization of parallel computing 
techniques to tackle a major economic challenge in big data analytics: flight delay prediction. 
Each year, around 20% of airline flights experience delays or cancellations, primarily 
attributed to factors such as adverse weather conditions, carrier equipment issues, and 
technical problems at airports. These delays incur substantial costs for both airlines and 
passengers. For example, in 2007, flight delays were estimated to have cost the US economy 
$32.9 billion, with more than half of the financial burden borne by passengers (Ball et al., 
2010). Accurate and timely weather forecasts are essential to making informed 
decisions and minimizing potential risks. Flight delays due to adverse weather conditions can have significant economic and 
operational consequences for airlines, passengers, and the entire air transport system. Big data 
techniques have been proposed to load, store, manage, and analyze that vast amount of weather data with several data mining algorithms to predict flight delays based on weather observations. By applying big data analytics to the weather forecasting dataset, the challenges associated with traditional data management techniques and technologies can be solved. 
The core of this research proposal is to evaluate the influence of weather observation factors 
on flight delays. Also, build a predictor model to accurately predict the departure flights 
delays according to weather observations using a big data analytics approach through the 
process of analyzing huge amounts of weather and flight data to detect correlations and 
insights. That enables better decision-making and potentially reduces the impact of weather related delays.

</div>


## Loading the neccessary libraries and packages

In [2]:
# pandas used for data manipulation and analysis, providing data structures like DataFrames for working with tabular data. 
import pandas as pd  

# numpy is numerical Python, the fundamental package for scientific computing in Python.  
import numpy as np    

# seaborn is a data visualization library based on Matplotlib that designed to create informative and attractive statistical graphics.
import seaborn as sns

# Extends the capabilities of pandas to allow for working with geospatial data.
import geopandas as gpd

# Creating interractive graphs.
import plotly.express as px
import plotly.graph_objs as go

# A common library for creating static, animated, and interactive visualizations in Python.
import matplotlib                    # pyplot module

# Pretty-print lists, tuples, & dictionaries recursively in a human-readable format.
import pprint                        

# Providing a high level interface for creating various types of plots and charts.
import matplotlib.pyplot as plt


In [3]:
# Taking care of jupyter environment 
# show graphs in-line, and turn on/off pretty_printing of lists
%matplotlib inline 
%pprint       

Pretty printing has been turned OFF


In [4]:
#ignore warning 
import warnings
warnings.filterwarnings("ignore")

In [5]:
# retina quality: to better display the plots. Any display with retina resolution will make the figures look better
# if your monitor's resolution is sub-retina than the improvement will be less noticeable [2].
%config InlineBackend.figure_format = 'retina'
sns.set_context('talk')

## Exploratory Data Analysis(EDA)¶

#### Reading Data from Files

https://www.kaggle.com/code/dansteveadekanbi/predict-the-delay-of-a-flight-using-minutes/input?select=full_data_flightdelay.csv


In [7]:
# Read the dataset from the CSV file
df = pd.read_csv('weather_flightdelay.csv') 

#### Observing and describing data

In [9]:
# Display the first five obsevations on the dataframe
df.head() 

Unnamed: 0,MONTH,DAY_OF_WEEK,DEP_DEL15,DEP_TIME_BLK,DISTANCE_GROUP,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,NUMBER_OF_SEATS,CARRIER_NAME,AIRPORT_FLIGHTS_MONTH,...,PLANE_AGE,DEPARTING_AIRPORT,LATITUDE,LONGITUDE,PREVIOUS_AIRPORT,PRCP,SNOW,SNWD,TMAX,AWND
0,1,7,0,0800-0859,2,1,25,143,Southwest Airlines Co.,13056,...,8,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
1,1,7,0,0700-0759,7,1,29,191,Delta Air Lines Inc.,13056,...,3,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
2,1,7,0,0600-0659,7,1,27,199,Delta Air Lines Inc.,13056,...,18,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
3,1,7,0,0600-0659,9,1,27,180,Delta Air Lines Inc.,13056,...,2,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
4,1,7,0,0001-0559,7,1,10,182,Spirit Air Lines,13056,...,1,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91


In [10]:
# Display the last five obsevations on the dataframe
df.tail()

Unnamed: 0,MONTH,DAY_OF_WEEK,DEP_DEL15,DEP_TIME_BLK,DISTANCE_GROUP,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,NUMBER_OF_SEATS,CARRIER_NAME,AIRPORT_FLIGHTS_MONTH,...,PLANE_AGE,DEPARTING_AIRPORT,LATITUDE,LONGITUDE,PREVIOUS_AIRPORT,PRCP,SNOW,SNWD,TMAX,AWND
6489057,12,7,0,2300-2359,1,11,3,123,Hawaiian Airlines Inc.,1318,...,18,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489058,12,7,0,1800-1859,1,11,2,123,Hawaiian Airlines Inc.,1318,...,16,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489059,12,7,0,2000-2059,1,11,2,123,Hawaiian Airlines Inc.,1318,...,18,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489060,12,7,0,2100-2159,1,12,3,123,Hawaiian Airlines Inc.,1318,...,18,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489061,12,7,1,2100-2159,1,12,3,123,Hawaiian Airlines Inc.,1318,...,15,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21


In [11]:
# .shape() method returns a tuple representing the dimensionality of the DataFrame, 
# which means the number of rows and columns in our data frame[5].
df.shape

(6489062, 26)

The dataset contains 6489062 rows and 26 columns.

In [13]:
# Display the header 
df.columns

Index(['MONTH', 'DAY_OF_WEEK', 'DEP_DEL15', 'DEP_TIME_BLK', 'DISTANCE_GROUP',
       'SEGMENT_NUMBER', 'CONCURRENT_FLIGHTS', 'NUMBER_OF_SEATS',
       'CARRIER_NAME', 'AIRPORT_FLIGHTS_MONTH', 'AIRLINE_FLIGHTS_MONTH',
       'AIRLINE_AIRPORT_FLIGHTS_MONTH', 'AVG_MONTHLY_PASS_AIRPORT',
       'AVG_MONTHLY_PASS_AIRLINE', 'FLT_ATTENDANTS_PER_PASS',
       'GROUND_SERV_PER_PASS', 'PLANE_AGE', 'DEPARTING_AIRPORT', 'LATITUDE',
       'LONGITUDE', 'PREVIOUS_AIRPORT', 'PRCP', 'SNOW', 'SNWD', 'TMAX',
       'AWND'],
      dtype='object')

Month --> Months of the year (1-12)
DAY_OF_WEEK --> Day of the month (1-31)
DEP_DEL15 -->
DEP_TIME_BLK -->
DISTANCE_GROUP -->
SEGMENT_NUMBER -->
CONCURRENT_FLIGHTS -->
NUMBER_OF_SEATS -->
CARRIER_NAME -->
AIRPORT_FLIGHTS_MONTH -->
AIRLINE_FLIGHTS_MONTH -->
AIRLINE_AIRPORT_FLIGHTS_MONTH -->
AVG_MONTHLY_PASS_AIRPORT -->
AVG_MONTHLY_PASS_AIRLINE -->
FLT_ATTENDANTS_PER_PASS -->
GROUND_SERV_PER_PASS -->
PLANE_AGE -->
DEPARTING_AIRPORT --> The airport departure
LATITUDE -->
LONGITUDE -->
PREVIOUS_AIRPORT -->
PRCP -->  Precipitation
SNOW --> Snowfall 
SNWD --> Snow Depth
TMAX --> Max temperature for day
AWND --> Max wind speed for day 


AIRPORT_COORDINATES
	ORIGIN_AIRPORT_ID: 	Airport ID, matches to ORIGIN_AIRPORT_ID in other files
	DISPLAY_AIRPORT_NAME:  	Display Airport, matches to DISPLAY_AIRPORT_NAME in other files
	LATITUDE: 		Latitude for airport
	LONGITUDE: 		Longitude for airport

B43_AIRCRAFT_INVENTORY
	MANUFACTURE_YEAR: 	Manufacture year
	TAIL_NUM: 		Unique tail number, matches to TAIL_NUM in other files
	NUMBER_OF_SEATS: 	Number of seats on aircraft
	
CARRIER_DECODE
	AIRLINE_ID: 		Airport ID, matches to AIRLINE_ID in other files
	OP_UNIQUE_CARRIER: 	Carrier code, matches to OP_UNIQUE_CARRIER in other files
	CARRIER_NAME: 		Carrier name, matches to UNIQUE_CARRIER_NAME or CARRIER_NAME in other files

ONTIME_REPORTING_XX
	MONTH: 			Month
	DAY_OF_MONTH: 		Day of the month (1-31)
	DAY_OF_WEEK: 		Day of the week
	OP_UNIQUE_CARRIER: 	Carrier code, matches to OP_UNIQUE_CARRIER in other files
	TAIL_NUM: 		Unique tail number, matches to TAIL_NUM in other files
	OP_CARRIER_FL_NUM: 	Flight number
	ORIGIN_AIRPORT_ID: 	Airport ID, matches to ORIGIN_AIRPORT_ID in other files
	ORIGIN: 		Origin airport abbreviation
	ORIGIN_CITY_NAME: 	Origin city name
	DEST_AIRPORT_ID: 	Destination airport ID, matches Airport ID in other files
	DEST: 			Destination airport abbreviation
	DEST_CITY_NAME: 	Destination city name
	CRS_DEP_TIME: 		Planned departure time
	DEP_TIME: 		Actual departure time
	DEP_DELAY_NEW: 		Departure delay in minutes
	DEP_DEL15:		TARGET VARIABLE Binary if delayed over 15 min, 1 is yes
	DEP_TIME_BLK:		Departure time block
	CRS_ARR_TIME:		Planned arrival time
	ARR_TIME:		Actual arrival time
	ARR_DELAY_NEW:		Arrival delay in minutes
	ARR_TIME_BLK:		Arrival time block
	CANCELLED:		Flag if flight was cancelled
	CANCELLATION_CODE:	Cancellation Code
	CRS_ELAPSED_TIME:	Flight planned elapsed time
	ACTUAL_ELAPSED_TIME:	Flight actual elapsed time
	DISTANCE:		Flight Distance in miles
	DISTANCE_GROUP:		Flight distance group
	CARRIER_DELAY:		Flag for a carrier delay
	WEATHER_DELAY:		Flag for a weather delay
	NAS_DELAY:		Flag for a NAS delay
	SECURITY_DELAY:		Flag for a security delay
	LATE_AIRCRAFT_DELAY:	Flag for a late aircraft delay

P10_EMPLOYEES
	YEAR: 			Year
	AIRLINE_ID: 		Airport ID, matches to AIRLINE_ID in other files
	OP_UNIQUE_CARRIER: 	Carrier code, matches to OP_UNIQUE_CARRIER in other files
	UNIQUE_CARRIER_NAME: 	Carrier name, matches to UNIQUE_CARRIER_NAME in other files
	CARRIER: 		Carrier abbreviation
	CARRIER_NAME: 		Carrier name, matches to UNIQUE_CARRIER_NAME or CARRIER_NAME in other files
	ENTITY:			
	GENERAL_MANAGE:		General managers
	PILOTS_COPILOTS:	Pilots/Copilots
	OTHER_FLT_PERS:		Other flight personnel
	PASS_GEN_SVC_ADMIN:	Passenger/General Services & Administration
	MAINTENANCE:		Maintenance Employees
	ARCFT_TRAF_HANDLING_GRP1: Aircraft Traffic Handling Group1 Employees
	GEN_ARCFT_TRAF_HANDLING	: General Aircraft Traffic Handling Employees
	AIRCRAFT_CONTROL:	Aircraft Control Employees
	PASSENGER_HANDLING:	Passenger Handling Employees
	CARGO_HANDLING:		Cargo Handling Employees
	TRAINEES_INTRUCTOR:	Trainees and Instructor
	STATISTICAL:		Statistical Employees
	TRAFFIC_SOLICITERS:	Traffic Soliciters
	OTHER:			Other Employees
	TRANSPORT_RELATED:	Transport Related Employees
	TOTAL:			Total employees

T3_AIR_CARRIER_SUMMARY_AIRPORT_ACTIVITY_XXXX
	OP_UNIQUE_CARRIER: 	Carrier code, matches to OP_UNIQUE_CARRIER in other files
	CARRIER_NAME: 		Carrier name, matches to UNIQUE_CARRIER_NAME or CARRIER_NAME in other files
	ORIGIN_AIRPORT_ID: 	Airport ID, matches to ORIGIN_AIRPORT_ID in other files
	SERVICE_CLASS: 		Service class of flight (required in download)
	REV_ACRFT_DEP_PERF_510: Departures performed for year
	REV_PAX_ENP_110: 	Passengers enplaned for year

airports_list
	ORIGIN_AIRPORT_ID: 	Airport ID, matches to ORIGIN_AIRPORT_ID in other files
	DISPLAY_AIRPORT_NAME: 	Display Airport, matches to DISPLAY_AIRPORT_NAME in other files
	ORIGIN_CITY_NAME: 	City
	NAME: 			Matches to NAME in airport_weather


airport_weather_xxxx
	See GHCND_documentation.pdf for full list
	Important features:
	NAME: 			Location of reading
	PRCP: 			Inches of precipitation for day
	SNOW: 			Inches of snowfall for day
	SNWD: 			Inches of snow on ground for day
	TMAX: 			Max temperature for day
	AWND: 			Max wind speed for day

Input (5.02 GB)


RPORT_COORDINATES

This dataset comprises information about airport coordinates, encompassing Airport ID, Display Airport Name, Latitude, and Longitude.
B43_AIRCRAFT_INVENTORY

Within this dataset, you'll find comprehensive data about aircraft, including the year of manufacture, a unique tail number, and the total number of seats on each aircraft.
CARRIER_DECODE

This dataset contains vital airline-related data such as Airline ID, Carrier Code, and Carrier Name.
ONTIME_REPORTING_XX

This dataset is a treasure trove of flight-related information, including details like the month, day, Carrier Code, Flight Number, Airport IDs, Scheduled Departure and Arrival Times, Actual Departure and Arrival Times, Departure Delays, Arrival Delays, and much more.
P10_EMPLOYEES

Here, you'll find a breakdown of airline employees categorized into various roles, encompassing General Managers, Pilots/Copilots, Maintenance Personnel, and more.
T3_AIR_CARRIER_SUMMARY_AIRPORT_ACTIVITY_XXXX

This dataset offers specific insights into airport activity related to different airlines. It includes data like Carrier Code, Carrier Name, Airport ID, Service Class of Flights, Departures Performed, and Passengers Enplaned.
airports_list

In this dataset, you'll discover essential information related to airports, such as Airport ID, Display Airport Name, City Name, and references to corresponding airport weather data.
airport_weather_xxxx

This dataset provides detailed weather observations recorded at various locations. It includes significant features like Location (NAME), Precipitation (PRCP), Snowfall (SNOW), Snow Depth (SNWD), Maximum Temperature (TMAX), and M