# <center>Flight Delay Prediction</center>

<div style="text-align: justify">
In the last decades, with the vast amount of digital data generated from various sources such 
as social media websites, social networks, audio and video content, and commercial and 
financial data, there has been a need for effective solutions to understand and extract 
information from this vast amount of data. Traditional data analysis approaches cannot handle 
large, complex datasets or cope with big data in general. To address the challenges of big data 
analysis, machine learning techniques coupled with scalable parallel computing systems have 
been combined as a promising solution. By leveraging parallel machine learning algorithms, 
scalable computing, and storage infrastructures, it becomes possible to analyze massive and 
intricate datasets, yielding valuable insights within reasonable timeframes (Talia and Trunfio, 
2012). The aim of the proposal is utilization to explore the utilization of parallel computing 
techniques to tackle a major economic challenge in big data analytics: flight delay prediction. 
Each year, around 20% of airline flights experience delays or cancellations, primarily 
attributed to factors such as adverse weather conditions, carrier equipment issues, and 
technical problems at airports. These delays incur substantial costs for both airlines and 
passengers. For example, in 2007, flight delays were estimated to have cost the US economy 
$32.9 billion, with more than half of the financial burden borne by passengers (Ball et al., 
2010). Accurate and timely weather forecasts are essential to making informed 
decisions and minimizing potential risks. Flight delays due to adverse weather conditions can have significant economic and 
operational consequences for airlines, passengers, and the entire air transport system. Big data 
techniques have been proposed to load, store, manage, and analyze that vast amount of weather data with several data mining algorithms to predict flight delays based on weather observations. By applying big data analytics to the weather forecasting dataset, the challenges associated with traditional data management techniques and technologies can be solved. 
The core of this research proposal is to evaluate the influence of weather observation factors 
on flight delays. Also, build a predictor model to accurately predict the departure flights 
delays according to weather observations using a big data analytics approach through the 
process of analyzing huge amounts of weather and flight data to detect correlations and 
insights. That enables better decision-making and potentially reduces the impact of weather related delays.

</div>


## Loading the neccessary libraries and packages

In [2]:
# pandas used for data manipulation and analysis, providing data structures like DataFrames for working with tabular data. 
import pandas as pd  

# numpy is numerical Python, the fundamental package for scientific computing in Python.  
import numpy as np    

# seaborn is a data visualization library based on Matplotlib that designed to create informative and attractive statistical graphics.
import seaborn as sns

# Extends the capabilities of pandas to allow for working with geospatial data.
import geopandas as gpd

# Creating interractive graphs.
import plotly.express as px
import plotly.graph_objs as go

# A common library for creating static, animated, and interactive visualizations in Python.
import matplotlib                    # pyplot module

# Pretty-print lists, tuples, & dictionaries recursively in a human-readable format.
import pprint                        

# Providing a high level interface for creating various types of plots and charts.
import matplotlib.pyplot as plt


In [3]:
# Taking care of jupyter environment 
# show graphs in-line, and turn on/off pretty_printing of lists
%matplotlib inline 
%pprint       

Pretty printing has been turned OFF


In [4]:
#ignore warning 
import warnings
warnings.filterwarnings("ignore")

In [5]:
# retina quality: to better display the plots. Any display with retina resolution will make the figures look better
# if your monitor's resolution is sub-retina than the improvement will be less noticeable [2].
%config InlineBackend.figure_format = 'retina'
sns.set_context('talk')

## Exploratory Data Analysis(EDA)¶

#### Reading Data from Files

https://www.kaggle.com/code/dansteveadekanbi/predict-the-delay-of-a-flight-using-minutes/input?select=full_data_flightdelay.csv


In [7]:
# Read the dataset from the CSV file
df = pd.read_csv('weather_flightdelay.csv') 

#### Observing and describing data

In [9]:
# Display the first five obsevations on the dataframe
df.head() 

Unnamed: 0,MONTH,DAY_OF_WEEK,DEP_DEL15,DEP_TIME_BLK,DISTANCE_GROUP,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,NUMBER_OF_SEATS,CARRIER_NAME,AIRPORT_FLIGHTS_MONTH,...,PLANE_AGE,DEPARTING_AIRPORT,LATITUDE,LONGITUDE,PREVIOUS_AIRPORT,PRCP,SNOW,SNWD,TMAX,AWND
0,1,7,0,0800-0859,2,1,25,143,Southwest Airlines Co.,13056,...,8,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
1,1,7,0,0700-0759,7,1,29,191,Delta Air Lines Inc.,13056,...,3,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
2,1,7,0,0600-0659,7,1,27,199,Delta Air Lines Inc.,13056,...,18,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
3,1,7,0,0600-0659,9,1,27,180,Delta Air Lines Inc.,13056,...,2,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
4,1,7,0,0001-0559,7,1,10,182,Spirit Air Lines,13056,...,1,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91


In [10]:
# Display the last five obsevations on the dataframe
df.tail()

Unnamed: 0,MONTH,DAY_OF_WEEK,DEP_DEL15,DEP_TIME_BLK,DISTANCE_GROUP,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,NUMBER_OF_SEATS,CARRIER_NAME,AIRPORT_FLIGHTS_MONTH,...,PLANE_AGE,DEPARTING_AIRPORT,LATITUDE,LONGITUDE,PREVIOUS_AIRPORT,PRCP,SNOW,SNWD,TMAX,AWND
6489057,12,7,0,2300-2359,1,11,3,123,Hawaiian Airlines Inc.,1318,...,18,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489058,12,7,0,1800-1859,1,11,2,123,Hawaiian Airlines Inc.,1318,...,16,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489059,12,7,0,2000-2059,1,11,2,123,Hawaiian Airlines Inc.,1318,...,18,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489060,12,7,0,2100-2159,1,12,3,123,Hawaiian Airlines Inc.,1318,...,18,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489061,12,7,1,2100-2159,1,12,3,123,Hawaiian Airlines Inc.,1318,...,15,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21


In [11]:
# .shape() method returns a tuple representing the dimensionality of the DataFrame, 
# which means the number of rows and columns in our data frame[5].
df.shape

(6489062, 26)

The dataset contains 6489062 rows and 26 columns.

In [13]:
# Display the header 
df.columns

Index(['MONTH', 'DAY_OF_WEEK', 'DEP_DEL15', 'DEP_TIME_BLK', 'DISTANCE_GROUP',
       'SEGMENT_NUMBER', 'CONCURRENT_FLIGHTS', 'NUMBER_OF_SEATS',
       'CARRIER_NAME', 'AIRPORT_FLIGHTS_MONTH', 'AIRLINE_FLIGHTS_MONTH',
       'AIRLINE_AIRPORT_FLIGHTS_MONTH', 'AVG_MONTHLY_PASS_AIRPORT',
       'AVG_MONTHLY_PASS_AIRLINE', 'FLT_ATTENDANTS_PER_PASS',
       'GROUND_SERV_PER_PASS', 'PLANE_AGE', 'DEPARTING_AIRPORT', 'LATITUDE',
       'LONGITUDE', 'PREVIOUS_AIRPORT', 'PRCP', 'SNOW', 'SNWD', 'TMAX',
       'AWND'],
      dtype='object')

- Month --> Months of the year (1-12)
- DAY_OF_WEEK --> Day of the month (1-31)
- DEP_DEL15 --> TARGET VARIABLE Binary if delayed over 15 min, 1 is yes
- DEP_TIME_BLK --> Departure time block
- DISTANCE_GROUP --> Flight distance group
- SEGMENT_NUMBER --> The segment that this tail number is on for the day
- CONCURRENT_FLIGHTS --> Concurrent flights leaving from the airport in the same departure block
- NUMBER_OF_SEATS --> Seats number
- CARRIER_NAME --> Air carrier
- AIRPORT_FLIGHTS_MONTH --> Average monthly airport flights
- AIRLINE_FLIGHTS_MONTH --> Average monthly airline flights 
- AIRLINE_AIRPORT_FLIGHTS_MONTH --> Average monthly flight count for both airlines and airports.
- AVG_MONTHLY_PASS_AIRPORT --> Average monthly departing airport passenger count.
- AVG_MONTHLY_PASS_AIRLINE --> Average monthly passenger count for the airline.
- FLT_ATTENDANTS_PER_PASS --> Flight attendants per passenger for airline
- GROUND_SERV_PER_PASS --> Ratio of ground service employees (service desk) per passenger for the airline.
- PLANE_AGE --> Age of departing aircraft
- DEPARTING_AIRPORT --> Airport of departure.
- LATITUDE --> Latitude of the departure airport.
- LONGITUDE --> Longitude of the departure airport.
- PREVIOUS_AIRPORT --> The airport from which the aircraft previously departed.
- PRCP -->  Precipitation
- SNOW --> Snowfall 
- SNWD --> Snow Depth
- TMAX --> Max temperature for day
- AWND --> Max wind speed for day 