# <center>Flight Delay Prediction</center>

<div style="text-align: justify">
In the last decades, with the vast amount of digital data generated from various sources such 
as social media websites, social networks, audio and video content, and commercial and 
financial data, there has been a need for effective solutions to understand and extract 
information from this vast amount of data. Traditional data analysis approaches cannot handle 
large, complex datasets or cope with big data in general. To address the challenges of big data 
analysis, machine learning techniques coupled with scalable parallel computing systems have 
been combined as a promising solution. By leveraging parallel machine learning algorithms, 
scalable computing, and storage infrastructures, it becomes possible to analyze massive and 
intricate datasets, yielding valuable insights within reasonable timeframes (Talia and Trunfio, 
2012). The aim of the proposal is utilization to explore the utilization of parallel computing 
techniques to tackle a major economic challenge in big data analytics: flight delay prediction. 
Each year, around 20% of airline flights experience delays or cancellations, primarily 
attributed to factors such as adverse weather conditions, carrier equipment issues, and 
technical problems at airports. These delays incur substantial costs for both airlines and 
passengers. For example, in 2007, flight delays were estimated to have cost the US economy 
$32.9 billion, with more than half of the financial burden borne by passengers (Ball et al., 
2010). Accurate and timely weather forecasts are essential to making informed 
decisions and minimizing potential risks. Flight delays due to adverse weather conditions can have significant economic and 
operational consequences for airlines, passengers, and the entire air transport system. Big data 
techniques have been proposed to load, store, manage, and analyze that vast amount of weather data with several data mining algorithms to predict flight delays based on weather observations. By applying big data analytics to the weather forecasting dataset, the challenges associated with traditional data management techniques and technologies can be solved. 
The core of this research proposal is to evaluate the influence of weather observation factors 
on flight delays. Also, build a predictor model to accurately predict the departure flights 
delays according to weather observations using a big data analytics approach through the 
process of analyzing huge amounts of weather and flight data to detect correlations and 
insights. That enables better decision-making and potentially reduces the impact of weather related delays.

</div>


## Loading the neccessary libraries and packages

In [5]:
# pandas used for data manipulation and analysis, providing data structures like DataFrames for working with tabular data. 
import pandas as pd  

# numpy is numerical Python, the fundamental package for scientific computing in Python.  
import numpy as np    

# seaborn is a data visualization library based on Matplotlib that designed to create informative and attractive statistical graphics.
import seaborn as sns

# Extends the capabilities of pandas to allow for working with geospatial data.
import geopandas as gpd

# Creating interractive graphs.
import plotly.express as px
import plotly.graph_objs as go

# A common library for creating static, animated, and interactive visualizations in Python.
import matplotlib                    # pyplot module

# Pretty-print lists, tuples, & dictionaries recursively in a human-readable format.
import pprint                        

# Providing a high level interface for creating various types of plots and charts.
import matplotlib.pyplot as plt


In [6]:
# Taking care of jupyter environment 
# show graphs in-line, and turn on/off pretty_printing of lists
%matplotlib inline 
%pprint       

Pretty printing has been turned OFF


In [7]:
#ignore warning 
import warnings
warnings.filterwarnings("ignore")

In [8]:
# retina quality: to better display the plots. Any display with retina resolution will make the figures look better
# if your monitor's resolution is sub-retina than the improvement will be less noticeable [2].
%config InlineBackend.figure_format = 'retina'
sns.set_context('talk')

## Exploratory Data Analysis(EDA)¶

#### Reading Data from Files

https://www.kaggle.com/code/dansteveadekanbi/predict-the-delay-of-a-flight-using-minutes/input?select=full_data_flightdelay.csv


In [9]:
# Read the dataset from the CSV file
df = pd.read_csv('weather_flightdelay.csv') 

#### Observing and describing data

In [10]:
# Display the first five obsevations on the dataframe
df.head() 

Unnamed: 0,MONTH,DAY_OF_WEEK,DEP_DEL15,DEP_TIME_BLK,DISTANCE_GROUP,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,NUMBER_OF_SEATS,CARRIER_NAME,AIRPORT_FLIGHTS_MONTH,...,PLANE_AGE,DEPARTING_AIRPORT,LATITUDE,LONGITUDE,PREVIOUS_AIRPORT,PRCP,SNOW,SNWD,TMAX,AWND
0,1,7,0,0800-0859,2,1,25,143,Southwest Airlines Co.,13056,...,8,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
1,1,7,0,0700-0759,7,1,29,191,Delta Air Lines Inc.,13056,...,3,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
2,1,7,0,0600-0659,7,1,27,199,Delta Air Lines Inc.,13056,...,18,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
3,1,7,0,0600-0659,9,1,27,180,Delta Air Lines Inc.,13056,...,2,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91
4,1,7,0,0001-0559,7,1,10,182,Spirit Air Lines,13056,...,1,McCarran International,36.08,-115.152,NONE,0.0,0.0,0.0,65.0,2.91


In [11]:
# Display the last five obsevations on the dataframe
df.tail()

Unnamed: 0,MONTH,DAY_OF_WEEK,DEP_DEL15,DEP_TIME_BLK,DISTANCE_GROUP,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,NUMBER_OF_SEATS,CARRIER_NAME,AIRPORT_FLIGHTS_MONTH,...,PLANE_AGE,DEPARTING_AIRPORT,LATITUDE,LONGITUDE,PREVIOUS_AIRPORT,PRCP,SNOW,SNWD,TMAX,AWND
6489057,12,7,0,2300-2359,1,11,3,123,Hawaiian Airlines Inc.,1318,...,18,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489058,12,7,0,1800-1859,1,11,2,123,Hawaiian Airlines Inc.,1318,...,16,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489059,12,7,0,2000-2059,1,11,2,123,Hawaiian Airlines Inc.,1318,...,18,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489060,12,7,0,2100-2159,1,12,3,123,Hawaiian Airlines Inc.,1318,...,18,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21
6489061,12,7,1,2100-2159,1,12,3,123,Hawaiian Airlines Inc.,1318,...,15,Lihue Airport,21.979,-159.346,Honolulu International,0.06,0.0,0.0,84.0,15.21


In [12]:
# .shape() method returns a tuple representing the dimensionality of the DataFrame, 
# which means the number of rows and columns in our data frame[5].
df.shape

(6489062, 26)

The dataset contains 6489062 rows and 26 columns.

In [13]:
# Display the header 
df.columns

Index(['MONTH', 'DAY_OF_WEEK', 'DEP_DEL15', 'DEP_TIME_BLK', 'DISTANCE_GROUP',
       'SEGMENT_NUMBER', 'CONCURRENT_FLIGHTS', 'NUMBER_OF_SEATS',
       'CARRIER_NAME', 'AIRPORT_FLIGHTS_MONTH', 'AIRLINE_FLIGHTS_MONTH',
       'AIRLINE_AIRPORT_FLIGHTS_MONTH', 'AVG_MONTHLY_PASS_AIRPORT',
       'AVG_MONTHLY_PASS_AIRLINE', 'FLT_ATTENDANTS_PER_PASS',
       'GROUND_SERV_PER_PASS', 'PLANE_AGE', 'DEPARTING_AIRPORT', 'LATITUDE',
       'LONGITUDE', 'PREVIOUS_AIRPORT', 'PRCP', 'SNOW', 'SNWD', 'TMAX',
       'AWND'],
      dtype='object')

Let's define the dataframe columns


- Month --> Months of the year (1-12)
- DAY_OF_WEEK --> Day of the month (1-31)
- DEP_DEL15 --> TARGET VARIABLE Binary if delayed over 15 min, 1 is yes
- DEP_TIME_BLK --> Departure time block
- DISTANCE_GROUP --> Flight distance group
- SEGMENT_NUMBER --> The segment that this tail number is on for the day
- CONCURRENT_FLIGHTS --> Concurrent flights leaving from the airport in the same departure block
- NUMBER_OF_SEATS --> Seats number
- CARRIER_NAME --> Air carrier
- AIRPORT_FLIGHTS_MONTH --> Average monthly airport flights
- AIRLINE_FLIGHTS_MONTH --> Average monthly airline flights 
- AIRLINE_AIRPORT_FLIGHTS_MONTH --> Average monthly flight count for both airlines and airports.
- AVG_MONTHLY_PASS_AIRPORT --> Average monthly departing airport passenger count.
- AVG_MONTHLY_PASS_AIRLINE --> Average monthly passenger count for the airline.
- FLT_ATTENDANTS_PER_PASS --> Flight attendants per passenger for airline
- GROUND_SERV_PER_PASS --> Ratio of ground service employees (service desk) per passenger for the airline.
- PLANE_AGE --> Age of departing aircraft
- DEPARTING_AIRPORT --> Airport of departure.
- LATITUDE --> Latitude of the departure airport.
- LONGITUDE --> Longitude of the departure airport.
- PREVIOUS_AIRPORT --> The airport from which the aircraft previously departed.
- PRCP -->  Precipitation
- SNOW --> Snowfall 
- SNWD --> Snow Depth
- TMAX --> Max temperature for day
- AWND --> Max wind speed for day 

In [14]:
def displayColumnsValues(data):
    '''
       To print the unique values for each feature (column)
    '''
    for i in data.columns:
        print(80*'=','\n',i,'has',
          df[i].nunique(),'value/s:\n',        # .nunique(): Count number of distinct elements in each column.
          df[i].unique())                      # .unique() method returns an array with the unique values.

In [15]:
# calling the displayColumnsValues() function
displayColumnsValues(df) 

 MONTH has 12 value/s:
 [ 1  2  3  4  5  6  7  8  9 10 11 12]
 DAY_OF_WEEK has 7 value/s:
 [7 5 3 4 2 1 6]
 DEP_DEL15 has 2 value/s:
 [0 1]
 DEP_TIME_BLK has 19 value/s:
 ['0800-0859' '0700-0759' '0600-0659' '0001-0559' '2300-2359' '1200-1259'
 '0900-0959' '1000-1059' '2200-2259' '1500-1559' '1100-1159' '2000-2059'
 '1400-1459' '1300-1359' '1800-1859' '1900-1959' '1600-1659' '1700-1759'
 '2100-2159']
 DISTANCE_GROUP has 11 value/s:
 [ 2  7  9  3  6  8  1  4 11  5 10]
 SEGMENT_NUMBER has 15 value/s:
 [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
 CONCURRENT_FLIGHTS has 107 value/s:
 [ 25  29  27  10  17  26  28   9  18  24  32  21  12  30   5  22  23  15
   2   3   6   7   4   1  16  11  13  14   8  52  55  49  48  40  37  36
  46  57  75  44  47  39  51  19  20  35  33  59  31  42  79  63  41  65
  56  88  80  84  93  69  38  34  66  74  43  58  60  54  61  50  67  72
  83  82  89  78  53  91  64  70  62  45  77  87  68  73  92  71  76  81
  86  85  90  94  98  97  95 108 106  96 107 

 AIRLINE_FLIGHTS_MONTH has 204 value/s:
 [107363  73508  15023   9496  20315   6791  75506  46218  23463   6713
  62105  17869  23760  24623  22418  19857  12231  70199  21895  17673
  22053  67273   6020  18154  22834  56790  20452   7180  94922   8643
  13447  43512  15953  10920  84142  21427  78308  18407  25502  53007
  68810 114119   6850   9663  17034  26361  22191  26929  24032  10218
  12510  74131  51763  81803  24966  21136 110752  16316   6893  67082
  25138  26476  17692   9637  20645   9219  23434  10452  22451  53980
  85579  78894  20860  17814  24886  27470  18618 113709  11254  70878
  27761   7217   8739  24260   9008  87183  71188  76419  21319  24204
  23065   7173  27159  26990  18858  11037  53737 112879   9184  17553
  11337  24179  18428  11745  79247  90457 117728  55374  12247  24454
  25142   7329  72721  19483  10613  21554  26909  24403  28267  74087
  91062 114987  24460  25270  80820  28070  12252  55706  10677  19375
  24496  28893   8693  22671   7348 

 PREVIOUS_AIRPORT has 356 value/s:
 ['NONE' 'Phoenix Sky Harbor International' 'San Francisco International'
 'Salt Lake City International' 'Orange County' 'Portland International'
 'Spokane International' 'Metropolitan Oakland International'
 'Seattle International' 'Port Columbus International'
 'Cleveland-Hopkins International' 'Austin - Bergstrom International'
 'Los Angeles International' 'General Mitchell Field'
 'Sacramento International' 'Atlanta Municipal' 'Reno/Tahoe International'
 'Stapleton International' 'Detroit Metro Wayne County'
 'San Diego International Lindbergh Fl' 'Friendship International'
 'Chicago Midway International' 'Ontario International' 'Eppley Airfield'
 'Tucson International' 'Boise Air Terminal'
 'Lambert-St. Louis International' 'Nashville International'
 'San Jose International' 'Dallas Love Field'
 'Indianapolis Muni/Weir Cook' 'Orlando International'
 'Hollywood-Burbank Midpoint' 'Kansas City International'
 'William P Hobby' 'San Antonio Internat

 PRCP has 305 value/s:
 [0.000e+00 1.000e-02 6.200e-01 2.200e-01 3.200e-01 4.700e-01 1.600e-01
 3.400e-01 7.000e-02 1.000e-01 1.700e-01 2.160e+00 6.100e-01 4.800e-01
 7.300e-01 1.100e-01 1.480e+00 1.000e+00 1.240e+00 6.700e-01 8.000e-01
 5.400e-01 1.200e-01 3.300e-01 1.540e+00 7.600e-01 8.600e-01 2.450e+00
 6.500e-01 5.700e-01 3.800e-01 4.500e-01 6.900e-01 4.000e-02 1.800e-01
 8.200e-01 3.000e-02 1.170e+00 4.100e-01 8.000e-02 9.500e-01 8.800e-01
 2.800e-01 2.400e-01 1.080e+00 5.600e-01 1.330e+00 3.700e-01 1.710e+00
 7.000e-01 9.000e-02 9.800e-01 8.100e-01 2.320e+00 1.030e+00 4.600e-01
 1.900e-01 2.000e-01 5.000e-02 1.070e+00 6.000e-02 6.600e-01 8.400e-01
 9.100e-01 8.700e-01 1.500e-01 2.100e-01 2.500e-01 4.400e-01 1.100e+00
 3.600e-01 1.150e+00 2.000e-02 5.300e-01 1.400e-01 1.040e+00 2.700e-01
 6.400e-01 1.770e+00 9.600e-01 4.600e+00 2.010e+00 3.900e-01 2.900e-01
 9.400e-01 9.300e-01 1.580e+00 4.900e-01 5.000e-01 1.310e+00 1.500e+00
 7.800e-01 1.290e+00 4.200e-01 3.000e-01 2.300e-01 4.

In [16]:
# .info(): return all information about the dataframe 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6489062 entries, 0 to 6489061
Data columns (total 26 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   MONTH                          int64  
 1   DAY_OF_WEEK                    int64  
 2   DEP_DEL15                      int64  
 3   DEP_TIME_BLK                   object 
 4   DISTANCE_GROUP                 int64  
 5   SEGMENT_NUMBER                 int64  
 6   CONCURRENT_FLIGHTS             int64  
 7   NUMBER_OF_SEATS                int64  
 8   CARRIER_NAME                   object 
 9   AIRPORT_FLIGHTS_MONTH          int64  
 10  AIRLINE_FLIGHTS_MONTH          int64  
 11  AIRLINE_AIRPORT_FLIGHTS_MONTH  int64  
 12  AVG_MONTHLY_PASS_AIRPORT       int64  
 13  AVG_MONTHLY_PASS_AIRLINE       int64  
 14  FLT_ATTENDANTS_PER_PASS        float64
 15  GROUND_SERV_PER_PASS           float64
 16  PLANE_AGE                      int64  
 17  DEPARTING_AIRPORT              object 
 18  LA

The data set contains a mixed type of data (int64, object, and float64). 


<break>
    

Generate descriptive statistics for the numeric columns, it summarize the central tendency, 
dispersion and shape of a dataset’s distribution, excluding NaN values.

In [17]:
# Generate descriptive statistics for the numeric columns, it summarize the central tendency, 
df.describe()

Unnamed: 0,MONTH,DAY_OF_WEEK,DEP_DEL15,DISTANCE_GROUP,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,NUMBER_OF_SEATS,AIRPORT_FLIGHTS_MONTH,AIRLINE_FLIGHTS_MONTH,AIRLINE_AIRPORT_FLIGHTS_MONTH,...,FLT_ATTENDANTS_PER_PASS,GROUND_SERV_PER_PASS,PLANE_AGE,LATITUDE,LONGITUDE,PRCP,SNOW,SNWD,TMAX,AWND
count,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,...,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0,6489062.0
mean,6.607062,3.935598,0.1891441,3.821102,3.04689,27.83675,133.7397,12684.58,62960.58,3459.251,...,9.753707e-05,0.0001355612,11.53211,36.70581,-94.25515,0.1037063,0.0315931,0.09152397,71.46846,8.341329
std,3.396853,1.9952,0.3916231,2.382233,1.757864,21.5106,46.45213,8839.796,34382.23,4251.139,...,8.644459e-05,4.64997e-05,6.935706,5.500804,17.90952,0.3432134,0.3170163,0.7281285,18.35333,3.607604
min,1.0,1.0,0.0,1.0,1.0,1.0,44.0,1100.0,5582.0,1.0,...,0.0,7.134695e-06,0.0,18.44,-159.346,0.0,0.0,0.0,-10.0,0.0
25%,4.0,2.0,0.0,2.0,2.0,11.0,90.0,5345.0,25034.0,654.0,...,3.419267e-05,9.889412e-05,5.0,33.436,-106.377,0.0,0.0,0.0,59.0,5.82
50%,7.0,4.0,0.0,3.0,3.0,23.0,143.0,11562.0,70878.0,2251.0,...,6.178236e-05,0.0001246511,12.0,37.505,-87.906,0.0,0.0,0.0,74.0,7.83
75%,10.0,6.0,0.0,5.0,4.0,39.0,172.0,17615.0,86312.0,4806.0,...,0.0001441659,0.0001772872,17.0,40.779,-80.936,0.02,0.0,0.0,86.0,10.29
max,12.0,7.0,1.0,11.0,15.0,109.0,337.0,35256.0,117728.0,21837.0,...,0.0003484077,0.0002289855,32.0,61.169,-66.002,11.63,17.2,25.2,115.0,33.78


Describe the categorical columns

In [18]:
# Select the categorical columns
df.describe(include='O')

Unnamed: 0,DEP_TIME_BLK,CARRIER_NAME,DEPARTING_AIRPORT,PREVIOUS_AIRPORT
count,6489062,6489062,6489062,6489062
unique,19,17,96,356
top,0800-0859,Southwest Airlines Co.,Atlanta Municipal,NONE
freq,452391,1296329,392603,1449009


For the object data type. The result includes 'count', 'unique', 'top', and 'freq'
'top' gives the most common values while 'freq' gives the most common value frequency. 

#### Cleaning Data.

In [22]:
# Null values, the totall number of missing values at each column
df.isnull().sum()

MONTH                            0
DAY_OF_WEEK                      0
DEP_DEL15                        0
DEP_TIME_BLK                     0
DISTANCE_GROUP                   0
SEGMENT_NUMBER                   0
CONCURRENT_FLIGHTS               0
NUMBER_OF_SEATS                  0
CARRIER_NAME                     0
AIRPORT_FLIGHTS_MONTH            0
AIRLINE_FLIGHTS_MONTH            0
AIRLINE_AIRPORT_FLIGHTS_MONTH    0
AVG_MONTHLY_PASS_AIRPORT         0
AVG_MONTHLY_PASS_AIRLINE         0
FLT_ATTENDANTS_PER_PASS          0
GROUND_SERV_PER_PASS             0
PLANE_AGE                        0
DEPARTING_AIRPORT                0
LATITUDE                         0
LONGITUDE                        0
PREVIOUS_AIRPORT                 0
PRCP                             0
SNOW                             0
SNWD                             0
TMAX                             0
AWND                             0
dtype: int64

No missing data was observed. 