# Flight Delay Prediction
Predict airline delays for Tunisian aviation company, Tunisair

## Description

Flight delays not only irritate air passengers and disrupt their schedules but also cause :

* a decrease in efficiency
* an increase in capital costs, reallocation of flight crews and aircraft
* an additional crew expenses

As a result, on an aggregate basis, an airline's record of flight delays may have a negative impact on passenger demand.

This solution proposes to build a flight delay predictive model using Machine Learning techniques. The accurate prediction of flight delays will help all players in the air travel ecosystem to set up effective action plans to reduce the impact of the delays and avoid loss of time, capital and resources.



## Business Understanding

#### About the stakeholder: Tunisair

Tunisair is the flag carrier airline of Tunisia. Formed in 1948, it operates scheduled international services to four continents. Its main base is Tunis–Carthage International Airport. The airline's head office is in Tunis, near Tunis Airport. Tunisair is a member of the Arab Air Carriers Organization

<img src="images/Tunisair_plane.jpeg" height=350 />

| **Column Names**  |                                                                                                                                                                                          **Description** |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| id                |                                                                                                                                                                            id of record                  |
| flight_date          |                                                                                                                                                                                  Date of flight |
| flight_number        |                                                                                                                                                                                     Id of flight ( like id of bus line ) |
| departure_point    |                                                                                                                                                                                       Departure point |
| arrival_point   |                                                                                                                             Arrival point |
| scheduled_time_departure|                                                                                                                                                      Scheduled Time departure |
| scheduled_time_arrival    |                                                                                                                                                                         Scheduled Time arrival |
| flight_status            |                                                                                                                                                                                         Flight status |
| aircraft_code        |                                                                                                                                                              Aircraft Code |
| target              |                                                                                                                                  Flight Delay ( in minutes ) |

## Setting up Problem

Lets set some background first:

Customer browsing through some flight booking website and want to book flight for some specific date, time, source and destination.

__IDEA:__ If during flight booking, we can show to customer whether the flight he/she considering for booking is likely to arrive on time or not. Additionaly, if flight is expected to delay, also show delayed time.

If customer know that the flight is likely to be late, he/she might choose to book another flight.

From Modelling Propective, need to set two goals:

__GOAL I:__ Predict whether flight is going to delay or not.
<br>

__GOAL II:__ If flight delays, predict amount of time by which it delays.

## Imports

In [3]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Import airport data
import airportsdata
airports = airportsdata.load()

from script_files.feature_engineering import *
from script_files.prepare_flight_data import *


## Understand the Data

1. How many flights are in our data set, in which date period we have data?
2. What datatype are in the data?
- Are there any missing values?
- Which features are continuous or categorical

In [4]:
# Loading the data in and Shape to fit
airports = pd.DataFrame(airports).T.reset_index(drop=True)
# test data
train_df = pd.read_csv("data/Train.csv")
# rename the columns
train_df.columns=['id','flight_date','flight_number','departure_point','arrival_point','scheduled_time_departure','scheduled_time_arrival','flight_status','aircraft_code',"target"]
# train data
test_df = pd.read_csv("data/Test.csv")


print("==="*40)
print('\033[1m'+"Airport data"+ '\033[0m')
display(airports.head())
print()
print("==="*40)
print('\033[1m'+"Training data"+ '\033[0m')
display(train_df.head())
print()
print("==="*40)
print('\033[1m'+"Test data"+ '\033[0m')
display(test_df.head())

[1mAirport data[0m


Unnamed: 0,icao,iata,name,city,subd,country,elevation,lat,lon,tz
0,00AK,,Lowell Field,Anchor Point,Alaska,US,450.0,59.9492,-151.695999,America/Anchorage
1,00AL,,Epps Airpark,Harvest,Alabama,US,820.0,34.864799,-86.770302,America/Chicago
2,00AZ,,Cordes Airport,Cordes,Arizona,US,3810.0,34.305599,-112.165001,America/Phoenix
3,00CA,,Goldstone /Gts/ Airport,Barstow,California,US,3038.0,35.350498,-116.888,America/Los_Angeles
4,00CO,,Cass Field,Briggsdale,Colorado,US,4830.0,40.6222,-104.344002,America/Denver



[1mTraining data[0m


Unnamed: 0,id,flight_date,flight_number,departure_point,arrival_point,scheduled_time_departure,scheduled_time_arrival,flight_status,aircraft_code,target
0,train_id_0,2016-01-03,TU 0712,CMN,TUN,2016-01-03 10:30:00,2016-01-03 12.55.00,ATA,TU 32AIMN,260.0
1,train_id_1,2016-01-13,TU 0757,MXP,TUN,2016-01-13 15:05:00,2016-01-13 16.55.00,ATA,TU 31BIMO,20.0
2,train_id_2,2016-01-16,TU 0214,TUN,IST,2016-01-16 04:10:00,2016-01-16 06.45.00,ATA,TU 32AIMN,0.0
3,train_id_3,2016-01-17,TU 0480,DJE,NTE,2016-01-17 14:10:00,2016-01-17 17.00.00,ATA,TU 736IOK,0.0
4,train_id_4,2016-01-17,TU 0338,TUN,ALG,2016-01-17 14:30:00,2016-01-17 15.50.00,ATA,TU 320IMU,22.0



[1mTest data[0m


Unnamed: 0,ID,DATOP,FLTID,DEPSTN,ARRSTN,STD,STA,STATUS,AC
0,test_id_0,2016-05-04,TU 0700,DJE,TUN,2016-05-04 06:40:00,2016-05-04 07.30.00,ATA,TU 32AIMF
1,test_id_1,2016-05-05,TU 0395,TUN,BKO,2016-05-05 15:20:00,2016-05-05 20.05.00,ATA,TU 320IMW
2,test_id_2,2016-05-06,TU 0745,FRA,TUN,2016-05-06 10:00:00,2016-05-06 12.25.00,ATA,TU 32AIMC
3,test_id_3,2016-05-11,TU 0848,BEY,TUN,2016-05-11 09:40:00,2016-05-11 13.10.00,ATA,TU 31BIMO
4,test_id_4,2016-05-11,TU 0635,ORY,MIR,2016-05-11 09:50:00,2016-05-11 12.35.00,ATA,TU 736IOQ


__1. How many flights are in our data set, in which date period we have data?__

In [5]:
print(train_df["flight_date"].min())
print(train_df["flight_date"].max())

2016-01-01
2018-12-31


Dates of flights from 2016-01-01 till 2018-12-31

In [6]:
# Getting an idea of the dimension
print('Number of rows in the flightdata dataset : ',train_df.shape[0])
print('Number of columns in the flightdata dataset : ',train_df.shape[1])

Number of rows in the flightdata dataset :  107833
Number of columns in the flightdata dataset :  10


We can observe that the data set contain 107833 rows and 19 columns.

__2. What datatype are in the data?__

In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107833 entries, 0 to 107832
Data columns (total 10 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   id                        107833 non-null  object 
 1   flight_date               107833 non-null  object 
 2   flight_number             107833 non-null  object 
 3   departure_point           107833 non-null  object 
 4   arrival_point             107833 non-null  object 
 5   scheduled_time_departure  107833 non-null  object 
 6   scheduled_time_arrival    107833 non-null  object 
 7   flight_status             107833 non-null  object 
 8   aircraft_code             107833 non-null  object 
 9   target                    107833 non-null  float64
dtypes: float64(1), object(9)
memory usage: 8.2+ MB


**Data-types**  
- The dataset contains 9 columns with objects (text) and the target variable is a float type. <br>

- We expect that all information about the date will be given as datatime type.

__3. Are there any missing values?__

In [8]:
# import from script files folder feature engineering script
missing_values_table(train_df)

Your selected dataframe has 10 columns and 107833 Rows.
There are 0 columns that have missing values.


Unnamed: 0,Zero Values,Missing Values,% of Total Values,Total Zero and Missing Values,Data Type


As we can see we have no missing data

__4. Which features are continuous or categorical__

In [9]:
# All continuous variables
train_df.select_dtypes("object").columns

Index(['id', 'flight_date', 'flight_number', 'departure_point',
       'arrival_point', 'scheduled_time_departure', 'scheduled_time_arrival',
       'flight_status', 'aircraft_code'],
      dtype='object')

In [10]:
train_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
target,107833.0,48.733013,117.135562,0.0,0.0,14.0,43.0,3451.0


The maxinum and minimum duration of flight delay is 3451 minutes and 0 minutes respectively

In [11]:
train_df

Unnamed: 0,id,flight_date,flight_number,departure_point,arrival_point,scheduled_time_departure,scheduled_time_arrival,flight_status,aircraft_code,target
0,train_id_0,2016-01-03,TU 0712,CMN,TUN,2016-01-03 10:30:00,2016-01-03 12.55.00,ATA,TU 32AIMN,260.0
1,train_id_1,2016-01-13,TU 0757,MXP,TUN,2016-01-13 15:05:00,2016-01-13 16.55.00,ATA,TU 31BIMO,20.0
2,train_id_2,2016-01-16,TU 0214,TUN,IST,2016-01-16 04:10:00,2016-01-16 06.45.00,ATA,TU 32AIMN,0.0
3,train_id_3,2016-01-17,TU 0480,DJE,NTE,2016-01-17 14:10:00,2016-01-17 17.00.00,ATA,TU 736IOK,0.0
4,train_id_4,2016-01-17,TU 0338,TUN,ALG,2016-01-17 14:30:00,2016-01-17 15.50.00,ATA,TU 320IMU,22.0
...,...,...,...,...,...,...,...,...,...,...
107828,train_id_107828,2018-07-05,WKL 0000,TUN,TUN,2018-07-05 23:00:00,2018-07-06 02.00.00,SCH,TU 32AIML,0.0
107829,train_id_107829,2018-01-13,UG 0003,DJE,TUN,2018-01-13 08:00:00,2018-01-13 09.00.00,SCH,UG AT7AT7,0.0
107830,train_id_107830,2018-11-07,SGT 0000,TUN,TUN,2018-11-07 05:00:00,2018-11-07 12.50.00,SCH,TU 736IOK,0.0
107831,train_id_107831,2018-01-23,UG 0010,TUN,DJE,2018-01-23 18:00:00,2018-01-23 18.45.00,ATA,TU CR9ISA,0.0


In [18]:
delays = train_df.query("target >= 10.0")

In [None]:
on_time = train_df.query("target ")

**Data Insights:**

1. We have **3622** expressions for the price of sold houses: many houses were sold for the same price.
2. Although the "grade" is an index from **1-13**, we have only **11** different values included.
   -   [King County](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r)<br>
  
3. The houses in our dataset were built in **116** different years.<br>
   
4. On **average**, the houses in this data set cost **USD 540 296** . The **most expensive** house sold for **USD 7 700 000** , the **cheapest** for **USD 78 000**.
5.  **50%** of the houses have **3** or less bedrooms. The house with the most bedrooms has 33! (does this make sense?)
6.  on **average**, each house has **2.1** bathrooms per bedroom.
7.  **75%** of houses have **2** or fewer floors.
8.  only **10%** of the houses have a view to the Waterfront (Since waterfront only carries the values 0 and 1, i.e. for true or false, the average can be interpreted as a percentage number for houses with waterfront.).
9.  the **oldest house** in our dataset is from **1900**, the **newest house** from 2015.

## Exploratory Data Analysis

In [12]:
depart10 = train_df['departure_point'].value_counts().head(10).to_frame().reset_index()
depart10.columns = ['departure_point', 'Count']

In [13]:
arrival10 = train_df['arrival_point'].value_counts().head(10).to_frame().reset_index()
arrival10.columns = ['arrival_point', 'Count']

## Feature Engineering

In [14]:
# import from script files folder feature engineering script
time_data(train_df)
# create a column for flight length
train_df["flight_length"] = train_df["flight_length"].apply(lambda x: x.total_seconds()/60)

KeyError: 'flight_length'

In [None]:
airports

## Exploratory Data Analysis