**DESCRIPTION**

**Problem Statement**

This is the flight delay prediction for the month of January. 

This data is collected from the Bureau of Transportation Statistics, Govt. of the USA. This data is open-sourced under U.S. Govt. Works. This dataset contains all the flights in the month of January 2019 and January 2020. There are more than 400,000 flights in the month of January itself throughout the United States. 

This data could well be used to predict the flight delay at the destination airport specifically for the month of January in upcoming years as the data is for January only.

This file contains all the flights starting from 1st January 2019 till 31st January 2019. There are around 400,000 rows in this file and 21 feature columns indicating the features of the flight including information about origin airport, destination airport, airplane information, departure time and arrival time.

Downlod the **data sets** from _**[here](https://www.kaggle.com/divyansh22/flight-delay-prediction)**_.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score,classification_report,confusion_matrix
        

from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBRFClassifier,XGBClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier

from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
traindata0119=pd.read_csv('/kaggle/input/flight-delay-prediction/Jan_2019_ontime.csv')
traindata0120=pd.read_csv('/kaggle/input/flight-delay-prediction/Jan_2020_ontime.csv')

In [None]:
traindata0119.head()

In [None]:
traindata0120.head()

In [None]:
#Merge both the dataset
traindata0=pd.concat([traindata0119,traindata0120])
#traindata0=traindata0.iloc[:,np.arange(21)]

In [None]:
traindata0.info()

In [None]:
#let us figure out the redundant attributes
traindata0.OP_UNIQUE_CARRIER.value_counts()

In [None]:
traindata0.OP_CARRIER_AIRLINE_ID.value_counts()

In [None]:
traindata0.OP_CARRIER.value_counts()

In [None]:
#From above resuls we can conclude that OP_UNIQUE_CARRIER , OP_CARRIER_AIRLINE_ID, OP_CARRIER are redundant features.
# Hence we can remove OP_UNIQUE_CARRIER, OP_CARRIER and retain only the numerical feature OP_CARRIER_AIRLINE_ID
# similarly ORIGIN_AIRPORT_ID, ORIGIN_AIRPORT_SEQ_ID and ORIGIN are redundant features and hence we can retain ONLY ORIGIN_AIRPORT_ID 
# similarly DEST_AIRPORT_ID, DEST_AIRPORT_SEQ_ID and DEST are redundant features and hence we can retain ONLY DEST_AIRPORT_ID
# Also we dont want CANCELLED and DIVERTED CASES
#ARR_TIME and ARR_DEL15 can be considered as target variables, We can drop ARR_TIME since we have consedered ARR_DEL15 as target  

traindata1=traindata0.drop(columns=['OP_UNIQUE_CARRIER','OP_CARRIER','ORIGIN_AIRPORT_SEQ_ID',
                                    'ORIGIN','DEST_AIRPORT_SEQ_ID','DEST',
                                   'CANCELLED','DIVERTED','ARR_TIME','Unnamed: 21','TAIL_NUM'])

In [None]:
traindata1.head()

In [None]:
traindata1.info()

In [None]:
# get the percentage of null values across each attributes
traindata1.isnull().sum()/len(traindata1)*100   

In [None]:
#As the the null value percentage are very less we can drop the null records instead of NULL value imputation
traindata1.dropna(inplace=True)
traindata1.reset_index(drop=True,inplace=True)
traindata1.info()

In [None]:
#Lets analyze the Catagorical variables DEP_TIME_BLK
traindata1.DEP_TIME_BLK.value_counts()


In [None]:
#We can do a quick check if these two categorical variables have any influence on the target variable.
#This can be tested using Chisquare Test of independence: H0: There is no dependency between Feature and Target Ha:There is dependency
from scipy.stats import chi2_contingency
categorical_columns=['DEP_TIME_BLK']
chi2_check = []
for i in categorical_columns:
    ch2 , p_value , df, exp_freq=chi2_contingency(pd.crosstab(traindata1[i],traindata1['ARR_DEL15']))
    if p_value < 0.05:
        chi2_check.append('Reject Null Hypothesis: Retain the Feature:'+i)
    else:
        chi2_check.append('Fail to Reject Null Hypothesis: Drop the Feature:')
chi2_check

In [None]:
#From Chisquare test it seems that we need to retain the the above categorical featuture
# DEP_TIME_BLK can be filled with numeric value
traindata1.DEP_TIME_BLK.replace(['0600-0659','0700-0759','0800-0859','1700-1759','1200-1259','1100-1159','1500-1559',
                                 '1000-1059','1400-1459','0900-0959','1600-1659','1800-1859','1300-1359','1900-1959',
                                 '2000-2059','2100-2159','0001-0559','2200-2259','2300-2359'],
                               [6,7,8,17,12,11,15,10,14,9,16,18,13,19,20,21,1,22,23],inplace=True)


In [None]:
traindata1.info()

In [None]:
#Lets check the correlation among the variables
traindata1.corr()

In [None]:
#From above correlation matrix its evident that there are some muticoliniarity variables
#Example:
# 1. There is a very strong correlation between DEP_TIME_BLK and DEP_TIME (96%)
# 2. There is also noticiable corrlelation between OP_CARRIER_AIRLINE_ID and OP_CARRIER_FL_NUM
# Hence we can try dropping DEP_TIME_BLK and OP_CARRIER_AIRLINE_ID

traindata2=traindata1.drop(columns=['DEP_TIME_BLK','OP_CARRIER_AIRLINE_ID'])

In [None]:
features=['DAY_OF_MONTH','DAY_OF_WEEK','OP_CARRIER_FL_NUM','ORIGIN_AIRPORT_ID']

#Lets do pair plot and visualize the class separability


#plt.subplot(2,4,1)
sns.pairplot(traindata2,x_vars=features,y_vars=features,kind='scatter',hue='DEP_DEL15')
plt.show()

In [None]:
features=['DEST_AIRPORT_ID','DEP_TIME','DISTANCE']

#Lets do pair plot and visualize the class separability
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#plt.subplot(2,4,1)
sns.pairplot(traindata2,x_vars=features,y_vars=features,kind='scatter',hue='DEP_DEL15')
plt.show()

In [None]:
# From above pair plots it seems that Target variables are not linearly saparable with most ofthe features except Dep_time
# Extract the features and label
traindata2.head(1)

In [None]:
features=traindata2.drop(columns=['ARR_DEL15']).values
label=traindata2['ARR_DEL15'].values

In [None]:
#Check for data balance
sns.countplot(x=label)

In [None]:
#Split the data into Train and Test
from sklearn.model_selection import train_test_split
X_train,X_test_final,y_train,y_test_final=train_test_split(features,label,test_size=0.2,random_state=12)
print('the shape of X_train and  y_train: ', X_train.shape, y_train.shape)
print('the shape of X_test and  y_test: ', X_test_final.shape,y_test_final.shape)


## Verify the performance of different models using Stratified-KFold Cross Validation


In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score,classification_report,confusion_matrix
#from sklearn import metrics

def stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y):
    global df_model_selection
    
    skf = StratifiedKFold(n_splits, random_state=12,shuffle=True)
    
    weighted_f1_score = []
    #print(skf.split(X,y))
    for train_index, test_index in skf.split(X,y):
        X_train, X_test = X[train_index], X[test_index] 
        y_train, y_test = y[train_index], y[test_index]
        
        
        model_obj.fit(X_train, y_train)##### HERE ###
        test_ds_predicted = model_obj.predict( X_test ) ##### HERE ####   
        #print( metrics.classification_report( y_test, test_ds_predicted ) )    
        weighted_f1_score.append(round(f1_score(y_true=y_test, y_pred=test_ds_predicted , average='weighted'),2))
        
    sd_weighted_f1_score = np.std(weighted_f1_score, ddof=1)
    range_of_f1_scores = "{}-{}".format(min(weighted_f1_score),max(weighted_f1_score))    
    df_model_selection = pd.concat([df_model_selection,pd.DataFrame([[process,model_name,sorted(weighted_f1_score),range_of_f1_scores,sd_weighted_f1_score]], columns =COLUMN_NAMES) ])
    

In [None]:
%%time

COLUMN_NAMES = ["Process","Model Name", "F1 Scores","Range of F1 Scores","Std Deviation of F1 Scores"]
df_model_selection = pd.DataFrame(columns=COLUMN_NAMES)

process='Stratified-KFold'
n_splits = 10
X=sc.fit_transform(X_train)
y=y_train

# Logistic Regression
model_LR=LogisticRegression()
model_obj=model_LR
model_name='Logistic Regression'
stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y)

# Decesion Tree Classifier
model_DTC=DecisionTreeClassifier()
model_obj=model_DTC
model_name='Decesion Tree Classifier'
stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y)

# Random Forest Classifier
model_RFC=RandomForestClassifier()
model_obj=model_RFC
model_name='Random Forest Classifier'
stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y)

# XGBoost Classifier
model_XGBC=XGBClassifier()
model_obj=model_XGBC
model_name='XGBoost Classifier'
stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y)

# Gradient Boosting Classifier
model_GBC=GradientBoostingClassifier()
model_obj=model_GBC
model_name='Gradient Boosting Classifier'
stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y)

# XGBoost Random Forest Classifier
#model_XGBRFC=XGBRFClassifier()
#model_obj=model_XGBRFC
#model_name='XGBoost Random Forest Classifier'
#stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y)

# 8.Support Vector Machine Classifier
#model_SVC=SVC()
#model_obj=model_SVC
#model_name='Support Vector Machine Classifier'
#stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y)


# 9.SGD Classifier
#model_sgd = OneVsRestClassifier(SGDClassifier())
#model_obj=model_sgd
#model_name='Stochastic Gradient Descent Classifier'
#stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y)


#11.KNeighborsClassifier
#model_KNNC=KNeighborsClassifier()
#model_obj=model_KNNC
#model_name='K Nearst Neighbour Classifier'
#stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y)

#12 Linear Discriminant Analysis
#model_LDA=LinearDiscriminantAnalysis()
#model_obj=model_LDA
#model_name='Linear Discriminant Analysis'
#stratified_K_fold_validation(model_obj, model_name, process, n_splits, X, y)

#Exporting the results to csv
#df_model_selection.to_csv("Model_statistics.csv",index = False)
df_model_selection

In [None]:

Process	Model Name	F1 Scores	Range of F1 Scores	Std Deviation of F1 Scores
0	Stratified-KFold	Logistic Regression	[0.92, 0.92, 0.92, 0.92, 0.92, 0.92, 0.92, 0.9...	0.92-0.93	3.162278e-03
0	Stratified-KFold	Decesion Tree Classifier	[0.88, 0.88, 0.88, 0.88, 0.88, 0.88, 0.88, 0.8...	0.88-0.88	1.170278e-16
0	Stratified-KFold	Random Forest Classifier	[0.92, 0.92, 0.92, 0.92, 0.92, 0.92, 0.92, 0.9...	0.92-0.93	3.162278e-03
0	Stratified-KFold	XGBoost Classifier	[0.92, 0.92, 0.92, 0.92, 0.92, 0.92, 0.92, 0.9...	0.92-0.93	3.162278e-03

In [None]:
#Lets get the best samples out of 10 splits

# Now lets try to get the Scores using StratifiedKFold Cross Validation

#Initialize the algo
model=LogisticRegression()

#Initialize StratifiedKFold Method
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, 
              random_state=1,
              shuffle=True)

#Initialize For Loop 

i=0
for train,test in kfold.split(X,y):
    i = i+1
    X_train,X_test = X[train],X[test]
    y_train,y_test = y[train],y[test]
    
    model.fit(X_train,y_train)
    test_ds_predicted=model.predict(X_test)
    train_ds_predicted=model.predict(X_train)
    
    test_f1_score=round(f1_score(y_true=y_test, y_pred=test_ds_predicted , average='weighted'),2)
    train_f1_score=round(f1_score(y_true=y_train, y_pred=train_ds_predicted , average='weighted'),2)
    
    #print("Train Score: {}, Test score: {}, for Sample Split: {}".format(model.score(X_train,y_train),model.score(X_test,y_test),i))
    print("Train f1-Score: {}, Test f1-score: {}, for Sample Split: {}".format(train_f1_score,test_f1_score,i))

In [None]:
#Lets extract the Train and Test sample for split 9
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, #n_splits should be equal to no of cv value in cross_val_score
              random_state=1,
              shuffle=True)
i=0
for train,test in kfold.split(X,y):
    i = i+1
    if i == 9:
        X_train,X_test,y_train,y_test = X[train],X[test],y[train],y[test]

#Final Model
finalModel=LogisticRegression()
finalModel.fit(X_train,y_train)

test_ds_predicted=model.predict(X_test)
train_ds_predicted=model.predict(X_train)

test_f1_score=round(f1_score(y_true=y_test, y_pred=test_ds_predicted , average='weighted'),2)
train_f1_score=round(f1_score(y_true=y_train, y_pred=train_ds_predicted , average='weighted'),2)
print("Train f1-Score: {}, Test f1-score: {}".format(train_f1_score,test_f1_score))


train_score=np.round(finalModel.score(X_train,y_train),2)
test_score=np.round(finalModel.score(X_test,y_test),2)
print('Train Accuracy Score is:{} and  Test Accuracy Score:{}'.format(train_score,test_score))

#Classification Report
cr=classification_report(y_true=y_test,y_pred=finalModel.predict(X_test))
print(cr)

In [None]:
#Lets test the model in unknown dataset: X_test_final and y_test_final
X_test_final=sc.fit_transform(X_test_final)
cr=classification_report(y_true=y_test_final,y_pred=finalModel.predict(X_test_final))
print(cr)

In [None]:
## Lets try out Neural Network
import tensorflow as tf
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dropout, Dense

In [None]:
%%time
#Build a sequential model
tf.keras.backend.clear_session()

#Initialize Sequential model
model_NN = tf.keras.models.Sequential()
#Input Layer
model_NN.add(tf.keras.layers.Reshape((8,),input_shape=(8,)))
#Normalize the data
model_NN.add(tf.keras.layers.BatchNormalization())

#Add 1st hidden layer
model_NN.add(tf.keras.layers.Dense(100, activation='relu'))
#Dropout layer
#model_NN.add(tf.keras.layers.Dropout(0.5))
#Normalize the data
model_NN.add(tf.keras.layers.BatchNormalization())

#Add 2nd hidden layer
model_NN.add(tf.keras.layers.Dense(50, activation='relu'))
#Dropout layer
#model_NN.add(tf.keras.layers.Dropout(0.3))
#Normalize the data
model_NN.add(tf.keras.layers.BatchNormalization())

#Add OUTPUT layer
model_NN.add(tf.keras.layers.Dense(1, activation='sigmoid'))

#Create optimizer with non-default learning rate
#sgd_optimizer = tf.keras.optimizers.SGD(lr=1.0)
#model_NN.compile(optimizer=sgd_optimizer, loss='binary_crossentropy', metrics=['accuracy'])

#Compile the model
model_NN.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
#model_NN.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

In [None]:
%%time
model_NN.fit(X_train,y_train,          
          validation_data=(X_test,y_test),
          epochs=10,
          batch_size=32)

In [None]:
##Observation from Neural Net: There is not much of improvement in accuracy as compared to logistic regression

In [None]:
# As the Data is imbalance , we can try using Oversampling technique() and see if we can improve the model performance
# Databalancing technique can be applied onöy on Train dataset

#Split the data into Train, Valdation and Test Set
from sklearn.model_selection import train_test_split
X_train,X_test_final,y_train,y_test_final=train_test_split(features,label,test_size=0.2,random_state=12)
print('the shape of X_train and  y_train: ', X_train.shape, y_train.shape)
print('the shape of X_test and  y_test: ', X_test_final.shape,y_test_final.shape)


from imblearn.over_sampling import SMOTE
print('length of X_train and y_train before Oversampling',len(X_train),len(y_train))

OS=SMOTE(random_state=42)
X_train_OS,y_train_OS=OS.fit_resample(X_train,y_train)

print('length of X_train and y_train after Oversampling',len(X_train_OS),len(y_train_OS))

#Check for data balance
sns.countplot(x=y_train_OS)


In [None]:
%%time
# We can now apply cross validation 
#Initialize the algo
model=LogisticRegression()

X_train_OS=sc.fit_transform(X_train_OS) # scale the data

#Initialize StratifiedKFold Method
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, 
              random_state=1,
              shuffle=True)

#Initialize For Loop 

i=0
for train,test in kfold.split(X_train_OS,y_train_OS):
    i = i+1
    X_train,X_test = X_train_OS[train],X_train_OS[test]
    y_train,y_test = y_train_OS[train],y_train_OS[test]
    
    model.fit(X_train,y_train)
    test_ds_predicted=model.predict(X_test)
    train_ds_predicted=model.predict(X_train)
    
    test_f1_score=round(f1_score(y_true=y_test, y_pred=test_ds_predicted , average='weighted'),2)
    train_f1_score=round(f1_score(y_true=y_train, y_pred=train_ds_predicted , average='weighted'),2)
    
    #print("Train Score: {}, Test score: {}, for Sample Split: {}".format(model.score(X_train,y_train),model.score(X_test,y_test),i))
    print("Train f1-Score: {}, Test f1-score: {}, for Sample Split: {}".format(train_f1_score,test_f1_score,i))

In [None]:
#Lets extract the Train and Test sample for split 2
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, #n_splits should be equal to no of cv value in cross_val_score
              random_state=1,
              shuffle=True)
i=0
for train,test in kfold.split(X_train_OS,y_train_OS):
    i = i+1
    if i == 2:
        X_train,X_test,y_train,y_test = X_train_OS[train],X_train_OS[test],y_train_OS[train],y_train_OS[test]

#Final Model
finalModel=LogisticRegression()
finalModel.fit(X_train,y_train)

test_ds_predicted=model.predict(X_test)
train_ds_predicted=model.predict(X_train)

test_f1_score=round(f1_score(y_true=y_test, y_pred=test_ds_predicted , average='weighted'),2)
train_f1_score=round(f1_score(y_true=y_train, y_pred=train_ds_predicted , average='weighted'),2)
print("Train f1-Score: {}, Test f1-score: {}".format(train_f1_score,test_f1_score))


train_score=np.round(finalModel.score(X_train,y_train),2)
test_score=np.round(finalModel.score(X_test,y_test),2)
print('Train Accuracy Score is:{} and  Test Accuracy Score:{}'.format(train_score,test_score))

#Classification Report
cr=classification_report(y_true=y_test,y_pred=finalModel.predict(X_test))
print(cr)

In [None]:
#Lets test the model in unknown dataset: X_test_final and y_test_final
X_test_final=sc.fit_transform(X_test_final)
cr=classification_report(y_true=y_test_final,y_pred=finalModel.predict(X_test_final))
print(cr)

In [None]:
#Conclusion: 
#As we can see from above test results on the unknown dataset, 
#The model performance between Balance and Imbalance dataset is very much similar.
# Precession is around 77% and Recall is around 74% and F-1 score is 86%.
# Next Action would be to improve the model performance with Hyperparameter tuning.