#**Aircraft Arrival Delay Prediction Analysis**

**In this Ipnyb notebook a rough aircraft arrival delay will be estimated with the help of aviation data available in year 2008.** 

###**Importing data and several libraries**

In this code section the importing of aviation data and other important libraries like pandas will be carried out.

In [2]:
import numpy as np
import pandas as pd

#importing data
data = pd.read_csv("/content/drive/My Drive/Datasets_csv/DelayedFlights.csv")

#converting imported data into a DataFrame
df = pd.DataFrame(data)

#printing the first five rows of the DataFrame for verification
print(df.head())

   Unnamed: 0  Year  Month  ...  NASDelay  SecurityDelay  LateAircraftDelay
0           0  2008      1  ...       NaN            NaN                NaN
1           1  2008      1  ...       NaN            NaN                NaN
2           2  2008      1  ...       NaN            NaN                NaN
3           4  2008      1  ...       0.0            0.0               32.0
4           5  2008      1  ...       NaN            NaN                NaN

[5 rows x 30 columns]


##**Exploratory Data Analysis**

In this code section below. Having a brief look at the data and its attributes

In [7]:
#Investigating number of columns and column names
print("DataFrame columns: \n",df.columns)

#total number of columns in the data
print("\nthere are to total number of ",str(len(df.columns))," features in the Dataframe")

DataFrame columns: 
 Index(['Unnamed: 0', 'Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime',
       'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum',
       'TailNum', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay',
       'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn', 'TaxiOut',
       'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay',
       'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
      dtype='object')

there are to total number of  30  features in the Dataframe


###**Feature Selection and Engineering**

*Feature selection is one of the important aspects of data cleaning. Unnecessary junk features create noise in data which diverts the model from achieving good accuracy. For making prediction analysis of any kind of data, It is important to keep in mind that it is required to give input features for model to predict output. Since there are 30 columns in the Data, the project objective is to predict arrival delay of the aircraft. From the remaining 29 features as 'ArrDelay' is going to the target feature in this project, I have opted to select 'Month', DayofMonth', DayofWeek', 'CRSDeptTime','CRSArrTime', 'UniqueCarrier', 'ArrDelay', 'Origin', 'Dest'. Since these are the only input a basic user can give. All other features of the data depends on other facts like weather, aircraft, traffic etc.*


In [9]:
#removing unnecessary features from the DataFrame and assigning to a new variable
new_df = df.drop(['Unnamed: 0', 'Year', 'DepTime', 'ArrTime','FlightNum',
       'TailNum', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime',
       'DepDelay', 'Distance', 'TaxiIn', 'TaxiOut',
       'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay',
       'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'], axis=1)

#veryfying the feature removal operation.
print(new_df.columns)




Index(['Month', 'DayofMonth', 'DayOfWeek', 'CRSDepTime', 'CRSArrTime',
       'UniqueCarrier', 'ArrDelay', 'Origin', 'Dest'],
      dtype='object')


###**Exploring the partially cleaned Data**

Exploring the statistical summary and other information of the new_df data

In [10]:
#statistical summary of the data
print("Statistical summary of new_df: \n", new_df.describe())

#columnwise data types information
print("Data types of the new_df: \n", new_df.info())

Statistical summary of new_df: 
               Month    DayofMonth  ...    CRSArrTime      ArrDelay
count  1.048575e+06  1.048575e+06  ...  1.048575e+06  1.044679e+06
mean   3.385121e+00  1.542589e+01  ...  1.632411e+03  4.218257e+01
std    1.700650e+00  8.852621e+00  ...  4.653917e+02  5.577485e+01
min    1.000000e+00  1.000000e+00  ...  0.000000e+00 -6.900000e+01
25%    2.000000e+00  8.000000e+00  ...  1.324000e+03  9.000000e+00
50%    3.000000e+00  1.500000e+01  ...  1.705000e+03  2.500000e+01
75%    5.000000e+00  2.300000e+01  ...  2.014000e+03  5.600000e+01
max    6.000000e+00  3.100000e+01  ...  2.400000e+03  2.461000e+03

[8 rows x 6 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 9 columns):
Month            1048575 non-null int64
DayofMonth       1048575 non-null int64
DayOfWeek        1048575 non-null int64
CRSDepTime       1048575 non-null int64
CRSArrTime       1048575 non-null int64
UniqueCarrier    1048575 non-n

###**Searching for missing values in new_df**

Missing values are potential to create noise. They need to be removed or filled to achieve good accuracy. 

In [13]:
#columnwise missing values information.

print("Columnwise missing values information\n")
for col in new_df.columns:
  print(col, new_df[col].isnull().values.sum())

Columnwise missing values information

Month 0
DayofMonth 0
DayOfWeek 0
CRSDepTime 0
CRSArrTime 0
UniqueCarrier 0
ArrDelay 3896
Origin 0
Dest 0


#####as it is observed that 'ArrDelay' contains 3896 missing records. 3896 is a very small ratio in 1048575 records. They can be removed with no hesitation.

In [15]:
#removing the rows which contains null 'ArrDelay' records.
new_df = new_df[~pd.isnull(new_df).any(axis=1)]

for col in new_df.columns:
  print(col, new_df[col].isnull().values.sum())
#veryfying the last operation
print("\nthere are", str(new_df.isnull().values.sum()), " missing values in the data")

Month 0
DayofMonth 0
DayOfWeek 0
CRSDepTime 0
CRSArrTime 0
UniqueCarrier 0
ArrDelay 0
Origin 0
Dest 0

there are 0  missing values in the data


###**Identyfying the total number of aiports in the data**

In the code section below an operation which is aimed to identify the total number of unique airports so the information about number of airports employed in data analysis would get to know.

In [29]:
#airports list
print("Available airports list: \n")

#combining 'Origin' and 'Dest' dataframes
airports = list(list(new_df['Origin'])+list(new_df['Dest']))

#Extracting unique values from both 'Origin' and 'Dest' dataframes
airports = list(pd.Series(airports).unique())
print(airports)

print("\nThere are total number of", len(airports) ,"unique airports in the data")


Available airports list: 

['IAD', 'IND', 'ISP', 'JAN', 'JAX', 'LAS', 'LAX', 'LBB', 'LIT', 'MAF', 'MCI', 'MCO', 'MDW', 'MHT', 'MSY', 'OAK', 'OKC', 'OMA', 'ONT', 'ORF', 'PBI', 'PDX', 'PHL', 'PHX', 'PIT', 'PVD', 'RDU', 'RNO', 'RSW', 'SAN', 'SAT', 'SDF', 'SEA', 'SFO', 'SJC', 'SLC', 'SMF', 'SNA', 'STL', 'TPA', 'TUL', 'TUS', 'ABQ', 'ALB', 'AMA', 'AUS', 'BDL', 'BHM', 'BNA', 'BOI', 'BUF', 'BUR', 'BWI', 'CLE', 'CMH', 'CRP', 'DAL', 'DEN', 'DTW', 'ELP', 'FLL', 'GEG', 'HOU', 'HRL', 'ROC', 'ORD', 'EWR', 'SYR', 'IAH', 'CRW', 'FAT', 'COS', 'MRY', 'LGB', 'BFL', 'EUG', 'ICT', 'MEM', 'BTV', 'MKE', 'LFT', 'BRO', 'PWM', 'MSP', 'SRQ', 'CLT', 'CVG', 'GSO', 'SHV', 'DCA', 'TYS', 'GSP', 'RIC', 'DFW', 'BGR', 'DAY', 'GRR', 'CHS', 'CAE', 'TLH', 'XNA', 'GPT', 'VPS', 'LGA', 'ATL', 'MSN', 'SAV', 'BTR', 'LEX', 'LRD', 'MOB', 'MTJ', 'GRK', 'AEX', 'PNS', 'ABE', 'HSV', 'CHA', 'MFE', 'MLU', 'DSM', 'MGM', 'AVL', 'LCH', 'BOS', 'MYR', 'CLL', 'DAB', 'ASE', 'ATW', 'BMI', 'CAK', 'CID', 'CPR', 'EGE', 'FLG', 'FSD', 'FWA', 'GJT',

Airports have unique codes for identity. I have attached a airports codes and names csv file for the reference. 

###**Identifying Independant and Dependant variables. i.e. stating target variable**. 

In [31]:
y = new_df['ArrDelay']
X = new_df.drop(['ArrDelay'], axis=1)

#verifying operation
print(y.head())
print("\n",X.head())

0   -14.0
1     2.0
2    14.0
3    34.0
4    11.0
Name: ArrDelay, dtype: float64

    Month  DayofMonth  DayOfWeek  ...  UniqueCarrier  Origin Dest
0      1           3          4  ...             WN     IAD  TPA
1      1           3          4  ...             WN     IAD  TPA
2      1           3          4  ...             WN     IND  BWI
3      1           3          4  ...             WN     IND  BWI
4      1           3          4  ...             WN     IND  JAX

[5 rows x 8 columns]


###**One hot label Encoding**

Since there are categorical variables in the data. It is important to label encode them for achieving greater accuracy of the model. 

In [33]:
#ordinal label encoding

cats = ['Month', 'DayofMonth', 'DayOfWeek', 'CRSDepTime', 'CRSArrTime', 'UniqueCarrier' ,'Origin','Dest']

for feature in X[cats]:
  X[feature] = pd.Categorical(X[feature]).codes

#label encoding has completed successfully. to ensure this operation having a look at
#'UniqueCarrier' variable to see whether the data has label encoded or not.

print(X['UniqueCarrier'].head())

0    17
1    17
2    17
3    17
4    17
Name: UniqueCarrier, dtype: int8


###**Splitting the Data into Training set and Testing set**

In this code section data will be splitted into traing and testing set 

In [34]:
#importing train_test_split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.33, random_state = 42)

print(X_train.head())

        Month  DayofMonth  DayOfWeek  ...  UniqueCarrier  Origin  Dest
301730      1          25          1  ...              9      18   218
135031      0          10          4  ...             12     289   177
376046      2           5          3  ...             17     129   131
345040      1           0          4  ...              1     278   182
114179      0          19          6  ...              9     173   175

[5 rows x 8 columns]


###**Importing Xgboost model to train and test the data**

This step involves in training and testing the model performance. Xgboost model has been employed I found the this model is performing slightly better than LinearRegression.

In [0]:
#importing model
from xgboost import XGBRegressor
xgb = XGBRegressor()

#importing scipy stats
import scipy.stats as st

#to avoid warning while training the data.
import warnings
warnings.filterwarnings('ignore')

one_to_left = st.beta(10, 1)  
from_zero_positive = st.expon(0, 50)

#defining decent parameters to achieve accuracy
params = {  
    "n_estimators": st.randint(3, 40),
    "max_depth": st.randint(3, 40),
    "learning_rate": st.uniform(0.05, 0.4),
    "colsample_bytree": one_to_left,
    "subsample": one_to_left,
    "gamma": st.uniform(0, 10),
    'reg_alpha': from_zero_positive,
    "min_child_weight": from_zero_positive
}


In [37]:
#importing RandomizedSearchCv
from sklearn.model_selection import RandomizedSearchCV

#fitting the training data in the model. 
xgbreg = XGBRegressor(nthread=-1)
rsCV = RandomizedSearchCV(xgbreg, params, n_jobs=1)  
rsCV.fit(X_train, y_train)
rsCV.best_params_, rsCV.best_score_

clf = XGBRegressor(**rsCV.best_params_)
clf.fit(X_train, y_train)

#printing the Mean_Absolute_Error.
from sklearn.metrics import mean_absolute_error
print("MAE: %.4f" % mean_absolute_error(y_test, clf.predict(X_test)))


MAE: 31.0649


##**Reports**

With mean absolute error ~31 minutes is not a great case. with the limitaions of the data, The results are better one.