# SDS Challenge #1 - Flight Cancellations

## Problem Statement
Welcome, Data Scientist! You have recently been hired by the US Department of Transportation (DOT) to analyze data from multiple airline carriers in the United States. The DOT wants to help airline carriers reduce the number of flight cancellations and improve travelers' experiences. Your job is to help the DOT predict whether or not a flight will be canceled based on the data provided.

## Evaluation

\begin{equation*}
accuracy = \frac{TP + TN}{TP + TN + FP + FN}
\end{equation*}
<br>

## Understanding the Dataset 

Each column in the dataset is labeled and explained in more detail below.

<b>YEAR:</b> Year in which the flight was scheduled to take place <br>
<b>MONTH:</b> Month in which the flight was scheduled to take place<br>
<b>DAY:</b> Day of the month the flight was scheduled to take place <br>
<b>DAY_OF_WEEK:</b> Day of the week the flight took place<br> 
<b>AIRLINE:</b> Initials of the airline that was scheduled to carry out the flight <br> 
<b>FLIGHT_NUMBER:</b> Initials of the airline that was scheduled to carry out the flight <br> 
<b>TAIL_NUMBER:</b> Tail Number of the plane that was scheduled to carry out the flight <br>
<b>ORIGIN_AIRPORT:</b> Location of the airport that the flight was scheduled to depart from <br>
<b>DESTINATION_AIRPORT:</b> Location of the airport that the flight was scheduled to arrive at<br>
<b>SCHEDULED_DEPARTURE:</b> Scheduled Departure time of flight<br>
<b>SCHEDULED_TIME:</b> Amount of time flight was scheduled to take<br>
<b>DISTANCE:</b> Distance between ORIGIN_AIRPORT and DESTINATION_AIRPORT<br>
<b>SCHEDULED_ARRIVAL:</b> Flight's scheduled time of arrival <br>
<b>CANCELLED:</b> Flight's cancellation status <br>

## Dataset Files

<b>public_flights.csv</b> - Dataset to train and analyze <br>
<b>pred_flights.csv</b>  - Dataset to predict flights' cancellation status

## Submission

The file should contain predictions made on the pred_flights.csv file, and it should have the following format:

<pre>
0
1
0
0
1
0
<pre>

## Acknowledgments
The flight cancellation data was collected and published by the DOT's Bureau of Transportation Statistics.

## Importing the Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the dataset and cleaning them

In [None]:
dataset = pd.read_csv("public_flights.csv")
dataset_test = pd.read_csv("pred_flights.csv",header= None)

In [None]:
dataset

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,SCHEDULED_TIME,DISTANCE,SCHEDULED_ARRIVAL,CANCELLED
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,205.0,1448,430,0
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,280.0,2330,750,0
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,286.0,2296,806,0
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,285.0,2342,805,0
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,235.0,1448,320,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
838853,2015,2,25,3,WN,249,N657SW,GSP,HOU,700,160.0,845,840,0
838854,2015,2,25,3,WN,4,N360SW,HOU,DAL,700,65.0,239,805,0
838855,2015,2,25,3,WN,9,N362SW,ICT,DAL,700,85.0,333,825,0
838856,2015,2,25,3,WN,584,N8613K,ISP,PBI,700,180.0,1052,1000,0


## Taking care of missing data

In [None]:
dataset.drop(['TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT'],
  axis='columns', inplace=True)
dataset_test.drop([6, 7, 8],
  axis='columns', inplace=True)

dataset.dropna(inplace=True)
dataset_test.dropna(inplace=True)

X_test = dataset.values

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [None]:
dataset

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,SCHEDULED_DEPARTURE,SCHEDULED_TIME,DISTANCE,SCHEDULED_ARRIVAL,CANCELLED
0,2015,1,1,4,AS,98,5,205.0,1448,430,0
1,2015,1,1,4,AA,2336,10,280.0,2330,750,0
2,2015,1,1,4,US,840,20,286.0,2296,806,0
3,2015,1,1,4,AA,258,20,285.0,2342,805,0
4,2015,1,1,4,AS,135,25,235.0,1448,320,0
...,...,...,...,...,...,...,...,...,...,...,...
838853,2015,2,25,3,WN,249,700,160.0,845,840,0
838854,2015,2,25,3,WN,4,700,65.0,239,805,0
838855,2015,2,25,3,WN,9,700,85.0,333,825,0
838856,2015,2,25,3,WN,584,700,180.0,1052,1000,0


In [None]:
dataset_test

Unnamed: 0,0,1,2,3,4,5,9,10,11,12
0,2015,2,25,3,WN,1046,700,65,255,905
1,2015,2,25,3,WN,2251,700,80,345,820
2,2015,2,25,3,WN,857,700,90,397,830
3,2015,2,25,3,WN,2864,700,295,2329,1455
4,2015,2,25,3,WN,3220,700,80,370,920
...,...,...,...,...,...,...,...,...,...,...
209712,2015,3,10,2,EV,4122,1013,96,416,1149
209713,2015,3,10,2,UA,1018,1013,264,1416,1337
209714,2015,3,10,2,UA,1260,1013,251,1723,1624
209715,2015,3,10,2,EV,4349,1013,149,837,1242


In [None]:
print(X)

[[2015 1 1 ... 205.0 1448 430]
 [2015 1 1 ... 280.0 2330 750]
 [2015 1 1 ... 286.0 2296 806]
 ...
 [2015 2 25 ... 85.0 333 825]
 [2015 2 25 ... 180.0 1052 1000]
 [2015 2 25 ... 65.0 236 805]]


## Encoding categorical data

### Encoding the Independent Variable

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [4])], remainder = 'passthrough' )
X = np.array(ct.fit_transform(X))
X_test = np.array(ct.transform(X_test))



In [None]:
print(X_test)
X_test.shape

[[0.0 1.0 0.0 ... 205.0 1448 430]
 [1.0 0.0 0.0 ... 280.0 2330 750]
 [0.0 0.0 0.0 ... 286.0 2296 806]
 ...
 [0.0 0.0 0.0 ... 85.0 333 825]
 [0.0 0.0 0.0 ... 180.0 1052 1000]
 [0.0 0.0 0.0 ... 65.0 236 805]]


(838856, 23)

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X = sc.fit_transform(X)
X_test = sc.fit_transform(X_test)

In [None]:
print(X)

[[-0.32093887  5.8584759  -0.2196386  ...  0.86939075  1.0878604
  -2.22242748]
 [ 3.11585817 -0.17069286 -0.2196386  ...  1.87443536  2.57262311
  -1.56166446]
 [-0.32093887 -0.17069286 -0.2196386  ...  1.95483893  2.51538736
  -1.44603093]
 ...
 [-0.32093887 -0.17069286 -0.2196386  ... -0.73868061 -0.78913554
  -1.40679813]
 [-0.32093887 -0.17069286 -0.2196386  ...  0.53437589  0.42123224
  -1.04544335]
 [-0.32093887 -0.17069286 -0.2196386  ... -1.0066925  -0.95242577
  -1.44809581]]


## Applying LDA

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X = lda.fit_transform(X, y)



In [None]:
X_test = lda.transform(X_test)

## Training CatBoost on the Training set

In [None]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/20/37/bc4e0ddc30c07a96482abf1de7ed1ca54e59bba2026a33bca6d2ef286e5b/catboost-0.24.4-cp36-none-manylinux1_x86_64.whl (65.7MB)
[K     |████████████████████████████████| 65.8MB 59kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.24.4


In [None]:
from catboost import CatBoostClassifier
classifier = CatBoostClassifier()
classifier.fit(X, y)

Learning rate set to 0.182536
0:	learn: 0.4412931	total: 313ms	remaining: 5m 13s
1:	learn: 0.3078443	total: 580ms	remaining: 4m 49s
2:	learn: 0.2359920	total: 841ms	remaining: 4m 39s
3:	learn: 0.1965494	total: 1.1s	remaining: 4m 34s
4:	learn: 0.1744341	total: 1.35s	remaining: 4m 28s
5:	learn: 0.1614282	total: 1.61s	remaining: 4m 26s
6:	learn: 0.1535948	total: 1.85s	remaining: 4m 23s
7:	learn: 0.1486579	total: 2.11s	remaining: 4m 21s
8:	learn: 0.1456868	total: 2.36s	remaining: 4m 19s
9:	learn: 0.1437411	total: 2.61s	remaining: 4m 18s
10:	learn: 0.1425568	total: 2.84s	remaining: 4m 15s
11:	learn: 0.1417306	total: 3.08s	remaining: 4m 13s
12:	learn: 0.1411706	total: 3.32s	remaining: 4m 11s
13:	learn: 0.1408061	total: 3.56s	remaining: 4m 11s
14:	learn: 0.1405779	total: 3.81s	remaining: 4m 9s
15:	learn: 0.1404075	total: 4.04s	remaining: 4m 8s
16:	learn: 0.1402996	total: 4.29s	remaining: 4m 7s
17:	learn: 0.1402258	total: 4.53s	remaining: 4m 7s
18:	learn: 0.1401798	total: 4.76s	remaining: 4m 5

<catboost.core.CatBoostClassifier at 0x7f0df2069c88>

## Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X, y = y, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
6:	learn: 0.1579937	total: 1.87s	remaining: 4m 25s
7:	learn: 0.1524014	total: 2.11s	remaining: 4m 21s
8:	learn: 0.1488588	total: 2.36s	remaining: 4m 19s
9:	learn: 0.1465919	total: 2.58s	remaining: 4m 15s
10:	learn: 0.1451159	total: 2.82s	remaining: 4m 13s
11:	learn: 0.1441320	total: 3.08s	remaining: 4m 13s
12:	learn: 0.1434827	total: 3.33s	remaining: 4m 12s
13:	learn: 0.1430414	total: 3.57s	remaining: 4m 11s
14:	learn: 0.1427647	total: 3.8s	remaining: 4m 9s
15:	learn: 0.1425665	total: 4.04s	remaining: 4m 8s
16:	learn: 0.1424335	total: 4.31s	remaining: 4m 9s
17:	learn: 0.1423348	total: 4.54s	remaining: 4m 7s
18:	learn: 0.1422709	total: 4.75s	remaining: 4m 5s
19:	learn: 0.1422307	total: 4.99s	remaining: 4m 4s
20:	learn: 0.1421926	total: 5.24s	remaining: 4m 4s
21:	learn: 0.1421717	total: 5.46s	remaining: 4m 2s
22:	learn: 0.1421486	total: 5.66s	remaining: 4m
23:	learn: 0.1421336	total: 5.87s	remaining: 3m 58s
24:	learn: 0.142

## Predicting values

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

## Training the Random Forest Classification model on the Training set

In [None]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators = 100, criterion = "entropy", max_depth = 5, random_state = 0)
classifier.fit(X, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

## Computing the accuracy with k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = classifier, X = X, y = y, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.58 %
Standard deviation: 0.00 %


## Predicting values

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

## Training the Naive Bayes model on the Training set

In [None]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(X, y)

GaussianNB(priors=None, var_smoothing=1e-09)

## Computing the accuracy with k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = classifier, X = X, y = y, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.09 %
Standard deviation: 0.83 %


## Predicting values

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

## Training the Logistic Regression model on the Training set

In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state = 0)

classifier.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Computing the accuracy with k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = classifier, X = X, y = y, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.58 %
Standard deviation: 0.00 %


## Predicting values

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

## Training the Kernel SVM model on the Training set

In [None]:
from sklearn.svm import SVC

classifier = SVC(kernel = "rbf")
classifier.fit(X, y)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

## Computing the accuracy with k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = classifier, X = X, y = y, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard deviation: {:.2f} %".format(accuracies.std()*100))

## Predicting values

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
y_pred

## Applying Grid Search to find the best model and the best parameters

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = [{'C': np.arange(0.1, 1.1, 0.1), 'kernel': ['linear']}, 
              {'C': np.arange(0.1, 1.1, 0.1), 'kernel': ['rbf'], 'gamma': np.arange(0.1, 1.1, 0.1)}]

grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X, y)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Accuracy = {:.2f} %".format(best_accuracy*100))
print("Best Parameters: ")
print('C :', best_parameters['C'])
print('gamma :', best_parameters['gamma'])
print('kernel :', best_parameters['kernel'])

## Training the K-NN model on the Training set

In [None]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors = 5, p = 2, metric = 'minkowski').fit(X, y)


## Computing the accuracy with k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = classifier, X = X, y = y, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard deviation: {:.2f} %".format(accuracies.std()*100))

## Predicting values

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
y_pred

## Training the Decision Tree Classification model on the Training set

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = "entropy", max_depth = 5)
classifier.fit(X, y)

## Computing the accuracy with k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = classifier, X = X, y = y, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard deviation: {:.2f} %".format(accuracies.std()*100))

## Predicting values

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
y_pred

## Training XGBoost on the Training set

In [None]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X, y)

In [None]:
from xgboost import XGBRFClassifier

classifier1 = XGBRFClassifier()
classifier1.fit(X, y)

## Applying k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X, y = y, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier1, X = X, y = y, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

## Predicting values

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
y_pred

## Predicting values

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier1.predict(X_test)
y_pred