# Description   

Let's practice and become familiar with regression.

Level 1

- Exercise 1

Create at least three different regression models to try to predict as well as possible the flight delay (ArrDelay) of DelayedFlights.csv.

- Exercise 2

Compare them on the basis of MSE and R2 .

- Exercise 3

Train them using the different parameters they allow.

- Exercise 4

Compare their performance using the test/train approach or using all data (internal validation).

Level 2

- Exercise 5

Perform some variable engineering process to improve your prediction.

Level 3

- Exercise 6

Do not use the DepDelay variable when making predictions.

# Level 1  

## - Exercise 1

Create at least three different regression models to try to predict as well as possible the flight delay (ArrDelay) of DelayedFlights.csv.

In [8]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

In [9]:
# settings to display all columns (default is 20, now is None (all))
pd.set_option("display.max_columns", None)

In [10]:
# Import cleaned and sampled train an test dataset from previous Task.
df_train = pd.read_csv('..\data\DelayedFlights_train.csv')
df_test  = pd.read_csv('..\data\DelayedFlights_test.csv')

In [11]:
# Let's explore the dataset
df_train.head()

Unnamed: 0.1,Unnamed: 0,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Distance,TaxiIn,TaxiOut,Date,UniqueCarrier_9E,UniqueCarrier_AA,UniqueCarrier_AQ,UniqueCarrier_AS,UniqueCarrier_B6,UniqueCarrier_CO,UniqueCarrier_DL,UniqueCarrier_EV,UniqueCarrier_F9,UniqueCarrier_FL,UniqueCarrier_HA,UniqueCarrier_MQ,UniqueCarrier_NW,UniqueCarrier_OH,UniqueCarrier_OO,UniqueCarrier_UA,UniqueCarrier_US,UniqueCarrier_WN,UniqueCarrier_XE,UniqueCarrier_YV
0,1212168,2,1.396108,1.553827,1.265258,1.312513,-0.781188,-0.790736,-0.818653,-0.423085,-0.453348,-0.68307,-0.341142,0.130433,2008-07-29,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,1623304,2,-0.699602,-0.997079,-0.540408,-0.920752,-0.90502,-0.832512,-0.804233,0.577369,0.739595,-0.783172,-0.154948,-0.654312,2008-10-07,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,914605,4,0.049503,0.125887,0.575125,0.600955,0.402102,0.490397,0.551251,-0.536344,-0.453348,0.708,-0.713531,-0.36895,2008-06-26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,642853,4,-0.445441,-0.394224,0.222755,0.198024,0.801118,0.936008,0.868492,-0.630727,-0.493787,0.918559,-0.341142,-0.012248,2008-04-10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
4,112693,5,0.004914,-0.082158,0.184414,-0.012015,-0.340894,-0.317274,-0.227431,0.067703,0.112794,-0.334439,0.21744,-0.725652,2008-01-11,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [12]:
# Let's delete the first column
df_train = df_train.drop(columns='Unnamed: 0')

## 1st model: Linear regression between DepDelay and ArrDelay.

In [13]:
# Our Y or Target is ArrDelay:
y = df_train.ArrDelay
type(y)

pandas.core.series.Series

In [14]:
# Our X easily could be DepDelay (as logical)
x = df_train.DepDelay

In [15]:
# Fit the model ( calculate b0 and b1 for the model y = b0 + b1x )
model = LinearRegression().fit(x.array.reshape(-1, 1),y.array)

In [16]:
# Calculate R2 to see the accuracy of the model.
r_sq = model.score(x.array.reshape(-1, 1),y.array)
print('coefficient of determination: %.3f' %r_sq)

coefficient of determination: 0.895


### Observation 1  
As we expected, the accuracy of the model is very high (0.895).  
We don't think that this would lead to an overfitting problem.  
Is just that the dependency between Arrival Delay and Departure Delay is quite obvious.

## 2nd model: Multiple Linear Regression

In [17]:
# Our Y or Target is ArrDelay:
y = df_train.ArrDelay
y = y.array # Convert pandas series to numpy array
type(y)

pandas.core.arrays.numpy_.PandasArray

In [18]:
# Our X now is going to be all the columns in df_train except ArrDelay and Date (also the OHE of Airline)
x = df_train.drop(columns=["ArrDelay","Date"])
feature_list = list(x.columns) # Saving feature names for later use
x = x.to_numpy() # Convert dataframe to array
type(x)

numpy.ndarray

In [19]:
# Fit the model ( calculate b0 and b1 for the model y = b0 + b1x )
model = LinearRegression().fit(x,y)

In [20]:
# Calculate R2 to see the accuracy of the model.
r_sq = model.score(x,y)
print('coefficient of determination: %.3f' %r_sq)

coefficient of determination: 1.000


### Observation 2  
The accuracy of the model is the maximum (1.0).  
That is a perfect fit, meaning that all the data is explained with the model. Like the target is a dependent variable of the attributes.  
That's normal, because is our $R^2$ with the train set. Later, we will test it against the test set and $R^2$ will not be a perfect fit.

## 3rd model: Random Forest

In [21]:
# Let's create the model of Random Forest
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(x, y)

RandomForestRegressor(n_estimators=1000, random_state=42)

In [22]:
# Let's plot it in an small scale.
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Limit depth of tree to 3 levels
rf_small = RandomForestRegressor(n_estimators=10, max_depth = 3)
rf_small.fit(x, y)
# Extract the small tree
tree_small = rf_small.estimators_[5]
# Save the tree as a png image
export_graphviz(tree_small, out_file = 'small_tree.dot', feature_names = feature_list, rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png');

### Graph created "small_tree.png" 
![](small_tree.png)

## 4th model: Neural Network

In [6]:
# Import modules
from sklearn.neural_network import MLPRegressor

In [23]:
# Create model
mlp = MLPRegressor(hidden_layer_sizes=(30,30,30))
mlp.fit(x,y)

predict_train = mlp.predict(x)
# predict_test = mlp.predict(X_test)


In [None]:
# Cross Validation
# https://www.analyticsvidhya.com/blog/2021/11/top-7-cross-validation-techniques-with-python-code/

# from sklearn.datasets import load_iris
# from sklearn.model_selection import cross_val_score,KFold
# from sklearn.linear_model import LogisticRegression
# iris=load_iris()
# X=iris.data
# Y=iris.target
# logreg=LogisticRegression()
# kf=KFold(n_splits=5)
# score=cross_val_score(logreg,X,Y,cv=kf)
# print("Cross Validation Scores are {}".format(score))
# print("Average Cross Validation score :{}".format(score.mean()))