
# [Level 3](#)

## Exercise 6 - Do not use the DepDelay variable when making predictions.

This is going to be the second execution of the whole notebook but without the DepDelay variable.  
So we will see exactly the effect of taking off this very dependent variable.

# [Level 1](#)

## Exercise 1

Create at least three different regression models to try to predict as well as possible the flight delay (ArrDelay) of DelayedFlights.csv.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [2]:
# settings to display all columns (default is 20, now is None (all))
pd.set_option("display.max_columns", None)

In [3]:
# Import cleaned and sampled train an test dataset from previous Task.
df_train = pd.read_csv('..\data\DelayedFlights_train.csv')
df_test  = pd.read_csv('..\data\DelayedFlights_test.csv')

### Explanation of the Train / Test Sample 
* Is imported from previous Task (S09T01).  
* Is 1% of the original dataset, randomly sampled and stratified by Airline.
* Is parted 33% test and 66% train.

In [4]:
# Let's delete the first column
df_train = df_train.drop(columns='Unnamed: 0')
df_test  = df_test.drop(columns='Unnamed: 0')

### Deletion of DepDelay attribute 


In [5]:
# Let's delete the first column
df_train = df_train.drop(columns='DepDelay')
df_test  = df_test.drop(columns='DepDelay')

## 1st model: Linear regression between DepDelay and ArrDelay.

### Observation 1:  
Is Not Applicable (N/A) now without DepDelay attribute.  
Let's remember from previous execution the conclusions of the Linear regression with DepDelay:  
* The accuracy of the model was very high (0.895 with train data and 0.902 with test data).  
* The dependency between Arrival Delay and Departure Delay is quite obvious.

## 2nd model: Multiple Linear Regression

In [6]:
# Our Y or Target is ArrDelay:
y_train = df_train.ArrDelay
y_train = y_train.array # Convert pandas series to numpy array
type(y_train)
y_test  = df_test.ArrDelay.array


In [7]:
# Our X now is going to be all the columns in df_train except ArrDelay and Date (also the OHE of Airline)
X_train = df_train.drop(columns=["ArrDelay","Date"])
feature_list = list(X_train.columns) # Saving feature names for later use
X_train = X_train.to_numpy() # Convert dataframe to array
type(X_train)
X_test = df_test.drop(columns=["ArrDelay","Date"]).to_numpy()

In [8]:
# Fit the model ( calculate b0...bn for the multiple lineal model y = b0 + b1x1 +... + bnxn)
model2 = LinearRegression().fit(X_train,y_train)

In [9]:
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train2 = model2.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train2)

coefficient of determination with train data: 0.206


In [10]:
# Calculate R2 to see the accuracy of the model with test data.
r_sq_test2 = model2.score(X_test,y_test)
print('coefficient of determination with test data: %.3f' %r_sq_test2)

coefficient of determination with test data: -97068975798565535744.000


In [11]:
# Applying k-Fold Cross Validation (CV) with train data
accuracies2a = cross_val_score(estimator = model2, X=X_train, y=y_train , cv = 10)
print("Multiple Linear Regression:\n Accuracy with train data:", accuracies2a.mean(), "+/-", accuracies2a.std(),"\n")

Multiple Linear Regression:
 Accuracy with train data: 0.19984159117574923 +/- 0.03683768952188743 



In [12]:
# Applying k-Fold Cross Validation (CV) with test data
accuracies2b = cross_val_score(estimator = model2, X=X_test, y=y_test , cv = 10)
print("Multiple Linear Regression:\n Accuracy with test data:", accuracies2b.mean(), "+/-", accuracies2b.std(),"\n")

Multiple Linear Regression:
 Accuracy with test data: 0.20230081003217393 +/- 0.08631821596875758 



In [13]:
# Let's calculate the prediction of the model
y_pred2 = model2.predict(X_train)
y_pred2.shape

(12919,)

### Observation 2:  
The accuracy of the model $R^2$ with train data is very low (0.206).   
The accuracy of the model $R^2$ with test data is very bad (-97068975798565535744). A negative number means that the model is arbitrarily worse.   
That is a very bad model, meaning that all the data was explained with the DepDelay variable. 

We also apply the k-Fold Cross Validation with train data and test data, and the results are low, but not so bad (train:0.19, test:0.20)

## 3rd model: Random Forest Regression

In [14]:
# Let's create the model of Random Forest
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 100 decision trees
model3 = RandomForestRegressor(n_estimators = 10, random_state = 42)
# Train the model on training data
model3.fit(X_train, y_train)

RandomForestRegressor(n_estimators=10, random_state=42)

In [29]:
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train3 = model3.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train3)

coefficient of determination with train data: 0.986


In [30]:
# Calculate R2 to see the accuracy of the model with test data.
r_sq_test3 = model3.score(X_test,y_test)
print('coefficient of determination with test data: %.3f' %r_sq_test3)

coefficient of determination with test data: 0.927


In [35]:
# Applying k-Fold Cross Validation (CV) with train data (we could/should do it with all the data - this will be done on Exercise 4)
accuracies3a = cross_val_score(estimator = model3, X=X_train, y=y_train , cv = 10)
print("RandomForest Regression:\n Accuracy: %.3f"%accuracies3a.mean(),"+/- %.3f"%accuracies3a.std(),"\n")

RandomForest Regression:
 Accuracy: 0.927 +/- 0.018 



In [37]:
# Applying k-Fold Cross Validation (CV) with test data
accuracies3b = cross_val_score(estimator = model3, X=X_test, y=y_test , cv = 10)
print("RandomForest Regression:\n Accuracy: %.3f"%accuracies3b.mean(),"+/- %.3f"%accuracies3b.std(),"\n")

RandomForest Regression:
 Accuracy: 0.865 +/- 0.059 



In [20]:
# Let's calculate the prediction of the model
y_pred3 = model3.predict(X_train)
y_pred3.shape

(12919,)

### Observation 3.
* Now we can really know if the random forest regression model is good or not!
* And the answer is that is very good! With test data, the $R^2$ is 0.927 and the accuracy of the Cross Validation is quite good (0.865 +/- 0.059).
* The explanation now is that, even without the DepDelay variable, the random forest model can predict test data with good accuracy.

## 4th model: Neural Network Regression

In [21]:
# Import modules
from sklearn.neural_network import MLPRegressor

In [22]:
# Create model
model4 = MLPRegressor(activation='relu',solver='adam',random_state=1, max_iter=500)
model4.fit(X_train,y_train)

MLPRegressor(max_iter=500, random_state=1)

In [23]:
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train4 = model4.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train4)

coefficient of determination with train data: 0.974


In [24]:
# Calculate R2 to see the accuracy of the model with test data. 
r_sq_test4 = model4.score(X_test,y_test)
print('coefficient of determination with test data: %.3f' %r_sq_test4)

coefficient of determination with test data: 0.951


In [38]:
# Applying k-Fold Cross Validation (CV) with train data (we could/should do it with all the data - this will be done on Exercise 4)
accuracies4a = cross_val_score(estimator = model4, X=X_train, y=y_train , cv = 10)
print("Neural Network Regression:\n Accuracy: %.3f"%accuracies4a.mean(),"+/- %.3f"%accuracies4a.std(),"\n")

Neural Network Regression:
 Accuracy: 0.951 +/- 0.013 



In [39]:
# Applying k-Fold Cross Validation (CV) with test data (we could/should do it with all the data - this will be done on Exercise 4)
accuracies4b = cross_val_score(estimator = model4, X=X_test, y=y_test , cv = 10)
print("Neural Network Regression:\n Accuracy: %.3f"%accuracies4b.mean(),"+/- %.3f"%accuracies4b.std(),"\n")

Neural Network Regression:
 Accuracy: 0.938 +/- 0.037 



In [27]:
# Let's calculate the prediction of the model
y_pred4 = model4.predict(X_train)
y_pred4.shape

(12919,)

### Observation 4.
* Again, now we can really know if the neural network regression model is good or not!
* And the answer is that is very good! With test data, the $R^2$ is 0.951 and the accuracy of the Cross Validation is quite good (0.938 +/- 0.037).
* The explanation now is that, even without the DepDelay variable, the random forest model can predict test data with good accuracy.

## Exercise 2 - Compare them on the basis of MSE and R2 .
 

R^2 has been calculated before, we copy here the results:


In [41]:
print("-------Multiple Linear Regression Model------------")
print('R^2 - coefficient of determination with train data: %.3f' %r_sq_train2)
print('R^2 - coefficient of determination with test data: %.3f' %r_sq_test2)
print("--------------Random Forest Regression Model------------")
print('R^2 - coefficient of determination with train data: %.3f' %r_sq_train3)
print('R^2 - coefficient of determination with test data: %.3f' %r_sq_test3)
print("-------Neural Network Regression Model------------")
print('R^2 - coefficient of determination with train data: %.3f' %r_sq_train4)
print('R^2 - coefficient of determination with test data: %.3f' %r_sq_test4)

-------Multiple Linear Regression Model------------
R^2 - coefficient of determination with train data: 0.206
R^2 - coefficient of determination with test data: -97068975798565535744.000
--------------Random Forest Regression Model------------
R^2 - coefficient of determination with train data: 0.986
R^2 - coefficient of determination with test data: 0.927
-------Neural Network Regression Model------------
R^2 - coefficient of determination with train data: 0.974
R^2 - coefficient of determination with test data: 0.951


### Observation 5.
* Now, we see that Multiple Linear Regression Model is not a good model for this dataset.
* But, Random forest is a good one (0.927) and Neural Network even better (0.951).


MSE has not been calculated, we calculate it now and print here the results:

In [43]:
from sklearn.metrics import mean_squared_error
print("-------Multiple Linear Regression Model------------")
print('MSE - Mean Square Error with train data: %.3f' %mean_squared_error(y_train,y_pred2))
print("--------Random Forest Regression Model------------")
print('MSE - Mean Square Error with train data: %.3f' %mean_squared_error(y_train,y_pred3))
print("-------Neural Network Regression Model------------")
print('MSE - Mean Square Error with train data: %.3f' %mean_squared_error(y_train,y_pred4))


-------Multiple Linear Regression Model------------
MSE - Mean Square Error with train data: 0.794
--------Random Forest Regression Model------------
MSE - Mean Square Error with train data: 0.014
-------Neural Network Regression Model------------
MSE - Mean Square Error with train data: 0.026


### Observation 6.
* Again, we see that Mean Square error with train data is higher. (0.794) (not bad, by the way)
* Random forest is the best one (0.014)
* Neural Network is the second, but near to the first (0.026)

## Exercise 3 - Train them using the different parameters they allow.


### 3.1.- Multiple Linear Regression Model


In [45]:
# This is our original model:
# Fit the model ( calculate b0...bn for the multiple lineal model y = b0 + b1x1 +... + bnxn)
model2 = LinearRegression().fit(X_train,y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train2 = model2.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train2)


coefficient of determination with train data: 0.206


In [46]:
# Let's now use different parameters: 
model2 = LinearRegression(fit_intercept=False).fit(X_train,y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train2 = model2.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train2)

# We don't change normalize= because is going to be deprecated and we have normalized before.
# copy_X will only change if we want to change X or not. By default is copied (not changed), if we use False, X will be overwritten.

coefficient of determination with train data: 0.206


In [48]:
# n_jobs: number of jobs to use for the computation. By default is None, interpreted usually as 1 CPU.
# if we use -1 all CPUs will be used in Parallel. Only effective with large datasets. Right now our 
# computational time for this model is 0.1, no improvement expected.
model2 = LinearRegression(n_jobs=-1).fit(X_train,y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train2 = model2.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train2)

coefficient of determination with train data: 0.206


#### Observation 7.  
* No improvement / change with the parameters used, speed is the same (0.1s) and $R^2$ is the same (0.206)

### 3.2.- Random Forest Regression Model


In [49]:
# Instantiate model with 100 decision trees
model3 = RandomForestRegressor(n_estimators = 100, random_state = 42)
# Train the model on training data
model3.fit(X_train, y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train3 = model3.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train3)

coefficient of determination with train data: 0.991


In [51]:
# Let's change parameters, but with R2 of 0.991 and speed 12.3s is going to be difficult to see differences.

# The most important parameter is n_estimators, how many "trees" are in the forest. 
# By default is 100, let's change it to 10
model3 = RandomForestRegressor(n_estimators = 10, random_state = 42)
# Train the model on training data
model3.fit(X_train, y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train3 = model3.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train3)


coefficient of determination with train data: 0.986


In [52]:
# The speed has improved to 1.5s and R2 is still very high 0.986. 
# If we change it to 1000 trees, R2 is the same, but time is 15h.

# criterion parameter also changes the function to measure the quality of a split.
# By default is squared_error, let's change it to absolute_error (poisson is not possible because there are negative values on y array)
model3 = RandomForestRegressor(n_estimators = 10, criterion='absolute_error', random_state = 42)
# Train the model on training data
model3.fit(X_train, y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train3 = model3.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train3)

coefficient of determination with train data: 0.969


In [55]:
# There are many more parameters to change, let's just make a last change, in order to try to improve the speed, lets compute again 
# the previous model, but with using all CPUs in parallel.
model3 = RandomForestRegressor(n_estimators = 10, criterion='absolute_error', n_jobs= -1, random_state = 42)
# Train the model on training data
model3.fit(X_train, y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train3 = model3.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train3)

coefficient of determination with train data: 0.969


#### Observation 8.  
* The speed has effectively improved with n_jobs = -1, from 70s to 20s. Remark that CPU Performance goes to 100%!
* $R^2$ with default parameters was 0.991, changing the parameters the accuracy goes down a little bit (0.986 and 0.969)
* See comments in code, we have changed the numbers of trees, affecting to the calculation speed of the model. 
* Also criterion was changed, but with not very different results.

### 3.3.- Neural Network Regression Model

In [56]:
# Create model
model4 = MLPRegressor(activation='relu',solver='adam',random_state=1, max_iter=500)
model4.fit(X_train,y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train4 = model4.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train4)

coefficient of determination with train data: 0.974


In [57]:
# Let's increase to 5 times default number of hidden layers
model4 = MLPRegressor(hidden_layer_sizes=(500,), activation='relu',solver='adam',random_state=1, max_iter=500)
model4.fit(X_train,y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train4 = model4.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train4)

coefficient of determination with train data: 0.981


In [58]:
# Let's change the activation to logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)). (old school)
model4 = MLPRegressor(activation='logistic',solver='adam',random_state=1, max_iter=500)
model4.fit(X_train,y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train4 = model4.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train4)

coefficient of determination with train data: 0.972


In [63]:
# Let's increase to change the solver to lbfgs (optimizer in the family of quasi-Newton methods)
model4 = MLPRegressor(activation='relu',solver='lbfgs',random_state=1, max_iter=1000)
model4.fit(X_train,y_train)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train4 = model4.score(X_train,y_train)
print('coefficient of determination with train data: %.3f' %r_sq_train4)

        # Note: The default solver 'adam' works pretty well on relatively
        # large datasets (with thousands of training samples or more) in terms of
        # both training time and validation score.
        # For small datasets, however, 'lbfgs' can converge faster and perform
        # better.

coefficient of determination with train data: 0.984


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


#### Observation 9.
* The initial R^2 is 0.974. 
* When we increase the hidden layers from 100 to 500, we get a better $R^2$ (0.981).  
* If we change the relu function to the logistic sigmoid function, it decrease slightly to 0.972, and is significantly slower (20s to 10s (double))
* If we change the solver from adam to lbfgs, we have to increase the max_iters, the $R^2$ goes up (0.992), but it doesn't converge after ~6min and 10000 iters.  

Error message: 
![](2022-03-14-17-15-14.png)


## Exercise 4 - Compare their performance using the test/train approach or using all data (internal validation).

Previously, we have used the test/train approach. Let's now use all data (internal validation) and check if R^2 changes.

### 4.1.- Multiple Linear Regression Model

In [64]:
#Let's concatenate df_train & df_test
df_complete = pd.concat([df_train,df_test])
df_complete.shape

(19283, 33)

In [65]:
# All samples have been combined.
# Now let's create the models with all data now and test R^2


In [66]:
# Our Y or Target is ArrDelay:
y_complete = df_complete.ArrDelay
y_complete = y_complete.array # Convert pandas series to numpy array
type(y_complete)

pandas.core.arrays.numpy_.PandasArray

In [67]:
# Our X now is going to be all the columns in df_complete except ArrDelay and Date (also the OHE of Airline)
X_complete = df_complete.drop(columns=["ArrDelay","Date"])
feature_list = list(X_complete.columns) # Saving feature names for later use
X_complete = X_complete.to_numpy() # Convert dataframe to array
type(X_complete)

numpy.ndarray

In [68]:
## Let's test our last 3 models
# Fit the model ( calculate b0...bn for the multiple lineal model y = b0 + b1x1 +... + bnxn)
model2 = LinearRegression().fit(X_complete,y_complete)

In [69]:
# Calculate R2 to see the accuracy of the model with complete data.
r_sq_complete2 = model2.score(X_complete,y_complete)
print('coefficient of determination with all data: %.3f' %r_sq_complete2)

coefficient of determination with all data: 0.209


In [77]:
# Applying k-Fold Cross Validation (CV) with all data
accuracies2 = cross_val_score(estimator = model2, X=X_complete, y=y_complete, cv = 10) # default cv = 5
print("Multiple Linear Regression:\n Accuracy with train data: %.3f"%accuracies2.mean(), "+/- %3.f"%accuracies2.std(),"\n")

Multiple Linear Regression:
 Accuracy with train data: 0.204 +/-   0 



### 4.2.- Random Forest Regression Model

In [70]:
# Instantiate model with 100 decision trees
model3 = RandomForestRegressor(n_estimators = 100, random_state = 42)
# Train the model on training data
model3.fit(X_complete, y_complete)
# Calculate R2 to see the accuracy of the model with all data.
r_sq_train3 = model3.score(X_complete,y_complete)
print('coefficient of determination with all data: %.3f' %r_sq_train3)

coefficient of determination with all data: 0.993


In [78]:
# Applying k-Fold Cross Validation (CV) with all data
accuracies3 = cross_val_score(estimator = model3, X=X_complete, y=y_complete , cv = 10) # default
print("Random Forest Regression:\n Accuracy with train data: %.3f"%accuracies3.mean(), "+/- %3.f"%accuracies3.std(),"\n")

Multiple Linear Regression:
 Accuracy with train data: 0.948 +/-   0 



### 4.3.- Neural Network Regression Model

In [71]:
model4 = MLPRegressor(activation='relu',solver='adam',random_state=1, max_iter=500)
model4.fit(X_complete,y_complete)
# Calculate R2 to see the accuracy of the model with train data.
r_sq_train4 = model4.score(X_complete,y_complete)
print('coefficient of determination with all data: %.3f' %r_sq_train4)

coefficient of determination with all data: 0.976


In [79]:
# Applying k-Fold Cross Validation (CV) with all data
accuracies4 = cross_val_score(estimator = model4, X=X_complete, y=y_complete , cv = 10) # default
print("Neural Network Regression:\n Accuracy with train data: %.3f"%accuracies4.mean(), "+/- %3.f"%accuracies4.std(),"\n")

Neural Network Regression:
 Accuracy with train data: 0.957 +/-   0 



These are the results with all data:
* Multiple Linear Regression $R^2$ = 0.209
* Random Forest Regression $R^2$ = 0.993
* Neural Network Regression $R^2$ = 0.976

These are the results of train/test approach:
* Multiple Linear Regression $R^2$ = -97068975798565535744
* Random Forest Regression $R^2$ = 0.927
* Neural Network Regression $R^2$ = 0.951

Now, the cross validation, with all the data, is more meaningful. Accuracy with different models are:
* Multiple Linear Regression  = 0.204 +/-   0
* Random Forest Regression = 0.948 +/-   0 
* Neural Network Regression  = 0.957 +/-   0

The result are the following:
* All data, $R^2$ ➡️ Best model is Random Forest
* Train/test approach, $R^2$ ➡️ Best model is Neural Network
* All data, $CV$ ➡️ Best model is Neural Network
* Train/test approach, $CV$ ➡️ Best model is Neural Network too (0.938 +/- 0.037)

As conclusion, it seems that **the best model is Neural Network and the best tester for accuracy is All data with $CV$**.