## Ejercicio 1

Crea almenys tres models de regressió diferents per intentar predir el millor possible l’endarreriment dels vols (ArrDelay) de DelayedFlights.csv

## Ejercicio 2

Compara’ls en base al MSE i al R2 .

Se lee el fichero, se le hace el pre procesamiento y se generan lost sets de train y test como en la tarea del módulo anterior.

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split

# Configuracion para que se mestren todas las columnas
pd.set_option('display.max_columns', None)
# Configuracion para que los valores no se muestren con punto flotante
pd.options.display.float_format = '{:,.2f}'.format
# Lectura del fichero
df = pd.read_csv('DelayedFlights.csv')

# Normalizacion de los nombres de la columnas
df.columns = [col.lower() for col in df]

df.rename(columns={
    'dayofmonth': 'day_of_month', 
    'dayofweek': 'day_of_week',
    'crsdeptime': 'crs_dep_time',
    'crsarrtime': 'crs_arr_time',
    'uniquecarrier': 'unique_carrier',
    'actualelapsedtime': 'actual_elapsed_time',
    'crselapsedtime': 'crs_elapsed_time',
    'airtime': 'air_time',
    'arrdelay': 'arr_delay',
    'depdelay': 'dep_delay',
    'taxiout': 'taxi_out',
    'taxiin': 'taxi_in',
    'cancellationcode': 'cancellation_code',
    'carrierdelay': 'carrier_delay',
    'weatherdelay': 'weather_delay',
    'nasdelay': 'nas_delay',
    'securitydelay': 'security_delay',
    'lateaircraftdelay': 'late_air_craft_delay',
    'deptime': 'dep_time',
    'arrtime': 'arr_time',
    'tailnum': 'tail_num',
    'flightnum': 'flight_num'
    }, inplace=True)


# seleccion de columnas más importantes y limpeza de valores nulos
df_arr_delay = df[['unique_carrier','carrier_delay', 'dep_delay', 'late_air_craft_delay', 'arr_delay']]
df_arr_delay.fillna({'arr_delay':0}, inplace=True)
df_arr_delay.fillna({'carrier_delay':0, 'late_air_craft_delay':0}, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


In [2]:
# Pre procesamiento

min_max_scaler = preprocessing.MinMaxScaler()
X_train_transformed = df_arr_delay.copy()
X_train_minmax = min_max_scaler.fit_transform(X_train_transformed[["carrier_delay",'dep_delay','late_air_craft_delay','arr_delay']])
scaled_features_df = pd.DataFrame(X_train_minmax, index=X_train_transformed.index, columns=["carrier_delay",'dep_delay','late_air_craft_delay','arr_delay'])
X_train_transformed["carrier_delay"] = scaled_features_df["carrier_delay"]
X_train_transformed['dep_delay'] = scaled_features_df['dep_delay']
X_train_transformed['late_air_craft_delay'] = scaled_features_df['late_air_craft_delay']
X_train_transformed['arr_delay'] = scaled_features_df['arr_delay']
X_train_transformed  = pd.get_dummies(X_train_transformed, prefix=['carr'], columns=['unique_carrier'])

In [3]:
# se elimina la columna objetivo del set de datos
X_data = X_train_transformed.copy()
del X_data['arr_delay']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X_data, X_train_transformed['arr_delay'], test_size=0.30)

### Primer modelo: Linear Regression.

In [6]:
model = LinearRegression()
model.fit(X_train, y_train)
y_test_lr_predicted = model.predict(X_test)

In [7]:
r2_lr_score = metrics.r2_score(y_test, y_test_lr_predicted)
r2_lr_score

0.9009871355438588

In [8]:
mse_lr = metrics.mean_squared_error(y_test, y_test_lr_predicted)
mse_lr

4.8822022111283685e-05

Con Linear Regression obtuvimos un R2 de 0.90 y un MSE de 4.88

### Segundo modelo: Decision Tree.

In [9]:
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

DecisionTreeRegressor(random_state=42)

In [10]:
y_test_dt_predicted = regressor.predict(X_test)

In [11]:
r2_dt_score = metrics.r2_score(y_test, y_test_dt_predicted)
r2_dt_score

0.9049097823595273

In [12]:
mse_dt = metrics.mean_squared_error(y_test, y_test_dt_predicted)
mse_dt

4.6887813353449364e-05

Con Linear Regression obtuvimos un R2 de 0.90 y un MSE de 4.68

### Tercer modelo: Neural Network.

In [16]:
mlp = MLPRegressor()
mlp.fit(X_train, y_train)

MLPRegressor()

In [17]:
y_test_nn_predicted = mlp.predict(X_test)

In [18]:
r2_nn_score = metrics.r2_score(y_test, y_test_nn_predicted)
r2_nn_score

0.9059055291019323

In [19]:
mse_nn = metrics.mean_squared_error(y_test, y_test_nn_predicted)
mse_nn

4.639682291759074e-05

Con Neural Network obtuvimos un R2 de 0.90 y un MSE de 4.63

Teniendo en cuenta los resultados, podemos concluir que el modelo con mejores métricas es el de Decision Tree, con un R2 de de 0.90 y un MSE de 4.68, un poco más bajo que el de Linear Regression.

## Ejercicio 3

Entrena’ls utilitzant els diferents paràmetres que admeten.

## Ejercicio 4

Compara el seu rendiment utilitzant l’aproximació traint/test o utilitzant totes les dades (validació interna)

Empezaremos empleando algunos regularizadores de Linear Regression, como Ridge, Lasso y ElasticNet

In [21]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

In [46]:
ridge = Ridge(alpha=0.01)
ridge.fit(X_train, y_train) 
y_test_ridge_predicted= ridge.predict(X_test)
print(metrics.r2_score(y_test, y_test_ridge_predicted))

0.9009870839332966


In [47]:
mse_ridge = metrics.mean_squared_error(y_test, y_test_ridge_predicted)
mse_ridge

4.882204755981524e-05

In [48]:
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train) 
y_test_lasso_predicted = lasso.predict(X_test)
print(metrics.r2_score(y_test, y_test_lasso_predicted))

-1.198792443091179e-07


In [49]:
mse_lasso = metrics.mean_squared_error(y_test, y_test_lasso_predicted)
mse_lasso

0.0004930877238245846

In [50]:
model_enet = ElasticNet(alpha = 0.9)
model_enet.fit(X_train, y_train) 
y_test_enet_predicted= model_enet.predict(X_test)
print(metrics.r2_score(y_test, y_test_enet_predicted))

-1.198792443091179e-07


Aplicando Ridge obtuvimos los mismos valores que Linear Regressiom, pero con Lasso y Elastic obtuvimos un r2 negativo, por que lo que esos modelos no aplican para este caso.

Aplicaremos ahora algunos parametros en DecisionTreeRegressor, pondremos un máximo de profundidad, una minima canidad de datos para el split y un mínimo de datos para considerar un nodo como hoja.

In [52]:
treer_egressor_2 = DecisionTreeRegressor(random_state=42, max_depth=15,  min_samples_split = 10, min_samples_leaf=4)
treer_egressor_2.fit(X_train, y_train)
y_test_dt2_predicted = treer_egressor_2.predict(X_test)
r2_dt2_score = metrics.r2_score(y_test, y_test_dt2_predicted)
r2_dt2_score

0.9205591381457506

In [53]:
mse_dt2 = metrics.mean_squared_error(y_test, y_test_dt2_predicted)
mse_dt2

3.917130905454817e-05

Con estos cambios hemos obtenido mejores resultados, elevando el R2 a 0.92 y disminuyendo el MSE a 0.91

Al Neural Network le agregamos como hyperparameter 5 capas de 10 neuronas cada una.
Seteamos relu como activation, y adam como solver.

In [73]:
mlp2 = MLPRegressor(hidden_layer_sizes=(24,24,24), activation='relu', solver='adam', max_iter=500, alpha=0.0001)
mlp2.fit(X_train, y_train)
y_test_nn2_predicted = mlp2.predict(X_test)
r2_nn2_score = metrics.r2_score(y_test, y_test_nn2_predicted)
r2_nn2_score

0.9125122500796836

In [74]:
mse_nn2 = metrics.mean_squared_error(y_test, y_test_nn2_predicted)
mse_nn2

4.3139130299256954e-05

Aplicando los parametros hemos obtenido un R2 de 0.91, y un MSE de 4.31, hasta el momento lo mejores resultados.

In [76]:
mlp2 = MLPRegressor(hidden_layer_sizes=(24,24,24), activation='relu', solver='adam', max_iter=500, alpha=0.0001)
mlp2.fit(X_train, y_train)
y_test_nn2_predicted = mlp2.predict(X_test)
r2_nn2_score = metrics.r2_score(y_test, y_test_nn2_predicted)
r2_nn2_score

0.9149558131498243

## Ejercicio 5

Realitza algun procés d’enginyeria de variables per millorar-ne la predicció

In [69]:
X_data2 = X_train_transformed.copy()

In [78]:
X_data2.corr()

Unnamed: 0,carrier_delay,dep_delay,late_air_craft_delay,arr_delay,carr_9E,carr_AA,carr_AQ,carr_AS,carr_B6,carr_CO,carr_DL,carr_EV,carr_F9,carr_FL,carr_HA,carr_MQ,carr_NW,carr_OH,carr_OO,carr_UA,carr_US,carr_WN,carr_XE,carr_YV
carrier_delay,1.0,0.57,-0.08,0.55,0.02,0.02,-0.0,-0.0,-0.0,-0.02,-0.0,0.05,-0.01,-0.03,0.01,-0.0,0.04,0.04,0.0,-0.0,-0.01,-0.09,-0.0,0.08
dep_delay,0.57,1.0,0.57,0.95,0.01,0.02,-0.01,-0.01,0.04,-0.0,-0.02,0.02,-0.04,-0.0,-0.01,0.0,-0.01,0.02,0.01,0.04,-0.02,-0.08,0.03,0.04
late_air_craft_delay,-0.08,0.57,1.0,0.56,0.01,0.01,-0.01,-0.0,0.03,-0.02,-0.02,-0.05,-0.03,0.05,-0.01,0.02,-0.03,-0.06,0.01,0.05,-0.02,0.02,0.02,-0.02
arr_delay,0.55,0.95,0.56,1.0,0.01,0.03,-0.01,-0.02,0.04,-0.01,-0.01,0.02,-0.03,0.01,-0.01,0.02,0.01,0.03,0.01,0.03,-0.02,-0.1,0.03,0.04
carr_9E,0.02,0.01,0.01,0.01,1.0,-0.06,-0.0,-0.02,-0.03,-0.04,-0.04,-0.03,-0.02,-0.03,-0.01,-0.05,-0.03,-0.03,-0.04,-0.05,-0.04,-0.08,-0.04,-0.03
carr_AA,0.02,0.02,0.01,0.03,-0.06,1.0,-0.01,-0.05,-0.06,-0.08,-0.08,-0.07,-0.04,-0.06,-0.02,-0.09,-0.07,-0.06,-0.09,-0.09,-0.08,-0.16,-0.08,-0.06
carr_AQ,-0.0,-0.01,-0.01,-0.01,-0.0,-0.01,1.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.01,-0.0,-0.0,-0.01,-0.01,-0.0,-0.01,-0.0,-0.0
carr_AS,-0.0,-0.01,-0.0,-0.02,-0.02,-0.05,-0.0,1.0,-0.02,-0.03,-0.04,-0.03,-0.02,-0.03,-0.01,-0.04,-0.03,-0.02,-0.04,-0.04,-0.03,-0.07,-0.03,-0.03
carr_B6,-0.0,0.04,0.03,0.04,-0.03,-0.06,-0.0,-0.02,1.0,-0.04,-0.04,-0.04,-0.02,-0.03,-0.01,-0.05,-0.04,-0.03,-0.05,-0.05,-0.04,-0.08,-0.04,-0.03
carr_CO,-0.02,-0.0,-0.02,-0.01,-0.04,-0.08,-0.0,-0.03,-0.04,1.0,-0.06,-0.05,-0.03,-0.05,-0.01,-0.07,-0.05,-0.04,-0.06,-0.07,-0.05,-0.11,-0.06,-0.04


Como los **unique carriers** no tienen mucha influencia, obtamos por quitarlos para que el modelo no los tenga en cuenta.

In [132]:
X_data22 = X_data2[['carrier_delay', 'dep_delay', 'late_air_craft_delay', 'arr_delay']]
X_data22.corr()

Unnamed: 0,carrier_delay,dep_delay,late_air_craft_delay,arr_delay
carrier_delay,1.0,0.57,-0.08,0.55
dep_delay,0.57,1.0,0.57,0.95
late_air_craft_delay,-0.08,0.57,1.0,0.56
arr_delay,0.55,0.95,0.56,1.0


In [134]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_data22[['carrier_delay','dep_delay','late_air_craft_delay']], X_data22['arr_delay'], test_size=0.30)

In [135]:
model2 = LinearRegression()
model2.fit(X_train2, y_train2)
y_test_lr_predicted2 = model2.predict(X_test2)
r2_lr_score2 = metrics.r2_score(y_test2,y_test_lr_predicted2)
r2_lr_score2

0.896461934669293

Haciendo esas modificaciones vemos que no hubo mucha variación en el resultado.

## Ejercicio 6

No utilitzis la variable DepDelay a l’hora de fer prediccions

In [124]:
# Lectura del fichero
df = pd.read_csv('DelayedFlights.csv')

# Normalizacion de los nombres de la columnas
df.columns = [col.lower() for col in df]

df.rename(columns={
    'dayofmonth': 'day_of_month', 
    'dayofweek': 'day_of_week',
    'crsdeptime': 'crs_dep_time',
    'crsarrtime': 'crs_arr_time',
    'uniquecarrier': 'unique_carrier',
    'actualelapsedtime': 'actual_elapsed_time',
    'crselapsedtime': 'crs_elapsed_time',
    'airtime': 'air_time',
    'arrdelay': 'arr_delay',
    'depdelay': 'dep_delay',
    'taxiout': 'taxi_out',
    'taxiin': 'taxi_in',
    'cancellationcode': 'cancellation_code',
    'carrierdelay': 'carrier_delay',
    'weatherdelay': 'weather_delay',
    'nasdelay': 'nas_delay',
    'securitydelay': 'security_delay',
    'lateaircraftdelay': 'late_air_craft_delay',
    'deptime': 'dep_time',
    'arrtime': 'arr_time',
    'tailnum': 'tail_num',
    'flightnum': 'flight_num'
    }, inplace=True)


# seleccion de columnas más importantes y limpeza de valores nulos
df_arr_delay = df[['carrier_delay', 'security_delay','nas_delay','weather_delay', 'late_air_craft_delay', 'arr_delay']]
df_arr_delay.fillna({'arr_delay':0}, inplace=True)
df_arr_delay.fillna({'carrier_delay':0, 'nas_delay':0, 'security_delay':0,'weather_delay':0,'late_air_craft_delay':0}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


Como no podemos utilizar **dep_delay**, utilizaremos la suma de **nas_delay**, **security_delay**, **weather_delay**, **late_air_craft_delay** y **carrier_delay**

In [125]:
# Se agrega columna que indica si el vuelo llegó tarde o no
def total_late(row):
    return row['carrier_delay'] + row['nas_delay'] + row['security_delay'] + row['weather_delay'] + row['late_air_craft_delay']

df_arr_delay['total_late'] = df_arr_delay.apply(total_late, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_arr_delay['total_late'] = df_arr_delay.apply(total_late, axis = 1)


In [126]:
# Borramos las columnas que no son útiles
df_arr_delay = df_arr_delay[['total_late', 'arr_delay']]

In [127]:
# Pre procesamiento
min_max_scaler = preprocessing.MinMaxScaler()
X_train_transformed = df_arr_delay.copy()
X_train_minmax = min_max_scaler.fit_transform(X_train_transformed[["total_late",'arr_delay']])
scaled_features_df = pd.DataFrame(X_train_minmax, index=X_train_transformed.index, columns=["total_late",'arr_delay'])
X_train_transformed["total_late"] = scaled_features_df["total_late"]
X_train_transformed['arr_delay'] = scaled_features_df['arr_delay']
X_train_transformed

Unnamed: 0,total_late,arr_delay
0,0.00,0.04
1,0.00,0.04
2,0.00,0.05
3,0.01,0.06
4,0.00,0.05
...,...,...
1936753,0.01,0.05
1936754,0.03,0.07
1936755,0.04,0.08
1936756,0.00,0.05


In [128]:
X_data = X_train_transformed.copy()
del X_data['arr_delay']

In [129]:
X_train, X_test, y_train, y_test = train_test_split(X_data, X_train_transformed['arr_delay'], test_size=0.30)

In [130]:
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
y_test_lr_predicted2 = regression_model.predict(X_test)
r2_lr_score2 = metrics.r2_score(y_test, y_test_lr_predicted2)
r2_lr_score2

0.9926760103699903

In [131]:
mse_nn2 = metrics.mean_squared_error(y_test, y_test_lr_predicted2)
mse_nn2

3.591546713810081e-06

Con estos cambios obtuviemos un alto valor de R2 de 0.99 y el mejor valor de MSE de 3.59.