## Modelo 03 Forest Regression

Como parte del bootcamp de Henry de Data Science se realiza un modelo para predecir el número de bicicletas.

### Exploración de datos

El objetivo de la primera parte del notebook es realizar una exploración de los datos del Dataset bike_train.xlsx

1. Librerías a usarse:

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
import xgboost as xgb

2. Importamos los archivos como dataframes

In [4]:
bike_train = pd.read_excel('bike_train.xlsx')
bike_test = pd.read_excel('bike_test.xlsx')
bike_train.head(5)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [5]:
bike_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11999 entries, 0 to 11998
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     11999 non-null  int64         
 1   dteday      11999 non-null  datetime64[ns]
 2   season      11999 non-null  int64         
 3   yr          11999 non-null  int64         
 4   mnth        11999 non-null  int64         
 5   hr          11999 non-null  int64         
 6   holiday     11999 non-null  int64         
 7   weekday     11999 non-null  int64         
 8   workingday  11999 non-null  int64         
 9   weathersit  11999 non-null  int64         
 10  temp        11999 non-null  float64       
 11  atemp       11999 non-null  float64       
 12  hum         11999 non-null  float64       
 13  windspeed   11999 non-null  float64       
 14  casual      11999 non-null  int64         
 15  registered  11999 non-null  int64         
 16  cnt         11999 non-

In [6]:
bike_train['dia'] = bike_train['dteday'].strptime("%d")

AttributeError: 'Series' object has no attribute 'strptime'

3. Eliminamos las columnas que no servirían como input. Instant es un anterior index y dteday no lo vamos a requerir ya que no lo estamos trabajando como serie de tiempo

In [5]:
drop_columns = ['instant','dteday', 'casual', 'registered']
train = bike_train.drop(drop_columns, axis =1)
test = bike_test.drop(['instant','dteday'], axis =1)
train.head()

Unnamed: 0,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,16
1,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,40
2,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,32
3,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,13
4,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,1


In [24]:
y = train['cnt']
X = train.drop(['cnt'], axis=1)

3.1 Evaluamos las correlaciones lineales, nos quedamos con las variables con las que tiene mayor correlación y las que tienen correlación entre ellas

In [37]:
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.20, random_state=0)

5. Seleccionamos la variables de entrada y las variables de salida

In [38]:
model = XGBRegressor(gpu_id=0)

7.1 Hyperparameter tunning

In [39]:
from sklearn.model_selection import GridSearchCV
xgb = XGBRegressor(gpu_id = 0)

In [40]:
xgb.fit(X_train, y_train)

In [41]:
predicciones= xgb.predict(X_validation)

In [42]:
predicciones

array([125.56248 ,  35.76439 ,   7.528962, ..., 338.4769  , 146.12898 ,
        44.070957], dtype=float32)

In [35]:
pred_xgb_v3 = xgb.predict(test)

In [36]:
pred_xgb_v3 = pd.DataFrame(pred_xgb_v3, columns = ['pred'])
pred_xgb_v3.to_csv('TeffaHM_v11_xgb.csv', header=True, index=False)

8. Entrenamos el modelo

Random Forest Regression

In [39]:
from sklearn.ensemble import RandomForestRegressor

In [41]:
regressor = RandomForestRegressor(n_estimators = 100, random_state = 2208)

In [42]:
parameters = {'n_estimators':[100, 150, 250, 300], 'max_features':['sqrt', 'log2'], 'max_depth':[6, 10, 20, 30, 40]}

In [43]:
rfr_04 = GridSearchCV(estimator = regressor, param_grid = parameters, refit = True, verbose = 2, cv = 5, scoring = 'neg_root_mean_squared_error', n_jobs = -1)

In [44]:
rfr_04.fit(X_train, y_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


In [45]:
rfr_04.best_estimator_

In [46]:
rfr_04.best_score_

-47.903877410072596

In [47]:
predicciones_rfr_02 = rfr_04.predict(X_validation)

In [48]:
predicciones_rfr_02

array([242.53666667, 287.6       , 155.34666667, ...,  12.18666667,
       156.33666667, 339.56      ])

In [49]:
pred_rfr_04 = rfr_04.predict(test.drop(['temp', 'hum'], axis=1))
pred_rfr_04= pd.DataFrame(pred_rfr_04 , columns = ['pred'])
pred_rfr_04 = np.round(pred_rfr_04,2)
pred_rfr_04.to_csv('TeffaHM_v10_rfr.csv', header=True, index=False)

In [189]:
parameters_01 = {'n_estimators':[100, 150, 250, 300, 500], 'max_features':['sqrt', 'log2'], 'max_depth':[6,10,20, 30, 40]}

In [191]:
rfr_03 = GridSearchCV(estimator = RandomForestRegressor(), param_grid = parameters_01, refit = True, verbose = 2, cv = 5, scoring = 'neg_root_mean_squared_error', n_jobs = -1)

In [196]:
rfr_03.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


In [197]:
rfr_03.best_estimator_

In [198]:
rfr_03.best_score_

-61.41086501227826

In [None]:
pred_rfr_03 = rfr_03.predict(test.drop(['yr', 'holiday', 'weekday','atemp', 'hum', 'windspeed'], axis=1))
pred_rfr_03= pd.DataFrame(pred_rfr_03 , columns = ['pred'])
pred_rfr_03 = np.round(pred_rfr_03,2)
pred_rfr_03.to_csv('TeffaHM.csv', header=True, index=False)