## Modelo 03 Forest Regression

Como parte del bootcamp de Henry de Data Science se realiza un modelo para predecir el número de bicicletas.

### Exploración de datos

El objetivo de la primera parte del notebook es realizar una exploración de los datos del Dataset bike_train.xlsx

1. Librerías a usarse:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
import xgboost as xgb

2. Importamos los archivos como dataframes

In [2]:
bike_train = pd.read_excel('bike_train.xlsx')
bike_test = pd.read_excel('bike_test.xlsx')
bike_train.head(5)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [3]:
bike_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11999 entries, 0 to 11998
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     11999 non-null  int64         
 1   dteday      11999 non-null  datetime64[ns]
 2   season      11999 non-null  int64         
 3   yr          11999 non-null  int64         
 4   mnth        11999 non-null  int64         
 5   hr          11999 non-null  int64         
 6   holiday     11999 non-null  int64         
 7   weekday     11999 non-null  int64         
 8   workingday  11999 non-null  int64         
 9   weathersit  11999 non-null  int64         
 10  temp        11999 non-null  float64       
 11  atemp       11999 non-null  float64       
 12  hum         11999 non-null  float64       
 13  windspeed   11999 non-null  float64       
 14  casual      11999 non-null  int64         
 15  registered  11999 non-null  int64         
 16  cnt         11999 non-

3. Eliminamos las columnas que no servirían como input. Instant es un anterior index y dteday no lo vamos a requerir ya que no lo estamos trabajando como serie de tiempo

In [4]:
drop_columns = ['instant','dteday', 'casual', 'registered', 'windspeed', 'mnth']
train = bike_train.drop(drop_columns, axis =1)
test = bike_test.drop(['instant','dteday', 'windspeed', 'mnth'], axis =1)
train.head()

Unnamed: 0,season,yr,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,cnt
0,1,0,0,0,6,0,1,0.24,0.2879,0.81,16
1,1,0,1,0,6,0,1,0.22,0.2727,0.8,40
2,1,0,2,0,6,0,1,0.22,0.2727,0.8,32
3,1,0,3,0,6,0,1,0.24,0.2879,0.75,13
4,1,0,4,0,6,0,1,0.24,0.2879,0.75,1


In [5]:
y = train['cnt']
X = train.drop(['cnt'], axis=1)

In [14]:
def rmsle(y_true, y_pred, convertExp=True):
    # 지수변환
    if convertExp:
        y_true = np.exp(y_true)
        y_pred = np.exp(y_pred)
        
    # 로그변환 후 결측값을 0으로 변환
    log_true = np.nan_to_num(np.log(y_true+1))
    log_pred = np.nan_to_num(np.log(y_pred+1))
    
    # RMSLE 계산
    output = np.sqrt(np.mean((log_true - log_pred)**2))

In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

rmsle_scorer = metrics.make_scorer(rmsle, greater_is_better=False)

In [17]:
from sklearn.ensemble import RandomForestRegressor

randomforest_model = RandomForestRegressor()
# 그리드서치 객체 생성
rf_params = {'random_state':[42], 'n_estimators':[100, 120, 140]}
gridsearch_random_forest_model = GridSearchCV(estimator=randomforest_model,
                                              param_grid=rf_params, n_jobs=-1,
                                              scoring=rmsle_scorer, cv=5)

log_y = np.log(y)
gridsearch_random_forest_model.fit(X, log_y)
print('최적 하이퍼파라미터 :', gridsearch_random_forest_model.best_estimator_)



최적 하이퍼파라미터 : RandomForestRegressor(random_state=42)


In [21]:
preds = gridsearch_random_forest_model.best_estimator_.predict(X)
error = rmsle(log_y, preds)
error

In [24]:
preds

array([3.01158927, 3.4158761 , 2.69060268, ..., 4.80749789, 4.44689203,
       3.91614748])

In [25]:
randomforest_preds = gridsearch_random_forest_model.best_estimator_.predict(test)

In [26]:
randomforest_preds_cnt= np.exp(randomforest_preds)

In [27]:
randomforest_preds_cnt

array([10.17384314,  8.98847062, 18.12132399, ..., 95.59461313,
       97.52944862, 43.33338701])

In [28]:

randomforest_preds_cnt = pd.DataFrame(randomforest_preds_cnt, columns = ['pred'])
randomforest_preds_cnt = np.round(randomforest_preds_cnt,2)
randomforest_preds_cnt.to_csv('TeffaHM_v15_randomforest_GridSearch_sin_viento_sin_mes.csv', header=True, index=False)

In [23]:
print(error)

None


3.1 Evaluamos las correlaciones lineales, nos quedamos con las variables con las que tiene mayor correlación y las que tienen correlación entre ellas

In [6]:
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.20, random_state=0)

5. Seleccionamos la variables de entrada y las variables de salida

In [7]:
model = XGBRegressor(gpu_id=0)

7.1 Hyperparameter tunning

In [8]:
from sklearn.model_selection import GridSearchCV
xgb = XGBRegressor(gpu_id = 0)

In [9]:
xgb.fit(X_train, y_train)

In [10]:
predicciones= xgb.predict(X_validation)

In [11]:
predicciones

array([135.24422  ,  32.989067 ,   2.7947333, ..., 351.0246   ,
       175.6465   ,  27.288723 ], dtype=float32)

In [13]:
pred_xgb_v5 = xgb.predict(test)
pred_xgb_v5 = pd.DataFrame(pred_xgb_v5, columns = ['pred'])
pred_xgb_v5 = np.round(pred_xgb_v5,2)
pred_xgb_v5.to_csv('TeffaHM_v14_xgb_day.csv', header=True, index=False)

In [34]:
from sklearn.metrics import mean_squared_error, r2_score

In [35]:
mean_squared_error(y_validation,predicciones, squared = False)

34.159297718989066

In [36]:
tuned_parameters = {
    'max_depth' : [3,4,5,6],
    'min_child_weight' : [1,2,3,4,5,6],
    'learning_rate' : [0.01,0.05,0.1,0.2, 0.3],
}

In [37]:
clf = GridSearchCV(xgb, param_grid=tuned_parameters, scoring = 'neg_root_mean_squared_error', cv=5, verbose = 2)

In [38]:
clf.fit(X_train, y_train)

Fitting 5 folds for each of 120 candidates, totalling 600 fits
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=1; total time=   0.2s
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=1; total time=   0.1s
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=1; total time=   0.1s
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=1; total time=   0.1s
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=1; total time=   0.1s
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=2; total time=   0.1s
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=2; total time=   0.1s
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=2; total time=   0.1s
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=2; total time=   0.1s
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=2; total time=   0.1s
[CV] END learning_rate=0.01, max_depth=3, min_child_weight=3; total time=   0.1s
[CV] END learning_rate=0.01, max_depth=3, min_

In [39]:
predicciones_2 = clf.predict(X_validation)
predicciones_2


array([172.51233 ,  45.763134,   9.892303, ..., 331.39844 , 144.10652 ,
        56.15449 ], dtype=float32)

In [41]:
print(clf.best_score_)

-35.06993906336568


In [40]:
mean_squared_error(y_validation,predicciones_2, squared = False)

34.33967599388628

In [43]:
pred_xgb_v5 = clf.predict(test)
pred_xgb_v5 = pd.DataFrame(pred_xgb_v5, columns = ['pred'])
pred_xgb_v5 = np.round(pred_xgb_v5,2)
pred_xgb_v5.to_csv('TeffaHM_v13_xgb_GridSearch_day.csv', header=True, index=False)

8. Entrenamos el modelo

Random Forest Regression