## Model Creation & Evaluation

In [1]:
%%capture
#Load feature_selection-engineering file
%run feature_selection-engineering.ipynb

### Train Test Split

In [2]:
#Get dependent and independent variables
X = traffic_data[[ "Year", "Month", "DayOfWeek", "HourOfDay", "Junction"]]
y = traffic_data[["Vehicles"]]

#Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [3]:
#Model evaluation function
def model_eval(model_name):
    model = model_name #Assign the model
    model.fit(X_train, y_train) #Fit the model
    
    #Make predictions on the test set
    y_pred = model.predict(X_test)

    #Different metrics
    rmse = np.sqrt(mean_squared_error(y_test, y_pred)) #Calculate the RMSE
    r2 = r2_score(y_test, y_pred) #Calculate the R-squared
    mae = mean_absolute_error(y_test, y_pred) #Calculate MAE

    #Print the results
    print("RMSE: ", rmse)
    print("R-squared: ", r2)
    print("MAE: ", mae)

#### XGBoost Regressor

In [4]:
#XGBoost
model_eval(XGBRegressor())

RMSE:  5.046367826810308
R-squared:  0.9316728843005203
MAE:  3.36985913373724


#### Random Forest Regressor

In [5]:
#RandomForest
model_eval(RandomForestRegressor())

RMSE:  4.543637380216312
R-squared:  0.9446085657888055
MAE:  2.9751357846113913


#### LGBM Regressor

In [6]:
#LGBMRegressor
model_eval(lgb.LGBMRegressor())

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 78
[LightGBM] [Info] Number of data points in the train set: 38496, number of used features: 5
[LightGBM] [Info] Start training from score 21.863407
RMSE:  5.897020920200154
R-squared:  0.9066959233210862
MAE:  3.9833544724937546


### With dummy variables

In [7]:
#Split dataset into train and test
new_X_train, new_X_test, new_y_train, new_y_test = train_test_split(new_X, new_y, test_size = 0.2, random_state = 42)

In [8]:
#Model evaluation function
def new_model_eval(model_name):
    model = model_name #Assign the model
    model.fit(new_X_train, new_y_train) #Fit the model
    
    #Make predictions on the test set
    new_y_pred = model.predict(new_X_test)

    #Different metrics
    rmse = np.sqrt(mean_squared_error(new_y_test, new_y_pred)) #Calculate the RMSE
    r2 = r2_score(new_y_test, new_y_pred) #Calculate the R-squared
    mae = mean_absolute_error(new_y_test, new_y_pred) #Calculate MAE

    #Print the results
    print("RMSE: ", rmse)
    print("R-squared: ", r2)
    print("MAE: ", mae)

#### XGBoost Regressor

In [9]:
#XGBoost
new_model_eval(XGBRegressor())

RMSE:  5.5934440588926435
R-squared:  0.916055184190329
MAE:  3.764442518738149


#### Random Forest Regressor

In [10]:
#RanadomForest
new_model_eval(RandomForestRegressor())

RMSE:  4.913429415726535
R-squared:  0.9352254021079539
MAE:  3.222777694006845


#### LGBM Regressor

In [11]:
#LGBMRegressor
new_model_eval(lgb.LGBMRegressor())

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 148
[LightGBM] [Info] Number of data points in the train set: 38496, number of used features: 74
[LightGBM] [Info] Start training from score 21.863407
RMSE:  6.071745192771964
R-squared:  0.9010849538732142
MAE:  4.135102803829889


### Cross Validation 

In [12]:
#Define the cross-validation method
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
def cv_model_eval(model):
    #Use cross-validation to evaluate the model
    scores = cross_val_score(model, X, y, cv = kfold, scoring = 'neg_mean_squared_error')

    #Calculate the RMSE, R-squared, and MAE
    rmse = np.sqrt(-scores.mean())
    r2 = cross_val_score(model, X, y, cv=kfold, scoring='r2').mean()
    mae = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_absolute_error').mean()

    #Print the results
    print("RMSE: ", rmse)
    print("R-squared: ", r2)
    print("MAE: ", -mae)

#### XGBRegressor

In [13]:
#XGBoost
cv_model_eval(XGBRegressor())

RMSE:  5.113697453557049
R-squared:  0.9331755953442382
MAE:  3.386168236494238


#### RanadomForest

In [14]:
#RanadomForest
cv_model_eval(RandomForestRegressor())

RMSE:  4.5249271200208625
R-squared:  0.9472842058972567
MAE:  2.927437008648471


#### LGBMRegressor

In [15]:
import warnings
warnings.filterwarnings("ignore")

In [16]:
#LGBMRegressor
cv_model_eval(lgb.LGBMRegressor())

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 78
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 5
[LightGBM] [Info] Start training from score 21.844718
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 78
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 5
[LightGBM] [Info] Start training from score 21.797708
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 78
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 5
[LightGBM] [Info] Start training from score 21.749151
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 78
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 5
[LightGBM] [Info] Start training from score 21.749151
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 78
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 5
[LightGBM] [Info] Start training from score 21.794822
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 78
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 5
[LightGBM] [Info] Start training from score 21.777771
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `

### With dummy variables

#### XGBRegressor

In [17]:
#Define the cross-validation method
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
def cv_model_eval_d(model):
    #Use cross-validation to evaluate the model
    scores = cross_val_score(model, new_X, new_y, cv = kfold, scoring = 'neg_mean_squared_error')

    #Calculate the RMSE, R-squared, and MAE
    rmse = np.sqrt(-scores.mean())
    r2 = cross_val_score(model,new_X, new_y, cv=kfold, scoring='r2').mean()
    mae = cross_val_score(model, new_X, new_y, cv=kfold, scoring='neg_mean_absolute_error').mean()

    #Print the results
    print("RMSE: ", rmse)
    print("R-squared: ", r2)
    print("MAE: ", -mae)

In [18]:
#XGBoost
cv_model_eval_d(XGBRegressor())

RMSE:  5.72827796050781
R-squared:  0.9161441948318643
MAE:  3.8506377245752335


#### RanadomForest

In [19]:
#RanadomForest
cv_model_eval_d(RandomForestRegressor())

RMSE:  4.968155442034164
R-squared:  0.9368958126762756
MAE:  3.210981798930902


#### LGBMRegressor

In [20]:
#LGBMRegressor
cv_model_eval_d(lgb.LGBMRegressor())

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 148
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 74
[LightGBM] [Info] Start training from score 21.844718
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 148
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 74
[LightGBM] [Info] Start training from score 21.797708
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 148
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 74
[LightGBM] [Info] Start training from score 21.749151
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 148
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 74
[LightGBM] [Info] Start training from score 21.797708
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 148
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 74
[LightGBM] [Info] Start training from score 21.749151
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 148
[LightGBM] [Info] Number of data points in the train set: 43308, number of used features: 74
[LightGBM] [Info] Start training from score 21.794822
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can