# **Model Training for the Forest Covertypes Data**

![Image](https://cdn.shopify.com/s/files/1/0326/7189/t/65/assets/pf-85b5b49e--Website-Header-2000px-x-600px.jpg?v=1625226604)

[Image Credit](https://onetreeplanted.org/pages/million-tree-challenge) <br>

In this notebook, the data of **Forest Cover Type Prediction** is used to develop prediction models. <br>
The data can be downloaded from: <br>
https://www.kaggle.com/competitions/forest-cover-type-prediction/data 

In [45]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps
import os

# to bypass warnings in various dataframe assignments
pd.options.mode.chained_assignment = None

pd.set_option('display.max_columns', 60)
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedKFold
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.metrics import accuracy_score


In [94]:
# Loading data
train = pd.read_csv("data/processed/train.csv")
test = pd.read_csv("data/processed/test.csv")

sample_submission = pd.read_csv("data/raw/sampleSubmission.csv")

## **1. Feature-Label:**
The first thing we need to do is separating the label from the features.

In [86]:
# Splitting data into features and labels
x_train = train.drop(['Cover_Type','Id'], axis=1)
y_train = train['Cover_Type']

x_test = test.drop(['Id'], axis=1)

Since we are going to implement cross-validation and considering that we already have a blind test set, we do not split the training set to train-validation. 

In [88]:
# checking the shape of the data
print("Shape of training input: ", x_train.shape)
print("Shape of training labels: ", y_train.shape)
print("Shape of test input: ", x_test.shape)

Shape of training input:  (14988, 54)
Shape of training labels:  (14988,)
Shape of test input:  (565892, 54)


## **2. Creating Baseline Model:**
In this section, we create a baseline model which is used as a benchmark to compare the performance of the other models. <br>
This model is usually a simple method of predicting labels based on the relationships between the features and the target. If there are obvious relationships, we can make predictions based on that. If we have access to an expert who is aware of such relationships based on the underlying technical principles, we can code it to arrive at some predictions. <br>
Considering the lack of the above knwoledge, we make the baseline predictions based on a stratified selection from the target classes using `DummyClassifier` as follows.

In [89]:
# Create dummy classifer
dummy = DummyClassifier(strategy='stratified', random_state=1)

# train the model
dummy.fit(x_train, y_train)

# accuracy score of the model on the training set
accuracy_dm = dummy.score(x_train, y_train)
print(f'The accuracy of the dummy algorithm is {accuracy_dm.round(3)} over the training set.')

The accuracy of the dummy algorithm is 0.141 over the training set.


As can be seen, the dummy classifier could predict only 13.5 % of the labels correctly in the validation set. <br>
It's time to try other classifiers.

## **3. Developing Prediction Models Using All Features:**
In this section, we develop several machine learning models to predict the cover types in the validation set. <br>
These models are:
* Logistic Regression
* K-Nearest Neighbors
* Decision Tree
* Gradient Boosting Tree
* Random Forest
* Extra Trees (Extreme Random Forest)

In order to make sure the best models will be fitted on the data, 5-fold cross validation together with hyperparameters tuning based on Bayesian optimization is applied to all the models. <br>

**Standardizing data before training?** <br>

For some algorithms, it is very important to scale data before training the models. This can significantly affect the performance of the models. Some comparisons between the performance of models developed using scaled data and unscaled data can be found [here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html). It states that:

> Even if tree based models are (almost) not affected by scaling, many other algorithms require features to be normalized, often for different reasons: to ease the convergence (such as a non-penalized logistic regression), to create a completely different model fit compared to the fit with unscaled data (such as KNeighbors models). The latter is demoed on the first part of the present example.

So, we first standardize the data based on Z-score normalization as follows:

In [90]:
# standardize the train, validation, and test data all based on the training data excluding binary features
binary_features = ['Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 
'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type8', 'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12', 'Soil_Type13',
 'Soil_Type14', 'Soil_Type16', 'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20', 'Soil_Type21', 'Soil_Type22', 
 'Soil_Type23', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28', 'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 
 'Soil_Type32', 'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36', 'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40']

# standardize the data
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train.drop(binary_features, axis=1))
x_test_scaled = scaler.transform(x_test.drop(binary_features, axis=1))

# convert the numpy arrays to dataframes
x_train_scaled = pd.DataFrame(x_train_scaled, columns=x_train.drop(binary_features, axis=1).columns)
x_test_scaled = pd.DataFrame(x_test_scaled, columns=x_test.drop(binary_features, axis=1).columns)

# add the binary features back to the dataframes
x_train_scaled = pd.concat((x_train_scaled, x_train[binary_features].reset_index(drop=True)), axis=1)
x_test_scaled = pd.concat((x_test_scaled, x_test[binary_features].reset_index(drop=True)), axis=1)

# check the shape of the data
print("Shape of training input: ", x_train_scaled.shape)
print("Shape of test input: ", x_test_scaled.shape)

Shape of training input:  (14988, 54)
Shape of test input:  (565892, 54)


### **3.1 Model 1: Logistic Regression**
The first model we are going to develop is logistic regression using `LogisticRegression` from scikit-learn.

In [112]:
model = LogisticRegression(solver='newton-cg', multi_class='multinomial')
param_grid = {'fit_intercept' : [False, True],
              'C'             : Real(1e-3, 100, prior='log-uniform')}
cv = RepeatedKFold(n_splits=5, n_repeats=1, random_state=1)
search1 = BayesSearchCV(estimator=model, search_spaces=param_grid, cv=cv, n_jobs=-1, return_train_score=True, random_state=1, verbose=0)
search1.fit(x_train_scaled, y_train)

BayesSearchCV(cv=RepeatedKFold(n_repeats=1, n_splits=5, random_state=1),
              estimator=LogisticRegression(multi_class='multinomial',
                                           solver='newton-cg'),
              n_jobs=-1, random_state=1, return_train_score=True,
              search_spaces={'C': Real(low=0.001, high=100, prior='log-uniform', transform='normalize'),
                             'fit_intercept': [False, True]})

Lets check the average CV scores and the best hyperparameters:

In [119]:
accuracy_lr_cv_train = search1.cv_results_['mean_train_score'][search1.best_index_].round(4)
accuracy_lr_cv_val = search1.cv_results_['mean_test_score'][search1.best_index_].round(4)

print(f"Optimal hyperparameters of the model = {search1.best_params_}")
print(f"CV AVG train score of the model = {accuracy_lr_cv_train}")
print(f"CV AVG validation score of the model = {accuracy_lr_cv_val}")

Optimal hyperparameters of the model = OrderedDict([('C', 2.7502092219694063), ('fit_intercept', True)])
CV AVG train score of the model = 0.7137
CV AVG validation score of the model = 0.7081


Now, it is time to build model with the optimal hyperparameters using the whole training data (without cross-validation). Then, make final predictions on the blind test data.

In [114]:
# train model with optimal hyperparameters on the entire training data
model1 = LogisticRegression(solver='newton-cg', multi_class='multinomial', **search1.best_params_)
model1.fit(x_train_scaled, y_train)

# check the accuracy of the model on the training data itself
y_pred = model1.predict(x_train_scaled)
accuracy_lr_train = accuracy_score(y_train, y_pred)
print(f'The accuracy of the model over the training data is {accuracy_lr_train.round(3)}.')

# make predictions on the test data
y_pred = model1.predict(x_test_scaled)
y_pred.shape

The accuracy of the model over the training data is 0.713.


(565892,)

We make a csv file of the predictions to submit it.

In [117]:
submission1 = sample_submission.copy()
submission1['Cover_Type'] = y_pred

# Save the submission file
submission1.to_csv('data/prediction/submission1.csv', index=False)

# Check the submission file
submission1['Cover_Type'].value_counts()

1    201932
2    187825
5     67134
7     43203
6     32439
3     27185
4      6174
Name: Cover_Type, dtype: int64

**Final Scores: CV train = 0.7137 | CV validation = 0.7081 | test (blind): 0.597**
__________
__________

### **3.2 Model 2: K-Nearest Neighbors**
The second model we are going to develop is k-nearest neighbors using `KNeighborsClassifier` from scikit-learn.

In [121]:
model = KNeighborsClassifier()
param_grid = {'n_neighbors' : Integer(1,40), 
              'weights'     : Categorical(['uniform', 'distance'])}
cv = RepeatedKFold(n_splits=5, n_repeats=1, random_state=1)
search2 = BayesSearchCV(estimator=model, search_spaces=param_grid, n_iter=10, cv=cv, n_jobs=-1, return_train_score=True, random_state=1, verbose=0)
search2.fit(x_train_scaled, y_train)

BayesSearchCV(cv=RepeatedKFold(n_repeats=1, n_splits=5, random_state=1),
              estimator=KNeighborsClassifier(), n_iter=10, n_jobs=-1,
              random_state=1, return_train_score=True,
              search_spaces={'n_neighbors': Integer(low=1, high=40, prior='uniform', transform='normalize'),
                             'weights': Categorical(categories=('uniform', 'distance'), prior=None)})

Lets check the average CV scores and the best hyperparameters:

In [122]:
accuracy_knn_cv_train = search2.cv_results_['mean_train_score'][search2.best_index_].round(4)
accuracy_knn_cv_val = search2.cv_results_['mean_test_score'][search2.best_index_].round(4)

print(f"Optimal hyperparameters of the model = {search2.best_params_}")
print(f"CV AVG train score of the model = {accuracy_knn_cv_train}")
print(f"CV AVG validation score of the model = {accuracy_knn_cv_val}")

Optimal hyperparameters of the model = OrderedDict([('n_neighbors', 6), ('weights', 'uniform')])
CV AVG train score of the model = 0.8493
CV AVG validation score of the model = 0.7846


Now, it is time to build model with the optimal hyperparameters using the whole training data (without cross-validation). Then, make final predictions on the blind test data.

In [123]:
# train model with optimal hyperparameters on the entire training data
model2 = KNeighborsClassifier(**search2.best_params_)
model2.fit(x_train_scaled, y_train)

# check the accuracy of the model on the training data itself
y_pred = model2.predict(x_train_scaled)
accuracy_knn_train = accuracy_score(y_train, y_pred)
print(f'The accuracy of the model over the training data is {accuracy_knn_train.round(3)}.')

# make predictions on the test data
y_pred = model2.predict(x_test_scaled)
y_pred.shape

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


The accuracy of the model over the training data is 0.858.


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


(565892,)

We make a csv file of the predictions to submit it.

In [124]:
submission2 = sample_submission.copy()
submission2['Cover_Type'] = y_pred

# Save the submission file
submission2.to_csv('data/prediction/submission2.csv', index=False)

# Check the submission file
submission2['Cover_Type'].value_counts()

1    217106
2    187152
5     55709
7     37077
3     35520
6     28654
4      4674
Name: Cover_Type, dtype: int64

**Final Scores: CV train = 0.8493 | CV validation = 0.7846 | test (blind): 0.643**
__________
__________

### **3.3 Model 3: Decision Tree**
The third model we are going to develop is decision tree using `DecisionTreeClassifier` from scikit-learn.

In [125]:
model = DecisionTreeClassifier()
param_grid = {'max_depth':Integer(1,30)}
cv = RepeatedKFold(n_splits=5, n_repeats=1, random_state=1)
search3 = BayesSearchCV(estimator=model, search_spaces=param_grid, n_iter=10, cv=cv, n_jobs=-1, return_train_score=True, random_state=1, verbose=0)
search3.fit(x_train_scaled, y_train)

BayesSearchCV(cv=RepeatedKFold(n_repeats=1, n_splits=5, random_state=1),
              estimator=DecisionTreeClassifier(), n_iter=10, n_jobs=-1,
              random_state=1, return_train_score=True,
              search_spaces={'max_depth': Integer(low=1, high=30, prior='uniform', transform='normalize')})

Lets check the average CV scores and the best hyperparameters:

In [126]:
accuracy_dt_cv_train = search3.cv_results_['mean_train_score'][search3.best_index_].round(4)
accuracy_dt_cv_val = search3.cv_results_['mean_test_score'][search3.best_index_].round(4)

print(f"Optimal hyperparameters of the model = {search3.best_params_}")
print(f"CV AVG train score of the model = {accuracy_dt_cv_train}")
print(f"CV AVG validation score of the model = {accuracy_dt_cv_val}")

Optimal hyperparameters of the model = OrderedDict([('max_depth', 21)])
CV AVG train score of the model = 0.9935
CV AVG validation score of the model = 0.7971


Now, it is time to build model with the optimal hyperparameters using the whole training data (without cross-validation). Then, make final predictions on the blind test data.

In [127]:
# train model with optimal hyperparameters on the entire training data
model3 = DecisionTreeClassifier(**search3.best_params_)
model3.fit(x_train_scaled, y_train)

# check the accuracy of the model on the training data itself
y_pred = model3.predict(x_train_scaled)
accuracy_dt_train = accuracy_score(y_train, y_pred)
print(f'The accuracy of the model over the training data is {accuracy_dt_train.round(3)}.')

# make predictions on the test data
y_pred = model3.predict(x_test_scaled)
y_pred.shape

The accuracy of the model over the training data is 0.994.


(565892,)

We make a csv file of the predictions to submit it.

In [128]:
submission3 = sample_submission.copy()
submission3['Cover_Type'] = y_pred

# Save the submission file
submission3.to_csv('data/prediction/submission3.csv', index=False)

# Check the submission file
submission3['Cover_Type'].value_counts()

2    223022
1    206127
5     38723
3     35546
7     34382
6     25428
4      2664
Name: Cover_Type, dtype: int64

**Final Scores: CV train = 0.9935 | CV validation = 0.7971 | test (blind): 0.672**
__________
__________

### **3.4 Model 4: Gradient Boosting Tree**
The fourth model we are going to develop is gradient boosting tree using `GradientBoostingClassifier` from scikit-learn.

In [None]:
x_train_scaled.iloc[:,:12].info()