        COMP-2704: Supervised Machine Learning

        Project – fine tuning and evaluation

I went back to my data analysis and use case, restoring all the features that I removed earlier. Instead of removing the features, I used LabelEncoder() to encode them to integer values that would help train Machine Learning Models.

Also, I made sure that the data was split ony into train/ test

In [1]:
# importing the packages
import pandas as pd
from sklearn.model_selection import GridSearchCV
import numpy as np

training = pd.read_csv('birds_training_data.csv')
testing = pd.read_csv('birds_testing_data.csv')


In [2]:
# Storing the features and labels in four different pandas objects
features_tr = training.drop(columns=['label'])
label_tr = training['label']

features_test = testing.drop(columns=['label'])
label_test = testing['label']


In [3]:
# importing the regressor and gridsearch
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV


In [4]:
# defining the param grid
param_grid = {
    'criterion' : ['squared_error', 'friedman_mse', 'absolute_error'],
    'max_depth': [1, 2, 4, 6],
    'min_samples_leaf': [2,3,4,5],
    'min_impurity_decrease': [0.01, 0.1, 0, 1]
}

In [5]:
# Initializing the decision tree

dec_tree = DecisionTreeRegressor()


In [6]:
# defining a StratifiedKFold object to setup cross-validation

In [7]:
from sklearn.model_selection import StratifiedKFold
str_kfold = StratifiedKFold(n_splits=5, shuffle=True)

In [8]:
# I choose 'mse' for determining the best model

from sklearn.metrics import mean_squared_error, make_scorer
mse_score = make_scorer(mean_squared_error)

In [9]:
# fitting the grid search
from sklearn.pipeline import Pipeline

grid_search = GridSearchCV(estimator=dec_tree, param_grid=param_grid, cv=str_kfold, scoring=mse_score, return_train_score=True, error_score='raise')
try:
    grid_search.fit(features_tr, label_tr)
except ValueError as e:
    print(e)



In [10]:
# displaying best parameters
print(f"The model's best parameters are: { grid_search.best_params_}")

The model's best parameters are: {'criterion': 'absolute_error', 'max_depth': 6, 'min_impurity_decrease': 0, 'min_samples_leaf': 2}


In [11]:
# displaying the weighted_mean_training_score
mean_train_score = grid_search.cv_results_['mean_train_score'][grid_search.best_index_]
print(f"The weighted mean training score: {mean_train_score}")

The weighted mean training score: 2.2814662838955284


In [12]:
# Displaying the weighted mean cross validation score
print(f"Weighted mean cross-validation score: {grid_search.best_score_}")

Weighted mean cross-validation score: 3.840600500417014


In [13]:
pred = grid_search.predict(features_test)

# Since, my model is a regression model, the confusion_matrix and classification_report are not good 
# metrices to validate my model, so I use mean_squared_error
mse = mean_squared_error(label_test, pred)
print(f"The mse in making predictions by the model: {mse}")

The mse in making predictions by the model: 3.6319148936170214


In [14]:
# Training and evaluating SVM  using GridSearchCV from sklearn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR


pipeline = Pipeline([
      ('scaler', StandardScaler()),
      ('SVR', SVR())
])
# the regression model SVR() does not have the 'class_weights' parameter

In [4]:
# defining a parameter grid for svr regressor
param_grid_svr = {
     
     'SVR__C': [0.01, 0.1, 0.1, 10, 100],
     'SVR__gamma': [ 0.01, 0.2, 0.3, 1, 10],
     'SVR__kernel': ['linear', 'rbf']
}

In [16]:
# defining str_kfold
from sklearn.model_selection import StratifiedKFold

str_kfold = StratifiedKFold(n_splits=5, shuffle=True)


In [17]:
# I choose mean squared error as a metric 
mse_scorer = make_scorer(mean_squared_error)

In [18]:
# fitting the grid search model for SVM
grid_search_svr = GridSearchCV(pipeline, param_grid_svr, cv=str_kfold, scoring=mse_scorer, return_train_score=True)

In [19]:
grid_search_svr.fit(features_tr, label_tr)



In [22]:
# Displaying the best model parameters
print(f"The best model hyperparameters of a  SVR model are: {grid_search_svr.best_params_}")

The best model parameters of a  SVR model are: {'SVR__C': 100, 'SVR__gamma': 0.2, 'SVR__kernel': 'rbf'}


In [24]:
# Displaying weighted mean training score
mse_svr = grid_search.cv_results_['mean_train_score'][grid_search_svr.best_index_]
print(f"The weighted mean training score: {mse_svr}")

The weighted mean training score: 2.349131599939523


In [25]:
# displaying the weighted mean cross-validation score
print(f"The weighted mean cross-validation score: {grid_search_svr.best_score_}")

The weighted mean cross-validation score: 5.9628116813516785


In [41]:
# finding predictions
pred_svr = grid_search_svr.predict(features_test)
mse_svr = mean_squared_error(label_test, pred_svr)

print(f"The mean squared error in finding the predictions on the test data is: {mse_svr}")
#the confusion_matrix and classification_report are for classification and my model is regression 

The mean squared error in finding the predictions on the test data is: 5.750639294236479


# Discussing the best model

Here, we notice that the mean_training score is almost same for both the decision tree and SVM models, however, the cross-validation and test scores for SVM model are comparitively higher suggesting overfitting, thus it cannot be good model as it tries to train itself learning the training data rather than understnading the underlying insights in the data. 
Meanwhile, the cross-validation and test scores for decision tree model are lower than SVM and also are closer to each other. 

Thus, it seems that the decision tree regressor is a good model, which is not overfitting and also has lower training, cross-validation and test scores.

For the decision tree model: the best params are, {'criterion': 'absolute_error', 'max_depth': 6, 'min_impurity_decrease': 0, 'min_samples_leaf': 2}
For SVM model: the best params are, {'SVR__C': 100, 'SVR__gamma': 0.2, 'SVR__kernel': 'rbf'}

But still decision tree model performs better as it decides which feature to give most importance while predicting the labels. 

In [30]:
# dumping the model to a pickle file 
import pickle
with open('model.pkl', 'wb') as file:
    pickle.dump(grid_search.best_estimator_, file)
    

In [33]:
# loading the model from the pickle file
with open('model.pkl', 'rb') as f:

    model = pickle.load(f)

In [34]:
# making predictions using the model
pred_test = model.predict(features_test)

In [37]:
mse_best = mean_squared_error(label_test, pred_test)
print(f"The mse by the best model on the test data is: {mse_best}")
# The confusion matrix and classification report cannot be displayed for a regression model

The mse by the best model on the test data is: 3.6319148936170214


My use case for my model was to predict number of birds in the Apline regions in Canada for researchers and bioactivists to know which bird populations are decreasing or will decrease by changing some of the climatic conditions and they need immediate action to prereserve those.

With the testing mse of 3.6 in the decision tree regressor model seems good in preedicting the number of birds in Alpine regions in Canada. The predictions could be reliable as the goal is not to predict the exact number of birds in those, however, it is know which bird species' popultions are comapritively lower than others, so it to know which bird speci numbers are extremely less to bring solutions to this problem.

Limitaions of the model:
- Changing enviromental factors: As sometimes the birds may hibernate in extreme winters resulting in the lower number of birds being detected, so the training data itself being flawed may not preedict the exact number of birds.
- Overfitting: The model might have trained by learning the training dataset while not realising the underlying patterns in it.
  