<a href="https://colab.research.google.com/github/PKpacheco/superv_ml_assignment/blob/main/DC_SVM_Comparison_Paola_Katherine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Supervised Machine Learning - Comparison between SVM and DC
#### Paola Katherine Pacheco
#### Hockey Analysis using NHL data


- [Decision Tree notebook](https://github.com/PKpacheco/superv_ml_assignment/blob/main/decision_trees__Paola_Katherine.ipynb
)
- [SVM notebook](https://github.com/PKpacheco/superv_ml_assignment/blob/main/svm_Paola_Katherine.ipynb)


The purpose of this notebook is to compare the performance of the Decision Tree and SVM models and  have a final result to define which one has better metrics and performance. 

In [1]:
import joblib
import requests

# Download the saved SVM model from GitHub
svm_url = 'https://raw.githubusercontent.com/PKpacheco/superv_ml_assignment/main/saved_joblib/best_svm_model.joblib'
svm_response = requests.get(svm_url)
with open('best_svm_model.joblib', 'wb') as f:
    f.write(svm_response.content)

# Download the saved DT model from GitHub
dt_url = 'https://raw.githubusercontent.com/PKpacheco/superv_ml_assignment/main/saved_joblib/best_dt_model.joblib'
dt_response = requests.get(dt_url)
with open('best_dt_model.joblib', 'wb') as f:
    f.write(dt_response.content)

# Download the X_test set from GitHub
x_test_set_url = 'https://raw.githubusercontent.com/PKpacheco/superv_ml_assignment/main/saved_joblib/X_test.joblib'
x_testset_response = requests.get(x_test_set_url)
with open('X_test.joblib', 'wb') as f:
    f.write(x_testset_response.content)

# Download the y_test set from GitHub
y_test_set_url = 'https://raw.githubusercontent.com/PKpacheco/superv_ml_assignment/main/saved_joblib/y_test.joblib'
y_testset_response = requests.get(y_test_set_url)
with open('y_test.joblib', 'wb') as f:
    f.write(y_testset_response.content)

# Load the saved DT model
dt_model = joblib.load('best_dt_model.joblib')

# Load the saved SVM model
svm_model = joblib.load('best_svm_model.joblib')

# Load the saved X_test, test set
X_test = joblib.load('X_test.joblib')

# Load the saved y_test, test set
y_test = joblib.load('y_test.joblib')


In [2]:
#check dt model
dt_model

In [3]:
# check SVM model
svm_model

In [4]:
# check X_test
X_test

Unnamed: 0,position,height,weight,hall_fame,points_per_game,bmi
438,1,73,190,0,0.245902,25.064740
198,1,73,185,0,0.262357,24.405142
586,2,68,165,0,0.500000,25.085424
34,0,71,196,1,1.186027,27.333466
900,3,72,180,0,0.480300,24.409722
...,...,...,...,...,...,...
191,3,72,192,0,0.272727,26.037037
408,1,73,180,0,0.000000,23.745543
14,3,70,185,1,0.974026,26.541837
837,0,72,178,0,0.449275,24.138503


In [5]:
# check test shape
X_test.shape

(300, 6)

In [6]:
# check y test shape
y_test.shape

(300,)

In [7]:
# check y_test
y_test

438    4
198    4
586    3
34     2
900    3
      ..
191    4
408    4
14     2
837    3
399    4
Name: range_points, Length: 300, dtype: int64

In [8]:
from tabulate import tabulate

# Use the imported SVM model to make predictions on X_test
svm_predictions = svm_model.predict(X_test)

# Use the imported DT model to make predictions on X_test
dt_predictions = dt_model.predict(X_test)

# Evaluate the predictions using appropriate metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# calculate metrics for SVM
svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_precision = precision_score(y_test, svm_predictions, average='weighted')
svm_recall = recall_score(y_test, svm_predictions,  average='weighted')
svm_f1 = f1_score(y_test, svm_predictions,  average='weighted')

# calculate metrics for DT
dt_accuracy = accuracy_score(y_test, dt_predictions)
dt_precision = precision_score(y_test, dt_predictions, average='weighted')
dt_recall = recall_score(y_test, dt_predictions,  average='weighted')
dt_f1 = f1_score(y_test, dt_predictions,  average='weighted')


# Define the data to be displayed in the table
data = [
    ['SVM', svm_accuracy, svm_precision, svm_recall, svm_f1],
    ['DT', dt_accuracy, dt_precision, dt_recall, dt_f1]
]

# Define the headers for the table
title = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score']

# Print the table
print(tabulate(data, headers=title))


Model      Accuracy    Precision    Recall    F1 Score
-------  ----------  -----------  --------  ----------
SVM        0.953333     0.946448  0.953333    0.949719
DT         0.986667     0.993351  0.986667    0.989412


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# FINAL CONCLUSION

Based on the performed analysis, it can be concluded that decision trees and SVMs are algorithms that work for classification tasks.
However, comparing the SVM and **Decision Tree **model shows the best performance with this dataset.

Decision Tree was the one with the best accuracy, precision, recall and F1 score. 

Accuracy: to measure how accurate the model prediction is.
Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision: to measure how many of what was predicted to be positive were correct.
Precision: TP / (TP + FP)

Recall: to compute how many TP the model was able to identify.
Recall: TP / (TP + FN)

F1 score: It is a balance between accuracy and recall.
F1 score = 2 * (precision * recall) / (precision + recall)



For future work, thinking about improving the model's efficiency, an important goal is to collect more data so that we can train and better adjust the model, generating better accuracy.

Experimenting with other models can be an interesting alternative, something like Random Forest, which is a combination of multiple decision trees.