# Part A: Introduction

This segment of the project aims to develop a machine learning model to predict clusters. The nature of the problem is a classification problem and our target variable is 'Cluster'.

During our lectures, we have identified several machine learning algorithms that are particularly effective for classification problems:

1. **K-Nearest Neighbors (KNN)**
2. **Decision Trees**
3. **Support Vector Machines (SVM)**

We will begin by implementing these three algorithms with their default settings to predict the clusters. This initial step will allow us to evaluate their performance and determine the next course of action based on the outcomes.


# Import Libraries and Dataset

First, we'll import the necessary libraries and load the dataset.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# supress the warning
import warnings
warnings.filterwarnings("ignore")

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
df = pd.read_csv('5dots\dataset_with_clusters.csv')

In [3]:
df.head()

Unnamed: 0,pauses,unique_patterns_count,total_values_count,duplicates,empty_submissions,Box_1_Submission,Box_2_Submission,Box_3_Submission,Box_4_Submission,Box_5_Submission,...,Box_10_Timegap,Box_11_Timegap,Box_12_Timegap,Box_13_Timegap,Box_14_Timegap,Box_15_Timegap,Box_16_Timegap,Box_17_Timegap,Box_18_Timegap,Cluster
0,1,4.0,5.0,1.0,0.0,3.0,2.0,0.0,0.0,0.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,1
1,6,25.0,26.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,...,2593.0,10000.0,3092.0,4389.0,10000.0,5426.0,10000.0,10000.0,10000.0,3
2,1,35.0,45.0,9.0,1.0,3.0,3.0,3.0,3.0,2.0,...,4760.0,4392.0,4784.0,2121.0,4000.0,3224.0,3961.0,5433.0,2176.0,0
3,2,51.0,58.0,5.0,2.0,5.0,3.0,3.0,3.0,3.0,...,3344.0,2352.0,2464.0,3552.0,6360.0,3064.0,2816.0,3664.0,2504.0,0
4,1,45.0,58.0,12.0,1.0,3.0,6.0,4.0,2.0,3.0,...,4282.0,3193.0,3551.0,3664.0,3056.0,4399.0,3808.0,10000.0,10000.0,0


# 🛠️ Preprocessing

**Objective**: To establish a baseline performance and understand the potential of various models using all available features initially. This will help in assessing the initial predictive capability of the models on the dataset created from students' submission time boxes.


### 🎯 Target variable

The target variable is the variable we are trying to predict. There are four clusters, which are numbers from 0 to 3.

In [4]:
y = df['Cluster']


### 🛠️ Features

Given the fact the clusters were created by using all time boxes from the student's submissions, we will use all features initially to train the models. Therefore, all columns except the column 'Cluster' will be used initially as features.

In [5]:
X = df.drop('Cluster', axis=1)

### 🪓 Splitting into Train/Test

To ensure a fair comparison between different models, we will use the same train and test sets for all three models. This process will be done only once to maintain consistency across model evaluations.


In [6]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### ⚖️ Scaling
The K-Nearest Neighbors (KNN) algorithm and Support Vector Machines (SVM) both rely on distance calculations to make predictions. For instance, KNN uses the concept of "being near" to decide cluster membership for new data points. This "being near" is calculated using [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance), which measures absolute differences between values of the same feature, but not among different features.

Therefore, it is necessary to scale all features to ensure they use the same unit of measurement. Without scaling, features with larger ranges can disproportionately influence the distance calculations, leading to biased results. A common approach is to use [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) as the unit, transforming all features to have a mean of 0 and a standard deviation of 1. This transformation is achieved using the `StandardScaler` from sklearn.

Example:
> Given the numbers 6 and 8, the Euclidean distance is 2. Given the numbers 95 and 100, the Euclidean distance is 5. However, 95 and 100 are closer to each other (95%) than 6 and 8 are (75%). Scaling ensures that all features contribute equally to the distance calculations.

[Decision Trees, however, do not require feature scaling because they are not distance-based algorithms. Instead, they work by splitting the data based on feature values, making decisions based on thresholds. As a result, the relative scales of the features do not affect their performance](https://towardsdatascience.com/do-decision-trees-need-feature-scaling-97809eaa60c6) 

**Therefore, we will scale the features and use them only for SVM and kNN models.**


In [7]:
from sklearn.preprocessing import StandardScaler

In [8]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 🧬 Modelling: k-Nearest Neighbor

In [9]:
knn = KNeighborsClassifier() # kNN model parameters are set to default: n_neighbors=5, weights='uniform'
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)
accuracy_knn = accuracy_score(y_test, y_pred_knn)

# 🧬 Modelling: Support Vector Machine

In [10]:
svm = SVC() # Support Vector Machine parameters are set to default: with kernel='rbf', C=1.0, gamma='scale'
svm.fit(X_train_scaled, y_train)
y_pred_svm = svm.predict(X_test_scaled)
accuracy_svm = accuracy_score(y_test, y_pred_svm)

# 🧬 Modelling: Decision Trees

In [11]:
dt = DecisionTreeClassifier() # Decision Tree parameters are set to default: with criterion='gini', splitter='best'
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

# 🔬 Evaluation

In [12]:
print(f"KNN Accuracy: {accuracy_knn}")
print(f"Decision Tree Accuracy: {accuracy_dt}")
print(f"SVM Accuracy: {accuracy_svm}")

KNN Accuracy: 0.847682119205298
Decision Tree Accuracy: 0.7671081677704195
SVM Accuracy: 0.9635761589403974


### _**Conclusion:**_

SVM achieved the highest accuracy (96.36%), outperforming KNN (84.77%) and Decision Tree (77.37%). The next course of actions include detailed classification report, confusion matrix and cross validation.

In [13]:
print("\nKNN Classification Report:")
print(classification_report(y_test, y_pred_knn))


KNN Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.79      0.84       153
           1       0.85      0.86      0.85       162
           2       0.82      0.92      0.87       336
           3       0.86      0.78      0.81       255

    accuracy                           0.85       906
   macro avg       0.86      0.84      0.85       906
weighted avg       0.85      0.85      0.85       906



In [14]:
print("Decision Tree Classification Report:")
print(classification_report(y_test, y_pred_dt))

Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.81      0.78       153
           1       0.85      0.70      0.77       162
           2       0.81      0.73      0.77       336
           3       0.70      0.83      0.76       255

    accuracy                           0.77       906
   macro avg       0.78      0.77      0.77       906
weighted avg       0.77      0.77      0.77       906



In [15]:
print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svm))

SVM Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.97      0.98       153
           1       0.97      0.89      0.93       162
           2       0.98      0.98      0.98       336
           3       0.92      0.98      0.95       255

    accuracy                           0.96       906
   macro avg       0.97      0.96      0.96       906
weighted avg       0.96      0.96      0.96       906



### _**Conclusion:**_

- The recall for cluster 1 in SVM (89%) is lower than the rest of the clusters, which is still slightly higher than kNN (86%) and significantly higher than Decision Trees (73%).

- From the classification report alone, it can be concluded that SVM outperforms all other models in terms of precision (ranging from 92% to 99%) and recall (ranging from 89% to 98%). The second-best model is kNN, with precision ranging from 82% to 90% and recalls ranging from 78% to 92%. The worst-performing model is Decision Trees, with precision values ranging from 69% to 85% and recall values ranging from 69% to 84%.

In [16]:
print("KNN Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))

KNN Confusion Matrix:
[[121   0  32   0]
 [  0 139   2  21]
 [ 13   1 310  12]
 [  0  24  33 198]]


### _**Summary:**_

- For Cluster '0', the KNN model correctly identified 121 instances, but misclassified 32 instances as Cluster '2'.

- In Cluster '1', the KNN model accurately identified 139 instances as '1', but erroneously classified 2 instances as '2' and 21 instances as '12'.

- Regarding Cluster '2', the KNN model correctly identified 310 instances as '2', but mistakenly classified 13 instances as '0', 1 instance as '1', and 12 instances as '3'.

- In Cluster '3', the KNN model accurately identified 198 instances as '3', but misclassified 24 instances as '1' and 33 instances as '2'.



In [17]:
print("Decision Tree Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dt))

Decision Tree Confusion Matrix:
[[124   0  28   1]
 [  0 113   5  44]
 [ 42   3 246  45]
 [  1  17  25 212]]


### _**Summary:**_

- For Cluster '0', the Decision Tree model correctly identified 125 instances, but misclassified 27 instances as Cluster '2' and 1 instance as Cluster '3'.

- In Cluster '1', the Decision Tree model accurately identified 118 instances as '1', but erroneously classified 5 instances as '2' and 39 instances as '12'.

- Regarding Cluster '2', the Decision Tree model correctly identified 233 instances as '2', but mistakenly classified 45 instances as '0', 3 instances as '1', and 55 instances as '3'.

- In Cluster '3', the Decision Tree model accurately identified 215 instances as '3', but misclassified 18 instances as '1' and 22 instances as '2'.


In [18]:
print("SVM Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_svm))

SVM Confusion Matrix:
[[149   0   4   0]
 [  0 144   1  17]
 [  1   1 330   4]
 [  0   3   2 250]]


### _**Summary:**_

- For Cluster '0', the SVM model correctly identified 149 instances, with no misclassifications.

- In Cluster '1', the SVM model accurately identified 144 instances as '1', but misclassified 1 instance as '2' and 17 instances as '12'.

- Regarding Cluster '2', the SVM model correctly identified 330 instances as '2', but mistakenly classified 1 instance as '0', 1 instance as '1', and 4 instances as '3'.

- In Cluster '3', the SVM model accurately identified 250 instances as '3', but misclassified 3 instances as '1' and 2 instances as '2'.


### _**Conclusion from Confusion Matrices:**_

- SVM demonstrates the highest accuracy and maintains consistency across all clusters in comparison with other models.
- Cluster '0' tends to be mistaken as '2' across all models.
- Cluster '2' appears more challenging, as it was mistaken with all other clusters in all three of the models.

## 🔎 Evaluating Overfitting, Underfitting, and Generalization

To assess our models for overfitting, underfitting, and generalization, we'll employ two steps:

- **Initial Evaluation**: 
We'll begin by comparing the classification reports obtained from both the test and train sets. This initial examination will provide insights into any inconsistencies between the model's performance on seen versus unseen data.

- **Cross-Validation Analysis**:
Utilizing the k-fold cross-validation technique, we'll conduct a series of evaluations, training the models on various subsets of the data and assessing their performance on unseen partitions. This method ensures a robust estimation of model performance and helps reveal potential overfitting or underfitting tendencies.

### 1. Intial Evaluation:

In [19]:
a = svm.predict(X_train_scaled)
a_accuracy = accuracy_score(y_train, a)
print(f"SVM Accuracy on training data: {a_accuracy:.6f}")
b = svm.predict(X_test_scaled)
b_accuracy = accuracy_score(y_test, b)
print(f"SVM Accuracy on test data: {b_accuracy:.6f}")

print("Training classification report:\n", classification_report(y_train, a))
print("Test classification report:\n", classification_report(y_test, b))

SVM Accuracy on training data: 0.998068
SVM Accuracy on test data: 0.963576
Training classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       723
           1       1.00      1.00      1.00       513
           2       1.00      1.00      1.00      1329
           3       1.00      1.00      1.00      1058

    accuracy                           1.00      3623
   macro avg       1.00      1.00      1.00      3623
weighted avg       1.00      1.00      1.00      3623

Test classification report:
               precision    recall  f1-score   support

           0       0.99      0.97      0.98       153
           1       0.97      0.89      0.93       162
           2       0.98      0.98      0.98       336
           3       0.92      0.98      0.95       255

    accuracy                           0.96       906
   macro avg       0.97      0.96      0.96       906
weighted avg       0.96      0.96      0.96    

In [20]:
a = knn.predict(X_train_scaled)
a_accuracy = accuracy_score(y_train, a)
print(f"KNN Accuracy on training data: {a_accuracy:.6f}")
b = knn.predict(X_test_scaled)
b_accuracy = accuracy_score(y_test, b)
print(f"KNN Accuracy on test data: {b_accuracy:.6f}")

print("Training classification report:\n", classification_report(y_train, a))
print("Test classification report:\n", classification_report(y_test, b))

KNN Accuracy on training data: 0.911123
KNN Accuracy on test data: 0.847682
Training classification report:
               precision    recall  f1-score   support

           0       0.98      0.88      0.92       723
           1       0.91      0.96      0.94       513
           2       0.86      0.96      0.91      1329
           3       0.94      0.85      0.89      1058

    accuracy                           0.91      3623
   macro avg       0.92      0.91      0.92      3623
weighted avg       0.92      0.91      0.91      3623

Test classification report:
               precision    recall  f1-score   support

           0       0.90      0.79      0.84       153
           1       0.85      0.86      0.85       162
           2       0.82      0.92      0.87       336
           3       0.86      0.78      0.81       255

    accuracy                           0.85       906
   macro avg       0.86      0.84      0.85       906
weighted avg       0.85      0.85      0.85    

In [21]:
a = dt.predict(X_train)
a_accuracy = accuracy_score(y_train, a)
print(f"Decision Trees Accuracy on training data: {a_accuracy:.6f}")
b = dt.predict(X_test)
b_accuracy = accuracy_score(y_test, b)
print(f"Decision Trees Accuracy on test data: {b_accuracy:.6f}")

print("Training classification report:\n", classification_report(y_train, a))
print("Test classification report:\n", classification_report(y_test, b))

Decision Trees Accuracy on training data: 1.000000
Decision Trees Accuracy on test data: 0.767108
Training classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       723
           1       1.00      1.00      1.00       513
           2       1.00      1.00      1.00      1329
           3       1.00      1.00      1.00      1058

    accuracy                           1.00      3623
   macro avg       1.00      1.00      1.00      3623
weighted avg       1.00      1.00      1.00      3623

Test classification report:
               precision    recall  f1-score   support

           0       0.74      0.81      0.78       153
           1       0.85      0.70      0.77       162
           2       0.81      0.73      0.77       336
           3       0.70      0.83      0.76       255

    accuracy                           0.77       906
   macro avg       0.78      0.77      0.77       906
weighted avg       0.77  

### _**Conclusion**_
KNN exhibits decent performance but shows signs of overfitting, while Decision Trees suffer from significant overfitting, which leads to poor performance on unseen data. Although both SVM and Decision Trees achieve perfect accuracy on the training data, there is a noticeable decline in performance on the test set, particularly for Decision Trees, suggesting a lack of generalization. Therefore, we can conclude the SVM model outperforms both KNN and Decision Trees in terms of accuracy and generalization ability.

To improve the training speed, we will use intel extension:

In [22]:
#from sklearnex import patch_sklearn
#patch_sklearn()

In [23]:
from sklearn.model_selection import cross_val_score

Since kNN and SVM require scaling we will perform it here:

In [24]:
X_Scaled = StandardScaler().fit_transform(X)

### _**Cross Validation for kNN:**_

In [25]:
scores = cross_val_score(knn, X_Scaled, y, cv=5)
print(f"kNN Cross Validation Scores: {scores}")
print(f"kNN Cross Validation Mean Score: {np.mean(scores)}")
print(f"kNN Cross Validation Standard Deviation: {np.std(scores)}")

kNN Cross Validation Scores: [0.84216336 0.84657837 0.85209713 0.86423841 0.86298343]
kNN Cross Validation Mean Score: 0.8536121376215042
kNN Cross Validation Standard Deviation: 0.008758795701430844


### _**Cross Validation for SVM:**_

In [26]:
scores = cross_val_score(svm, X_Scaled, y, cv=5)
print(f"SVM Cross Validation Scores: {scores}")
print(f"SVM Cross Validation Mean Score: {np.mean(scores)}")
print(f"SVM Cross Validation Standard Deviation: {np.std(scores)}")

SVM Cross Validation Scores: [0.94591611 0.96357616 0.95916115 0.97461369 0.96022099]
SVM Cross Validation Mean Score: 0.9606976205285817
SVM Cross Validation Standard Deviation: 0.00919808359254364


### _**Cross Validation for Decision Trees:**_

In [27]:
scores = cross_val_score(dt, X, y, cv=5)
print(f"Decision Trees Cross Validation Scores: {scores}")
print(f"Decision Trees Cross Validation Mean Score: {np.mean(scores)}")
print(f"Decision Trees Cross Validation Standard Deviation: {np.std(scores)}")

Decision Trees Cross Validation Scores: [0.74282561 0.76931567 0.76490066 0.79028698 0.78121547]
Decision Trees Cross Validation Mean Score: 0.7697088775871112
Decision Trees Cross Validation Standard Deviation: 0.016135944726401903


### _**Conclusion**_
> Based on the results from Cross Validation, SVM still outperforms other models with the highest mean score (accuracy). When it comes to standard deviation, SVM has the highest variances across different train and test set splits. However, the number is still considerably low.

In [28]:
# hyperparameter tuning for svm
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf', 'poly', 'sigmoid']}
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)
grid.fit(X_train_scaled, y_train)

print(grid.best_params_)
print(grid.best_estimator_)
grid_predictions = grid.predict(X_test_scaled)
print(accuracy_score(y_test, grid_predictions))
print(classification_report(y_test, grid_predictions))
print(confusion_matrix(y_test, grid_predictions))

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   2.5s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   2.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   2.3s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   2.4s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   2.1s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.7s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.4s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.4s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.6s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.9s
[CV] END .....................C=0.1, gamma=1, kernel=sigmoid; total time=   1.0s
[CV] END .....................C=0.1, gamma=1, k

### _**Next Course of Action**_

Based on the conclusion we have made, it is clear Support Vector Machine is more suitable for this dataset. In the next part, we will reduce the features and aim to improve prediction on cluster '0' and cluster '2'.

In [29]:
# train svm using best parameters
svm = SVC(C=10, gamma=0.01, kernel='rbf')
svm.fit(X_train_scaled, y_train)
y_pred_svm = svm.predict(X_test_scaled)
accuracy_svm = accuracy_score(y_test, y_pred_svm)

print(f"SVM Accuracy: {accuracy_svm}")
print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svm))


SVM Accuracy: 0.9591611479028698
SVM Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.98      0.98       153
           1       0.97      0.90      0.94       162
           2       0.98      0.97      0.97       336
           3       0.92      0.97      0.95       255

    accuracy                           0.96       906
   macro avg       0.96      0.96      0.96       906
weighted avg       0.96      0.96      0.96       906



In [30]:
# export the svm model
import joblib
joblib.dump(svm, 'svm_model.pkl')

['svm_model.pkl']

In [36]:
# Read in csv
test = pd.read_csv('./Demo/5 dot test/bin/Debug/test.csv')

test.head()

Unnamed: 0,pauses,unique_patterns_count,total_values_count,duplicates,empty_submissions,Box_1_Submission,Box_2_Submission,Box_3_Submission,Box_4_Submission,Box_5_Submission,...,Box_9_Timegap,Box_10_Timegap,Box_11_Timegap,Box_12_Timegap,Box_13_Timegap,Box_14_Timegap,Box_15_Timegap,Box_16_Timegap,Box_17_Timegap,Box_18_Timegap
0,7.0,16.0,16.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0


In [37]:
# Predict for test
predictions = svm.predict(test)
print(predictions)

[3]
