# Introduction

This segment of the project aims to develop a machine learning model to predict clusters. The nature of the problem is a classification problem and our target variable is 'Cluster'.

During our lectures, we have identified several machine learning algorithms that are particularly effective for classification problems:

1. **K-Nearest Neighbors (KNN)**
2. **Decision Trees**
3. **Support Vector Machines (SVM)**

We will begin by implementing these three algorithms with their default settings to predict the clusters. This initial step will allow us to evaluate their performance and determine the next course of action based on the outcomes.


# Import Libraries and Dataset

First, we'll import the necessary libraries and load the dataset.

In [85]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# supress the warning
import warnings
warnings.filterwarnings("ignore")

In [86]:
df = pd.read_csv('5dots\dataset_with_clusters.csv')

In [87]:
df.head()

Unnamed: 0,pauses,unique_patterns_count,total_values_count,duplicates,empty_submissions,Box_1_Submission,Box_2_Submission,Box_3_Submission,Box_4_Submission,Box_5_Submission,...,Box_10_Timegap,Box_11_Timegap,Box_12_Timegap,Box_13_Timegap,Box_14_Timegap,Box_15_Timegap,Box_16_Timegap,Box_17_Timegap,Box_18_Timegap,Cluster
0,1,4.0,5.0,1.0,0.0,3.0,2.0,0.0,0.0,0.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,1
1,6,25.0,26.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,...,2593.0,10000.0,3092.0,4389.0,10000.0,5426.0,10000.0,10000.0,10000.0,3
2,1,35.0,45.0,9.0,1.0,3.0,3.0,3.0,3.0,2.0,...,4760.0,4392.0,4784.0,2121.0,4000.0,3224.0,3961.0,5433.0,2176.0,0
3,2,51.0,58.0,5.0,2.0,5.0,3.0,3.0,3.0,3.0,...,3344.0,2352.0,2464.0,3552.0,6360.0,3064.0,2816.0,3664.0,2504.0,0
4,1,45.0,58.0,12.0,1.0,3.0,6.0,4.0,2.0,3.0,...,4282.0,3193.0,3551.0,3664.0,3056.0,4399.0,3808.0,10000.0,10000.0,0


# Preprocessing

Given the fact we are trying to compare the performance of three models, we will use the same train and test sets for all three of the models.


In [88]:
# Split the data into features and target
X = df.drop('Cluster', axis=1)
y = df['Cluster']

In [89]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [90]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Modeling

We will first initialize the models:

In [92]:
knn = KNeighborsClassifier(n_neighbors=3)
dt = DecisionTreeClassifier()
svm = SVC()

Then we train the models:

In [93]:
knn.fit(X_train, y_train)
dt.fit(X_train, y_train)
svm.fit(X_train, y_train)

Then we predict using the models:

In [94]:
y_pred_knn = knn.predict(X_test)
y_pred_dt = dt.predict(X_test)
y_pred_svm = svm.predict(X_test)

Lastly, we evaluate the models:

In [95]:
accuracy_knn = accuracy_score(y_test, y_pred_knn)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
accuracy_svm = accuracy_score(y_test, y_pred_svm)

In [96]:
print(f"KNN Accuracy: {accuracy_knn}")
print(f"Decision Tree Accuracy: {accuracy_dt}")
print(f"SVM Accuracy: {accuracy_svm}")

KNN Accuracy: 0.8432671081677704
Decision Tree Accuracy: 0.7549668874172185
SVM Accuracy: 0.9635761589403974


Detailed classification reports:

In [97]:
# Detailed classification reports
print("\nKNN Classification Report:")
print(classification_report(y_test, y_pred_knn))

print("Decision Tree Classification Report:")
print(classification_report(y_test, y_pred_dt))

print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svm))

# Confusion matrices
print("KNN Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))

print("Decision Tree Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dt))

print("SVM Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_svm))



KNN Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.76      0.83       153
           1       0.83      0.86      0.85       162
           2       0.83      0.91      0.87       336
           3       0.84      0.80      0.82       255

    accuracy                           0.84       906
   macro avg       0.85      0.83      0.84       906
weighted avg       0.85      0.84      0.84       906

Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.71      0.82      0.76       153
           1       0.87      0.69      0.77       162
           2       0.80      0.71      0.75       336
           3       0.69      0.83      0.75       255

    accuracy                           0.75       906
   macro avg       0.77      0.76      0.76       906
weighted avg       0.77      0.75      0.76       906

SVM Classification Report:
              precision    recall  f1-

In [98]:
a = svm.predict(X_train)
a_accuracy = accuracy_score(y_train, a)
print(f"SVM Accuracy on training data: {a_accuracy}")

b = svm.predict(X_test)
b_accuracy = accuracy_score(y_test, b)
print(f"SVM Accuracy on test data: {b_accuracy}")

print("Training classification report:\n", classification_report(y_train, a))
print("Test classification report:\n", classification_report(y_test, b))

SVM Accuracy on training data: 0.9980678995307756
SVM Accuracy on test data: 0.9635761589403974
Training classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       723
           1       1.00      1.00      1.00       513
           2       1.00      1.00      1.00      1329
           3       1.00      1.00      1.00      1058

    accuracy                           1.00      3623
   macro avg       1.00      1.00      1.00      3623
weighted avg       1.00      1.00      1.00      3623

Test classification report:
               precision    recall  f1-score   support

           0       0.99      0.97      0.98       153
           1       0.97      0.89      0.93       162
           2       0.98      0.98      0.98       336
           3       0.92      0.98      0.95       255

    accuracy                           0.96       906
   macro avg       0.97      0.96      0.96       906
weighted avg       0.96    