Φωτιάδης Κωνσταντίνος (ΑΕΜ: 10726) - Μπακούλας Επαμεινώνδας (ΑΕΜ: 10683)

# PART D

We will first install the necessary dependencies and import them

In [None]:
%pip install pandas numpy scikit-learn

import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.svm import SVC
from sklearn.calibration import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import warnings

warnings.filterwarnings('ignore') # When training the models, we get some convergence warnings. We can ignore them.

We will also import the train and test datasets. After importing, we transform the values of the features using StandardScaler to ensure that the model is not biased towards any feature.

In [5]:
train_data = pd.read_csv('datasetTV.csv', header=None)
test_data = pd.read_csv('datasetTest.csv', header=None)

X_train = train_data.iloc[:, :-1].values # All columns except the last one
y_train = train_data.iloc[:, -1].values # Last column is the target
X_test = test_data.values

# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Before we start training our models, we will do some hyperparameter tuning to find the best hyperparameters for our models. We will use GridSearchCV for this purpose. It performs an exhaustive search over specified parameter values for an estimator. The accuracy of the model is calculated using 5-fold cross-validation.

In [None]:
# Sample hyperparameters for each model
param_grids = {
    'k-NN': {
        'n_neighbors': [3, 5, 7, 9, 13, 17],
        'weights': ['uniform', 'distance'],
        'p': [1, 2]
    },
    'Perceptron': {
        'penalty': ['l1', 'l2', 'elasticnet'],
        'alpha': [1e-4, 1e-3, 1e-2],
        'max_iter': [1000, 2000]
    },
    'Logistic Regression': {
        'C': [0.01, 0.1, 1, 10],
        'penalty': ['l1', 'l2', 'elasticnet'],
        'solver': ['liblinear', 'saga']
    },
    'SVM': {
        'C': [0.1, 1, 10, 100, 1000],
        'kernel': ['poly', 'rbf', 'sigmoid'],
        'gamma': ['scale', 'auto']
    },
    'SVM Linear': {
        'C': [0.1, 1, 10, 100]
    },
    'Random Forests': {
        'n_estimators': [100, 200],
        'max_depth': [None, 10, 20],
        'max_features': ['sqrt', 'log2']
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 1.0]
    },
    'Decision Tree': {
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    },
    'Neural Networks': {
        'hidden_layer_sizes': [(50,), (100,), (100, 50)],
        'activation': ['relu', 'tanh', 'logistic'],
        'alpha': [1e-5, 1e-4],
        'learning_rate_init': [1e-3, 1e-4]
    }
}

# Perform Grid Search for each model
results = {}

# Naive Bayes model (has no hyperparameters to tune)
naive_bayes_model = GaussianNB()
nb_scores = cross_val_score(naive_bayes_model, X_train, y_train, cv=5, scoring='accuracy')
results['Naive Bayes'] = {
    'Best Parameters': None,
    'Best Accuracy': np.mean(nb_scores)
}

for model_name, param_grid in param_grids.items():
    print(f"Tuning {model_name}...")
    model = {
        'k-NN': KNeighborsClassifier(),
        'Perceptron': Perceptron(),
        'Logistic Regression': LogisticRegression(max_iter=500),
        'SVM': SVC(),
        'SVM Linear': LinearSVC(),
        'Random Forests': RandomForestClassifier(random_state=42),
        'AdaBoost': AdaBoostClassifier(random_state=42),
        'Decision Tree': DecisionTreeClassifier(random_state=42),
        'Neural Networks': MLPClassifier(max_iter=500, random_state=42)
    }.get(model_name)

    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', verbose=2)
    grid_search.fit(X_train, y_train)
    results[model_name] = {
        'Best Parameters': grid_search.best_params_,
        'Best Accuracy': grid_search.best_score_
    }


The above step took a total 1.5 hours, which is worth noting. Now we can print the training results

In [7]:
for model_name, result in results.items():
    print(f"{model_name}: Best Parameters = {result['Best Parameters']}")
    print(f"{model_name}: Best Accuracy = {result['Best Accuracy']:.4f}")

Naive Bayes: Best Parameters = None
Naive Bayes: Best Accuracy = 0.7026
k-NN: Best Parameters = {'n_neighbors': 9, 'p': 2, 'weights': 'distance'}
k-NN: Best Accuracy = 0.8220
Perceptron: Best Parameters = {'alpha': 0.0001, 'max_iter': 1000, 'penalty': 'l1'}
Perceptron: Best Accuracy = 0.7057
Logistic Regression: Best Parameters = {'C': 0.01, 'penalty': 'l2', 'solver': 'saga'}
Logistic Regression: Best Accuracy = 0.7811
SVM: Best Parameters = {'C': 10, 'gamma': 'auto', 'kernel': 'rbf'}
SVM: Best Accuracy = 0.8534
SVM Linear: Best Parameters = {'C': 0.1}
SVM Linear: Best Accuracy = 0.7634
Random Forests: Best Parameters = {'max_depth': 20, 'max_features': 'sqrt', 'n_estimators': 200}
Random Forests: Best Accuracy = 0.8140
AdaBoost: Best Parameters = {'learning_rate': 1.0, 'n_estimators': 200}
AdaBoost: Best Accuracy = 0.6609
Decision Tree: Best Parameters = {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10}
Decision Tree: Best Accuracy = 0.6390
Neural Networks: Best Parame

We will select the model with the best accuracy, using the optimal hyperparameters

In [11]:
# Print the best model
best_model = max(results, key=lambda key: results[key]['Best Accuracy'])
best_accuracy = results[best_model]['Best Accuracy']
print(f"Best Model: {best_model} with accuracy {best_accuracy:.4f}")
print(f"Hyperparameters of the best model: {results[best_model]['Best Parameters']}")

Best Model: SVM with accuracy 0.8534
Hyperparameters of the best model: {'C': 10, 'gamma': 'auto', 'kernel': 'rbf'}


The model with the highest accuracy is **SVM (Support Vector Machine)** with hyperparameters shown above. We will use this model to make predictions on the test dataset. First, we define this model

In [9]:
clf_best_model = SVC(kernel='rbf', C=10, gamma='auto', random_state=42)

Now we're going to train the model on the entire dataset, and then make predictions on the test dataset. We will then save the predictions to a numpy file.

In [10]:
clf_best_model.fit(X_train, y_train)
y_test = clf_best_model.predict(X_test)

np.save('labels40.npy', y_test)