# PART 3: Prediction of the winner of an NBA game

The goal of this exercise is to develop and fine-tune predictive models that can accurately determine the outcome of basketball games based on data available at halftime. We will do so by training two different model and opposing them on a provided test set to estimate which one has an accuracy superior to 0.84.

----

## Studying the data

In [35]:
"""
Import section
"""

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from typing import Tuple, Dict, Any

In [4]:
"""
Let's now loads the different datasets
"""

# Load the datasets
X_train = np.load('X_train.npy') #features
X_test = np.load('X_test.npy')
y_train = np.load('y_train.npy') #labels
y_test = np.load('y_test.npy')

In [5]:
# Check the shapes of the loaded arrays to understand the dimensions of the data
print("Training set features shape:", X_train.shape)
print("Test set features shape:", X_test.shape)
print("Training set labels shape:", y_train.shape)
print("Test set labels shape:", y_test.shape)

Training set features shape: (500, 50)
Test set features shape: (500, 50)
Training set labels shape: (500,)
Test set labels shape: (500,)


---
We now know that there are 500 samples under 50 different features on both of the features sets

And we have 500 samples on the labels sets

If we study some features on the datasets, mostly in the labels distributions, we can notice that If we study some features in the datasets, particularly focusing on the labels distributions, we can notice that the dataset is almost perfectly balanced with a slight advantage towards away wins (-1.0) over home wins (1.0), as indicated by 256 away wins compared to 244 home wins.

---

In [23]:
# Check for class distributions in the labels
unique_elements, counts_elements = np.unique(y_train, return_counts=True)
print("Label Distribution: ", dict(zip(unique_elements, counts_elements)))

# Label statistics
print("Labels Mean: ", np.mean(y_train))
print("Labels Std Dev: ", np.std(y_train))
print("Labels Min: ", np.min(y_train))
print("Labels Max: ", np.max(y_train))

Label Distribution:  {-1.0: 256, 1.0: 244}
Labels Mean:  -0.024
Labels Std Dev:  0.9997119585160518
Labels Min:  -1.0
Labels Max:  1.0


## Scaling the dataset

We need to preprocess the data because it ensures that the model receives data in a format that optimizes its ability to learn patterns and make accurate predictions. This includes scaling features to the same range.

After the preprocessing, we can see that the features are normalized, making them directly comparable and preventing any single feature from disproportionately influencing the model due to its scale

In [9]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Mean of scaled training features:", np.mean(X_train_scaled, axis=0)[:5])  # First 5 features
print("Std of scaled training features:", np.std(X_train_scaled, axis=0)[:5])  # First 5 features

Mean of scaled training features: [ 2.48689958e-17 -2.88657986e-17  1.03250741e-17  1.77635684e-17
 -3.99680289e-18]
Std of scaled training features: [1. 1. 1. 1. 1.]


## Training the models

We will compare two methods to train our models: The Logistic Regression and Support Vector Classifier (SVC), to identify which offers superior performance in predicting basketball game outcomes based on halftime data.

### Method 1: Logistic Regression Model

In [12]:
# Initialize and train the logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

y_train_pred = log_reg.predict(X_train_scaled) # training set
y_test_pred = log_reg.predict(X_test_scaled) # test set

# Evaluate the model
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("Logistic Regression Training Accuracy:", train_accuracy)
print("Logistic Regression Test Accuracy:", test_accuracy)

Logistic Regression Training Accuracy: 0.894
Logistic Regression Test Accuracy: 0.84


### Method 2: Train SVC - Support Vector Classifier

In [24]:
# Initialize and train the support vector classifier
svc = SVC()
svc.fit(X_train_scaled, y_train)

y_train_pred_svc = svc.predict(X_train_scaled) # training set
y_test_pred_svc = svc.predict(X_test_scaled) # test set

# Evaluate the model
train_accuracy_svc = accuracy_score(y_train, y_train_pred_svc)
test_accuracy_svc = accuracy_score(y_test, y_test_pred_svc)

print("SVC Training Accuracy:", train_accuracy_svc)
print("SVC Test Accuracy:", test_accuracy_svc)

SVC Training Accuracy: 0.962
SVC Test Accuracy: 0.876


---
The SVC definitely seems a better candidate, because it has shown superior accuracy in the initial tests.

So let's try to optimize it by fine tuning the hyper-parameters using the Grid Search, which is a systematic approach to parameter tuning that methodically builds and evaluates a model for each combination of algorithm parameters specified in a grid.


## Fine tuning the SVC hyperparameters

In [None]:
def tune_svc_hyperparameters(X_train: np.ndarray, y_train: np.ndarray, X_test: np.ndarray, y_test: np.ndarray) -> Tuple[SVC, float, float]:
    """
    Tunes hyperparameters for an SVC model using GridSearchCV and evaluates the optimized model.
    
    Parameters:
    - X_train: ndarray, feature matrix for training data.
    - y_train: ndarray, labels for training data.
    - X_test: ndarray, feature matrix for test data.
    - y_test: ndarray, labels for test data.
    
    Returns:
    - best_svc: The SVC model with the best found hyperparameters.
    - best_cv_accuracy: The best cross-validation accuracy achieved during tuning.
    - test_accuracy: Accuracy of the best SVC model on the test data.
    """
    
    # Define the parameter grid
    param_grid = {
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 'auto', 0.01, 0.1, 1, 10],
        'kernel': ['rbf', 'linear']
    }
    param_grid2 = {
        'C': [0.1, 1, 10, 50, 100, 500],
        'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1, 5, 10],
        'kernel': ['rbf', 'linear', 'poly'],
        'degree': [2, 3, 4]  # Only relevant for 'poly' kernel
    }
    
    # Initialize the GridSearchCV object
    grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', verbose=3)
    grid_search.fit(X_train, y_train)
    
    best_svc = grid_search.best_estimator_ # Extract the best SVC model
    best_cv_accuracy = grid_search.best_score_ #cross-validation
    
    # Predict on the test set with the optimized model
    y_test_pred_optimized = best_svc.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred_optimized)
    
    return best_svc, best_cv_accuracy, test_accuracy, y_test_pred_optimized

In [None]:
# Call the tuned svc function
best_svc, best_cv_accuracy, test_accuracy, y_test_pred_optimized = tune_svc_hyperparameters(X_train_scaled, y_train, X_test_scaled, y_test)

print("Best Cross-Validation Accuracy:", best_cv_accuracy)
print("Test Accuracy of the Optimized SVC Model:", test_accuracy)
#print("Best Model Parameters:", best_svc.get_params())

In [None]:
# Generate a classification report for the optimized SVC model on the test data
report = classification_report(y_test, y_test_pred_optimized, target_names=['Away Win', 'Home Win'])

print("Classification Report for the Optimized SVC Model:\n", report)

## Observation

The SVC demonstrated a suitable performance at predicting the outcomes of basketball games based on halftime data, achieving a test accuracy of 0.876 and a cross-validation accuracy of 0.846. Notably, the model showed a balanced ability to predict both 'Home Win' and 'Away Win' scenarios, as proved by F1-scores of 0.87 and 0.83, respectively. These metrics suggest that the SVC effectively captures the underlying patterns within the halftime data.

## Fine-tuning the Logistic regression

As for comparison purpose, we should also push forward the optimization of the hyperparameters of the logistic regression model, ensuring a level playing field in evaluating its performance against the SVC model.

In [None]:
def tune_logistic_regression_hyperparameters(X_train: np.ndarray, y_train: np.ndarray, X_test: np.ndarray, y_test: np.ndarray) -> Tuple[LogisticRegression, Dict[str, Any], float, np.ndarray]:
    """
    Tunes hyperparameters for a Logistic Regression model using GridSearchCV and evaluates the optimized model.
    
    Parameters:
    - X_train: ndarray, feature matrix for training data.
    - y_train: ndarray, labels for training data.
    - X_test: ndarray, feature matrix for test data.
    - y_test: ndarray, labels for test data.
    
    Returns:
    - best_lr: The Logistic Regression model with the best found hyperparameters.
    - best_params: The best hyperparameters found by GridSearchCV.
    - test_accuracy: Accuracy of the best Logistic Regression model on the test data.
    - y_test_pred_lr_optimized: Predictions made by the optimized model on the test set.
    """
    
    # Define the parameter grid
    param_grid_lr = {
        'C': [0.01, 0.1, 1, 10, 100],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear']
    }
    param_grid_lr2 = {
        'C': [0.01, 0.1, 1, 10, 50, 100, 500],  # Expanded range of C values
        'penalty': ['l1', 'l2'],  # Including both l1 and l2 penalties
        'solver': ['liblinear', 'saga'],  # Including solvers that support both penalties. 'saga' supports elastic-net as well.
        'max_iter': [100, 200, 500]  # To ensure convergence for larger C values or more complex data
    }

    
    # Initialize the GridSearchCV object
    grid_search_lr = GridSearchCV(LogisticRegression(), param_grid_lr, cv=5, scoring='accuracy', verbose=4)
    grid_search_lr.fit(X_train, y_train)
    
    # Extract the best model and parameters
    best_lr = grid_search_lr.best_estimator_
    best_params = grid_search_lr.best_params_
    
    # Predict on the test set with the optimized model
    y_test_pred_lr_optimized = best_lr.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred_lr_optimized)
    
    return best_lr, best_params, test_accuracy, y_test_pred_lr_optimized

In [None]:
best_lr, best_params_lr, test_accuracy_lr, y_test_pred_lr_optimized = tune_logistic_regression_hyperparameters(X_train_scaled, y_train, X_test_scaled, y_test)

# Print the best parameters and test accuracy
print("Best Parameters for Logistic Regression:", best_params_lr)
print("Optimized Logistic Regression Test Accuracy:", test_accuracy_lr)

# Generate and print the classification report, including the F1-score
classification_report_lr = classification_report(y_test, y_test_pred_lr_optimized, target_names=['Away Win', 'Home Win'])
print("Classification Report for the Optimized Logistic Regression Model:\n", classification_report_lr)

# Final overview

The fine-tuning of the logistic regression showed us a test accuracy of 0.836, which is close but doesnt achieve the required performance of 0.84. Despite the promising training accuracy of 0.894, the weak point of the logistic regression is in the prediction of the result for the home team.

In a concise comparison, the SVC model outperforms Logistic Regression in terms of overall test accuracy and demonstrates balanced precision and recall across classes, with an accuracy of 0.856 compared to 0.836 for the logistic regression.