# PART 3: Prediction of the winner of an NBA game

The purpose of this exercise is to create and improve predictive models that, using information available at halftime, can forecast basketball games' outcomes with accuracy. To do this, we will train two distinct models and compare them using a test set to determine which of them is more accurate than 0.84.

----

## Studying the data

In [1]:
"""
Import section
"""

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from typing import Tuple, Dict, Any

In [2]:
"""
Let's now loads the different datasets
"""

# Load the datasets
X_train = np.load('X_train.npy') #features
X_test = np.load('X_test.npy')
y_train = np.load('y_train.npy') #labels
y_test = np.load('y_test.npy')

In [3]:
# Check the shapes of the loaded arrays to understand the dimensions of the data
print("Training set features shape:", X_train.shape)
print("Test set features shape:", X_test.shape)
print("Training set labels shape:", y_train.shape)
print("Test set labels shape:", y_test.shape)

Training set features shape: (500, 50)
Test set features shape: (500, 50)
Training set labels shape: (500,)
Test set labels shape: (500,)


---
We now know that there are 500 samples under 50 different features on both of the features sets

And we have 500 samples on the labels sets

An examination of certain features within the datasets, primarily the labels distributions, reveals that the dataset is nearly perfectly balanced, with a slight advantage in favor of away wins (-1.0) over home wins (1.0), as evidenced by the 256 away wins compared to the 244 home wins.

---

In [4]:
# Check for class distributions in the labels
unique_elements, counts_elements = np.unique(y_train, return_counts=True)
print("Label Distribution: ", dict(zip(unique_elements, counts_elements)))

# Label statistics
print("Labels Mean: ", np.mean(y_train))
print("Labels Std Dev: ", np.std(y_train))
print("Labels Min: ", np.min(y_train))
print("Labels Max: ", np.max(y_train))

Label Distribution:  {-1.0: 256, 1.0: 244}
Labels Mean:  -0.024
Labels Std Dev:  0.9997119585160518
Labels Min:  -1.0
Labels Max:  1.0


## Scaling the dataset

Preprocessing the data is necessary because it makes sure the model gets the information in a way that maximizes its capacity to identify patterns and generate precise predictions. Scaling features to the same range is part of this.

The preprocessing has normalized the features, allowing for direct comparison and prevents any one feature from having an excessively large impact on the model.

In [5]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Mean of scaled training features:", np.mean(X_train_scaled, axis=0)[:5])  # First 5 features
print("Std of scaled training features:", np.std(X_train_scaled, axis=0)[:5])  # First 5 features

Mean of scaled training features: [ 2.48689958e-17 -2.88657986e-17  1.03250741e-17  1.77635684e-17
 -3.99680289e-18]
Std of scaled training features: [1. 1. 1. 1. 1.]


## Training the models

We will compare two methods to train our models: The Logistic Regression and Support Vector Classifier (SVC), to identify which offers superior performance in predicting basketball game outcomes based on halftime data.

### Method 1: Logistic Regression Model

In [6]:
# Initialize and train the logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

y_train_pred = log_reg.predict(X_train_scaled) # training set
y_test_pred = log_reg.predict(X_test_scaled) # test set

# Evaluate the model
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("Logistic Regression Training Accuracy:", train_accuracy)
print("Logistic Regression Test Accuracy:", test_accuracy)

Logistic Regression Training Accuracy: 0.894
Logistic Regression Test Accuracy: 0.84


### Method 2: Train SVC - Support Vector Classifier

In [7]:
# Initialize and train the support vector classifier
svc = SVC()
svc.fit(X_train_scaled, y_train)

y_train_pred_svc = svc.predict(X_train_scaled) # training set
y_test_pred_svc = svc.predict(X_test_scaled) # test set

# Evaluate the model
train_accuracy_svc = accuracy_score(y_train, y_train_pred_svc)
test_accuracy_svc = accuracy_score(y_test, y_test_pred_svc)

print("SVC Training Accuracy:", train_accuracy_svc)
print("SVC Test Accuracy:", test_accuracy_svc)

SVC Training Accuracy: 0.962
SVC Test Accuracy: 0.876


---
The SVC definitely seems a better candidate, because it has shown superior accuracy in the initial tests.

So let's try to optimize it by fine tuning the hyper-parameters using the Grid Search, which is a systematic approach to parameter tuning that methodically builds and evaluates a model for each combination of algorithm parameters specified in a grid.


## Fine tuning the SVC hyperparameters

In [8]:
def tune_svc_hyperparameters(X_train: np.ndarray, y_train: np.ndarray, X_test: np.ndarray, y_test: np.ndarray) -> Tuple[SVC, float, float]:
    """
    Tunes hyperparameters for an SVC model using GridSearchCV and evaluates the optimized model.
    
    Parameters:
    - X_train: ndarray, feature matrix for training data.
    - y_train: ndarray, labels for training data.
    - X_test: ndarray, feature matrix for test data.
    - y_test: ndarray, labels for test data.
    
    Returns:
    - best_svc: The SVC model with the best found hyperparameters.
    - best_cv_accuracy: The best cross-validation accuracy achieved during tuning.
    - test_accuracy: Accuracy of the best SVC model on the test data.
    """
    
    # Define the parameter grid
    param_grid = {
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 'auto', 0.01, 0.1, 1, 10],
        'kernel': ['rbf', 'linear']
    }
    
    # Initialize the GridSearchCV object
    grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', verbose=1)
    grid_search.fit(X_train, y_train)
    
    best_svc = grid_search.best_estimator_ # Extract the best SVC model
    best_cv_accuracy = grid_search.best_score_ #cross-validation
    
    # Predict on the test set with the optimized model
    y_test_pred_optimized = best_svc.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred_optimized)
    
    return best_svc, best_cv_accuracy, test_accuracy, y_test_pred_optimized

In [9]:
# Call the tuned svc function
best_svc, best_cv_accuracy, test_accuracy, y_test_pred_optimized = tune_svc_hyperparameters(X_train_scaled, y_train, X_test_scaled, y_test)

print("Best Cross-Validation Accuracy:", best_cv_accuracy)
print("Test Accuracy of the Optimized SVC Model:", test_accuracy)
#print("Best Model Parameters:", best_svc.get_params())

Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best Cross-Validation Accuracy: 0.8460000000000001
Test Accuracy of the Optimized SVC Model: 0.856


In [10]:
# Generate a classification report for the optimized SVC model on the test data
report = classification_report(y_test, y_test_pred_optimized, target_names=['Away Win', 'Home Win'])

print("Classification Report for the Optimized SVC Model:\n", report)

Classification Report for the Optimized SVC Model:
               precision    recall  f1-score   support

    Away Win       0.92      0.83      0.87       298
    Home Win       0.78      0.89      0.83       202

    accuracy                           0.86       500
   macro avg       0.85      0.86      0.85       500
weighted avg       0.86      0.86      0.86       500



## Observation

Based on halftime data, the SVC performed satisfactorily in terms of basketball game prediction, with test accuracy of 0.876 and cross-validation accuracy of 0.846. F1-scores of 0.87 and 0.83, respectively, demonstrated the model's balanced ability to predict both "Home Win" and "Away Win" scenarios. According to these measurements, the halftime data's underlying patterns are adequately captured by the SVC.

## Fine-tuning the Logistic regression

As for comparison purpose, we should also push forward the optimization of the hyperparameters of the logistic regression model, ensuring a level playing field in evaluating its performance against the SVC model.

In [11]:
def tune_logistic_regression_hyperparameters(X_train: np.ndarray, y_train: np.ndarray, X_test: np.ndarray, y_test: np.ndarray) -> Tuple[LogisticRegression, Dict[str, Any], float, np.ndarray]:
    """
    Tunes hyperparameters for a Logistic Regression model using GridSearchCV and evaluates the optimized model.
    
    Parameters:
    - X_train: ndarray, feature matrix for training data.
    - y_train: ndarray, labels for training data.
    - X_test: ndarray, feature matrix for test data.
    - y_test: ndarray, labels for test data.
    
    Returns:
    - best_lr: The Logistic Regression model with the best found hyperparameters.
    - best_params: The best hyperparameters found by GridSearchCV.
    - test_accuracy: Accuracy of the best Logistic Regression model on the test data.
    - y_test_pred_lr_optimized: Predictions made by the optimized model on the test set.
    """
    
    # Define the parameter grid
    param_grid_lr = {
        'C': [0.01, 0.1, 1, 10, 100],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear']
    }
    
    # Initialize the GridSearchCV object
    grid_search_lr = GridSearchCV(LogisticRegression(), param_grid_lr, cv=5, scoring='accuracy', verbose=1)
    grid_search_lr.fit(X_train, y_train)
    
    # Extract the best model and parameters
    best_lr = grid_search_lr.best_estimator_
    best_params = grid_search_lr.best_params_
    
    # Predict on the test set with the optimized model
    y_test_pred_lr_optimized = best_lr.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred_lr_optimized)
    
    return best_lr, best_params, test_accuracy, y_test_pred_lr_optimized

In [12]:
best_lr, best_params_lr, test_accuracy_lr, y_test_pred_lr_optimized = tune_logistic_regression_hyperparameters(X_train_scaled, y_train, X_test_scaled, y_test)

# Print the best parameters and test accuracy
print("Best Parameters for Logistic Regression:", best_params_lr)
print("Optimized Logistic Regression Test Accuracy:", test_accuracy_lr)

# Generate and print the classification report, including the F1-score
classification_report_lr = classification_report(y_test, y_test_pred_lr_optimized, target_names=['Away Win', 'Home Win'])
print("Classification Report for the Optimized Logistic Regression Model:\n", classification_report_lr)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best Parameters for Logistic Regression: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Optimized Logistic Regression Test Accuracy: 0.836
Classification Report for the Optimized Logistic Regression Model:
               precision    recall  f1-score   support

    Away Win       0.92      0.80      0.85       298
    Home Win       0.75      0.89      0.81       202

    accuracy                           0.84       500
   macro avg       0.83      0.84      0.83       500
weighted avg       0.85      0.84      0.84       500



# Final overview

The fine-tuning of the logistic regression showed us a test accuracy of 0.836, which is close but doesnt achieve the required performance of 0.84. Despite the promising training accuracy of 0.894, the weak point of the logistic regression is in the prediction of the result for the home team.

In a concise comparison, the SVC model outperforms Logistic Regression in terms of overall test accuracy and demonstrates balanced precision and recall across classes, with an accuracy of 0.856 compared to 0.836 for the logistic regression.

Pol-Antoine Loiseau - Florent Rossignol