# Classification

In this notebook we can see the application of several classification models to predict whether a team will make the playoffs in the following year. 

It takes data from years 1 to 9 for training the models, and data from year 10 for testing. Here's an overview of how the script works:

First, the dataset is loaded, and the data is split into training and testing sets. The training set consists of data from years 1 through 9, while the testing set consists of data from year 10. 

Then we selected the features (i.e., the input variables) by excluding the target variable (`PlayOffNextYear`), as well as team-specific identifiers like `tmID` and `year`. To ensure the data is properly scaled, the features are standardized using `StandardScaler`.

Then we define a range of classification models to evaluate, including Logistic Regression, K-Nearest Neighbors (KNN), Random Forest, Naive Bayes, Decision Tree, XGBoost, Gradient Boosting, and Support Vector Machine (SVM). For the XGBoost and Gradient Boosting models, we used SMOTE (Synthetic Minority Over-sampling Technique). 

Once the models are defined and the data is prepared, the following step was to train each model using the scaled training data. For the models that use SMOTE, the training was done on the balanced dataset. 

After training, we were able to guarantee to the models to make predictions and calculate several evaluation metrics for each model, including accuracy, precision, recall, and F1 score. These metrics help determine how well the models are performing.

Additionally, we used `ClassificationUtils` to generate visualizations such as confusion matrices and learning curves, which help in understanding how each model behaves and where it might be making mistakes. In this way we had a deeper insight into the models' performance, and it helped us to identify areas for improvement.

Finally, we collected all the results in a summary table, showing the performance of each model across all the metrics. In this way we made comparison between the models to see the best possible outcome.

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from classification_utils import ClassificationUtils
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE

file_path = '../newData/Shifted_playoff.csv' 
df = pd.read_csv(file_path)

In [None]:

train_df = df[df['year'] < 10]
test_df = df[df['year'] == 11]

X_train = train_df.drop(columns=["PlayOffNextYear", "tmID", "year"])
y_train = train_df["PlayOffNextYear"]

X_test = test_df.drop(columns=["PlayOffNextYear", "tmID", "year"])
y_test = test_df["PlayOffNextYear"] 

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
models = {
    "Logistic Regression": LogisticRegression(),
    "KNN": KNeighborsClassifier(),
    "Random Forest": RandomForestClassifier(max_depth=5, min_samples_leaf=10, random_state=42),
    "Naive Bayes": GaussianNB(),
    "Decision Tree": DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, random_state=42),
    "XGBoost": XGBClassifier(max_depth=2, n_estimators=50, learning_rate=0.05, reg_lambda=15, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(max_depth=2, n_estimators=50, learning_rate=0.05, random_state=42),
    "SVM": SVC(kernel="linear", probability=True)
}

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

results = {}
for model_name, model in models.items():
    if model_name in ["XGBoost", "Gradient Boosting"]:
       
        print(f"Training {model_name} with SMOTE-balanced dataset...")
        model.fit(X_train_balanced, y_train_balanced)
        predictions = model.predict(X_train_balanced)
        accuracy = accuracy_score(y_train_balanced, predictions)
        precision = precision_score(y_train_balanced, predictions)
        recall = recall_score(y_train_balanced, predictions)
        f1 = f1_score(y_train_balanced, predictions)

        results[f"{model_name} (SMOTE)"] = {
            "Accuracy": accuracy,
            "Precision": precision,
            "Recall": recall,
            "F1": f1
        }

        ClassificationUtils.plot_confusion_matrix(
            y_train_balanced, predictions, model_name=f"{model_name} (SMOTE)"
        )
        ClassificationUtils.plot_learning_curve(
            estimator=model,
            X=X_train_balanced,
            y=y_train_balanced,
            title=f"Learning Curve: {model_name} (SMOTE)",
            cv=5
        )
    else:
       
        print(f"Training {model_name}...")
        model.fit(X_train_scaled, y_train)

        predictions = model.predict(X_train_scaled)

        accuracy = accuracy_score(y_train, predictions)
        precision = precision_score(y_train, predictions)
        recall = recall_score(y_train, predictions)
        f1 = f1_score(y_train, predictions)

        results[model_name] = {
            "Accuracy": accuracy,
            "Precision": precision,
            "Recall": recall,
            "F1": f1
        }

        ClassificationUtils.plot_confusion_matrix(
            y_train, predictions, model_name=model_name
        )
        ClassificationUtils.plot_learning_curve(
            estimator=model,
            X=X_train_scaled,
            y=y_train,
            title=f"Learning Curve: {model_name}",
            cv=5
        )

results_df = pd.DataFrame(results).T
print("Model evaluation results:")
display(results_df)

### Cross-Validation Results
In this section we applied cross-validation to the XGBoost and Gradient Boosting models to evaluate their performance. First, we defined a `StratifiedKFold` object for 5 splits and sets up a loop to apply this cross-validation technique to both the XGBoost and Gradient Boosting models. 
For each model, we performed the cross-validation and calculated the mean accuracy and standard deviation of the accuracy scores across the 5 folds. These results are stored in a dictionary and displayed in a summary table.

Afterward, we ran a standard 5-fold cross-validation for both boosting models (without the stratified approach) and calculates similar metrics: the mean accuracy and standard deviation. The results are stored in another dictionary and displayed in a separate table.


In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

cv_results = {}
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for model_name in ["XGBoost", "Gradient Boosting"]:
    model = models[model_name]
    print(f"Performing stratified cross-validation for {model_name}...")
    
    scores = cross_val_score(model, X_train_scaled, y_train, cv=skf, scoring='accuracy')
    
    cv_results[model_name] = {
        "Mean Accuracy": scores.mean(),
        "Std Dev": scores.std()
    }

cv_results_df = pd.DataFrame(cv_results).T
print("Stratified Cross-Validation results for boosting models:")
display(cv_results_df)

cv_results_boosting = {}

for model_name in ["XGBoost", "Gradient Boosting"]:
    model = models[model_name]
    print(f"Performing cross-validation for {model_name}...")
    
    scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    
    cv_results_boosting[model_name] = {
        "Mean Accuracy": scores.mean(),
        "Std Dev": scores.std()
    }

cv_results_boosting_df = pd.DataFrame(cv_results_boosting).T
print("Cross-validation results for boosting models:")
display(cv_results_boosting_df)


### Test Set Evaluation
Finally, we evaluated each model on the test set (Year 10) to assess their generalization performance. 
First therei s the training of the model, then probabilities' prediction and normalization.


In [None]:
test_results = {}
for model_name, model in models.items():
    print(f"Making predictions for {model_name} on Year 10...")

    model.fit(X_train_scaled, y_train)
    probabilities = model.predict_proba(X_test_scaled)[:, 1]

    normalized_probabilities = 8 * probabilities / probabilities.sum()
    test_df[f"{model_name}_Predicted_Probability"] = probabilities
    test_df[f"{model_name}_Normalized_Probability"] = normalized_probabilities

display(test_df)


### Prediction for Year 10 with Playoff Constraints

In this phase, we used the best model to:

1. Predict the probabilities of playoff qualification for the teams in Year 10.
2. Normalize the probabilities to ensure the sum equals 8.
3. Select the top 4 teams from each conference.
4. Add a label identifying the teams selected for the playoffs.

In [None]:
best_model = RandomForestClassifier(max_depth=5, min_samples_leaf=10, random_state=42)
best_model.fit(X_train_scaled, y_train)

probabilities = best_model.predict_proba(X_test_scaled)[:, 1]  # Probability for class 1 (playoff)

test_df = test_df.copy()  # Explicit copy
test_df.loc[:, "Predicted_Probability"] = probabilities

normalized_probabilities = 8 * probabilities / probabilities.sum()
test_df.loc[:, "Normalized_Probability"] = normalized_probabilities

east_teams = test_df[test_df["confID"] == 1].nlargest(4, "Normalized_Probability")
west_teams = test_df[test_df["confID"] == 0].nlargest(4, "Normalized_Probability")

playoff_teams = pd.concat([east_teams, west_teams])

test_df.loc[:, "Selected_Playoff"] = test_df["tmID"].isin(playoff_teams["tmID"]).astype(int)

print("Selected playoff teams:")
display(playoff_teams[["tmID", "confID", "Normalized_Probability"]])

print("Complete table with probabilities and selections:")
display(test_df[["tmID", "confID", "Predicted_Probability", "Normalized_Probability", "Selected_Playoff"]])
