## **Machine Learning - WNBA Playoffs Prediction**
This notebook will focus on the undestanding of the data. We will be using SQLite to store the data due to its scalability & the fact that it's a relational schema.

https://docs.python.org/3/library/sqlite3.html

Import sqlite3 and connect to database file

### **Imports**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
import seaborn as sns
import feature_selection as fs
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.svm import SVC
import prep_utils as pu
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import pointbiserialr
from sklearn.feature_selection import RFE

PAST_YEARS = 1
TEST_YEAR = 10
KAGGLE_TEST_YEAR = 11

### **Database Connection Setup**
In this phase we will be creating the database and the tables that will be used to store the data. We also added the competition data to the database.

In [2]:
db = sqlite3.connect("db/ac.db")
db_cur = db.cursor()

[df_awards, df_coaches, df_players_teams, df_players, df_series_post, df_teams_post, df_teams] = pu.db_to_pandas(db)

### **Data Preparation**

***Preparing Teams Dataframe***

+ Removed Irrelevant attributes (arena, etc…)
+ Replaced some features by its success rate; e.g made/attempted
+ Created Playoff Rank




In [3]:
# Transform all possible attributes into percentages. (Made / Attempted) & (Offensive & Defensive Rebound %)
df_new_teams = pu.prepare_teams(df_teams,df_teams_post,PAST_YEARS)

df_new_teams = fs.fs_teams(df_new_teams)

df_new_teams = pu.playoff_rank(df_new_teams,df_teams,PAST_YEARS)
df_team_results = df_new_teams[["year","tmID","confID","playoff","rank","team_playoffs_count","playoff_rank"]]


Dropping divID in [1mTeams[0m...


***Preparing Coaches Dataframe***

+ Regular & Playoff win-rate
+ Coach_Awards
+ Num_Playoff appearances



In [4]:
df_new_coaches = pu.prepare_coaches(df_coaches, df_awards,PAST_YEARS)
df_new_coaches = pu.group_coaches(df_new_coaches)

df_new_coaches.drop("coachID", axis = 1, inplace = True)

df_final_coaches = df_new_coaches.copy()
df_final_coaches.columns = df_final_coaches.columns.str.lower()


Dropping Attribute lgID in [1mCoaches[0m...
Creating attribute coach previous regular season win ratio...
Creating attribute coach playoffs win ratio...
Creating attribute coach playoffs count...
Creating attribute coach awards count...
Dropping attribute post_wins..
Dropping attribute post_losses..
Dropping attribute won..
Dropping attribute lost..

[1mCoaches Null Verification:[0m
year                    0
tmID                    0
coachID                 0
coach_reg_wr            0
coach_po_wr             0
coach_playoffs_count    0
coach_awards            0
dtype: int64


***Preparing Players Dataframe***
+ Created Player_Awards;
+ Replaced some features by its success rate; e.g made/attempted;


In [5]:
df_new_players_teams = pu.prepare_player_teams(df_players_teams,df_awards,PAST_YEARS)

Dropping Attribute lgID in [1mPlayers_Teams[0m...


***Developed 2 Team Ratings:***
+  One rated the players performance from the previous seasons. The goal is to assess the roster's capacity for consistent performance across a range of seasons, player growth, and the team's general quality and stability.

+ A rating that evaluates whether the team has had a roster of talented players in previous years. The goal of this rating is to assess the team's historical trend of acquiring or keeping high-caliber players as well as to provide a quantitative measure of the team's general quality and stability.

In [6]:

# How the team performed in the previous years
previous_team_ratings, df_new_players_team = pu.final_team_ratings(df_players_teams,df_awards, df_players, df_teams, PAST_YEARS)


# How the players performed in the previous years
previous_team_player_ratings = pu.final_player_team_ratings(df_teams, df_new_players_team, df_players, PAST_YEARS,df_new_players_teams[df_new_players_teams['year'] == KAGGLE_TEST_YEAR])





A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  player_data.sort_values(by='year', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  player_data['NextYear_tmID'] = player_data['tmID'].shift(-1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  player_data['NextYear'] = player_data['year'].shift(-1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the docume

#### **Merging all dataframes into one final dataframe**

In [7]:

df_players = df_new_players_teams.copy()
df_players = fs.fs_players(df_players,0.2)
df_players = df_players[df_players['year'] != 1]


df_team_results.columns = df_team_results.columns.str.lower()
previous_team_player_ratings.columns = previous_team_player_ratings.columns.str.lower()


merged_data = pd.merge(df_players, df_team_results, on=['tmid', 'year'], how='left')
merged_data = pd.merge(merged_data, df_final_coaches, on=['tmid', 'year'], how='left')
merged_data = pd.merge(merged_data, previous_team_ratings, on=['tmid', 'year'], how='left')
merged_data = pd.merge(merged_data, previous_team_player_ratings, on=['tmid', 'year'], how='left')


#### Point Bisserial Correlation 
We will use this to check correlation between continuous attributes & target


In [8]:
data1_10 = merged_data[merged_data['year'] != 11]
fs.bisserial_corr(data1_10)

team_players_rating: 39.21% correlation
total_assists: 36.39% correlation
playoff_rank: 33.90% correlation
total_gs: 32.71% correlation
total_points: 31.77% correlation
coach_po_wr: 31.26% correlation
total_minutes: 31.16% correlation
coach_reg_wr: 30.47% correlation
total_turnovers: 28.44% correlation
player_awards: 27.79% correlation
total_blocks: 26.30% correlation
team_rating: 25.45% correlation
total_steals: 24.90% correlation
coach_playoffs_count: 24.50% correlation
total_pf: 23.19% correlation
team_playoffs_count: 19.10% correlation
rank: 18.01% correlation
total_drebounds_pct: 12.93% correlation
total_orebounds_pct: 12.93% correlation
coach_awards: 12.93% correlation
total_dq: 12.35% correlation
total_fg_pct: 10.58% correlation
total_gp: 10.20% correlation
total_ft_pct: 4.27% correlation
total_three_pct: 3.71% correlation


#### **Dividing the dataset in both train & test sets**


In [9]:
label_encoder = LabelEncoder()

merged_data['tmid'] = label_encoder.fit_transform(merged_data['tmid'])
merged_data['confid'] = label_encoder.fit_transform(merged_data['confid'])

x = merged_data.drop('playoff', axis=1)
y = merged_data['playoff']

x_train = merged_data[merged_data['year'].between(0, TEST_YEAR - 1)].drop('playoff', axis=1)
y_train = merged_data[merged_data['year'].between(0, TEST_YEAR - 1)]['playoff']

x_test = merged_data[merged_data['year'] == TEST_YEAR].drop('playoff', axis=1)
y_test = merged_data[merged_data['year'] == TEST_YEAR]['playoff']

### RFE
We will running RFE on the different models to find out which features produce the best results

In [10]:
print("Final Features:")
print(x_train.columns)
min_features = 20

"""rfe_classifiers = {
    'Random Forest': RandomForestClassifier(random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42,max_iter=10000),
    'Support Vector Machine': SVC(random_state=42, kernel='linear'),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
}"""
rfe_classifiers = {
    'Random Forest': RandomForestClassifier(random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42,max_iter=10000),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}


classifiers_features = {}

total_features = len(x_train.columns)

best_model_info = {}


for model_name, model in rfe_classifiers.items():
    print(f"\033[1mModel: {model_name}\033[0m")
    
    results = []

    for i in range(min_features, total_features):
        rfe = RFE(model, n_features_to_select=i)
        rfe.fit(x_train, y_train)
        
        selected_features = set(x_train.columns[rfe.support_])
        selected_features.add("tmid")
        selected_features.add("year")
        selected_features = list(selected_features)

        model.fit(x_train[selected_features], y_train)

        accuracy = model.score(x_test[selected_features], y_test)

        results.append((selected_features, accuracy))
    

    # Sort the results based on accuracy in descending order
    results = sorted(results, key=lambda x: x[1], reverse=True)

    classifiers_features[model_name] = results[0][0]
    best_model_info[model_name] = {'features': results[0][0], 'accuracy': results[0][1], 'model': model}
    
    # Print the results
    for features, accuracy in results[:3]:
        print("Selected Features:", features)
        print("Accuracy:" + str(accuracy) + '\n')

best_model_name = max(best_model_info, key=lambda k: best_model_info[k]['accuracy'])
best_model = best_model_info[best_model_name]['model']
best_features = best_model_info[best_model_name]['features']

print(f"\033[1mBest Model: {best_model_name}\033[0m")
print(f"Best Features: {best_features}")
print(f"Accuracy: {best_model_info[best_model_name]['accuracy']}\n")

Final Features:
Index(['year', 'tmid', 'player_awards', 'total_minutes', 'total_points',
       'total_assists', 'total_steals', 'total_blocks', 'total_turnovers',
       'total_pf', 'total_dq', 'total_gs', 'total_gp', 'total_fg_pct',
       'total_ft_pct', 'total_three_pct', 'total_orebounds_pct',
       'total_drebounds_pct', 'confid', 'rank', 'team_playoffs_count',
       'playoff_rank', 'coach_reg_wr', 'coach_po_wr', 'coach_playoffs_count',
       'coach_awards', 'team_rating', 'team_players_rating'],
      dtype='object')
[1mModel: Random Forest[0m
Selected Features: ['total_fg_pct', 'total_points', 'total_assists', 'total_ft_pct', 'team_players_rating', 'team_rating', 'total_orebounds_pct', 'coach_reg_wr', 'total_three_pct', 'tmid', 'total_pf', 'total_blocks', 'coach_playoffs_count', 'year', 'total_steals', 'total_gp', 'total_minutes', 'total_drebounds_pct', 'rank', 'total_gs', 'total_turnovers']
Accuracy:0.7692307692307693

Selected Features: ['team_playoffs_count', 'total_fg_

Since RFE doesn't work with KNN, we will be using SelectKBest which produces the same process

In [11]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif
results = []

for i in range(min_features, total_features):
    knn = KNeighborsClassifier()

    selector = SelectKBest(score_func=mutual_info_classif, k=i)
    selector.fit(x_train, y_train)


    selected_features = set(x_train.columns[selector.get_support()])
    selected_features.add("tmid")
    selected_features.add("year")
    selected_features = list(selected_features)

    knn.fit(x_train[selected_features], y_train)

    accuracy = knn.score(x_test[selected_features], y_test)

    results.append((selected_features, accuracy))

results = sorted(results, key=lambda x: x[1], reverse=True)

classifiers_features["K-Nearest Neighbors"] = results[0][0]

# Print the best 3 results
for features, accuracy in results[:3]:
    print("Selected Features:", features)
    print("Accuracy:" + str(accuracy) + '\n')

Selected Features: ['team_playoffs_count', 'confid', 'total_points', 'total_assists', 'total_ft_pct', 'team_players_rating', 'team_rating', 'coach_reg_wr', 'total_three_pct', 'tmid', 'total_pf', 'total_blocks', 'coach_po_wr', 'coach_playoffs_count', 'year', 'total_steals', 'total_gp', 'total_minutes', 'player_awards', 'total_gs', 'playoff_rank']
Accuracy:0.6923076923076923

Selected Features: ['team_playoffs_count', 'total_points', 'total_assists', 'total_ft_pct', 'team_players_rating', 'team_rating', 'coach_reg_wr', 'total_three_pct', 'coach_awards', 'tmid', 'total_pf', 'total_blocks', 'coach_po_wr', 'coach_playoffs_count', 'year', 'total_steals', 'total_gp', 'total_minutes', 'rank', 'total_gs', 'playoff_rank', 'total_dq']
Accuracy:0.6923076923076923

Selected Features: ['team_playoffs_count', 'confid', 'total_points', 'total_assists', 'total_ft_pct', 'team_players_rating', 'team_rating', 'coach_reg_wr', 'total_three_pct', 'coach_awards', 'tmid', 'total_pf', 'total_blocks', 'coach_po_

In [12]:
accuracy_scores = []

best_model = best_model_info[best_model_name]['model']
probs = False
for test_year in range(3, 11):  # Testing from year 2 to 10
    x_train = merged_data[merged_data['year'].between(1, test_year - 1)].drop('playoff', axis=1)
    y_train = merged_data[merged_data['year'].between(1, test_year - 1)]['playoff']
    
    x_test = merged_data[merged_data['year'] == test_year].drop('playoff', axis=1)
    y_test = merged_data[merged_data['year'] == test_year]['playoff']
    
   
    

    # Training the model
    best_model.fit(x_train[best_features], y_train)
    if(probs):
        # Testing the model
        probabilities = best_model.predict_proba(x_test[best_features])
        x_test['probabilities'] = probabilities.tolist()

        
        top_4_teams = (
            x_test.groupby('confid')
            .apply(lambda x: x.iloc[np.argsort(-np.array(x['probabilities'].tolist())[:, 0])][:4])
            .reset_index(drop=True)['tmid']
            .tolist()
        )

        # Create a list where 1 represents the team with one of the top 4 probabilities in their conference
        final_list = [0 if tmid in top_4_teams else 1 for tmid in x_test['tmid']]


        # Calculating accuracy
        accuracy = accuracy_score(y_test, final_list)
        accuracy_scores.append(accuracy)
    else:
        predictions = best_model.predict(x_test[best_features])
    
        # Calculating accuracy
        accuracy = accuracy_score(y_test, predictions)
        accuracy_scores.append(accuracy)
    print(f"Accuracy for testing year {test_year}: {accuracy}")
    
   

# Calculating average accuracy
average_accuracy = sum(accuracy_scores) / len(accuracy_scores)
print(f"\nAverage Accuracy across all years: {average_accuracy}")

Accuracy for testing year 3: 0.6875
Accuracy for testing year 4: 0.5714285714285714
Accuracy for testing year 5: 0.46153846153846156
Accuracy for testing year 6: 0.6923076923076923
Accuracy for testing year 7: 0.8571428571428571
Accuracy for testing year 8: 0.6923076923076923
Accuracy for testing year 9: 0.5714285714285714
Accuracy for testing year 10: 0.7692307692307693

Average Accuracy across all years: 0.6628605769230769


In [13]:

x_train = merged_data[merged_data['year'].between(1, 10)].drop('playoff', axis=1)
y_train = merged_data[merged_data['year'].between(1, 10)]['playoff']

x_test = merged_data[merged_data['year'] == 11].drop('playoff', axis=1)
y_test = merged_data[merged_data['year'] == 11]['playoff']


best_model.fit(x_train, y_train)
predictions = best_model.predict(x_test)


y_t = [1,0,0,1,1,0,1,1,1,1,0,1]

print("11 year Accuray:",accuracy_score(y_t, predictions))




11 year Accuray: 0.75


#### GridSearch
Now that we know the best features for each model, we will use gridsearch to fine tune its parameters.

In [14]:
#best_params = fs.grid_search(classifiers_features,x_train,x_test,y_train,y_test)
#print(best_params)

# Random Forest -> {'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 50}

# Logistic -> {'C': 100, 'fit_intercept': False, 'penalty': 'l1', 'solver': 'liblinear'}

# SVM -> {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}

# Gradient -> {'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 50}

# KNN ->  {'n_neighbors': 10, 'p': 2, 'weights': 'uniform'}

import time

model_params = {'Random Forest': {'random_state':42, 'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 50}, 'Logistic Regression': {'random_state':42, 'C': 100, 'fit_intercept': False, 'penalty': 'l1', 'solver': 'liblinear', 'max_iter': 10000}, 'Support Vector Machine': {'random_state':42, 'C': 10, 'gamma': 'scale', 'kernel': 'linear'}, 'Gradient Boosting': {'random_state':42,'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 50}, 'K-Nearest Neighbors': {'n_neighbors': 10, 'p': 2, 'weights': 'uniform'}}

final_classifiers = {
    'Random Forest': RandomForestClassifier(random_state =42),
    'Logistic Regression': LogisticRegression(random_state =42),
    'Support Vector Machine': SVC(random_state =42),
    'Gradient Boosting': GradientBoostingClassifier(random_state =42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
}

for model_name, model in final_classifiers.items():
    start = time.time()
    model.fit(x_train[classifiers_features[model_name]], y_train)
    y_pred = model.predict(x_test[classifiers_features[model_name]])
    
    accuracy = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred)
    end = time.time()
    print(f'Accuracy for {model_name}: {accuracy}')
    print(f'AUC for {model_name}: {auc}')
    print(end-start)


  if xp.any(data != data.astype(int)):


ValueError: Input y_true contains NaN.