<h3> NBA 5 year careeer prediction using Random Forest Classifier

In [1]:
import pandas as pd
import numpy as np

<h4> 1. Load processed sets

<h5> Load the balanced train, validation and test sets using a custom built function

In [2]:
#Load Test and validation sets from custom function
from src.data.sets import load_sets

In [3]:
X_train, X_val, y_train, y_val, X_test, X_test_ID = load_sets( )

<h4> 2. Import Random Forest Classifier

In [4]:
from sklearn.ensemble import RandomForestRegressor

<h4> 2.1 Analyse features importance in random forest model with current training sets

In [5]:
from sklearn.feature_selection import SelectFromModel
sel = SelectFromModel(RandomForestRegressor(n_estimators = 300, random_state=44))
sel.fit(X_train, y_train)

<h4> 2.2 Finding the most relevant features in making predictions

In [6]:
columns = sel.get_support()
columns = columns.tolist()
columns

[True,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 False]

<h5> As column importance is given in order of appearance on the scaled set, importing column names saved in data_prep for easy visualization will be performed

<h5> Importing predictor column names and combining with relevant columns to find out the best column predictors

In [7]:
predictors  = pd.read_csv('../data/interim/predictor_names.csv')

In [8]:
predictors['Relevant_Feature'] = columns

In [9]:
predictors

Unnamed: 0,names,Relevant_Feature
0,GP,True
1,MIN,True
2,PTS,False
3,FGM,False
4,FGA,False
5,FG%,True
6,3P Made,False
7,3PA,False
8,3P%,True
9,FTM,False


In [10]:
predictors.loc[predictors['Relevant_Feature'] == 1]

Unnamed: 0,names,Relevant_Feature
0,GP,True
1,MIN,True
5,FG%,True
8,3P%,True
11,FT%,True
15,AST,True


<h5> <b>CONCLUSION ON PREDICTORS: </b> Based on the rebalanced training set from data_prep, 7 columns are the most relevant predictors. Therefore new data prep will be generated for usage of these columns within the model

<h4> 2.3 New data prep to be saved in the interim folder

<h5> 2.3.1 Get a list of column names to be removed from data set

In [11]:
columns_to_remove_df = predictors.loc[predictors['Relevant_Feature'] == 0]

In [12]:
columns_to_remove = columns_to_remove_df.pop("names")

In [13]:
columns_to_remove_list = columns_to_remove.values.tolist()

<h5> 2.3.2 Read csvs and remove unwanted columns

In [14]:
df = pd.read_csv('../data/raw/2022_train.csv')

In [15]:
df_cleaned = df.copy()
df_cleaned.drop(columns=columns_to_remove_list, axis=1, inplace=True)
df_cleaned.drop("Id", axis=1, inplace=True)

In [16]:
df_cleaned.head()

Unnamed: 0,GP,MIN,FG%,3P%,FT%,AST,TARGET_5Yrs
0,80,24.3,45.7,22.6,72.1,3.2,1
1,75,21.8,55.1,34.9,67.8,0.7,1
2,85,19.1,42.8,34.3,75.7,0.8,1
3,63,19.1,52.5,23.7,66.9,1.8,1
4,63,17.8,50.8,13.7,54.0,0.4,1


<h5> 2.3.3 Rebalance imbalance set with custom function

In [17]:
from src.data.sets import rebalance_mayority_class

In [18]:
df_rebalanced = df_cleaned.copy()

In [19]:
df_rebalanced = rebalance_mayority_class(df_cleaned, "TARGET_5Yrs")

<h5> 2.3.4 Scaling and generating training and validation set

In [20]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [21]:
target = df_rebalanced.pop("TARGET_5Yrs")

In [22]:
df_rebalanced_clean = scaler.fit_transform(df_rebalanced)

In [23]:
#Import subset function for getting training and valudation sets
from src.data.sets import subset_x_y

In [24]:
X_train, X_val, y_train, y_val = subset_x_y(df_rebalanced_clean, target)

<h5> 2.3.5 Clean, transform and scale Test set

In [25]:
df_test = pd.read_csv('../data/raw/2022_test.csv')

In [26]:
X_test_ID = df_test.pop('Id')

In [27]:
X_test = df_test.copy()

In [28]:
X_test.drop(columns=columns_to_remove_list, axis=1, inplace=True)

In [29]:
X_test = scaler.fit_transform(X_test)

<h5> 2.3.6 Save train, evaluate and test sets in ../data/interim folder

In [30]:
from src.data.sets import save_sets_interim

In [31]:
save_sets_interim(X_train, X_val, y_train, y_val, X_test, X_test_ID)

<h4> 3 Train Random Forest Classifier with new calculated training sets

In [32]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=44)
rf_model.fit(X_train, y_train)

In [33]:
#Save model in the models folder
from joblib import dump
dump(rf_model, '../models/random_forest_default.joblib')

['../models/random_forest_default.joblib']

<h4> 2.1 Generate predictions for training and validation sets in order to compare accuracy vs Baseline

In [34]:
y_trainpreds = rf_model.predict(X_train)
y_val_preds = rf_model.predict(X_val)

<h4> 2.2 Calculate mse and mae to assess fiting accuracy for the training and validation sets

In [35]:
#Get error/score metrics
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae

In [36]:
print(mse(y_train, y_trainpreds, squared=False))
print(mae(y_train, y_trainpreds))

0.17625001457415101
0.1560732738374824


In [37]:
print(mse(y_val, y_val_preds, squared=False))
print(mae(y_val, y_val_preds))

0.4627801620899658
0.41906191369606005


In [38]:
rf_model.score(X_train, y_train)

0.8757404123092672

In [39]:
rf_model.score(X_val, y_val)

0.14297305893621826

<h5> Results for training set show a perfect fit to the balanced data set, however, the results drop when evaluating the validation set 

<h4> 3 Analysis of the validation set

<h5> Confusion matrix is analysed to find out where the innacuracy exists in order to improve the model

In [40]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

<h5> Similar number of false positives and false negatives are predicted (49% for zeroes and 44% for ones) in the validation set

<h4> 4 Generate predictions on the test set for Kaggle submission

In [41]:
y_test_preds = rf_model.predict(X_test)

In [42]:
y_test_preds_list = y_test_preds.tolist()
target_prob = y_test_preds
#target_prob = [item[1] for item in y_test_preds_list]

In [43]:
#Create Data Frame for Doc printing
df =pd.DataFrame()

In [44]:
df['Id'] = X_test_ID
df['TARGET_5Yrs'] = target_prob

In [45]:
df.head()

Unnamed: 0,Id,TARGET_5Yrs
0,0,0.32
1,1,0.36
2,2,0.97
3,3,0.57
4,4,0.33


In [46]:
#Saving predictions into csv within ../data/external folder
df.to_csv('../data/external/Kaggle_submission_random_forest.csv', index=False)