## 1.0 SETUP 

In [1]:
# import numpy and pandas libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

np.random.seed(1)

## 2.0 Load the data

In [2]:
X_train = pd.read_csv("./train_X.csv")
X_test = pd.read_csv("./test_X.csv")
y_train = pd.read_csv("./train_y.csv")
y_test = pd.read_csv("./test_y.csv")

### For predicting whether an NBA player will play for 5 years or not, the more relevant metric to use would be Precision.

**Precision** is the proportion of true positives out of all the predicted positives. 
In our case, a forecast that a player will play for at least five years would be considered a true positive, whereas a prediction that a player will play for at least five years but eventually retire or sustain a career-ending injury would be considered a false positive.

The reason I chose precision and why precision is more relevant in this case is that it is more important to avoid false positives (i.e., predicting a player will play for 5 years when they won't) than to avoid false negatives (i.e., predicting a player won't play for 5 years when they will).

I went with **precision** since it's more essential to prevent false positives (for example, forecasting a player will play for 5 years when they won't) than false negatives (for example, predicting a player won't play for 5 years when they will) in this situation.

This is because false positives can lead to significant financial losses for NBA teams if they invest in players who retire early or suffer career-ending injuries, while investing in players who are predicted to play for less than 5 years, but end up playing for longer is not a financial loss for the teams.

Therefore, precision is the more relevant metric to use when evaluating the performance of a predictive model for whether NBA players will play for 5 years or not

## 3.0 Model the data
First, we will create a dataframe to hold all the results of our models.

In [3]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

## 3.1.1 Logistic regression using random search

In [4]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'max_iter':np.arange(500,1000),
    'penalty': ['None','l1','l2','elasticnet'],
    'solver':['saga','liblinear']
}

log_reg_model = LogisticRegression()
rand_search = RandomizedSearchCV(estimator = log_reg_model, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best precision score is 0.7033599200498099
... with parameters: {'solver': 'liblinear', 'penalty': 'l2', 'max_iter': 601}


1255 fits failed out of a total of 2500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
645 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\risha\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\risha\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\risha\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 441, in _check_solver
    raise ValueError(
ValueError: Logistic Regression supports only penalties in ['l1', 'l2', 'elasticnet', 'none'], got None.


## 3.1.2 Logistic regression using grid search 

In [5]:
score_measure = "precision"
kfolds = 5
max_iter = rand_search.best_params_['max_iter']
penalty = rand_search.best_params_['penalty']
solver = rand_search.best_params_['solver']

param_grid = {
    'max_iter': np.arange(max_iter-5,max_iter+5),  
    'penalty': [penalty],
    'solver': [solver]
}

logistic_reg_model = LogisticRegression()
grid_search = GridSearchCV(estimator = logistic_reg_model, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallLogistic = grid_search.best_estimator_

  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 10 candidates, totalling 50 fits
The best precision score is 0.7033599200498099
... with parameters: {'max_iter': 596, 'penalty': 'l2', 'solver': 'liblinear'}


In [6]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Logistic Regression", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
log_reg_bm=grid_search

## 3.2.1 Decision Tree using random search

In [7]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(1,100),  
    'min_samples_leaf': np.arange(1,100),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005),
    'max_leaf_nodes': np.arange(5, 100), 
    'max_depth': np.arange(1,20), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=70,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 70 candidates, totalling 350 fits
The best precision score is 0.7422923937663772
... with parameters: {'min_samples_split': 4, 'min_samples_leaf': 5, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 58, 'max_depth': 16, 'criterion': 'gini'}


5 fits failed out of a total of 350.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\risha\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\risha\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 937, in fit
    super().fit(
  File "C:\Users\risha\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 250, in fit
    raise ValueError(
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

 0.69873944 0.69760937 0.6986813  0.67853129 0.68411707 0.70618003
 0.74229239 

## 3.2.2 Decision Tree using grid search

In [8]:
score_measure = "precision"
kfolds = 5
min_samples_split = rand_search.best_params_['min_samples_split']
min_samples_leaf = rand_search.best_params_['min_samples_leaf']
min_impurity_decrease = rand_search.best_params_['min_impurity_decrease']
max_leaf_nodes = rand_search.best_params_['max_leaf_nodes']
max_depth = rand_search.best_params_['max_depth']
criterion = rand_search.best_params_['criterion']

param_grid = {
    'min_samples_split': np.arange(min_samples_split-2,min_samples_split+2),  
    'min_samples_leaf': np.arange(min_samples_leaf-2,min_samples_leaf+2),
    'min_impurity_decrease': np.arange(min_impurity_decrease-0.0001, min_impurity_decrease+0.0001, 0.00005),
    'max_leaf_nodes': np.arange(max_leaf_nodes-2,max_leaf_nodes+2), 
    'max_depth': np.arange(max_depth-2,max_depth+2), 
    'criterion': [criterion]
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 1024 candidates, totalling 5120 fits
The best precision score is 0.7586689011760931
... with parameters: {'criterion': 'gini', 'max_depth': 17, 'max_leaf_nodes': 57, 'min_impurity_decrease': 5e-05, 'min_samples_leaf': 3, 'min_samples_split': 3}


In [9]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

## 3.3.1 SVM using random search

In [10]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'C': np.arange(1,25),   
    'gamma': ['scale','auto'],
    'kernel':['linear','rbf','poly']
}

svm_model = SVC()
rand_search = RandomizedSearchCV(estimator = svm_model, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

Fitting 5 folds for each of 144 candidates, totalling 720 fits




The best precision score is 0.7779774253075223
... with parameters: {'kernel': 'rbf', 'gamma': 'auto', 'C': 15}


  y = column_or_1d(y, warn=True)


## 3.3.2 SVM using grid search

In [11]:
score_measure = "precision"
kfolds = 5

C = rand_search.best_params_['C']
gamma = rand_search.best_params_['gamma']
kernel = rand_search.best_params_['kernel']

param_grid = {
    'C': np.arange(C-2,C+2),  
    'gamma': [gamma],
    'kernel': [kernel]
    
}

svm_model = SVC()
grid_search = GridSearchCV(estimator = svm_model, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallSVM = grid_search.best_estimator_

Fitting 5 folds for each of 4 candidates, totalling 20 fits
The best precision score is 0.7779774253075223
... with parameters: {'C': 15, 'gamma': 'auto', 'kernel': 'rbf'}


  y = column_or_1d(y, warn=True)


In [12]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"SVM", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

## 3.4 Evaluating the performance matrix

In [13]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Logistic Regression,0.669154,0.778325,0.642276,0.703786
0,Decision Tree,0.651741,0.726496,0.691057,0.708333
0,SVM,0.619403,0.71831,0.621951,0.666667


In [None]:
# Saving our best model for future use cases
#here the best model is logistic regression
BestModel=log_reg_bm
pickle.dump(best_model, open('LogisticRegression.pkl', 'wb'))

## 4.0 Conclusion

Prediction of career length of an NBA player is very important for NBA teams as an effective predictive model would help the teams to take more informed decisions on the signings of players. This would help the teams to spend their limited resources and money on players who have a higher career length and can be more impactful for the team for a longer period of time and ultimately being a profitable deal. 

Out of all the models, the best performing model based on the metric *Precision* is **Logistic Regression**. It has a precision value of **0.778**. It also has highest accuracy **(0.669)** and good recall **(0.642)** and F1 score **(0.703)** as well. This means that our best performing model can correctly identify 77.8% players that whether they will play for 5 years or not. 
 


