# Pulling in data

Before we start we obviosuly need a data frame to work on so I will pull that in and create some features as I have in previous explorations. I will also be creating categories for my Y value since there is none readily available and the salary is a continous target.

In [2]:
import pandas as pd
import numpy as np

data_frame = pd.read_csv("nba_contracts_history.csv")
data_frame["PPG"] = data_frame["PTS"]/(data_frame["GP"])
data_frame["APG"] = data_frame["AST"]/(data_frame["GP"])
data_frame["RPG"] = data_frame["REB"]/(data_frame["GP"])
data_frame.describe()

data_frame['SALARY_CAT'] = pd.cut(data_frame.AVG_SALARY, [-np.inf, 1.10e+07, 1.64e+07, np.inf],
                              labels=['low', 'medium', 'high'])
print(data_frame)

                NAME  CONTRACT_START  CONTRACT_END    AVG_SALARY   AGE    GP  \
0    Wesley Matthews            2019          2020  2.564753e+06  32.0  69.0   
1        Brook Lopez            2015          2017  2.116568e+07  27.0  72.0   
2     DeAndre Jordan            2011          2014  1.075976e+07  22.0  80.0   
3    Markieff Morris            2015          2018  8.143324e+06  25.0  82.0   
4      Dwight Howard            2018          2019  1.341074e+07  32.0  81.0   
..               ...             ...           ...           ...   ...   ...   
194      Brook Lopez            2012          2014  1.469367e+07  24.0   5.0   
195   Nikola Vucevic            2015          2018  1.200000e+07  24.0  74.0   
196      Aron Baynes            2015          2017  5.766667e+06  28.0  70.0   
197   Andre Iguodala            2013          2016  1.200000e+07  29.0  80.0   
198   Draymond Green            2015          2019  1.640000e+07  25.0  79.0   

        W     L     MIN     PTS  ...   

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Features

For my X features I will be using the ones I created above, points per game, assists per game, and rebounds per game, these are optimal over just points because it will take an average instead of just a lump some. These are also the stats the NBA uses to determine their awards and that makes it a viable option. However, instead of predicting awards I will be predicting players salaries.

In [3]:
from sklearn.tree import DecisionTreeClassifier
X = data_frame[['PPG', 'APG', 'RPG']]
y = data_frame['SALARY_CAT']

tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(X,y)

In [4]:
from sklearn.metrics import confusion_matrix
y_predicted = tree_classifier.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)

[[ 48   0   0]
 [  0 112   0]
 [  0   0  39]]


#### As much as we want to feel good about this we can't
I am going to make the assumption now that what we are seeing here is overfitting of this data

In [5]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print ("Accuracy is ", accuracy_score(y, y_predicted))

# We have to specify how to combine for the multiclassifications
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

Accuracy is  1.0
Precision is  1.0
Sensitivity is  1.0
F1 is  1.0


In [7]:
from sklearn.model_selection import KFold
validation_accuracy = []
validation_f1 =[]
fold_and_validate = KFold(n_splits=5, shuffle=True, random_state=145)
for train_set_indices, validation_set_indices in fold_and_validate.split(X):
    cv_train_set = X.iloc[train_set_indices]
    cv_train_target = y.iloc[train_set_indices]
    
    cv_decision_tree = DecisionTreeClassifier()
    cv_decision_tree.fit(cv_train_set, cv_train_target)
    
    cv_xvalidation = X.iloc[validation_set_indices]
    cv_y_true = y.iloc[validation_set_indices]
    cv_y_predicted = cv_decision_tree.predict(cv_xvalidation)
    
    cv_accuracy_score = accuracy_score(cv_y_true, cv_y_predicted)
    cv_f1_score = f1_score(cv_y_true, cv_y_predicted,  average="weighted")
    validation_accuracy.append(cv_accuracy_score)
    validation_f1.append(cv_f1_score)
    
print("Cross validation accuracies are: ", validation_accuracy)
print("Cross validation f1 scores  are: ", validation_f1)

Cross validation accuracies are:  [0.65, 0.625, 0.7, 0.675, 0.7692307692307693]
Cross validation f1 scores  are:  [0.6444017094017094, 0.6141491841491842, 0.7099378881987578, 0.6248459383753502, 0.7782957782957783]


### My assumption was correct
These are not very good values and on top of that we had 100% accuracy earlier so we have serious overfitting issues.

## Let's try SVM

In [8]:
from sklearn.svm import SVC
X = data_frame[['PPG', 'APG', 'RPG']]
y = data_frame['SALARY_CAT']

svm_classifier = SVC(kernel="linear")
svm_classifier.fit(X,y)

In [9]:
from sklearn.metrics import confusion_matrix
y_predicted = svm_classifier.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)

[[ 33   6   9]
 [  3 103   6]
 [ 11   7  21]]


In [10]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print ("Accuracy is ", accuracy_score(y, y_predicted))

# We have to specify how to combine for the multiclassifications
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

Accuracy is  0.7889447236180904
Precision is  0.7834191131740877
Sensitivity is  0.7889447236180904
F1 is  0.785830908930618


This is not at all what we got using the decision tree and this does not give me much confidence moving forward.

In [13]:
from sklearn.model_selection import KFold
validation_accuracy = []
validation_f1 =[]
fold_and_validate = KFold(n_splits=5, shuffle=True, random_state=145)
for train_set_indices, validation_set_indices in fold_and_validate.split(X):
    cv_train_set = X.iloc[train_set_indices]
    cv_train_target = y.iloc[train_set_indices]
    #print(cv_train_set)
    
    cv_svc = SVC()
    cv_svc.fit(cv_train_set, cv_train_target)
    
    cv_xvalidation = X.iloc[validation_set_indices]
    cv_y_true = y.iloc[validation_set_indices]
    cv_y_predicted = cv_svc.predict(cv_xvalidation)
    
    cv_accuracy_score = accuracy_score(cv_y_true, cv_y_predicted)
    cv_f1_score = f1_score(cv_y_true, cv_y_predicted,  average="weighted")
    validation_accuracy.append(cv_accuracy_score)
    validation_f1.append(cv_f1_score)
    
print("Cross validation accuracies are: ", validation_accuracy)
print("Cross validation f1 scores  are: ", validation_f1)

Cross validation accuracies are:  [0.675, 0.75, 0.725, 0.725, 0.8461538461538461]
Cross validation f1 scores  are:  [0.6600081833060556, 0.7476491405460062, 0.7474780701754385, 0.675715157858015, 0.8461538461538461]


All hope is not lost as we see a couple improvements from the regular fitting of the data so we might be better off than we think.