# Pulling in data

Before we start we obviosuly need a data frame to work on so I will pull that in and create some features as I have in previous explorations. I will also be creating categories for my Y value since there is none readily available and the salary is a continous target.

In [20]:
import pandas as pd
import numpy as np

data_frame = pd.read_csv("nba_contracts_history.csv")
data_frame["PPG"] = data_frame["PTS"]/(data_frame["GP"])
data_frame["APG"] = data_frame["AST"]/(data_frame["GP"])
data_frame["RPG"] = data_frame["REB"]/(data_frame["GP"])
data_frame.describe()

data_frame['AVG_SALARY'] = data_frame['AVG_SALARY']/1000000

data_frame['SALARY_CAT'] = pd.cut(data_frame.AVG_SALARY, [-np.inf, 4.999, 9.999, 14.999, 19.999, np.inf],
                              labels=['<5mil', '5-10mil', '10-15mil', '15-20mil', '20mil+'])
print(data_frame)

                NAME  CONTRACT_START  CONTRACT_END  AVG_SALARY   AGE    GP  \
0    Wesley Matthews            2019          2020    2.564753  32.0  69.0   
1        Brook Lopez            2015          2017   21.165675  27.0  72.0   
2     DeAndre Jordan            2011          2014   10.759764  22.0  80.0   
3    Markieff Morris            2015          2018    8.143323  25.0  82.0   
4      Dwight Howard            2018          2019   13.410739  32.0  81.0   
..               ...             ...           ...         ...   ...   ...   
194      Brook Lopez            2012          2014   14.693667  24.0   5.0   
195   Nikola Vucevic            2015          2018   12.000000  24.0  74.0   
196      Aron Baynes            2015          2017    5.766667  28.0  70.0   
197   Andre Iguodala            2013          2016   12.000000  29.0  80.0   
198   Draymond Green            2015          2019   16.400000  25.0  79.0   

        W     L     MIN     PTS  ...    AST    TOV    STL    BL

## Features

For my X features I will be using the ones I created above, points per game, assists per game, and rebounds per game, these are optimal over just points because it will take an average instead of just a lump some. These are also the stats the NBA uses to determine their awards and that makes it a viable option. However, instead of predicting awards I will be predicting players salaries.

In [21]:
from sklearn.tree import DecisionTreeClassifier
X = data_frame[['PPG', 'APG', 'RPG']]
y = data_frame['SALARY_CAT']

tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(X,y)

In [22]:
from sklearn.metrics import confusion_matrix
y_predicted = tree_classifier.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)

[[41  0  0  0  0]
 [ 0 24  0  0  0]
 [ 0  0 32  0  0]
 [ 0  0  0 49  0]
 [ 0  0  0  0 53]]


#### As much as we want to feel good about this we can't
I am going to make the assumption now that what we are seeing here is overfitting of this data

In [23]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print ("Accuracy is ", accuracy_score(y, y_predicted))

# We have to specify how to combine for the multiclassifications
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

Accuracy is  1.0
Precision is  1.0
Sensitivity is  1.0
F1 is  1.0


In [24]:
from sklearn.model_selection import KFold
validation_accuracy = []
validation_f1 =[]
fold_and_validate = KFold(n_splits=5, shuffle=True, random_state=145)
for train_set_indices, validation_set_indices in fold_and_validate.split(X):
    cv_train_set = X.iloc[train_set_indices]
    cv_train_target = y.iloc[train_set_indices]
    
    cv_decision_tree = DecisionTreeClassifier()
    cv_decision_tree.fit(cv_train_set, cv_train_target)
    
    cv_xvalidation = X.iloc[validation_set_indices]
    cv_y_true = y.iloc[validation_set_indices]
    cv_y_predicted = cv_decision_tree.predict(cv_xvalidation)
    
    cv_accuracy_score = accuracy_score(cv_y_true, cv_y_predicted)
    cv_f1_score = f1_score(cv_y_true, cv_y_predicted,  average="weighted")
    validation_accuracy.append(cv_accuracy_score)
    validation_f1.append(cv_f1_score)
    
print("Cross validation accuracies are: ", validation_accuracy)
print("Cross validation f1 scores  are: ", validation_f1)

Cross validation accuracies are:  [0.525, 0.425, 0.45, 0.45, 0.41025641025641024]
Cross validation f1 scores  are:  [0.5226190476190476, 0.397948717948718, 0.4597759728194511, 0.4390801722071072, 0.3976856476856477]


### My assumption was correct
These are not very good values and on top of that we had 100% accuracy earlier so we have serious overfitting issues.

## Let's try SVM

In [25]:
from sklearn.svm import SVC
X = data_frame[['PPG', 'APG', 'RPG']]
y = data_frame['SALARY_CAT']

svm_classifier = SVC(kernel="linear")
svm_classifier.fit(X,y)

In [26]:
from sklearn.metrics import confusion_matrix
y_predicted = svm_classifier.predict(X)
matrix = confusion_matrix(y, y_predicted)
print(matrix)

[[25  0  7  7  2]
 [14  2  6  2  0]
 [ 6  1 25  0  0]
 [ 9  0  1 24 15]
 [ 2  0  0 11 40]]


In [27]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print ("Accuracy is ", accuracy_score(y, y_predicted))

# We have to specify how to combine for the multiclassifications
print ("Precision is ", precision_score(y, y_predicted, average="weighted"))
print ("Sensitivity is ", recall_score(y, y_predicted, average="weighted"))
print ("F1 is ", f1_score(y, y_predicted, average="weighted"))

Accuracy is  0.5829145728643216
Precision is  0.5966665684663569
Sensitivity is  0.5829145728643216
F1 is  0.5580932892855155


This is not at all what we got using the decision tree and this does not give me much confidence moving forward.

In [28]:
from sklearn.model_selection import KFold
validation_accuracy = []
validation_f1 =[]
fold_and_validate = KFold(n_splits=5, shuffle=True, random_state=145)
for train_set_indices, validation_set_indices in fold_and_validate.split(X):
    cv_train_set = X.iloc[train_set_indices]
    cv_train_target = y.iloc[train_set_indices]
    #print(cv_train_set)
    
    cv_svc = SVC()
    cv_svc.fit(cv_train_set, cv_train_target)
    
    cv_xvalidation = X.iloc[validation_set_indices]
    cv_y_true = y.iloc[validation_set_indices]
    cv_y_predicted = cv_svc.predict(cv_xvalidation)
    
    cv_accuracy_score = accuracy_score(cv_y_true, cv_y_predicted)
    cv_f1_score = f1_score(cv_y_true, cv_y_predicted,  average="weighted")
    validation_accuracy.append(cv_accuracy_score)
    validation_f1.append(cv_f1_score)
    
print("Cross validation accuracies are: ", validation_accuracy)
print("Cross validation f1 scores  are: ", validation_f1)

Cross validation accuracies are:  [0.525, 0.425, 0.425, 0.55, 0.6153846153846154]
Cross validation f1 scores  are:  [0.5251633986928106, 0.37723214285714285, 0.3961927546138072, 0.49602813852813854, 0.5911780527165142]


All hope is not lost as we see a couple improvements from the regular fitting of the data so we might be better off than we think.

In [30]:
from sklearn.cluster import KMeans
kmeans_classifier = KMeans(n_clusters=4, n_init=20)
svm_classifier.fit(X, y)
print('Intercepts are:', svm_classifier.intercept_)
print('Hyperplane coefficients are:', svm_classifier.coef_)


Intercepts are: [ 2.96883264  6.24364851 -4.30117981 -3.89361506  4.34623826 -4.70769352
 -7.3602136  -8.66200943 -7.07668221 -3.37892276]
Hyperplane coefficients are: [[-1.38137837e-01 -2.11803087e-02 -1.22269294e-02]
 [-2.47113326e-01 -4.58483104e-02 -2.91518868e-01]
 [ 2.04828829e-01  4.91714829e-02  3.78081212e-01]
 [ 2.66819617e-01 -4.57571040e-04  2.83993439e-01]
 [-1.34984680e-01  3.53202151e-03 -3.25460181e-01]
 [ 2.06077991e-01  6.21041077e-02  3.31934625e-01]
 [ 4.46152477e-01 -1.01634884e-01  5.78931151e-01]
 [ 3.59568548e-01  1.94170473e-01  4.51417971e-01]
 [ 4.58747808e-01  1.51472280e-02  1.85484560e-01]
 [ 3.04446535e-01 -5.66564014e-02  3.50995165e-01]]


In [31]:
new_instance = [[]]
class_out = kmeans_classifier.predict(new_instance)
print('The predicted class:', class_out)

NotFittedError: This KMeans instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.