## Narrowing down the features
We need to determine which feature seem most relevant for prediction of a player to be waived or traded.


Summary as of 11/12 at 1:30 pm: Based on correlation info, basic knowlegdge of basketball, and the preliminary trials with the original data, I came up with three different classes of feature to train a model on. These are:

- The correlation features (the features that have the highest correlation with being traded or waived by the end of next season) (abbreviated CORR)
- The intuitive features (the feature that seem to me, subjectively, to be the ones most likely to predict player movement) (INTUIT)
- The basic features (the in-game statistics that we ran some preliminary models on) (BASE)

Since being waived (and perhaps being traded as well) are correlated with number of minutes played, all in-game stats that are not percentages are rescaled to be per minute (e.g. points per minute). 

I've tried looking at two predictors: Is the player waived or traded by the end of next season (END_NEXT), and is the player waived (but not necessarily traded) by the start of the season (START_NEXT). In none of these six attempts is the F1 score particularly high we have the following table, though the accuracy increase quite a bit with the shorter time horizon of START_NEXT. These are all for KNN-classifiers; I tried logistic regression too, but in all cases the F1 score was worse.
<table style="width:50%">
  <tr>
    <th></th>
    <th>END_NEXT</th>
    <th>START_NEXT</th>
  </tr>
  <tr>
    <td><b>CORR</b></td>
    <td>0.46</td>
    <td>0.34</td>
  </tr>
  <tr>
    <td><b>INTUIT</b></td>
    <td>0.46</td>
    <td>0.30</td>
  </tr>
  <tr>
    <td><b>BASE</b></td>
    <td>0.46</td>
    <td>0.17</td>
  </tr>
</table>

The shorter time of START_NEXT horizon gives a higher accuracy though, even though there is a lower F1-score, START_NEXT models just return lots of false negatives. In all cases, however, the F1-score is better than just guessing randomly in proportion with the size of the two classes.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

In [2]:
import os
print(os.path.exists("Data/merged_data/merged_data_collapsed_teams.csv"))

True


In [3]:
print(os.getcwd())

C:\Users\jandr\OneDrive\Documents\GitHub\predicting_nba_transactions


In [4]:
player_data = pd.read_csv("Data/merged_data/merged_data_collapsed_teams.csv")

In [5]:
player_data.sample()

Unnamed: 0,NAME,PLAYER_ID,SEASON_START,TEAMS_LIST,PLAYER_AGE,EXPERIENCE,POS,GP,GS,MIN,...,WAIVED_NEXT_NEXT_OFF,RELEASED_NEXT_NEXT_OFF,TRADED_NEXT_NEXT_OFF,WAIVED_NBA_YEAR,WAIVED_NEXT_NBA_YEAR,RELEASED_NBA_YEAR,RELEASED_NEXT_NBA_YEAR,TRADED_NBA_YEAR,TRADED_NEXT_NBA_YEAR,IN_LEAGUE_NEXT
1241,Clar. Weatherspoon,221,1997,"['PHI', 'GSW']",27.0,6,PF,79,49.0,2325.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


We ultimately want to decide which features best predict if a player will be on a different team by the end of next season based on his performance and salary in the current season. Since salary data is missing let's first drop all rows for which the salary entry is blank

In [6]:
player_data = player_data.dropna(subset=['Salary'])

In [7]:
player_data.sample()

Unnamed: 0,NAME,PLAYER_ID,SEASON_START,TEAMS_LIST,PLAYER_AGE,EXPERIENCE,POS,GP,GS,MIN,...,WAIVED_NEXT_NEXT_OFF,RELEASED_NEXT_NEXT_OFF,TRADED_NEXT_NEXT_OFF,WAIVED_NBA_YEAR,WAIVED_NEXT_NBA_YEAR,RELEASED_NBA_YEAR,RELEASED_NEXT_NBA_YEAR,TRADED_NBA_YEAR,TRADED_NEXT_NBA_YEAR,IN_LEAGUE_NEXT
14642,Jaylen Adams,1629121,2018,['ATL'],23.0,1,PG,34,1.0,428.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


To simplify things for the moment, let's look at a combined feature that indicates if a player is traded, waived, or released by the end of next season.

In [8]:
player_data['MOVED_BY_END_OF_NEXT_SEASON'] = player_data[['WAIVED_NBA_YEAR', 'WAIVED_NEXT_NBA_YEAR', 'RELEASED_NBA_YEAR', 'RELEASED_NEXT_NBA_YEAR', 'TRADED_NBA_YEAR', 'TRADED_NEXT_NBA_YEAR']].any(axis=1).astype(int)

In [9]:
player_data.sample(6)

Unnamed: 0,NAME,PLAYER_ID,SEASON_START,TEAMS_LIST,PLAYER_AGE,EXPERIENCE,POS,GP,GS,MIN,...,RELEASED_NEXT_NEXT_OFF,TRADED_NEXT_NEXT_OFF,WAIVED_NBA_YEAR,WAIVED_NEXT_NBA_YEAR,RELEASED_NBA_YEAR,RELEASED_NEXT_NBA_YEAR,TRADED_NBA_YEAR,TRADED_NEXT_NBA_YEAR,IN_LEAGUE_NEXT,MOVED_BY_END_OF_NEXT_SEASON
7991,Beno Udrih,2757,2010,['SAC'],28.0,7,PG,79,64.0,2734.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1
4602,Kelvin Cato,1509,1999,['HOU'],25.0,3,C,65,32.0,1581.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
10303,Alexis Ajinca,201582,2016,['NOP'],29.0,7,C,39,15.0,584.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1
650,Steven Smith,120,1998,['ATL'],30.0,8,SG,36,36.0,1314.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1
8192,Rolando Blackman,76176,1992,['NYK'],34.0,12,SG,60,33.0,1434.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1
14187,Luke Kornet,1628436,2017,['NYK'],22.0,1,PF,20,1.0,326.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0


In [10]:
numeric_data = player_data.select_dtypes(include=['number'])

In [11]:
correlations = numeric_data.corr()['MOVED_BY_END_OF_NEXT_SEASON']

In [12]:
print(correlations)

PLAYER_ID                     -0.007206
SEASON_START                  -0.016470
PLAYER_AGE                     0.040112
EXPERIENCE                    -0.012069
GP                            -0.227963
                                 ...   
RELEASED_NEXT_NBA_YEAR         0.131092
TRADED_NBA_YEAR                0.404463
TRADED_NEXT_NBA_YEAR           0.518717
IN_LEAGUE_NEXT                -0.068808
MOVED_BY_END_OF_NEXT_SEASON    1.000000
Name: MOVED_BY_END_OF_NEXT_SEASON, Length: 76, dtype: float64


In [13]:
sorted_correlations = correlations.sort_values(ascending=False)
print(sorted_correlations)

MOVED_BY_END_OF_NEXT_SEASON    1.000000
TRADED_NEXT_NBA_YEAR           0.518717
WAIVED_NBA_YEAR                0.438904
WAIVED_NEXT_NBA_YEAR           0.432247
TRADED_NBA_YEAR                0.404463
                                 ...   
MIN                           -0.226591
GP                            -0.227963
FGM                           -0.229878
DWS                           -0.232815
WS                            -0.236431
Name: MOVED_BY_END_OF_NEXT_SEASON, Length: 76, dtype: float64


In [14]:
pd.set_option('display.max_rows', None)

# Print the sorted correlations
print(sorted_correlations)

# Reset display options if needed
pd.reset_option('display.max_rows')

MOVED_BY_END_OF_NEXT_SEASON    1.000000
TRADED_NEXT_NBA_YEAR           0.518717
WAIVED_NBA_YEAR                0.438904
WAIVED_NEXT_NBA_YEAR           0.432247
TRADED_NBA_YEAR                0.404463
TRADED_NEXT_OFF                0.388480
TRADED_NEXT_REG                0.360414
TRADED_NEXT_NEXT_OFF           0.351442
WAIVED_NEXT_OFF                0.345090
WAIVED_NEXT_NEXT_OFF           0.313604
WAIVED_NEXT_REG                0.300161
WAIVED_REG                     0.268713
RELEASED_NBA_YEAR              0.133510
RELEASED_NEXT_NBA_YEAR         0.131092
RELEASED_NEXT_OFF              0.128133
RELEASED_NEXT_NEXT_OFF         0.119370
TRADED_REG                     0.085308
WAIVED_OFF                     0.075564
TOV_PERCENT                    0.060260
TRADED_POST                    0.058124
TRADED_NEXT_POST               0.057035
RELEASED_NEXT_REG              0.051247
WAIVED_POST                    0.040306
PLAYER_AGE                     0.040112
X3P_AR                         0.038584


From this the highest positively correlated stats for being moved are:
TOV_PERCENT: 0.060260
PLAYER_AGE: 0.040112
X3P_AR: 0.038584 (I think this is 3-pt-attempt rate; the percentage of field goals attempted from 3-point range)

And the highest negatively correlated stats for being moved are:
WS: -0.236431 (Win share)
DWS: -0.232815 (Defensive win share)
FGM: -0.229878 
GP: -0.227963
MIN: -0.226591

However, each of these, and being waived are correlated to minutes, let's look at these stats as per minute to account for the fact that players who are waived play less than players who are not waived.


In [15]:
player_data = player_data[player_data['MIN'] != 0]

In [16]:
len(player_data)

11172

In [17]:
columns_to_normalize = ['FGM', 'FGA', 'PTS', 'PF', 'DREB', 'OREB', 'REB', 'FTA', 'FTM', 'STL', 'TOV', 'BLK', 'AST', 'FG3A', 'FG3M']

# Normalize the selected columns by dividing by 'MIN'
player_data[columns_to_normalize] = player_data[columns_to_normalize].div(player_data['MIN'], axis=0)

In [18]:
numeric_data = player_data.select_dtypes(include=['number'])

In [19]:
correlations = numeric_data.corr()['MOVED_BY_END_OF_NEXT_SEASON']

In [20]:
sorted_correlations = correlations.sort_values(ascending=False)
pd.set_option('display.max_rows', None)

# Print the sorted correlations
print(sorted_correlations)

# Reset display options if needed
pd.reset_option('display.max_rows')

MOVED_BY_END_OF_NEXT_SEASON    1.000000
TRADED_NEXT_NBA_YEAR           0.518883
WAIVED_NBA_YEAR                0.438674
WAIVED_NEXT_NBA_YEAR           0.432198
TRADED_NBA_YEAR                0.404590
TRADED_NEXT_OFF                0.388602
TRADED_NEXT_REG                0.360527
TRADED_NEXT_NEXT_OFF           0.351552
WAIVED_NEXT_OFF                0.345198
WAIVED_NEXT_NEXT_OFF           0.313473
WAIVED_NEXT_REG                0.300254
WAIVED_REG                     0.268282
RELEASED_NBA_YEAR              0.133551
RELEASED_NEXT_NBA_YEAR         0.131132
RELEASED_NEXT_OFF              0.128172
RELEASED_NEXT_NEXT_OFF         0.119406
TRADED_REG                     0.085334
WAIVED_OFF                     0.075623
PF                             0.061292
TOV_PERCENT                    0.060260
TRADED_POST                    0.058142
TRADED_NEXT_POST               0.057052
RELEASED_NEXT_REG              0.051262
WAIVED_POST                    0.040318
PLAYER_AGE                     0.040190


Now the most highest negative correlations to be waived are still win shares (total, defensive, and offensive). VORP (value over replacement player), PER (player efficiency rating), BPM (box plus/minus), points per minute, FGM/min, TS percent (true shooting percent), FG_PCT, FTM/min, FT_PCT, and percent team salary

First, let's try the KNN-classifer and Logistic regression models on those stats with abs(corr) > .15

In [21]:
Xfeatures = ['WS', 'DWS', 'GP', 'MIN', 'OWS', 'VORP', 'PER', 'BPM', 'GS', 'OBPM', 'WS_48', 'PTS', 'FGM']

In [22]:
from sklearn.model_selection import train_test_split
train_player_data, final_test_player_data = train_test_split(player_data, test_size=0.2, stratify=player_data['MOVED_BY_END_OF_NEXT_SEASON'], random_state=812)
print(len(train_player_data))
print(len(final_test_player_data))

8937
2235


This final_test_player_data is set aside for the very end to test our final model(s). In this notebook from here on out we'll use train_player_data as our data set

In [23]:
train_player_data = train_player_data.dropna(subset = Xfeatures)

In [24]:
len(train_player_data)

8937

In [25]:
X = train_player_data[Xfeatures]
y = train_player_data['MOVED_BY_END_OF_NEXT_SEASON']

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=812, stratify = y)

In [26]:
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors = 5))
])
    

In [27]:
knn_pipeline.fit(X_train, y_train)

In [28]:
y_pred = knn_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.5928411633109619
Confusion Matrix:  [[752 290]
 [438 308]]
F1 score:  0.4583333333333333


So this model is not very good, but let's see if logistic regression does better on these features in any case before moving on to other features.

In [29]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(penalty = None)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.6236017897091722
Confusion Matrix:  [[847 195]
 [478 268]]
F1 score:  0.44334160463192723


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


So this is about the same.

Next, maybe let's ignore advanced stats and just look at the most relevant basic stats and the salary information

In [30]:
Xfeatures = ['MIN', 'PTS', 'FGM', 'FG_PCT', 'FTM', 'FT_PCT', 'Percent_team_salary', 'PF']

In [31]:
train_player_data = train_player_data.dropna(subset = Xfeatures)

In [32]:
len(train_player_data)

8937

In [33]:
X = train_player_data[Xfeatures]
y = train_player_data['MOVED_BY_END_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=812, stratify = y)

In [34]:
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors = 5))
])

knn_pipeline.fit(X_train, y_train)



In [35]:
y_pred = knn_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.5833333333333334
Confusion Matrix:  [[722 320]
 [425 321]]
F1 score:  0.46286950252343184


In [36]:
log_reg = LogisticRegression(penalty = None)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.62248322147651
Confusion Matrix:  [[857 185]
 [490 256]]
F1 score:  0.43133951137320975


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


This version is about the same, maybe something simpler should be checked, so let's try again with only points per minute as the Xfeatures

In [37]:
Xfeatures = ['PTS']

In [38]:
X = train_player_data[Xfeatures]
y = train_player_data['MOVED_BY_END_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=812, stratify = y)

In [39]:
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors = 5))
])

knn_pipeline.fit(X_train, y_train)

In [40]:
y_pred = knn_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.5430648769574944
Confusion Matrix:  [[697 345]
 [472 274]]
F1 score:  0.4014652014652015


In [41]:
log_reg = LogisticRegression(penalty = None)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.5894854586129754
Confusion Matrix:  [[944  98]
 [636 110]]
F1 score:  0.23060796645702306


As another attempt before changing the variable we are predicting, let's look at the same stats as the earlier models, just now with the stats scaled per minute rather than per game.

In [42]:
Xfeatures = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF']

In [43]:
X = train_player_data[Xfeatures]
y = train_player_data['MOVED_BY_END_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=812, stratify = y)

In [44]:
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors = 5))
])

knn_pipeline.fit(X_train, y_train)

In [45]:
y_pred = knn_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.5894854586129754
Confusion Matrix:  [[746 296]
 [438 308]]
F1 score:  0.4562962962962963


In [46]:
log_reg = LogisticRegression(penalty = None)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.6006711409395973
Confusion Matrix:  [[918 124]
 [590 156]]
F1 score:  0.30409356725146197


This again is not very good. The next thing to try is to shrink the time horizon and restrict to waived or released in the current seaon alone. So we should make the new y's to be Waived or Released before start of next season.

In [47]:
player_data['WAIVED_BY_START_OF_NEXT_SEASON'] = player_data[['WAIVED_NBA_YEAR', 'RELEASED_NBA_YEAR']].any(axis=1).astype(int)

Now try with the basic (per minute) statistics

In [48]:
train_player_data, final_test_player_data = train_test_split(player_data, test_size=0.2, stratify=player_data['WAIVED_BY_START_OF_NEXT_SEASON'], random_state=812)
print(len(train_player_data))
print(len(final_test_player_data))

8937
2235


In [49]:
Xfeatures = ['PTS', 'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF']

In [50]:
X = train_player_data[Xfeatures]
y = train_player_data['WAIVED_BY_START_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=812, stratify = y)

In [51]:
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors = 5))
])

knn_pipeline.fit(X_train, y_train)

In [52]:
y_pred = knn_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.8585011185682326
Confusion Matrix:  [[1510   40]
 [ 213   25]]
F1 score:  0.16501650165016502


So this is much more accurate, but the F1 score is also much worse, false negative rate is too high.

In [53]:
log_reg = LogisticRegression(penalty = None)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.8646532438478747
Confusion Matrix:  [[1538   12]
 [ 230    8]]
F1 score:  0.06201550387596899


Next, let's look at this set that seems more intuitive for what we might expect to be the most important features

In [54]:
Xfeatures = ['MIN', 'PTS', 'FGM', 'FG_PCT', 'FTM', 'FT_PCT', 'Percent_team_salary', 'PF']

In [55]:
X = train_player_data[Xfeatures]
y = train_player_data['WAIVED_BY_START_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=812, stratify = y)

In [56]:
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors = 5))
])

knn_pipeline.fit(X_train, y_train)
y_pred = knn_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)


accuracy:  0.8529082774049217
Confusion Matrix:  [[1469   81]
 [ 182   56]]
F1 score:  0.2986666666666667


In [57]:
log_reg = LogisticRegression(penalty = None)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.8691275167785235
Confusion Matrix:  [[1523   27]
 [ 207   31]]
F1 score:  0.20945945945945946


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


This is better, but still not great. Finally let's try again with the best correlated stats

In [58]:
Xfeatures = ['WS', 'DWS', 'GP', 'MIN', 'OWS', 'VORP', 'PER', 'BPM', 'GS', 'OBPM', 'WS_48', 'PTS', 'FGM']

In [59]:
X = train_player_data[Xfeatures]
y = train_player_data['WAIVED_BY_START_OF_NEXT_SEASON']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=812, stratify = y)

In [60]:
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors = 5))
])

knn_pipeline.fit(X_train, y_train)
y_pred = knn_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.8585011185682326
Confusion Matrix:  [[1469   81]
 [ 172   66]]
F1 score:  0.34285714285714286


In [61]:
log_reg = LogisticRegression(penalty = None)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy: ", accuracy)
CM = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ", CM)
f1score = f1_score(y_test, y_pred)
print("F1 score: ", f1score)

accuracy:  0.8696868008948546
Confusion Matrix:  [[1529   21]
 [ 212   26]]
F1 score:  0.1824561403508772


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [62]:
moved_by_end = player_data.loc[player_data['MOVED_BY_END_OF_NEXT_SEASON'] == 1]

In [66]:
len(moved_by_end)/(len(player_data))

0.41747225205871824

In [68]:
moved_by_start = player_data.loc[player_data['WAIVED_BY_START_OF_NEXT_SEASON'] == 1]

In [69]:
len(moved_by_start)/len(player_data)

0.13310060866451845