# Basketball Player Career Prediction

In [13]:
import pandas as pd

data = pd.read_csv('data/player_performances.csv')

data.head()

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers,target_5y
0,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


ℹ️ Each observation represents a player and each column a characteristic of performance. The target `target_5y` defines whether the player has had a professional career of less than 5 years [0] or 5 years or more [1].

# Preprocessing

👇 To avoid spending too much time on the preprocessing, we will Robust Scale the entire feature set. While not optimal, this will allow us to get models up and running quickly.

In [14]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_scaled = scaler.fit_transform(data.drop(columns = 'target_5y'))

# Base modeling

🎯 Our task is to detect players who will last 5 years minimum as professionals, with a 90% guarantee.

Let's see if a default Logistic Regression model is going to satisfy the 90% accuracy requirement.

In [15]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

log_cv_results = cross_validate(LogisticRegression(max_iter=1000), X_scaled, data['target_5y'], cv=10, 
                            scoring=['precision'])

base_score = log_cv_results['test_precision'].mean()

base_score

0.737761327524343

# Threshold adjustment

Let's find the decision threshold that guarantees a 90% precision for a player to last 5 years or more as a professional.

In [16]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_recall_curve

y_pred_probas_0, y_pred_probas_1 = cross_val_predict(LogisticRegression(),
                                                     X_scaled, data['target_5y'],
                                                     method = "predict_proba").T

precision, recall, thresholds = precision_recall_curve(data['target_5y'], y_pred_probas_1)

df_precision = pd.DataFrame({"precision" : precision[:-1], "threshold" : thresholds})

new_threshold = df_precision[df_precision['precision'] >= 0.9]['threshold'].min()

new_threshold

0.8666405182816753

# Using the new threshold

🎯 Let's say a coach has spotted a potentially interesting player, but wants our 90% guarantee that he would last 5 years minimum as a pro.

In [17]:
new_player = pd.read_csv("data/ML_New_player.csv")

new_player

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers
0,80,31.4,14.3,5.9,11.1,52.5,0.0,0.1,11.1,2.6,3.9,65.4,3.0,5.0,8.0,2.4,1.1,0.8,2.2


In [18]:
new_player_scaled = scaler.transform(new_player)

model = LogisticRegression()
model.fit(X_scaled, data['target_5y'])

def custom_predict(X, custom_threshold):
    probs = model.predict_proba(X) 
    expensive_probs = probs[:, 1] 
    return (expensive_probs > custom_threshold)
    
    
custom_prediction = custom_predict(X=new_player_scaled, custom_threshold=new_threshold)[0]  
print(custom_prediction) # "True" = recommended, "False" = not recommended

True


# We can tell the coach that this new player would last at least 5 years as a pro!