# Threshold Adjustment

👇 Load the player `player_performances.csv` dataset to see what you will be working with.

In [56]:
import pandas as pd

!curl -s https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Player_performance.csv > data/player_performances.csv

data = pd.read_csv('data/player_performances.csv')

data

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers,target_5y
0,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1323,80,15.8,4.3,1.6,3.6,43.3,0.0,0.2,14.3,1.2,1.5,79.2,0.4,0.8,1.2,2.5,0.6,0.2,0.8,0
1324,68,12.6,3.9,1.5,4.1,35.8,0.1,0.7,16.7,0.8,1.0,79.4,0.4,1.1,1.5,2.3,0.8,0.0,1.3,1
1325,43,12.1,5.4,2.2,3.9,55.0,0.0,0.0,0.0,1.0,1.6,64.3,1.5,2.3,3.8,0.3,0.3,0.4,0.9,0
1326,52,12.0,4.5,1.7,3.8,43.9,0.0,0.2,10.0,1.2,1.8,62.5,0.2,0.4,0.7,2.2,0.4,0.1,0.8,1


ℹ️ Each observation represents a player and each column a characteristic of performance. The target `target_5y` defines whether the player has had a professional career of less than 5 years [0] or 5 years or more [1].

# Preprocessing

👇 To avoid spending too much time on the preprocessing, Robust Scale the entire feature set. This practice is not optimal, but can be used for preliminary preprocessing and/or to get models up and running quickly.

Save the scaled feature set as `X_scaled`.

In [69]:
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
y = data.target_5y
X = data.drop(columns='target_5y')

robust_scaler = RobustScaler().set_output(transform='pandas')
X_scaled = robust_scaler.fit_transform(X)
X_scaled

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers
0,-0.900000,0.933884,0.352941,0.25,0.666667,-1.206557,1.00,1.500000,0.083077,0.585366,0.571429,-0.109375,-0.1,1.0625,0.659794,0.571429,-0.2,0.50,0.375
1,-0.933333,0.892562,0.313725,-0.05,0.452381,-1.875410,1.50,2.083333,0.036923,1.560976,1.357143,0.406250,-0.3,0.1875,-0.041237,1.857143,1.2,0.75,0.750
2,0.366667,-0.066116,-0.078431,-0.05,-0.023810,-0.222951,0.75,1.166667,0.064615,-0.097561,-0.142857,-0.335937,-0.3,0.0000,-0.123711,-0.071429,0.0,0.25,0.000
3,-0.166667,-0.371901,0.019608,0.10,0.166667,-0.170492,0.00,0.166667,0.009231,-0.097561,-0.142857,-0.187500,0.2,-0.5000,-0.247423,-0.214286,0.2,-0.25,0.000
4,-0.500000,-0.380165,-0.215686,-0.25,-0.428571,1.114754,-0.25,-0.166667,-0.686154,0.292683,0.285714,-0.304687,0.2,-0.1250,0.000000,-0.571429,-0.4,0.50,-0.250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1323,0.566667,-0.024793,-0.254902,-0.25,-0.285714,-0.078689,-0.25,-0.083333,-0.246154,0.195122,0.000000,0.617188,-0.4,-0.5625,-0.536082,1.000000,0.2,0.00,-0.250
1324,0.166667,-0.289256,-0.333333,-0.30,-0.166667,-1.062295,0.00,0.333333,-0.172308,-0.195122,-0.357143,0.632813,-0.4,-0.3750,-0.412371,0.857143,0.6,-0.50,0.375
1325,-0.666667,-0.330579,-0.039216,0.05,-0.214286,1.455738,-0.25,-0.250000,-0.686154,0.000000,0.071429,-0.546875,0.7,0.3750,0.536082,-0.571429,-0.4,0.50,-0.125
1326,-0.366667,-0.338843,-0.215686,-0.20,-0.238095,0.000000,-0.25,-0.083333,-0.378462,0.195122,0.214286,-0.687500,-0.6,-0.8125,-0.742268,0.785714,-0.2,-0.25,-0.250


### ☑️ Check your code

In [70]:
from nbresult import ChallengeResult

result = ChallengeResult('scaled_features',
                         scaled_features = X_scaled
)

result.write()
print(result.check())


platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /home/bat/.pyenv/versions/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/bat/code/syanrys/05-ML/03-Performance-metrics/data-threshold-adjustments/tests
plugins: dash-3.0.4, anyio-4.8.0, typeguard-4.4.2
[1mcollecting ... [0mcollected 1 item

test_scaled_features.py::TestScaled_features::test_scaled_features [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/scaled_features.pickle

[32mgit[39m commit -m [33m'Completed scaled_features step'[39m

[32mgit[39m push origin master



# Base modeling

🎯 The task is to detect players who will last 5 years minimum as professionals, with a 90% guarantee.

👇 Is a default Logistic Regression model going to satisfy the coach's requirements? Use cross-validation and save the score that supports your answer under variable name `base_score`.

In [75]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_validate

base_score = cross_validate(estimator=LogisticRegression(), X=X_scaled, y=y, scoring=['precision'])['test_precision'].mean()
base_score

0.7379036747632812

### ☑️ Check your code

In [76]:
from nbresult import ChallengeResult

result = ChallengeResult('base_precision',
                         score = base_score
)

result.write()
print(result.check())


platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /home/bat/.pyenv/versions/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/bat/code/syanrys/05-ML/03-Performance-metrics/data-threshold-adjustments/tests
plugins: dash-3.0.4, anyio-4.8.0, typeguard-4.4.2
[1mcollecting ... [0mcollected 1 item

test_base_precision.py::TestBase_precision::test_precision_score [32mPASSED[0m[32m  [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/base_precision.pickle

[32mgit[39m commit -m [33m'Completed base_precision step'[39m

[32mgit[39m push origin master



# Threshold adjustment

👇 Find the decision threshold that guarantees a 90% precision for a player to last 5 years or more as a professional. Save the threshold under variable name `new_threshold`.

<details>
<summary>💡 Hint</summary>

- Make cross validated probability predictions with [`cross_val_predict`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html)
    
- Plug the probabilities into [`precision_recall_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html) to generate precision scores at different thresholds

- Find out which threshold guarantees a precision of 0.9
      
</details>



In [91]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_recall_curve

data[['proba_0', 'proba_1']] = cross_val_predict(estimator=LogisticRegression(), X=X_scaled, y=y, method='predict_proba')
precision, recall, threshold = precision_recall_curve(y, data.proba_1)

scores = pd.DataFrame({
    'threshold' : threshold,
    'precision' : precision[:-1],
    'recall' : recall[:-1]
})
scores.query('precision>0.9')

Unnamed: 0,threshold,precision,recall
1103,0.866692,0.900990,0.220339
1104,0.866740,0.900498,0.219128
1106,0.867006,0.908629,0.216707
1107,0.867635,0.913265,0.216707
1108,0.868156,0.912821,0.215496
...,...,...,...
1294,0.987141,1.000000,0.006053
1295,0.987451,1.000000,0.004843
1296,0.987759,1.000000,0.003632
1297,0.993227,1.000000,0.002421


In [100]:
new_threshold = scores[scores['precision'] >= 0.9].threshold.min()
new_threshold

0.8666918410449227

### ☑️ Check your code

In [101]:
from nbresult import ChallengeResult

result = ChallengeResult('decision_threshold',
                         threshold = new_threshold
)

result.write()
print(result.check())


platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /home/bat/.pyenv/versions/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/bat/code/syanrys/05-ML/03-Performance-metrics/data-threshold-adjustments/tests
plugins: dash-3.0.4, anyio-4.8.0, typeguard-4.4.2
[1mcollecting ... [0mcollected 1 item

test_decision_threshold.py::TestDecision_threshold::test_new_threshold [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/decision_threshold.pickle

[32mgit[39m commit -m [33m'Completed decision_threshold step'[39m

[32mgit[39m push origin master



# Using the new threshold

🎯 The coach has spotted a potentially interesting player, but wants your 90% guarantee that he would last 5 years minimum as a pro. Download the player's data [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_New_player.csv).

In [99]:
new_player = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_New_player.csv")

new_player

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers
0,80,31.4,14.3,5.9,11.1,52.5,0.0,0.1,11.1,2.6,3.9,65.4,3.0,5.0,8.0,2.4,1.1,0.8,2.2


❓ Would you risk recommending the player to the coach? Save your answer as string under variable name `recommendation` as "recommend" or "not recommend".

In [110]:
model = LogisticRegression()
model.fit(X_scaled, y)

proba = model.predict_proba(new_player)[:, 1]
y_pred = (proba >= 0.9).astype(int)[0]

recommendation = ['not recommend', 'recommend'][y_pred]
recommendation

'recommend'

### ☑️ Check your code

In [111]:
from nbresult import ChallengeResult

result = ChallengeResult('recommendation',
                         recommendation = recommendation
)

result.write()
print(result.check())


platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /home/bat/.pyenv/versions/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/bat/code/syanrys/05-ML/03-Performance-metrics/data-threshold-adjustments/tests
plugins: dash-3.0.4, anyio-4.8.0, typeguard-4.4.2
[1mcollecting ... [0mcollected 1 item

test_recommendation.py::TestRecommendation::test_recommendation [32mPASSED[0m[32m   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/recommendation.pickle

[32mgit[39m commit -m [33m'Completed recommendation step'[39m

[32mgit[39m push origin master



# 🏁