I mostly followed this guide written by Tirendiaz AI, but got my own data https://medium.com/mlearning-ai/machine-learning-project-with-linear-regression-algorithm-b433d770fefd. It was very helpful!

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math

### We have a file which contains every NBA player as of 2021 along with their height and weight. So, we need to join this file with other files containing:

* 1. Wingspan
* 2. All-Defensive selections

all NBA players as of 2021, uploaded to Kaggle by JUSTINAS CIRTAUTAS:
https://www.kaggle.com/datasets/justinas/nba-players-data

In [2]:
en_df = pd.read_csv('every_nba_player.csv')
en_df = en_df.drop('Unnamed: 0', axis=1)
en_df.head()

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,...,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season
0,Dennis Rodman,CHI,36,198.12,99.79024,Southeastern Oklahoma State,USA,1986,2,27,...,5.7,16.1,3.1,16.1,0.186,0.323,0.1,0.479,0.113,1996-97
1,Dwayne Schintzius,LAC,28,215.9,117.93392,Florida,USA,1990,1,24,...,2.3,1.5,0.3,12.3,0.078,0.151,0.175,0.43,0.048,1996-97
2,Earl Cureton,TOR,39,205.74,95.25432,Detroit Mercy,USA,1979,3,58,...,0.8,1.0,0.4,-2.1,0.105,0.102,0.103,0.376,0.148,1996-97
3,Ed O'Bannon,DAL,24,203.2,100.697424,UCLA,USA,1995,1,9,...,3.7,2.3,0.6,-8.7,0.06,0.149,0.167,0.399,0.077,1996-97
4,Ed Pinckney,MIA,34,205.74,108.86208,Villanova,USA,1985,1,10,...,2.4,2.4,0.2,-11.2,0.109,0.179,0.127,0.611,0.04,1996-97


all NBA combine measurements dating from 2001 to 2022, uploaded to Kaggle by MARCUS FERN: (I am filtering out wingspan) https://www.kaggle.com/datasets/marcusfern/nba-draft-combine

In [3]:
ew_df = pd.read_csv('every_wingspan.csv')

# standardize 'player_name' column for later joining
comma_names = ew_df.player_name.astype('str').tolist()
comma_to_without = {}
for x in comma_names:
    if ' ' in x and ',' in x:
        # the first name is the substring starting two indices after the comma
        # and ending at the length of the string
        first = x[x.index(',')+2 : len(x)]
        
        # the last name is the substring starting at index 0 and ending
        # at the index of the comma
        last = x[0 : x.index(',')]
        
        # we can combine these to get the full name
        comma_to_without[x] = first + ' ' + last
        
ew_df.player_name = ew_df.player_name.map(comma_to_without)
ew_df.head()

Unnamed: 0,player_name,wingspan
0,Ochai Agbaji,82.25
1,Patrick Baldwin Jr.,85.75
2,Dominick Barlow,87.0
3,MarJon Beauchamp,84.75
4,Hugo Besson,77.5


all NBA-players to make an All-Defensive first or second team, sourced from Basketball Reference:
   https://www.basketball-reference.com/awards/all_defense_by_player.html

In [4]:
ad_df = pd.read_csv('every_all_defense.csv')
ad_df.head()

Unnamed: 0,player_name,NBA_ABA_total,NBA_1st,NBA_2nd,NBA_total,ABA_1st,ABA_2nd,ABA_total
0,Tim Duncan,15,8,7,15,0,0,0
1,Kobe Bryant,12,9,3,12,0,0,0
2,Kevin Garnett,12,9,3,12,0,0,0
3,Kareem Abdul-Jabbar,11,5,6,11,0,0,0
4,Bobby Jones,11,8,1,9,2,0,2


now, we can join the first two DataFrames on the column 'player_name'

intermediate DataFrame: 'every defense DataFrame' (ed_df)

In [5]:
ed_df = pd.merge(en_df, ad_df, how="outer", on=["player_name"])
ed_df.head()

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,...,ts_pct,ast_pct,season,NBA_ABA_total,NBA_1st,NBA_2nd,NBA_total,ABA_1st,ABA_2nd,ABA_total
0,Dennis Rodman,CHI,36.0,198.12,99.79024,Southeastern Oklahoma State,USA,1986,2,27,...,0.479,0.113,1996-97,8.0,7.0,1.0,8.0,0.0,0.0,0.0
1,Dennis Rodman,CHI,37.0,198.12,99.79024,Southeastern Oklahoma State,USA,1986,2,27,...,0.459,0.112,1997-98,8.0,7.0,1.0,8.0,0.0,0.0,0.0
2,Dennis Rodman,LAL,38.0,200.66,95.25432,Southeastern Oklahoma State,USA,1986,2,27,...,0.388,0.063,1998-99,8.0,7.0,1.0,8.0,0.0,0.0,0.0
3,Dennis Rodman,DAL,39.0,200.66,95.25432,Southeastern Oklahoma State,USA,1986,2,27,...,0.457,0.046,1999-00,8.0,7.0,1.0,8.0,0.0,0.0,0.0
4,Dwayne Schintzius,LAC,28.0,215.9,117.93392,Florida,USA,1990,1,24,...,0.43,0.048,1996-97,,,,,,,


finally, we can join this DataFrame with the third

final DataFrame: all accolades and measurables DataFrame (aam_df)

In [25]:
aam_df = pd.merge(ed_df, ew_df, how="outer", on=["player_name"])
aam_df.head()

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,...,ast_pct,season,NBA_ABA_total,NBA_1st,NBA_2nd,NBA_total,ABA_1st,ABA_2nd,ABA_total,wingspan
0,Dennis Rodman,CHI,36.0,198.12,99.79024,Southeastern Oklahoma State,USA,1986,2,27,...,0.113,1996-97,8.0,7.0,1.0,8.0,0.0,0.0,0.0,
1,Dennis Rodman,CHI,37.0,198.12,99.79024,Southeastern Oklahoma State,USA,1986,2,27,...,0.112,1997-98,8.0,7.0,1.0,8.0,0.0,0.0,0.0,
2,Dennis Rodman,LAL,38.0,200.66,95.25432,Southeastern Oklahoma State,USA,1986,2,27,...,0.063,1998-99,8.0,7.0,1.0,8.0,0.0,0.0,0.0,
3,Dennis Rodman,DAL,39.0,200.66,95.25432,Southeastern Oklahoma State,USA,1986,2,27,...,0.046,1999-00,8.0,7.0,1.0,8.0,0.0,0.0,0.0,
4,Dwayne Schintzius,LAC,28.0,215.9,117.93392,Florida,USA,1990,1,24,...,0.048,1996-97,,,,,,,,


### We have a lot of data now, but it's actually not in our best interest to drop the rows with all NaNs because some of those players were just drafted and have not begun their careers yet. I will probably need to update this as the season progresses. Regardless, let's drop everything outside of player_name, age, player_height, player_weight, wingspan, NBA_ABA_total, and season for our model for now.

In [26]:
aam_df = aam_df[['player_name', 'age', 'player_height', 'player_weight', 'wingspan', 'NBA_ABA_total', 'season']]
aam_df.head()

Unnamed: 0,player_name,age,player_height,player_weight,wingspan,NBA_ABA_total,season
0,Dennis Rodman,36.0,198.12,99.79024,,8.0,1996-97
1,Dennis Rodman,37.0,198.12,99.79024,,8.0,1997-98
2,Dennis Rodman,38.0,200.66,95.25432,,8.0,1998-99
3,Dennis Rodman,39.0,200.66,95.25432,,8.0,1999-00
4,Dwayne Schintzius,28.0,215.9,117.93392,,,1996-97


#### See types of data in preparation for our model:

In [27]:
aam_df.dtypes

player_name       object
age              float64
player_height    float64
player_weight    float64
wingspan         float64
NBA_ABA_total    float64
season            object
dtype: object

#### We'll use name and season as category so that our model looks at the numerical values for its estimation:

In [28]:
aam_df.player_name = aam_df.player_name.astype('category')
aam_df.season = aam_df.season.astype('category')

#### I want to view the transpose of the dataset so I can see some key indicators of acclaimed defenders. First, let's drop the rows with duplicate names so the number of total All-Defensive teams is not skewed.

In [29]:
aam_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,12305.0,27.084518,4.335868,18.0,24.0,26.0,30.0,44.0
player_height,12305.0,200.611602,9.146321,160.02,193.04,200.66,208.28,231.14
player_weight,12305.0,100.369926,12.47715,60.327736,90.7184,99.79024,108.86208,163.29312
wingspan,5641.0,82.578328,4.035971,70.0,80.0,82.75,85.5,98.25
NBA_ABA_total,1142.0,3.75394,3.204381,1.0,1.0,3.0,5.0,15.0


#### It is safe to say the more standout defenders usually make about 5 All-Defensive teams on average. I expect that, when inputting my 2K player 1's stats, the regression model will put him at or above that number.

#### Player 1: Height 203.2 cm, 117.48 kg weight, 88 in wingspan
#### Player 2: Height 205.74 cm, 106.594 kg weight, 91 in wingspan 

In [30]:
awards_data = aam_df.groupby(["player_weight", "player_height", "wingspan"]).mean().round(2)
awards_data[awards_data.NBA_ABA_total.isna() == False]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,age,NBA_ABA_total
player_weight,player_height,wingspan,Unnamed: 3_level_1,Unnamed: 4_level_1
79.378600,182.88,76.25,28.62,9.0
79.378600,185.42,76.25,35.00,9.0
79.378600,185.42,77.75,30.50,1.0
81.646560,185.42,77.75,20.50,1.0
81.646560,187.96,79.25,24.00,2.0
...,...,...,...,...
121.562656,213.36,89.50,29.50,2.0
122.469840,213.36,89.50,31.00,2.0
124.737800,210.82,88.50,29.00,5.0
124.737800,213.36,89.50,27.00,2.0


#### We will proceed with one-hot encoding to finally get our linear regression model going. We also need to make sure scikit learn doesn't see a bunch of nans or infinites in the dataset. Credit to Boern on StackOverflow https://stackoverflow.com/a/46581125/17186022

In [31]:
aam_df = pd.get_dummies(aam_df)
aam_df.columns
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(axis=1)
    return df[indices_to_keep].astype(np.float64)
aam_df = clean_dataset(aam_df)

In [32]:
y = aam_df.NBA_ABA_total
X = aam_df.drop("NBA_ABA_total", axis = 1)
aam_df

Unnamed: 0,age,player_height,player_weight,wingspan,NBA_ABA_total,player_name_A.C. Green,player_name_A.J. Bramlett,player_name_A.J. Granger,player_name_A.J. Guyton,player_name_A.J. Price,...,season_2012-13,season_2013-14,season_2014-15,season_2015-16,season_2016-17,season_2017-18,season_2018-19,season_2019-20,season_2020-21,season_2021-22
4538,19.0,215.90,106.594120,87.00,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4539,20.0,215.90,106.594120,87.00,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4540,21.0,215.90,106.594120,87.00,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4541,22.0,215.90,106.594120,87.00,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4542,23.0,215.90,106.594120,87.00,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11408,20.0,203.20,106.594120,86.25,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
11409,21.0,203.20,105.233344,86.25,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
11410,22.0,200.66,105.233344,86.25,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
11411,23.0,200.66,105.233344,86.25,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


#### This is where we will do the bulk of the linear regression; I only want 10% of the data to make it into the testing size... I'm curious to see how this will work with the multiple award-winners

In [53]:
my_player = pd.DataFrame({'player_height': 203.2, 'player_weight': 117.48, 'wingspan': '88'}, index=[0])
aam_df = pd.concat([my_player, aam_df]).reset_index(drop = True)
aam_df = aam_df.fillna(0)
aam_df

Unnamed: 0,player_height,player_weight,wingspan,age,NBA_ABA_total,player_name_A.C. Green,player_name_A.J. Bramlett,player_name_A.J. Granger,player_name_A.J. Guyton,player_name_A.J. Price,...,season_2012-13,season_2013-14,season_2014-15,season_2015-16,season_2016-17,season_2017-18,season_2018-19,season_2019-20,season_2020-21,season_2021-22
0,203.20,117.480000,88,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,203.20,117.480000,88,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,203.20,117.480000,88,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,203.20,117.480000,88,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,215.90,106.594120,87.0,19.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
466,203.20,106.594120,86.25,20.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
467,203.20,105.233344,86.25,21.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
468,200.66,105.233344,86.25,22.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
469,200.66,105.233344,86.25,23.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [45]:
X_train,X_test,y_train,y_test=train_test_split(
    X,y, 
    train_size = 0.90, 
    random_state = 1)

lr = LinearRegression()
lr.fit(X_train,y_train)


lr.score(X_test, y_test).round(3)

1.0

In [46]:
lr.score(X_train, y_train).round(3)

1.0

#### Wow, this is definitely unexpected. Maybe the coefficient of determination is skewed by the data we had to clean up. This is a suspiciously optimal outcome.

In [47]:
y_pred = lr.predict(X_test)

In [48]:
math.sqrt(mean_squared_error(y_test, y_pred))

9.408395391453971e-15

#### Looks like the small percentage of NBA athletes in general, let alone All-Defensive players, is leading to a very small standard deviation due to the massive outliers like Tim Duncan... I'm going to add my player's stats to the training data and see what the model predicts as his All-Defense selections!

In [49]:
aam_df_new = X_train[:1]
lr.predict(aam_df_new)

array([3.])

In [50]:
y_train[:1]

5759    3.0
Name: NBA_ABA_total, dtype: float64

In [54]:
y_train

5759    3.0
5837    5.0
8573    2.0
6511    9.0
7283    1.0
       ... 
7953    2.0
5765    3.0
9481    8.0
7394    1.0
5312    6.0
Name: NBA_ABA_total, Length: 420, dtype: float64

In [55]:
X_train

Unnamed: 0,age,player_height,player_weight,wingspan,player_name_A.C. Green,player_name_A.J. Bramlett,player_name_A.J. Granger,player_name_A.J. Guyton,player_name_A.J. Price,player_name_AJ Hammons,...,season_2012-13,season_2013-14,season_2014-15,season_2015-16,season_2016-17,season_2017-18,season_2018-19,season_2019-20,season_2020-21,season_2021-22
5759,24.0,193.04,96.161504,82.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5837,31.0,210.82,120.201880,88.50,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
8573,31.0,190.50,81.646560,79.25,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6511,29.0,182.88,79.378600,76.25,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7283,23.0,208.28,111.130040,84.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7953,25.0,210.82,120.201880,90.00,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5765,30.0,193.04,99.790240,82.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9481,29.0,200.66,104.326160,85.25,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7394,33.0,185.42,79.378600,77.75,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


#### Whoa, index 9481 is predicted to win 8 All-Defensive selections. His measurements are eerily similar to both of my players':

* Player 9481: 200.66 cm height, 104.326 cm height, 85.25 in wingspan 
* Player 1: Height 203.2 cm, 117.48 kg weight, 88 in wingspan
* Player 2: Height 205.74 cm, 106.594 kg weight, 91 in wingspan 

#### It looks like the optimal height for a high performing NBA player is roughly around the measurements I picked, which is consistent with that of some of the all time greats: Dennis Rodman, Scottie Pippen, Ben Wallace, for example

### Future improvements:

* Scrape websites to fill in knowledge gaps as far as wingspan and other measurables, then convert to cm/kg/in for height/weight/wingspan
* Add stats for this year's combine players who haven't played in regular season games yet