In [1]:
__author__ = "Victor Xu"
__email__ = "victor.c.xu@gmail.com"
__website__ = "victorxu.me"

__copyright__ = "Copyright 2019, Victor Xu"

# Problem Definition

The goal of this analysis is to predict NBA player positions given performance data such the player shot count, shot location, and player defense etc... Each player in the league is assigned a position label such as point guard, shooting guard, and center.

### Why is predicting player position important?
If you are basketball fan, you probably realized that the player positions are inconsistent across data sources. For instance ESPN and the League even uses different position categories altogether.


| ESPN | NBA   |
|------|------|
|Center|Center|
|Point Guard|Guard|
|Shooting Guard|Guard|
|Small Forward|Forward|
|Power Forward|Forward|
|No Direct Translation|Guard-Forward|
|No Direct Translation|Center-Forward|

Though we will not be examining the difference between NBA and ESPN player labeling methodologies, never the less by looking at how ML algos approach classifying player into ESPN categories, we could gain insights into the structure of player labels. Specifically, allowing us to examine if player labels are well separated or not.

In [2]:
import re
import pandas as pd
import numpy as np

from src.espn_scraper import espn_player_scraper
from src.player_stat import get_player_stat, NoDataError

from sqlalchemy import create_engine, Integer
from tqdm import tqdm_notebook, tqdm
from nba_py import player
from time import sleep

## Data Acqusition
#### Scrape ESPN for player name and position label /w our scraper

In [3]:
# Scrape ESPN pages
teams_overview_url = "http://www.espn.com/nba/players"
scraper = espn_player_scraper()
espn_player_list = scraper.scrape_all_players(teams_overview_url)

100%|██████████| 30/30 [01:58<00:00,  3.67s/it]


In [4]:
espn_player_list.head()

Unnamed: 0,name,position,espn_player_id,url
0,Jaylen Brown,SG,3917376,http://www.espn.com/nba/player/_/id/3917376/ja...
1,Carsen Edwards,PG,4066407,http://www.espn.com/nba/player/_/id/4066407/ca...
2,Tacko Fall,C,3904625,http://www.espn.com/nba/player/_/id/3904625/ta...
3,Jonathan Gibson,PG,2234666,http://www.espn.com/nba/player/_/id/2234666/jo...
4,Javonte Green,SG,2596112,http://www.espn.com/nba/player/_/id/2596112/ja...


### Getting player name and NBA ID data from the Official NBA API
Here we will use nba_py by seemethere, a Python wrapper for the offical but unpublished NBA API.

We will be using the player performance data from the NBA API and using it to predict player's ESPN position label.

The official NBA API uses a different set of Player IDs, so we will have to join ESPN and NBA data by cross referencing player names. Also, the NBA API returns some players in the NBA Delevelopment league, which we are not interested in. We will thus use a left join on the ESPN table.

In [5]:
# Getting players who were active in the last 3 seasons for the NBA database
nba_player_list_17 = player.PlayerList(season='2017-18').info()
nba_player_list_18 = player.PlayerList(season='2018-19').info()
nba_player_list_19 = player.PlayerList(season='2019-20').info()

# Concat the 3 dfs together
nba_player_list = pd.concat([nba_player_list_17,nba_player_list_18,nba_player_list_19])
nba_player_list = nba_player_list.drop_duplicates()

# Convert upper case letter to upper case letter
nba_player_list.columns = [col.lower() for col in nba_player_list.columns]

# Check if number of players are the same
print("NBA roster has {} players over past 3 seasons".format(nba_player_list.shape[0]))
print("ESPN roster has {} players in the current season.".format(espn_player_list.shape[0]))

NBA roster has 1016 players over past 3 seasons
ESPN roster has 564 players in the current season.


#### Examine the data

In [6]:
# Selecting columns of interest
nba_player_list = nba_player_list[['person_id','display_first_last']]
nba_player_list.display_first_last = nba_player_list.display_first_last.str.lower()
nba_player_list.head()

Unnamed: 0,person_id,display_first_last
0,203518,alex abrines
1,203112,quincy acy
2,201167,arron afflalo
3,201582,alexis ajinca
4,202332,cole aldrich


In [7]:
nba_player_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1016 entries, 0 to 514
Data columns (total 2 columns):
person_id             1016 non-null int64
display_first_last    1016 non-null object
dtypes: int64(1), object(1)
memory usage: 23.8+ KB


In [8]:
espn_player_list.head()

Unnamed: 0,name,position,espn_player_id,url
0,Jaylen Brown,SG,3917376,http://www.espn.com/nba/player/_/id/3917376/ja...
1,Carsen Edwards,PG,4066407,http://www.espn.com/nba/player/_/id/4066407/ca...
2,Tacko Fall,C,3904625,http://www.espn.com/nba/player/_/id/3904625/ta...
3,Jonathan Gibson,PG,2234666,http://www.espn.com/nba/player/_/id/2234666/jo...
4,Javonte Green,SG,2596112,http://www.espn.com/nba/player/_/id/2596112/ja...


In [9]:
espn_player_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564 entries, 0 to 563
Data columns (total 4 columns):
name              564 non-null object
position          561 non-null object
espn_player_id    564 non-null object
url               564 non-null object
dtypes: object(4)
memory usage: 17.8+ KB


#### Check duplicates

In [10]:
nba_player_list.duplicated().sum()

236

In [11]:
espn_player_list.duplicated().sum()

3

In [12]:
espn_player_list[espn_player_list.duplicated(keep=False)]

Unnamed: 0,name,position,espn_player_id,url
124,Rodney McGruder,SF,2488826,http://www.espn.com/nba/player/_/id/2488826/ro...
157,Jawun Evans,PG,3912854,http://www.espn.com/nba/player/_/id/3912854/ja...
386,Brandon Goodwin,PG,3057198,http://www.espn.com/nba/player/_/id/3057198/br...
429,Rodney McGruder,SF,2488826,http://www.espn.com/nba/player/_/id/2488826/ro...
483,Brandon Goodwin,PG,3057198,http://www.espn.com/nba/player/_/id/3057198/br...
518,Jawun Evans,PG,3912854,http://www.espn.com/nba/player/_/id/3912854/ja...


In [13]:
espn_player_list = espn_player_list.drop_duplicates()

In [14]:
espn_player_list.duplicated().sum()

0

#### Check N/A

A small number of newly drafted rookies will not have a position assigned to them yet. Their positions will be N/As

In [15]:
nba_player_list.isna().sum()

person_id             0
display_first_last    0
dtype: int64

In [16]:
espn_player_list.isna().sum()

name              0
position          3
espn_player_id    0
url               0
dtype: int64

In [17]:
espn_player_list = espn_player_list.dropna()

### Joining 2 datasets together

We need the ESPN player position label which lives in the ESPN table, and the NBA player ID, which lives in the NBA table.

We will later use NBA player ID to call the official NBA API to retrieve player performence data, which is used to train our models.

| name | position | 
|------|------|
|Center|Center|
|Point Guard|Guard|

#### Cleaning before joining 2 data scources

We will be joining on player names, which are different across ESPN and NBA records. As such, cleaning is required



In [18]:
def sanitize_name(name_str):
    """Remove all special characters and player name suffix"""
    sanitized = name_str.lower()
    sanitized = sanitized.replace('-', ' ')
    
    to_remove = [".", "'", "jr", 'sr', 'iii']
    
    for pattern in to_remove:
        sanitized = sanitized.replace(pattern, '')
        
    # Remove special characters and trailing spaces
    sanitized = re.sub(re.compile("\s*$"), '', sanitized)
        
    return sanitized

In [19]:
# Getting a list of ESPN player names not in NBA player name list
espn_player_list[~espn_player_list.name.isin(nba_player_list.display_first_last)].head()

Unnamed: 0,name,position,espn_player_id,url
0,Jaylen Brown,SG,3917376,http://www.espn.com/nba/player/_/id/3917376/ja...
1,Carsen Edwards,PG,4066407,http://www.espn.com/nba/player/_/id/4066407/ca...
2,Tacko Fall,C,3904625,http://www.espn.com/nba/player/_/id/3904625/ta...
3,Jonathan Gibson,PG,2234666,http://www.espn.com/nba/player/_/id/2234666/jo...
4,Javonte Green,SG,2596112,http://www.espn.com/nba/player/_/id/2596112/ja...


In [20]:
# Sanitize player name for both dfs so they can later be used to join the tables
nba_player_list.display_first_last = nba_player_list.display_first_last.astype('str')
espn_player_list.name = espn_player_list.name.astype('str')

espn_player_list.name = espn_player_list.name.apply(sanitize_name)
nba_player_list.display_first_last = nba_player_list.display_first_last.apply(sanitize_name)

In [21]:
# Handle a few special cases where names are different across two data sources
espn_player_list.loc[espn_player_list.espn_player_id == '1713', 'name'] = 'nene'
espn_player_list.loc[espn_player_list.espn_player_id == '4017839', 'name'] = 'juancho hernangomez'
espn_player_list.loc[espn_player_list.espn_player_id == '3056247', 'name'] = 'kendrick nunn'
espn_player_list.loc[espn_player_list.espn_player_id == '2528586', 'name'] = 'walter lemon'
espn_player_list.loc[espn_player_list.espn_player_id == '3133602', 'name'] = 'svi mykhailiuk'
espn_player_list.loc[espn_player_list.espn_player_id == '4066508', 'name'] = 'charles brown'
espn_player_list.loc[espn_player_list.espn_player_id == '4395627', 'name'] = 'cameron reddish'
espn_player_list.loc[espn_player_list.espn_player_id == '4395627', 'name'] = 'cameron reddish'

In [76]:
merged_player_list = espn_player_list.join(nba_player_list.set_index("display_first_last", drop=True),
                                    on='name')

In [77]:
# Check for rows that didn't join correctly
merged_player_list[merged_player_list.isnull().any(axis=1)]

Unnamed: 0,name,position,espn_player_id,url,person_id
175,cody demps,F,4028211,http://www.espn.com/nba/player/_/id/4028211/co...,
481,tyler cook,F,4066367,http://www.espn.com/nba/player/_/id/4066367/ty...,


In [78]:
merged_player_list.position.isnull().sum()

0

In [79]:
# Tyler Cook and Cody Demps just got traded and 
# has not played any games in his career, so we're dropping them
merged_player_list = merged_player_list.dropna()
merged_player_list = merged_player_list.rename(columns={"person_id":"nba_id", 'position':'espn_position'})
merged_player_list.nba_id = merged_player_list.nba_id.astype('int')
merged_player_list = merged_player_list[['name','espn_position','nba_id']]

In [80]:
merged_player_list.head()

Unnamed: 0,name,espn_position,nba_id
0,jaylen brown,SG,1627759
1,carsen edwards,PG,1629035
2,tacko fall,C,1629605
3,jonathan gibson,PG,1626780
4,javonte green,SG,1629750


#### Check NA for final player information df

In [81]:
merged_player_list.isna().sum()

name             0
espn_position    0
nba_id           0
dtype: int64

### Getting Player Performmance Data

Data source: NBA official API

The data includes the following:
    - Position on the court where the shot was take
    - Shooting accuracy and frequency stats
    - Blocking accuracy and frequency stats

In [28]:
# Getting unique set of player_ids to iterate over
nba_ids = merged_player_list.nba_id.unique()

print("Total of {} unique IDs".format(nba_ids.shape[0]))

Total of 557 unique IDs


In [29]:
dfs_to_concat = []
for player_id in tqdm(nba_ids):
    player_id = str(int(player_id))
    sleep(1) # Prevents from being banned
    try: 
        player_stat = get_player_stat(player_id)
        dfs_to_concat.append(player_stat)
    except NoDataError:
        # This is when the player has no data and returns an empty df
        # print("No data for player id", idx)
        continue

100%|██████████| 557/557 [37:15<00:00,  3.42s/it]


In [82]:
player_performance = pd.concat(dfs_to_concat, sort=False)

player_performance.index.name = "nba_id"
player_performance.index = player_performance.index.astype('str') 

merged_player_list = merged_player_list.set_index('nba_id')
merged_player_list.index = merged_player_list.index.astype('str')

player_performance = player_performance.join(merged_player_list2[['espn_position']])

In [83]:
player_performance.head()

Unnamed: 0_level_0,shot_res,shot_in_paint,shot_mid_range,shot_lcorner_3,shot_rcorner_3,shot_above_3,fga,block_res,block_in_paint,block_mid_range,block_lcorner_3,block_rcorner_3,block_above_3,blka,oreb,dreb,ast,stl,min,espn_position
nba_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
101106,0.666666,0.277778,0.0555555,0.0,0.0,0.0,36,0.0,0.99999,0.0,0,0.0,0.0,1,10.1,26.9,7.2,2.0,350.758333,C
101107,0.18894,0.173579,0.0522273,0.132104,0.06298,0.390169,651,0.619047,0.238095,0.0,0,0.047619,0.095238,21,3.5,14.5,4.4,2.9,4139.115,PF
101108,0.100418,0.211994,0.199442,0.0306834,0.0516039,0.405858,717,0.533333,0.333333,0.0666666,0,0.0,0.0666666,15,1.9,13.1,24.1,5.4,3703.331667,PG
101108,0.100418,0.211994,0.199442,0.0306834,0.0516039,0.405858,717,0.533333,0.333333,0.0666666,0,0.0,0.0666666,15,1.9,13.1,24.1,5.4,3703.331667,PG
101109,0.134328,0.179104,0.238806,0.0373134,0.0447761,0.365672,134,0.333332,0.333332,0.333332,0,0.0,0.0,3,1.4,8.4,13.7,3.0,1744.088333,PG


#### Checking dtypes, length with .info()

In [84]:
player_performance.info()

<class 'pandas.core.frame.DataFrame'>
Index: 647 entries, 101106 to 2772
Data columns (total 20 columns):
shot_res           647 non-null object
shot_in_paint      646 non-null object
shot_mid_range     647 non-null object
shot_lcorner_3     640 non-null object
shot_rcorner_3     643 non-null object
shot_above_3       647 non-null object
fga                647 non-null int64
block_res          647 non-null object
block_in_paint     646 non-null object
block_mid_range    647 non-null object
block_lcorner_3    640 non-null object
block_rcorner_3    643 non-null object
block_above_3      647 non-null object
blka               647 non-null int64
oreb               647 non-null float64
dreb               647 non-null float64
ast                647 non-null float64
stl                647 non-null float64
min                647 non-null float64
espn_position      647 non-null object
dtypes: float64(5), int64(2), object(13)
memory usage: 106.1+ KB


In [85]:
#player_performance.index = player_performance.index.astype('int')

#### Check Missing Values

In [86]:
player_performance.isna().any()

shot_res           False
shot_in_paint       True
shot_mid_range     False
shot_lcorner_3      True
shot_rcorner_3      True
shot_above_3       False
fga                False
block_res          False
block_in_paint      True
block_mid_range    False
block_lcorner_3     True
block_rcorner_3     True
block_above_3      False
blka               False
oreb               False
dreb               False
ast                False
stl                False
min                False
espn_position      False
dtype: bool

For players that have NA values, lets look at how many field goals (shots) they made so far this season

In [88]:
player_performance.loc[player_performance.isnull().any(axis=1),['fga']].join(merged_player_list, how='left')

Unnamed: 0_level_0,fga,name,espn_position
nba_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1628505,8,troy caupain,PG
1628961,3,kostas antetokounmpo,PF
1628961,3,kostas antetokounmpo,PF
1628961,3,kostas antetokounmpo,PF
1628961,3,kostas antetokounmpo,PF
1628994,0,george king,SF
1629055,2,donte grantham,SF
1629116,5,angel delgado,C
1629147,3,joe chealey,PG
1629147,3,joe chealey,PG


Evidently these are rookies that all took less than 10 shots so far this season. This is not enough data to make predictions. We will be dropping these players

In [89]:
player_performance = player_performance.dropna()

#### Check Duplicates

In [90]:
player_performance.duplicated().sum()

193

## Loading all data into SQLite

In [93]:
def load_data_to_db(merged_player_list, player_performance):
    """Loads eplayer name and position data to db"""
    
    
    engine = create_engine("sqlite:///db/nba.db", echo=False)
    
    with engine.connect() as conn:
        merged_player_list.to_sql('players', 
                                conn, 
                                dtype={"player_id":Integer}, 
                                if_exists="replace")  
        
        player_performance.to_sql('player_stats', 
                                  conn,
                                 if_exists='replace')
        
        print("Successfully loaded.")
        
        
def read_data_from_db():
    """Retrieves player name and position data from db
    
    Returns:
        merged_player_list: pd.DataFrame
            df containing espn player info including 
                -  player name, 
                -  player position, 
                -  espn player id,
                -  url to player profile
                -  nba player id
                
        player_performance: pd.DataFrame
            df containing player shooting and blocking stats
    """
    
    engine = create_engine("sqlite:///db/nba.db", echo=False)
    with engine.connect() as conn:
        merged_player_list = pd.read_sql('players', conn)
        player_performance = pd.read_sql('player_stats', conn)
        
        print("Successfully read in players data into dataframes")
        
        return merged_player_list, player_performance

In [94]:
load_data_to_db(merged_player_list, player_performance)

Successfully loaded.
