# Project Setup

pip install sportsreference is needed to run this code, as it contains data from the popular Sportsreference website. The documentation is listed here: https://sportsreference.readthedocs.io/en/stable/sportsreference.html

The pypi site for this package is listed here: https://pypi.org/project/sportsreference/


In [2]:
# pip install sportsreference

In [3]:
# Dependencies
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sportsreference.ncaab.roster import Player
from sportsreference.nba.roster import Roster

Creating the list of player objects.
This is done by going through each team, then getting the list of player objects on each roster, and storing them in a list. This project collects data only from players on an active roster, to preserve accuracy, and because we realize that trends in the modern NBA can change rapidly that attributes like the mid-range shot is not as needed today as it was years ago.

In [4]:
teams = ['ATL','BRK','BOS','CHO','CHI','CLE','DAL','DEN','DET','GSW',
         'HOU','IND','LAC','LAL','MEM','MIA','MIL','MIN','NOP','NYK',
         'OKC','ORL','PHI','PHO','POR','SAC','SAS','TOR','UTA','WAS']
player_list = []

for team in teams:
    teamname = Roster(team)
    for players in teamname.players:
        player_list.append(players)
print("Got all player names")

Successfully extracted player names


Storing all stats into two dictonaries now, each dictionary containing the player name as the key, and the value being the pandas dataframe that the function Player.dataframe gives us. This dataframe is a compilation of a ton of different stats, including some advanced stats. 

In [6]:
nba_player_info = {}
ncaa_player_info = {}
for nbaplayer in player_list:
    try:
        name = nbaplayer.name
        name = name.replace("'", "")
        name = name.replace(".", "")
        split_name = name.split()
        firstname = str(split_name[0]).lower()
        lastname = str(split_name[1]).lower()
        nameid = firstname + "-" + lastname + "-1"
        ncaa_player = Player(nameid)
        nba_player_info[nbaplayer.name] = nbaplayer.dataframe
        ncaa_player_info[nbaplayer.name] = ncaa_player.dataframe 
    except(TypeError):
        pass
print("Stored Everything in two dictionaries")

Stored Everything in two dictionaries


Here is an example of how to get a stat from the information dictionary made earlier, this gives a career average of Kevin Durant's true shooting percentage.

In [7]:
float(nba_player_info['Kevin Durant']['true_shooting_percentage']['Career'])

0.613

This code below stores 2 Pandas Dataframes, one for NBA statistics and one for NCAA statistics. These statistics are from the dataframe object stored from earlier, and the purpose of this is to make is much easier to access stats than the cell above. Each stat is calculated as a career total, then averaged out later.

First we stored information to an NBA dataframe, and printed out names that are one's we have to omit, because there is not enough data on them to store. Often, these players have empty stats lists because they are technically on the roster, but haven't played much yet. 

In [8]:
nba_required_stats_list = []
nba_names_to_drop = []
for key, value in nba_player_info.items():
    try:
        raw_height = nba_player_info[key]['height']['Career'][-1].split('-')
        career_height = (float(raw_height[0]) * 12 ) + float(raw_height[1])
        career_weight = float(nba_player_info[key]['weight']['Career'])
        career_points = float(nba_player_info[key]['points']['Career'])
        career_games = float(nba_player_info[key]['games_played']['Career'])
        career_assists = float(nba_player_info[key]['assists']['Career'])
        defensive_rebounds = float(nba_player_info[key]['defensive_rebounds']['Career'])
        offensive_rebounds = float(nba_player_info[key]['offensive_rebounds']['Career'])
        career_turnovers = float(nba_player_info[key]['turnovers']['Career'])
        career_blocks = float(nba_player_info[key]['blocks']['Career'])
        career_steals = float(nba_player_info[key]['steals']['Career'])
        career_free_throw_percentage = float(nba_player_info[key]['free_throw_percentage']['Career'])
        career_three_point_percentage = float(nba_player_info[key]['three_point_percentage']['Career'])
        career_PER = float(nba_player_info[key]['player_efficiency_rating']['Career'])
        career_win_shares = float(nba_player_info[key]['win_shares']['Career'])
        off_win_shares = float(nba_player_info[key]['offensive_win_shares']['Career'])
        def_win_shares = float(nba_player_info[key]['defensive_win_shares']['Career'])
        career_field_goal_percentage = float(nba_player_info[key]['field_goal_percentage']['Career'])
        career_usage_percentage = float(nba_player_info[key]['usage_percentage']['Career'])
        vorp = float(nba_player_info[key]['value_over_replacement_player']['Career'][-1])
        boxplusminus = float(nba_player_info[key]['box_plus_minus']['Career'])
        true_shooting_percentage = float(nba_player_info[key]['true_shooting_percentage']['Career'])
        player_dict =  {'Name': key,
                        'Career Height': career_height,
                        'Career Weight': career_weight,
                        'Career Points': career_points,
                        'Career Games': career_games,
                        'Career Assists': career_assists,
                        'Career Def Rebounds':defensive_rebounds,
                        'Career Off Rebounds':offensive_rebounds,
                        'Career Turnovers': career_turnovers,
                        'Career Blocks': career_blocks,
                        'Career Steals': career_steals,
                        'Career Free Throw Percentage': career_free_throw_percentage,
                        'Career Three Point Percentage': career_three_point_percentage,
                        'Career Field Goal Percentage': career_field_goal_percentage,
                        'Career PER': career_PER,
                        'Career Win Shares': career_win_shares,
                        'Offensive Win Shares': off_win_shares,
                        'Defensive Win Shares': def_win_shares,
                        'Career Usage Percentage': career_usage_percentage,
                        'VORP': vorp,
                        'Box Plus Minus': boxplusminus,
                        'True Shooting Per': true_shooting_percentage
                        }
        nba_required_stats_list.append(player_dict)
    except (KeyError, TypeError):
        nba_names_to_drop.append(key)
        print(key)
nba_stats_df = pd.DataFrame(nba_required_stats_list)

Onyeka Okongwu
Skylar Mays
Nathan Knight
Reggie Perry
Kaiser Gates
Tacko Fall
Payton Pritchard
Aaron Nesmith
Robert Williams
Amile Jefferson
Keandre Cook
Nate Darling
Kahlil Whitney
Xavier Sneed
Grant Riller
Javin DeLaurier
Devon Dotson
Patrick Williams
Daniel Gafford
Simisola Shittu
Isaac Okoro
Dylan Windler
Charles Matthews
Marques Bolden
Lamar Stevens
Levi Randolph
Freddie Gillespie
Nate Hinton
Devonte Patterson
Tyler Bey
Tyrell Terry
Markus Howard
Zeke Nnaji
Greg Whittington
Isaiah Stewart
Saben Lee
Saddiq Bey
Kaleb Wesson
Nico Mannion
James Wiseman
Dwayne Sutton
Jae'Sean Tate
Mason Jones
Kenyon Martin Jr.
Rayshaun Hammonds
Amida Brimah
Cassius Stanley
Daniel Oturu
Kostas Antetokounmpo
Jontay Porter
Desmond Bane
Xavier Tillman Sr.
Sean McDermott
Ahmad Caver
Killian Tillie
Bennie Boatwright
Precious Achiuwa
Gabe Vincent
Jordan Nwora
Sam Merrill
Mamadi Diakite
E.J. Montgomery
Jaden McDaniels
Anthony Edwards
Tyler Cook
Ade Murkey
Ashton Hagans
Will Magnay
Ike Anigbogu
Naji Marshall
To

In [9]:
nba_stats_df

Unnamed: 0,Name,Career Height,Career Weight,Career Points,Career Games,Career Assists,Career Def Rebounds,Career Off Rebounds,Career Turnovers,Career Blocks,...,Career Three Point Percentage,Career Field Goal Percentage,Career PER,Career Win Shares,Offensive Win Shares,Defensive Win Shares,Career Usage Percentage,VORP,Box Plus Minus,True Shooting Per
0,Bruno Fernando,81.0,240.0,240.0,56.0,49.0,131.0,67.0,42.0,17.0,...,0.135,0.518,11.9,0.8,0.4,0.4,15.3,-0.4,-4.1,0.542
1,John Collins,81.0,235.0,2850.0,176.0,279.0,1041.0,511.0,300.0,185.0,...,0.369,0.571,21.0,16.4,12.2,4.2,21.4,3.9,1.1,0.634
2,Brandon Goodwin,72.0,180.0,229.0,50.0,65.0,63.0,12.0,32.0,4.0,...,0.301,0.385,11.3,0.1,-0.1,0.2,22.5,-0.2,-3.7,0.496
3,Trae Young,73.0,180.0,3327.0,141.0,1213.0,460.0,96.0,597.0,23.0,...,0.344,0.428,20.2,9.1,7.8,1.3,31.4,4.0,1.5,0.567
4,Solomon Hill,78.0,226.0,2128.0,364.0,575.0,921.0,241.0,334.0,86.0,...,0.336,0.395,9.1,10.7,2.4,8.3,13.2,-0.1,-2.0,0.516
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336,Russell Westbrook,75.0,200.0,20412.0,878.0,7298.0,4740.0,1471.0,3581.0,270.0,...,0.305,0.437,23.5,101.1,60.2,40.9,32.7,52.1,4.8,0.530
337,Moritz Wagner,83.0,245.0,600.0,88.0,80.0,231.0,73.0,105.0,32.0,...,0.299,0.494,13.4,1.9,0.8,1.1,19.7,-0.3,-3.0,0.597
338,Robin Lopez,84.0,281.0,7300.0,832.0,687.0,2294.0,1912.0,917.0,972.0,...,0.295,0.529,15.9,44.5,28.3,16.2,17.6,4.6,-1.0,0.566
339,Jerome Robinson,76.0,190.0,431.0,96.0,106.0,154.0,16.0,62.0,20.0,...,0.319,0.379,7.3,0.3,-0.7,1.0,16.5,-0.7,-4.1,0.480


Now to do the same for NCAA, we have to jump through a couple more hoop because of the fact that not all players have data that is available. For example, Tyler Johnson, who is in the NBA, has no NCAA data on height. To fix this, we need to create a function, that return -999 if no information is available, or return the corresponding stat as a float. This is so the code is not cluttered with a ton of if statements.

In [10]:
def stat_checker(stat):
    if(stat is None):
        return -999
    else:
        return float(stat)

In [11]:
#NO VORP OR PER
ncaa_required_stats_list = []
ncaa_names_to_drop = []
for key, value in ncaa_player_info.items():
    try:
        raw_height = ncaa_player_info[key]['height']['Career'][-1]
        if(raw_height == None):
            career_height = 0
        else:
            raw_height = raw_height.split('-')
            career_height = (float(raw_height[0]) * 12 ) + float(raw_height[1])        
        career_weight = stat_checker(ncaa_player_info[key]['weight']['Career'][-1])
        career_points = stat_checker(ncaa_player_info[key]['points']['Career'][-1])
        career_games = stat_checker(ncaa_player_info[key]['games_played']['Career'][-1])
        career_assists = stat_checker(ncaa_player_info[key]['assists']['Career'][-1])
        defensive_rebounds = stat_checker(ncaa_player_info[key]['defensive_rebounds']['Career'][-1])
        offensive_rebounds = stat_checker(ncaa_player_info[key]['offensive_rebounds']['Career'][-1])
        career_turnovers = stat_checker(ncaa_player_info[key]['turnovers']['Career'][-1])
        career_blocks = stat_checker(ncaa_player_info[key]['blocks']['Career'][-1])
        career_steals = stat_checker(ncaa_player_info[key]['steals']['Career'][-1])
        career_free_throw_percentage = stat_checker(ncaa_player_info[key]['free_throw_percentage']['Career'][-1])
        career_three_point_percentage = stat_checker(ncaa_player_info[key]['three_point_percentage']['Career'][-1])
        career_win_shares = stat_checker(ncaa_player_info[key]['win_shares']['Career'][-1])
        off_win_shares = stat_checker(ncaa_player_info[key]['offensive_win_shares']['Career'][-1])
        def_win_shares = stat_checker(ncaa_player_info[key]['defensive_win_shares']['Career'][-1])
        career_field_goal_percentage = stat_checker(ncaa_player_info[key]['field_goal_percentage']['Career'][-1])
        career_usage_percentage = stat_checker(ncaa_player_info[key]['usage_percentage']['Career'][-1])
        boxplusminus = stat_checker(ncaa_player_info[key]['box_plus_minus']['Career'][-1])
        true_shooting_percentage = stat_checker(ncaa_player_info[key]['true_shooting_percentage']['Career'][-1])
        player_dict =  {'Name': key,
                        'Career Height': career_height,
                        'Career Weight': career_weight,
                        'Career Points': career_points,
                        'Career Games': career_games,
                        'Career Assists': career_assists,
                        'Career Def Rebounds':defensive_rebounds,
                        'Career Off Rebounds':offensive_rebounds,
                        'Career Turnovers': career_turnovers,
                        'Career Blocks': career_blocks,
                        'Career Steals': career_steals,
                        'Career Free Throw Percentage': career_free_throw_percentage,
                        'Career Three Point Percentage': career_three_point_percentage,
                        'Career Field Goal Percentage': career_field_goal_percentage,
                        'Career Win Shares': career_win_shares,
                        'Offensive Win Shares': off_win_shares,
                        'Defensive Win Shares': def_win_shares,
                        'Career Usage Percentage': career_usage_percentage,
                        'Box Plus Minus': boxplusminus,
                        'True Shooting Per': true_shooting_percentage
                        }
        ncaa_required_stats_list.append(player_dict)
    except KeyError as err:
        print(key, "Key Error: ", err)
    except TypeError as err:
        print(key, "Type Error: ", err)
    except AttributeError as err:
        print(key, "Attribute Error: ", err)
ncaa_stats_df = pd.DataFrame(ncaa_required_stats_list)

Reggie Perry Key Error:  'height'


Now we have to condense the NCAA dataframe in order to include players only from the NBA. Recall the NBA dataframe where some players didn't play enough minutes in the NBA in order to record data, so we were not able to include them. So we need to make sure the names in the NCAA line up in the NBA. This also ensures a plot of x and y are represented as actual players, making it more accurate to see a transition of stats from college to the league. 

In [13]:
nbanames = list(nba_stats_df["Name"])
ncaa_stats_df = ncaa_stats_df[ncaa_stats_df['Name'].isin(nbanames)]
ncaa_stats_df = ncaa_stats_df.reset_index(drop = True)
ncaa_stats_df

Unnamed: 0,Name,Career Height,Career Weight,Career Points,Career Games,Career Assists,Career Def Rebounds,Career Off Rebounds,Career Turnovers,Career Blocks,Career Steals,Career Free Throw Percentage,Career Three Point Percentage,Career Field Goal Percentage,Career Win Shares,Offensive Win Shares,Defensive Win Shares,Career Usage Percentage,Box Plus Minus,True Shooting Per
0,Bruno Fernando,82.0,240.0,770.0,64.0,89.0,413.0,145.0,148.0,101.0,33.0,0.763,0.308,0.595,8.3,4.5,3.8,23.1,8.1,0.638
1,John Collins,82.0,218.0,859.0,64.0,24.0,270.0,176.0,91.0,75.0,30.0,0.729,0.000,0.601,7.1,5.7,1.4,27.8,7.4,0.638
2,Brandon Goodwin,72.0,180.0,1651.0,126.0,477.0,396.0,116.0,290.0,11.0,137.0,0.716,0.301,0.474,13.3,8.9,4.3,24.1,2.2,0.563
3,Trae Young,74.0,180.0,876.0,32.0,279.0,111.0,14.0,167.0,8.0,54.0,0.861,0.360,0.422,5.7,4.6,1.1,37.1,11.1,0.585
4,Solomon Hill,79.0,220.0,1430.0,139.0,304.0,523.0,250.0,290.0,49.0,127.0,0.745,0.375,0.481,15.0,9.0,6.0,19.7,6.9,0.582
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336,Russell Westbrook,75.0,187.0,619.0,75.0,191.0,112.0,69.0,121.0,8.0,77.0,0.685,0.354,0.464,6.2,2.3,3.9,21.8,-999.0,0.532
337,Moritz Wagner,82.0,210.0,1114.0,107.0,57.0,356.0,128.0,126.0,40.0,83.0,0.698,0.385,0.547,10.7,6.4,4.3,24.7,7.2,0.633
338,Robin Lopez,84.0,255.0,601.0,67.0,51.0,205.0,171.0,109.0,156.0,23.0,0.612,0.500,0.511,8.2,2.2,6.0,21.1,-999.0,0.535
339,Jerome Robinson,76.0,195.0,947.0,113.0,206.0,109.0,17.0,175.0,68.0,95.0,0.741,0.375,0.440,10.0,5.1,4.8,-999.0,-999.0,0.557
