## Collecting Data

The data for this project was collected from basketball-reference.com.  Each player that played in at least one game of a season has both a game log page and an advanced game log page displaying their stats from every game they participated in.  Each team has a game log and advanced game log page as well.  The code below is meant to scrape all the data from each of player and team pages as well as clean and process the data

## Webscraping the team data

I'll be using the 2017 - 2018 season as the example in this notebook.  But this process was applied to each season beginning with the 2014-2015 season and continuing through to the current day of the current season

In [3]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup  #the webscraping module
import requests                #for accessing the sites to be scraped

In [None]:
""" 

I am just focusing on the advanced stats
for each team, and this will grab those stats
from any team page that is input to the function

"""

def team_adv(url):
    url = url
    res = requests.get(url)
    #make sure there isn't an error accessing the site 
    #200 means the request was succesful
    if res.status_code == 200:
        soup = BeautifulSoup(res.content, 'lxml')
        
        #set up an empty list for the column names
        col_names = []
        #this returns headers for all relevant columns
        for i in range(9, 36):   
            col_names.append(soup.find_all('table')[0]('thead')[0]('th')[i].text)
    
    #if there is a status error this will let us know
    #400 means it is a problem with the url the user entered
    #500 means the site is experiencing problems
    else:
        print('There was a ' + str(res.status_code + ' error'))
    row = []
    num_rows = len(soup.find_all('table')[0]('tbody')[0]('tr'))
    num_cols = len(soup.find_all('table')[0]('tbody')[0]('tr')[0]('td'))
    
    #this code returns all of the statistics that belong in each column
    for x in range(num_rows):
        for y in range(num_cols):
            
            #this if statement makes sure the code doesn't break when it 
            #gets to a line that doesn't have any stats in it
            if len(soup.find_all('table')[0]('tbody')[0]('tr')[x]('td')) == num_cols:
                row.append(soup.find_all('table')[0]('tbody')[0]('tr')[x]('td')[y].text)
            else:
                pass
    #right now the data is in one long list and needs
    #to be broken up into rows
    
    rows = []
    data = len(row)
    b = 0  #this is a counter for when we get to the end of a row
    for a in range(int(data / num_cols)):
        rows.append(row[b:(b + num_cols)])
        b += num_cols

    df = pd.DataFrame(rows, columns=col_names)
    return df

In [None]:
""" 
This function is going to clean the data set
that the previous function returns.  The first time this 
function is run, the only input necessary is the team abbreviation 
of the first team.  After that, each team can be passed into this
and be concatenated on the 0 axis until we have a dataframe that
includes all 30 teams

"""
def clean_team_adv(df, team):
    #run the above function on the selected team's game log page
    team = team_adv('https://www.basketball-reference.com/teams/' + team + '/2018/gamelog-advanced/')
    
    #rename the second column to reflect whether this team was
    #the home team
    team.rename(columns = {list(team)[2]: 'Home'}, inplace = True)
    
    #this line will take care of the problem that this dataframe has 
    #multiple columns with the same name
    cols = []
    count = 1
    for column in team.columns:
        if column == 'Opp':
            cols.append('Opp_'+str(count))
            count+=1
            continue
        elif column == 'TOV%':
            cols.append('TOV%_'+str(count))
            count+=1
            continue
        elif column == 'eFG%':
            cols.append('eFG%_'+str(count))
            count+=1
            continue
        elif column == 'Home':
            cols.append('Home_'+str(count))
            count+=1
            continue
        elif column == 'FT/FGA':
            cols.append('FT/FGA_'+str(count))
            count+=1
            continue
        cols.append(column)
        
    team.columns = cols
    #renaming the rest of the columns to be more appropriately descriptive
    team.rename(columns={'Opp_2': 'opp',
                        'Opp_3': 'opp_score',
                        'Tm': 'score',
                        'W/L': 'win',
                        'eFG%_5': 'o_efg_pct',
                        'eFG%_9': 'd_efg_pct',
                        'TOV%_6': 'o_tov_pct',
                        'TOV%_10': 'd_tov_pct',
                        'Home_1': 'Home',
                        'FT/FGA_7': 'o_ft_fga',
                        'FT/FGA_11': 'd_ft_fga'}, inplace=True)
    
    #lowercase every column and connect multiple words
    #with an underscore. also replace certain characters
    team.columns = [x.lower().replace('%', '_pct') for x in team.columns]
    team.columns = [x.lower().replace('/', '_') for x in team.columns]
    
    #make the index of the dataframe the date of the game
    #and drop unneccesary colums
    team.index = team.date
    team.drop(columns=['g', 'date', 'home_4', 'home_8'], inplace=True)
    
    #changing the home and win columns to binary
    team.home = team.home.map(lambda x: 0 if x in ['@'] else 1)
    team.win = team.win.map(lambda x: 1 if x == 'W' else 0)
    
    #changing the datatypes from object to integer
    #or float where necessary
    ints = ['score', 'opp_score']
    floats = ['ortg', 'drtg', 'pace', 'ftr', '3par', 'ts_pct', 'trb_pct', 'ast_pct',
              'stl_pct', 'blk_pct', 'o_efg_pct', 'o_tov_pct', 'orb_pct', 'o_ft_fga', 
              'd_efg_pct','d_tov_pct', 'drb_pct', 'd_ft_fga']
    team[ints] = team[ints].astype(int)
    team[floats] = team[floats].astype(float)
    
    #this adds recursive average columns for all the relevant 
    #stats. these will end up being possible features for model
    more_cols = ['score', 'ortg', 'drtg', 'pace',
             'ftr', '3par', 'ts_pct', 'trb_pct',
             'ast_pct', 'stl_pct', 'blk_pct', 'o_efg_pct',
             'o_tov_pct', 'orb_pct', 'o_ft_fga', 'd_efg_pct',
             'd_tov_pct', 'drb_pct', 'd_ft_fga']
    for col in more_cols:
        stat_list = list(team[col])
        avg = [0]
        for i in range(1, len(stat_list)):
            avg.append((np.sum(stat_list[:i]) / len(stat_list[:i])).round(2))
        team['avg_' + col] = avg
    
    #add this team data set to all the previous ones
    adv_df = pd.concat([df, team], axis=0)
    #return the completely cleaned and process data
    return adv_df

## Scraping Player Data

This requires a lot of similar steps to those used in scraping the team data.  There are three functions here.  One for the player's regular stats, one to get the advanced stats and to concatenate the two together on the 1 axis, and then a third to merge each player with the team they played against that game.  It also adds the fantasy points column as well as double doubles and triple doubles

In [None]:
def player(url):
    
    url = url
    res = requests.get(url)
    if res.status_code == 200:
        
        soup = BeautifulSoup(res.content, 'lxml')
        col_names = []
        header = len(soup.find_all('table')[7]('thead')[0]('tr')[0]('th'))
        for i in range(1, header):
            col_names.append(soup.find_all('table')[7]('thead')[0]('tr')[0]('th')[i].text)

        num_cols = len(col_names)
        num_rows = len(soup.find_all('table')[7]('tbody')[0]('tr'))
        
    else:
        print('There was a ' + res.status_code + ' status error')
    row = []
        
    for x in range(num_rows):
        for y in range(num_cols):
            if len(soup.find_all('table')[7]('tbody')[0]('tr')[x]('td')) == num_cols:
                row.append(soup.find_all('table')[7]('tbody')[0]('tr')[x]('td')[y].text)
            else:
                pass
            
    rows = []
    data = len(row)
    b = 0
    for a in range(int(data / num_cols)):
        rows.append(row[b:(b + num_cols)])
        b += num_cols

    player = pd.DataFrame(rows, columns=col_names)
    return player

In [None]:
def player_adv(url):
    
    url = url
    res = requests.get(url)
    if res.status_code == 200:
        
        soup = BeautifulSoup(res.content, 'lxml')
        col_names = []
        header = len(soup.find_all('table')[0]('thead')[0]('tr')[0]('th'))
        for i in range(10, header):
            col_names.append(soup.find_all('table')[0]('thead')[0]('tr')[0]('th')[i].text)

        num_cols = len(col_names)
        num_rows = len(soup.find_all('table')[0]('tbody')[0]('tr'))
        
    else:
        print('There was a status error')
    row = []
        
    for x in range(num_rows):    
        for y in range(9, num_cols + 9):
            if len(soup.find_all('table')[0]('tbody')[0]('tr')[x]('td')) == 22:
                row.append(soup.find_all('table')[0]('tbody')[0]('tr')[x]('td')[y].text)
            else:
                pass
            
    rows = []
    data = len(row)
    b = 0
    for a in range(int(data / num_cols)):
        rows.append(row[b:(b + num_cols)])
        b += num_cols

    player = pd.DataFrame(rows, columns=col_names)
    return player

In [None]:
"""
This function takes a player's game log page and advanced game log
page, runs the above two functions on them and concatenates them
together.  The df input is not used for the first player, but for
every player after so that we continue to concatenate to the bottom
until every player's game stats are included in the dataframe

"""

def complete(url, url_2, name, df):
    #run functions on the two urls and then concatenate them 
    complete_player = pd.concat([player(i), player_adv(j)], axis=1)
    #make the index the date of the game 
    complete_player.index=complete_player['Date']
    #renaming of columns just like with the team data
    complete_player.rename(columns = {list(complete_player)[4]: 'Home'}, inplace = True)
    cols = []
    count = 1
    for column in complete_player.columns:
        if column == 'Home':
            cols.append('Home_'+str(count))
            count+=1
            continue
        elif column == 'GmSc':
            cols.append('GmSc'+str(count))
            count+=1
            continue
        cols.append(column)
    complete_player.columns = cols

    complete_player.rename(columns={'Home_1': 'Home',
                                   'Home_2': 'Win_Loss',
                                   '+/-': 'plus_minus',
                                   'GmSc4': 'GmSc'}, inplace=True)
    complete_player.drop(columns=['G', 'GmSc3', 'Date', 'Age'], inplace=True)

    complete_player.plus_minus = [x.replace('+', '') for x in complete_player.plus_minus]
    
    for i in complete_player.columns:
        complete_player[i] = complete_player[i].map(lambda x: 0.0 if len(x) == 0 else x)

    
    complete_player['MP'] = complete_player['MP'].map(lambda x: float(x.replace(':', '.')))
    complete_player['MP'] = complete_player['MP'].map(lambda x: int(x))

    complete_player.columns = [x.lower().replace('%', '_pct') for x in complete_player.columns]

    complete_player.home = complete_player.home.map(lambda x: 0 if x in ['@'] else 1)
    
    
    ints = ['gs', 'fg', 'fga', '3p', '3pa', 'ft', 'fta', 'orb', 'drb', 'trb', 'ast',
           'stl', 'blk', 'tov', 'pf', 'pts', 'plus_minus', 'ortg', 'drtg']

    floats = ['3p_pct', 'fg_pct', 'ft_pct', 'ts_pct', 'efg_pct', 'orb_pct', 'drb_pct', 'trb_pct',
             'ast_pct', 'stl_pct', 'blk_pct', 'tov_pct', 'usg_pct', 'gmsc']

    complete_player[ints] = complete_player[ints].astype(int)
    complete_player[floats] = complete_player[floats].astype(float)

"""
Here is where we create the columns for the triple double and double
double.  We also bring in the team data and add 'opp' to the beginning of each column
to represent that this is the stats for the opposing team that each
player faced on a given day.

"""

    stats = ['pts', 'trb', 'ast', 'blk', 'stl']
    complete_player['trip_dub'] = (complete_player[stats] >= 10).sum(1) >= 3
    complete_player['dub_dub'] = (complete_player[stats] >= 10).sum(1) >= 2
    complete_player['trip_dub'] = complete_player['trip_dub'].map(lambda x: 1 if x == True else 0)
    complete_player['dub_dub'] = complete_player['dub_dub'].map(lambda x: 1 if x == True else 0)
    
    opponent = data[data['opp'] == complete_player['tm'][0]]
    opponent = opponent.sort_index()
    opponent.columns = [('opp_' + x) for x in opponent.columns]
    
    #since we are about to merge together player data and opponent data
    #we don't need to include these redundant columns that are in both
    #the player and team dataframes
    skip_columns = ['home', 'opp', 'win', 'score', 'opp_score']
    
    cols = [col for col in opponent.columns if col not in skip_columns]
    
    #we are able to add the player's opponent for a particular game by
    #merging on the date index
    complete_player = pd.merge(complete_player,
                               opponent[cols],
                               left_index=True,
                               right_index=True)

    #adding the fantasy points column
    complete_player['fantasy_points'] = (complete_player.pts) \
                                        + (complete_player['3p'] * .5) \
                                        + (complete_player.trb * 1.25) \
                                        + (complete_player.ast * 1.5) \
                                        + (complete_player.stl * 2) \
                                        + (complete_player.blk * 2) \
                                        - (complete_player.tov * .5) \
                                        + (complete_player.dub_dub * 1.5) \
                                        + (complete_player.trip_dub * 3)
    
    #this line of code changes puts the name of the players into
    #the dataframe
    complete_player.rename(columns={'tm': 'player'}, inplace=True)
    complete_player['player'] = name
    #this saves an individual player's data for a season
    complete_player.to_csv('./players_17-18/' + name + '.csv')
    #this adds the player to the combined data for all players
    full_df = pd.concat([df, complete_player], axis = 0)
    
    return full_df

## Automating the Process

There's a page on basketball-reference.com that lists every player that played during a season.  We can pull those player names and put them in a list.  Then iterate through the list and apply the above functions to each player.  This process usually took around two and a half hours per season.  However by the end, I had over 117,000 rows to build a model with and a choice of over 60 columns to build a model with.

In [None]:
#get each player's name 
stats = 'https://www.basketball-reference.com/leagues/NBA_2018_per_game.html'
res = requests.get(stats)
soup = BeautifulSoup(res.content, 'lxml')

In [None]:
#this code will find each players name from the above page
col_names = []
header = len(soup.find_all('table')[0]('thead')[0]('tr')[0]('th'))
for i in range(1, header):
    col_names.append(soup.find_all('table')[0]('thead')[0]('tr')[0]('th')[i].text)

num_cols = len(col_names)
num_rows = len(soup.find_all('table')[0]('tbody')[0]('tr'))

row = []
        
for x in range(num_rows):
    if len(soup.find_all('table')[0]('tbody')[0]('tr')[x]('td')) == num_cols:
            row.append(soup.find_all('table')[0]('tbody')[0]('tr')[x]('td')[0].text)
    else:
        pass

In [None]:
#this code makes each players name lowercase and connects
#the first and last names with an underscore so that it can 
#be inserted into the name of a csv file
names = [x.replace('.html', '') for x in row]

#when players switch teams, they are represented in multiple
#lines of the total players table.  the below code makes sure
#each player goes into the names list only once
from collections import OrderedDict
names = list(OrderedDict.fromkeys(names))

In [None]:
#getting the player ids from their unique url
col_names = []
header = len(soup.find_all('table')[0]('thead')[0]('tr')[0]('th'))
for i in range(1, header):
    col_names.append(soup.find_all('table')[0]('thead')[0]('tr')[0]('th')[i].text)

num_cols = len(col_names)
num_new_rows = len(soup.find_all('table')[0]('tbody')[0]('tr'))

new_row = []
        
for x in range(num_new_rows):
    if len(soup.find_all('table')[0]('tbody')[0]('tr')[x]('td')) == num_cols:
            new_row.append(soup.find_all('table')[0]('tbody')[0]('tr')[x]('td')[0]('a')[0]['href'])
    else:
        pass
        

In [None]:
#this code pulls the player id off of the page so that
#it can be added to the url template as we are looping
#through each player to apply the above functions
players = [x.replace('.html', '') for x in new_row]

In [None]:
#creating the list of unique player urls to loop through
#and apply the above functions.  first from the regular stats
#game log page
player_list = []
for player in players:
    player_list.append('https://www.basketball-reference.com' + player + '/gamelog/2018')

In [None]:
#next we grab the advanced stats
adv_player_list = []
for player in players:
    adv_player_list.append('https://www.basketball-reference.com' + player + '/gamelog-advanced/2018')

## Finally Iterate Through All the Players

Here is where we run the big functions from above on each player and add them to the full season dataframe that will include everyone that played in at least one game

In [None]:
#run the functions on the first player to get an initial dataframe
total = complete(player_list[0], adv_player_list[0], names[0])

#now that we have a player dataframe that we can concatenate all
#the other player dataframes to, we will add the dataframe as an
#input to the function
for i in range(1, len(names)):
    total = complete(player_list[i], adv_player_list[i], names[i], total)
    #this will let us know that the cell is continuing to run by 
    #printing every time a new player dataframe has been completed
    print(names[i])
    

## Worth Noting

This will usually run and populate on its own.  However, if there were to be a status error during the running of the above cell, make sure to change the range in the function to start at where we left off before the error.  We will know where this is because we are printing out each player as they have been added to the dataframe