In this notebook I am exporting all of the player Fanduel data for the 2020 NBA season. This information was pulled from this site http://rotoguru1.com/cgi-bin/hyday.pl?game=fd&mon=10&day=22&year=2019 using Beautiful soup and then by iterating over the dropdown menu.

While I want to do additional investigation by cleaning the table up to include stats, this first go around was to prove out that I could pull the records for each night of games and then merge them by date and by player to be able to look at the aggregate information for Fanduel points scored vs fanduel price, with some other pieces of information such as team, opponent, and whether the player was on the road vs at home.

In [1]:
#pulling all relevant libraries 
from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time

the below information was used to scrape the site. I ran this once and then commneted it out after saving the tables as a csv to avoid constantly pulling the data from the web.

In [2]:
##realized the url just updated based on dates, so wrote a loop to accumulate all of the urls
#years = [2019,2020]
#months = list(range(1,13,1))
#days = list(range(1,32,1))

#urls = []
#for year in years:
#    for month in months:
#        for day in days:
#            urls.append(str('http://rotoguru1.com/cgi-bin/hyday.pl?game=fd&mon='+str(month)+'&day='+str(day)+'&year='+str(year)))

##The 2020 season kicked off October 22, 2019 so I want to have a list with just the information for those dates
##finding index number of url that matches that day
#urls.index('http://rotoguru1.com/cgi-bin/hyday.pl?game=fd&mon=10&day=22&year=2019')
##ended up being 300
##finding index that matches last game of regular season - March 11. I am not counting playoffs or post pandemic games sense not all teams are playing 
#urls.index('http://rotoguru1.com/cgi-bin/hyday.pl?game=fd&mon=3&day=10&year=2020')
##ended up being 443
##saving list just for dates that would be in 2020 season
#seasonurls = urls[300:444]

##we should now have the page we need for each url

This is the script to scrape all of the player data. I only ran this once and saved it as a csv in the section in the code marked below because running it is very time consuming.

In [3]:
##creating an empty list that will have the player row for each table

#rows = []

##Running a loop through all the urls to get every piece of data in one list
#scraping each url
#for url in urls:
#    webpage = requests.get(url)
#    webpage_content = webpage.content
#    soup = BeautifulSoup(webpage_content,'html.parser')
#    table_rows = soup.find_all('td')
    #pulling just the player data, which starts on the 24th entry. 
    #We also want to add some kind of date identifier so for now im doing the url
#    for row in table_rows[24:]:
#        rows.append([row.get_text(),url])

In [4]:
##create data frame
#sample_frame = pd.DataFrame.from_records(rows).reset_index()
##save to csv
#sample_frame.to_csv('All_rows.csv')

The above text can be uncommented if you ever need to pull the information again, for example in 2021. The below information is now pulling from the "All_rows" CSV I saved, which has all data from the beginning of the 2021 season through March 10, the day before games were halted due to COVID.<br><br>
From here on, I'm going to approach the problem using that static csv, starting with creating a data frame

In [5]:
#create frame from saved csv
new_frame = pd.read_csv('All_rows.csv')
#rename columns
new_frame.columns = ['Row','ID','Data','URL']

In [6]:
#clean up column with urls so it just has date
dates = new_frame['URL'].map(lambda x: x.replace('http://rotoguru1.com/cgi-bin/hyday.pl?game=fd&mon=','')\
                             .replace('&year=','/')\
                             .replace('&day=','/'))
new_frame['Date'] = dates

#create separate frame that removes all the url columns
date_frame = new_frame[['Data','Date']].reset_index(drop=True)

Everything from here to the next markup is me clearing out rows that were in the table but did not have player information.

In [7]:
#We need to remove some ads and nav stuff here by 
#converting to a series, finding the ones that match, and adding back to the table
find_Ads = date_frame['Data']
#create series that has 0 for what matches the ads
ads_found = find_Ads.str.find('RotoGuru')
#add column to table with 0's
date_frame['Ad'] = ads_found
#create new table with those rows with zero gone
no_ads = date_frame[date_frame['Ad'] != 1].reset_index(drop=True)

In [8]:
#Repeating to remove Jump To:
#converting to a series, finding the ones that match, and adding back to the table
find_jump = no_ads['Data']
#create series that has 0 for what matches the text
jump_found = find_jump.str.find('Jump to:')
#add column to table with 0's
no_ads['Remove'] = jump_found
#create new table with those rows with zero gone
jump_gone = no_ads[no_ads['Remove'] !=0].reset_index(drop=True)

In [9]:
#There's a term 'Unlisted' that pops up occasionally and breaks everything. I'm clearing that here
find_unlisted = jump_gone['Data']
#create series that has a 0 for where it says unlisted
unlisted_found = find_unlisted.str.find('Unlisted')
#add column to table with 0's
jump_gone['Z'] = unlisted_found
#create new table with those rows removed
unlisted_gone = jump_gone[jump_gone['Z'] != 0].reset_index(drop=True)

In [10]:
##There's a subtable headers that aren't player data. we are getting rid of most those here. Min can't go 
#because it lines up with al-farouq aminu
##creates list of all the words I want to find and get rid of
sub = ['Forward','Center','FD Points','Salary','Team','Opp.','Score','Stats','Unlisted']
pattern = '|'.join(sub)

unlisted_gone['gone'] = unlisted_gone['Data'].str.contains(pattern, case=False)


In [11]:
#remove any rows where we found those subtable headers
cleaner_table = unlisted_gone[unlisted_gone['gone'] != True].reset_index()
#specifically remove Min
#converting to a series, finding the ones that match, and adding back to the table
find_min = cleaner_table['Data']
#create series that has 0 for what matches the word Min
min_found = find_min.str.find('Min')
#add column to table with 0's
cleaner_table['Remove'] = min_found
#create new table with those rows with zero gone
clean_table = cleaner_table[cleaner_table['Remove'] !=0].reset_index(drop=True)

Now that we've cleared all the random stuff out, we're going to try to create a row per player per game and include the date played

In [12]:
##create series with the data
just_data = clean_table[['Data','Date']].reset_index(drop=True)

In [13]:
 #clearing out a row with an "unlisted" entry
just_data.drop([4158])

Unnamed: 0,Data,Date
0,SG,1/1/2019
1,"Beasley, Malik",1/1/2019
2,40.7,1/1/2019
3,"$4,500",1/1/2019
4,den,1/1/2019
...,...,...
572710,den,9/22/2020
572711,v lal,9/22/2020
572712,114-106,9/22/2020
572713,4:35,9/22/2020


Now we have a frame with every piece of information and a date, but we need to make it one row per player per game

In [14]:
#merging data and date in a column so
#I can then hopefully turn each one into a series and then just have the date once at the end.
just_data['merge_date'] = just_data['Data'].astype(str)+'|'+just_data['Date']

In [15]:
#turning my merged column into a list so I can run a comprehension and then add the date to the end of a player row
just_datas = just_data['merge_date']


In [16]:
data_list = list(just_datas)

In [17]:
#This is a thing that let you make a row per player, but you couldn't add the date
##each row was 9 entries. So we're gonna write this thing to create a series with lists of 9 entries each. This gets thrown off very easily though
players = [data_list[x:x+9] for x in range(0, len(data_list), 9)]

In [18]:
#now we have a row per player, but every piece of data has the date. So we want to just make that its own entry at the end
for player in players:
    player.insert(0,player[0].split('|')[1])

In [19]:
#creating list that has each player entry as its own record without date. 
player_rows =[]
for player in players:
    if len(player) == 10:
        player_rows.append([player[0],\
                            player[1].split('|')[0],\
                            player[2].split('|')[0],\
                            player[3].split('|')[0],\
                            player[4].split('|')[0],\
                            player[5].split('|')[0],\
                            player[6].split('|')[0],\
                            player[7].split('|')[0],\
                            player[8].split('|')[0],\
                            player[9].split('|')[0]])

In [20]:
sample_frame = pd.DataFrame.from_records(player_rows).reset_index(drop=True)

We now have a frame with records for every player for every game. It's time to clean things up!

In [21]:
sample_frame.columns = ['Date,','Position','Name','Fanduel_Points','Fanduel_Price','Team','Opponent','Score','Minutes','Stats']

Unnamed: 0,"Date,",Position,Name,Fanduel_Points,Fanduel_Price,Team,Opponent,Score,Minutes,Stats
0,1/1/2019,SG,"Beasley, Malik",40.7,"$4,500",den,v nyk,115-108,29:56,23pt 6rb 5as 1st 5trey 8-15fg 2-2ft


I'm now going to clean up the price column, and then use that column to delete non player rows by making a new frame that drops any rows that do not have a price

In [22]:
#Replace currency symbols in column so we can make it an integer
sample_frame['Fanduel_Price'] = sample_frame['Fanduel_Price'].str.replace(',', '')
sample_frame['Fanduel_Price'] = sample_frame['Fanduel_Price'].str.replace('$', '')
#Turn the column to integers
sample_frame['Fanduel_Price'] = pd.to_numeric(sample_frame['Fanduel_Price'], errors='coerce')

In [23]:
#there is no price listed for some players. i am removing them since the whole goal is to see who exceeds their price
df1 = sample_frame[~sample_frame['Fanduel_Price'].isna()].reset_index(drop=True)

In [24]:
#convert DNP in Minutes column to 0's. Had to make 0:0 to allow split function to work for players with actual minutes
df1['NoDNP'] = df1['Minutes'].apply(lambda x: '0:0' if str(x) == 'DNP' else x)
#convert NA in Minutes column to 0's
df1['NoNaN'] = df1['NoDNP'].apply(lambda x: '0:0' if str(x) == 'nan' else x)
#convert mm:ss to total minutes as a float by splitting and then adding
df1['Minutes_Played']= df1['NoNaN'].str.split(':').apply(lambda x: int(x[0]) + ((int(x[1])/60)))

In [25]:
#Convert Fanduel_Points to float
df1['Fanduel_Points'] = pd.to_numeric(df1['Fanduel_Points'])

In [26]:
#create column for home vs away and updated column for opponent
df1['Split'] = df1['Opponent'].str[0]
df1['Home'] = df1.Split.apply(lambda x: 'Home' if str(x) == 'v' else 'Away')
df1['Foe'] = df1['Opponent'].str[1:]

In [27]:
#clearing out some columns
del df1['NoDNP']
del df1['NoNaN']
del df1['Split']
del df1['Opponent']

In [28]:
#get rid of some carrots that are appearing
df1['Name'] = df1['Name'].str.replace('^','')

I am now going to work on breaking the statistics out into each of their own categories

In [30]:
#turning stats into its own data frame so I can play around without dealing with the other stuff
stats = df1['Stats'].reset_index()
stats.head()

Unnamed: 0,index,Stats
0,0,23pt 6rb 5as 1st 5trey 8-15fg 2-2ft
1,1,27pt 5rb 4as 1bl 2to 5trey 11-23fg
2,2,25pt 6rb 6as 1st 5to 1trey 8-21fg 8-8ft
3,3,14pt 9rb 8as 1st 1bl 4to 6-13fg 2-6ft
4,4,19pt 5rb 5as 3st 4to 3trey 7-19fg 2-2ft


Our big problem is the stats go in this order: (pt,rb,as,st,to,try,fg,ft). But if someone doesn't get one of those stats, it doesn't say 0; it just skips to the next one. So we'd see 19pt 8as instead of 19pt 0rb 8as. My idea is to break these into columns by splitting along the space, and then find a string with the corresponding statistic in each column. If the statistic identifier isn't there, I will move it and all following statistics one column to the right.

In [31]:
#starting by splitting our stats column
stats_split = stats['Stats'].str.split(' ',expand=True).reset_index(drop=True)
stats_split.fillna('0')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,,,23pt,6rb,5as,1st,5trey,8-15fg,2-2ft,0,0
1,,,27pt,5rb,4as,1bl,2to,5trey,11-23fg,0,0
2,,,25pt,6rb,6as,1st,5to,1trey,8-21fg,8-8ft,0
3,,,14pt,9rb,8as,1st,1bl,4to,6-13fg,2-6ft,0
4,,,19pt,5rb,5as,3st,4to,3trey,7-19fg,2-2ft,0
...,...,...,...,...,...,...,...,...,...,...,...
61043,,,,0,0,0,0,0,0,0,0
61044,,,,0,0,0,0,0,0,0,0
61045,,,,0,0,0,0,0,0,0,0
61046,,,,0,0,0,0,0,0,0,0


I'm using the designated identifiers for stats along with regex to break each stat out into its own column.

In [32]:
#filtering just for values that contain points
stats_split['Points'] = stats_split[2].apply(lambda x: x if x.find('pt') else 0)
#removing signifier
stats_split['Points'] = stats_split['Points'].str.replace('pt','')
#converting points to numeric
stats_split['Points'] = pd.to_numeric(stats_split['Points'])
#As soon as I hit rebounds, I ran into a none type error. I am trying this way instead.
stats_split['Rebounds'] = stats_split[3].str.extract(pat = '(.?.rb)').fillna(stats_split[2].str.extract(pat = '(.?.rb)'))
#removing signifier
stats_split['Rebounds'] = stats_split['Rebounds'].str.replace('rb','')
#converting to numeric
stats_split['Rebounds'] = pd.to_numeric(stats_split['Rebounds'])
#trying now for assists
stats_split['Assists'] = stats_split[4].str.extract(pat = '(.?.as)').fillna(stats_split[3].str.extract(pat = '(.?.as)')).fillna(stats_split[2].str.extract(pat = '(.?.as)'))
#removing signifier
stats_split['Assists'] = stats_split['Assists'].str.replace('as','')
#converting to numeric
stats_split['Assists'] = pd.to_numeric(stats_split['Assists'])
#trying now for steals
stats_split['Steals'] = stats_split[5].str.extract(pat = '(.?.st)').fillna(stats_split[4].str.extract(pat = '(.?.st)')).fillna(stats_split[3].str.extract(pat = '(.?.st)')).fillna(stats_split[2].str.extract(pat = '(.?.st)'))
#removing signifier
stats_split['Steals'] = stats_split['Steals'].str.replace('st','')
#converting to numeric
stats_split['Steals'] = pd.to_numeric(stats_split['Steals'])
#trying now for Blocks
stats_split['Blocks'] = stats_split[6].str.extract(pat = '(.?.bl)').fillna(stats_split[5].str.extract(pat = '(.?.bl)'))\
                        .fillna(stats_split[4].str.extract(pat = '(.?.bl)')).fillna(stats_split[3].str.extract(pat = '(.?.bl)')).fillna(stats_split[2].str.extract(pat = '(.?.bl)'))
#removing signifier
stats_split['Blocks'] = stats_split['Blocks'].str.replace('bl','')
#converting to numeric
stats_split['Blocks'] = pd.to_numeric(stats_split['Blocks'])
#trying now for Turnovers
stats_split['Turnovers'] = stats_split[7].str.extract(pat = '(.?.to)').fillna(stats_split[6].str.extract(pat = '(.?.to)')).fillna(stats_split[5].str.extract(pat = '(.?.to)'))\
                            .fillna(stats_split[4].str.extract(pat = '(.?.to)')).fillna(stats_split[3].str.extract(pat = '(.?.to)')).fillna(stats_split[2].str.extract(pat = '(.?.to)'))
#removing signifier
stats_split['Turnovers'] = stats_split['Turnovers'].str.replace('to','')
#converting to numeric
stats_split['Turnovers'] = pd.to_numeric(stats_split['Turnovers'])
#trying now for Three Pointers
stats_split['3pts'] = stats_split[8].str.extract(pat = '(.?.trey)').fillna(stats_split[7].str.extract(pat = '(.?.trey)'))\
                        .fillna(stats_split[6].str.extract(pat = '(.?.trey)')).fillna(stats_split[5].str.extract(pat = '(.?.trey)'))\
                            .fillna(stats_split[4].str.extract(pat = '(.?.trey)')).fillna(stats_split[3].str.extract(pat = '(.?.trey)')).fillna(stats_split[2].str.extract(pat = '(.?.to)'))
#removing signifier
stats_split['3pts'] = stats_split['3pts'].str.replace('trey','')
#converting to numeric
stats_split['3pts'] = pd.to_numeric(stats_split['3pts'])
#trying now for Field Goal
stats_split['Field_Goals'] = stats_split[9].str.extract(pat = '(.?.?.?.fg)').fillna(stats_split[8].str.extract(pat = '(.?.?.?.fg)')).fillna(stats_split[7].str.extract(pat = '(.?.?.?.fg)'))\
                            .fillna(stats_split[6].str.extract(pat = '(.?.?.?.fg)')).fillna(stats_split[5].str.extract(pat = '(.?.?.?.fg)'))\
                            .fillna(stats_split[4].str.extract(pat = '(.?.?.?.fg)')).fillna(stats_split[3].str.extract(pat = '(.?.?.?.fg)')).fillna(stats_split[2].str.extract(pat = '(.?.?.?.fg)'))
#removing signifier
stats_split['Field_Goals'] = stats_split['Field_Goals'].str.replace('fg','')
#trying now for Free throws
stats_split['Free_Throws'] = stats_split[10].str.extract(pat = '(.?.?.?.ft)').fillna(stats_split[9].str.extract(pat = '(.?.?.?.ft)')).fillna(stats_split[8].str.extract(pat = '(.?.?.?.ft)'))\
                            .fillna(stats_split[7].str.extract(pat = '(.?.?.?.ft)')).fillna(stats_split[6].str.extract(pat = '(.?.?.?.ft)')).fillna(stats_split[5].str.extract(pat = '(.?.?.?.ft)'))\
                            .fillna(stats_split[4].str.extract(pat = '(.?.?.?.ft)')).fillna(stats_split[3].str.extract(pat = '(.?.?.?.ft)')).fillna(stats_split[2].str.extract(pat = '(.?.?.?.ft)'))
#removing signifier
stats_split['Free_Throws'] = stats_split['Free_Throws'].str.replace('ft','')
#now splitting free throws and field goals into their own specific stats
#creating frame of just field goals made and attempted
field_goals_made = stats_split['Field_Goals'].str.split('-',expand=True).fillna(0)
#naming those columns
field_goals_made.columns = ['Field_Goals_Made','Field_Goals_Attempted']
#adding back to stat_split
stats_split['Field_Goals_Made'] = pd.to_numeric(field_goals_made['Field_Goals_Made'])
stats_split['Field_Goals_Attempted'] = pd.to_numeric(field_goals_made['Field_Goals_Attempted'])
#adding free throw percentage
stats_split['Shooting%'] = stats_split['Field_Goals_Made']/stats_split['Field_Goals_Attempted']
#repeating for free throws
#creating frame of just free throws made and attempted
frees_made = stats_split['Free_Throws'].str.split('-',expand=True).fillna(0)
##naming those columns
frees_made.columns = ['Frees_Made','Frees_Attempted']
##adding back to stat_split
stats_split['Free_Throws_Made'] = pd.to_numeric(frees_made['Frees_Made'])
stats_split['Free_Throws_Attempted'] = pd.to_numeric(frees_made['Frees_Attempted'])
##adding free throw percentage
stats_split['Free_Throw%'] = stats_split['Free_Throws_Made']/stats_split['Free_Throws_Attempted']

In [34]:
df1[['Points', 'Rebounds', 'Assists', 'Steals', 'Blocks', 'Turnovers',
       '3pts', 'Field_Goals', 'Free_Throws', 'Field_Goals_Made',
       'Field_Goals_Attempted', 'Shooting%', 'Free_Throws_Made',
       'Free_Throws_Attempted', 'Free_Throw%']] = stats_split[['Points', 'Rebounds', 'Assists', 'Steals', 'Blocks', 'Turnovers',
       '3pts', 'Field_Goals', 'Free_Throws', 'Field_Goals_Made',
       'Field_Goals_Attempted', 'Shooting%', 'Free_Throws_Made',
       'Free_Throws_Attempted', 'Free_Throw%']]

In [36]:
df1.columns

Index(['Date,', 'Position', 'Name', 'Fanduel_Points', 'Fanduel_Price', 'Team',
       'Score', 'Minutes', 'Stats', 'Minutes_Played', 'Home', 'Foe', 'Points',
       'Rebounds', 'Assists', 'Steals', 'Blocks', 'Turnovers', '3pts',
       'Field_Goals', 'Free_Throws', 'Field_Goals_Made',
       'Field_Goals_Attempted', 'Shooting%', 'Free_Throws_Made',
       'Free_Throws_Attempted', 'Free_Throw%'],
      dtype='object')

In [37]:
df2 = df1[['Date,', 'Position', 'Name', 'Fanduel_Points', 'Fanduel_Price', 'Team',
       'Score','Minutes_Played', 'Home', 'Foe', 'Points',
       'Rebounds', 'Assists', 'Steals', 'Blocks', 'Turnovers', '3pts',
       'Field_Goals_Made','Field_Goals_Attempted', 'Shooting%', 'Free_Throws_Made',
       'Free_Throws_Attempted', 'Free_Throw%']]

In [39]:
df2.to_csv('tight_frame.csv')