# Data wrangling

Since the data tables on the [Footballguys](https://www.footballguys.com/) website are fully rendered in HTML, we might be able to scrape the data without too much trouble. This gives us good control over exactly what data we download and an easy mechanism by which to update it throughout the season. Let's give it a try using [urllib](https://docs.python.org/3/howto/urllib2.html) and [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/).

In [1]:
import urllib.request
import pandas as pd
from bs4 import BeautifulSoup

# 1. Download test page

In [2]:
headers={
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    #"Accept-Encoding": "gzip, deflate, br, zstd",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
    "Host": "httpbin.io",
    "Sec-Ch-Ua": '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Linux"',
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "cross-site",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
}

In [3]:
# First add a reasonable-ish header for the request. Visit https://httpbin.io/headers to see what your
# web browser looks like to web servers. Copy that output to use it with urllib

# Target url
test_url='https://www.footballguys.com/playerhistoricalstats?pos=flex&yr=2024&startwk=1&stopwk=18&profile=p'

# Create the request
request_params = urllib.request.Request(
    url=test_url,
    headers=headers
)   

# Get the html
with urllib.request.urlopen(request_params) as response:
   html=response.read()

# Take a look
print(html.decode())

<!DOCTYPE html><html lang="en"><head><title>Player Historical Stats - Footballguys</title><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><meta name="description" content="Statistics for previous years of the NFL"><meta property="og:title" content="Player Historical Stats"><meta property="og:type" content="website"><meta property="og:description" content="Statistics for previous years of the NFL"><meta property="og:locale" content="en_US"><meta property="og:site_name" content="Footballguys.com"><meta name="twitter:card" content="summary_large_image"><meta name="twitter:site" content="@Football_Guys"><meta name="twitter:creator" content="@Football_Guys"><meta name="twitter:title" content="Player Historical Stats"><meta name="twitter:description" content="Statistics for previous years of the NFL"><link rel="icon" type="image/png" href="https://www.footballguys.com/fbgstatic/img/favicon/16x16.pn

We can clearly see the data table inside the table tag with `class="datasmall table"`

## 2. Parse HTML data

In [4]:
# Extract the table rows
soup=BeautifulSoup(html, 'html.parser')
table=soup.find('table',{'class':'datasmall table'})
table_rows=table.find_all('tr')

# Get the column names from the first row
columns=table_rows[0].find_all('th')
column_names=[column.getText() for column in columns]

# Get the values for each row
data=[]

for row in table_rows[1:]:
    columns=row.find_all('td')
    values=[column.getText() for column in columns]
    data.append(values)

# Convert to pandas dataframe
data_df=pd.DataFrame(columns=column_names, data=data)
data_df.head()

Unnamed: 0,Rank,Name,Age,Exp,G,Rsh,RshYd,Y/Rsh,RshTD,Rec,RecYd,RecTD,FP/G,FantPt
0,1,Ja'Marr Chase CIN,24.0,4.0,17,3,32,10.7,0,127,1708,17,23.7,403.0
1,2,Jahmyr Gibbs DET,22.0,2.0,17,250,1412,5.6,16,52,517,4,21.5,364.9
2,3,Saquon Barkley PHI,27.0,7.0,16,345,2005,5.8,13,33,278,2,22.0,351.3
3,4,Bijan Robinson ATL,22.0,2.0,17,304,1456,4.8,14,61,431,1,20.0,339.7
4,5,Derrick Henry BAL,30.0,9.0,17,325,1921,5.9,16,19,193,2,19.9,338.4


## 3. Fix the player name/team column

In [5]:
data_df['Team']=data_df['Name'].apply(lambda x: x.split()[-1])
data_df['Name']=data_df['Name'].apply(lambda x: ' '.join(x.split()[:-1]))
data_df.head()

Unnamed: 0,Rank,Name,Age,Exp,G,Rsh,RshYd,Y/Rsh,RshTD,Rec,RecYd,RecTD,FP/G,FantPt,Team
0,1,Ja'Marr Chase,24.0,4.0,17,3,32,10.7,0,127,1708,17,23.7,403.0,CIN
1,2,Jahmyr Gibbs,22.0,2.0,17,250,1412,5.6,16,52,517,4,21.5,364.9,DET
2,3,Saquon Barkley,27.0,7.0,16,345,2005,5.8,13,33,278,2,22.0,351.3,PHI
3,4,Bijan Robinson,22.0,2.0,17,304,1456,4.8,14,61,431,1,20.0,339.7,ATL
4,5,Derrick Henry,30.0,9.0,17,325,1921,5.9,16,19,193,2,19.9,338.4,BAL


## 4. Save the data

In [6]:
data_df.to_csv('../data/sample_data.csv', index=False)