# Data wrangling

Since the data tables on the [Footballguys](https://www.footballguys.com/) website are fully rendered in HTML, we might be able to scrape the data without too much trouble. This gives us good control over exactly what data we download and an easy mechanism by which to update it throughout the season. Let's give it a try using [urllib](https://docs.python.org/3/howto/urllib2.html) and [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/).

In [1]:
import time
import urllib.request
from itertools import product
from random import randrange

import pandas as pd
from bs4 import BeautifulSoup

# Decide if we want to re-download the data or not
download_data=True

# Set the data file paths
raw_data_path='../data/raw_qb_data.parquet'

## 1. Download and parse HTML data

The available data spans 1996 to 2024 and each year has 18 weeks of data. We also will want to download the data for multiple positions. But, let's start with just one. We also need to pick a scoring scheme, let's go with PPR. We can easily change this later. We will use a loop to construct and download the URL for each year and week and parse and collect the data as we get it.

**Note**: Downloading all of the data for one position takes just over 45 minutes.

### 1.1. Download function

In [2]:
def download_url(url: str) -> bytes:
    '''Takes string url, downloads URL and returns HTML bytes object'''

    headers={
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
        "Accept-Language": "en-US,en;q=0.9",
        "Connection": "keep-alive",
        "Host": "httpbin.io",
        "Sec-Ch-Ua": '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
        "Sec-Ch-Ua-Mobile": "?0",
        "Sec-Ch-Ua-Platform": '"Linux"',
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "cross-site",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
    }

    # Create the request
    request_params = urllib.request.Request(
        url=url,
        headers=headers
    )   

    # Get the html
    with urllib.request.urlopen(request_params) as response:
        html=response.read()

    return html

### 1.2. HTML parsing function

In [3]:
def parse_html_table(html: bytes, position: str, year: int, week: int, profile: str) -> pd.DataFrame:
    '''Takes a html bytes object from URL, parses data table, adds
    year, week, position and scoring profile and returns as pandas dataframe'''

    # Extract the table rows
    soup=BeautifulSoup(html, 'html.parser')
    table=soup.find('table',{'class':'datasmall table'})
    table_rows=table.find_all('tr')

    # Get the column names from the first row
    columns=table_rows[0].find_all('th')
    column_names=[column.getText() for column in columns]
    column_names.extend(['Position', 'Year', 'Week', 'Scoring profile'])

    # Get the values for each row
    data=[]

    for row in table_rows[1:]:
        columns=row.find_all('td')
        values=[column.getText() for column in columns]
        values.extend([position, year, week, profile])
        data.append(values)

    # Convert to pandas dataframe and return
    return pd.DataFrame(columns=column_names, data=data)

### 1.3. Main download loop

In [4]:
# URL parameter arguments
position='qb'
profile='p'
years=list(range(1996,2025))
weeks=list(range(1,19))

# Download the data if asked
if download_data is True:

    # Empty list to accumulate results
    results=[]

    for year, week in product(years, weeks):

        print(f'Downloading data for {year}', end='\r')

        # Construct the URL for this year and week
        url=f'https://www.footballguys.com/playerhistoricalstats?pos={position}&yr={year}&startwk={week}&stopwk={week}&profile={profile}'

        # Get the HTML
        html=download_url(url)

        # Parse the HTML
        result=parse_html_table(html, position, year, week, profile)

        # Collect the result
        results.append(result)

        # Wait before downloading the next page
        time.sleep(randrange(1, 10))

    # Combine the week by week dataframes
    data_df=pd.concat(results)

    # Clean up the index
    data_df.reset_index(inplace=True, drop=True)

    # Save as parquet
    data_df.to_parquet(raw_data_path)

# Or load it from disk if we already have it
if download_data is False:
    data_df=pd.read_parquet(raw_data_path)
    print('Loaded data from disk')

Downloading data for 2024

## 2. Fix the player name/team column

In [5]:
data_df['Team']=data_df['Name'].apply(lambda x: x.split()[-1])
data_df['Name']=data_df['Name'].apply(lambda x: ' '.join(x.split()[:-1]))
data_df.head()

Unnamed: 0,Rank,Name,Age,Exp,G,Cmp,Att,Cm%,PYd,Y/Att,...,Rsh,RshYd,RshTD,FP/G,FantPt,Position,Year,Week,Scoring profile,Team
0,1,Brett Favre,27.0,6.0,1,20,27,74.1,247,9.15,...,1,1,0,26.0,26.0,qb,1996,1,p,GB
1,2,Mark Brunell,26.0,3.0,1,20,31,64.5,212,6.84,...,10,41,0,20.6,20.6,qb,1996,1,p,JAX
2,3,Vinny Testaverde,33.0,10.0,1,19,33,57.6,254,7.7,...,8,42,1,20.4,20.4,qb,1996,1,p,BAL
3,4,Rodney Peete,30.0,8.0,1,20,34,58.8,269,7.91,...,6,10,0,17.8,17.8,qb,1996,1,p,PHI
4,5,Kerry Collins,24.0,2.0,1,17,31,54.8,198,6.39,...,1,14,0,17.3,17.3,qb,1996,1,p,CAR
