##### Football data collection tutorial

For this tutorial, we will be providing the means and methods on collecting data from various websites using webscraping techniques. In our case, it is football data from various sources that encompass different aspects needed for our project. 

A key note about the following code, we use requests and BeautifulSoup for each webscrape and will not discuss the loops with the exception of the first. Each loop is dependent on the website and you will need to determine the HTML elements your website is using and run your own loop. I will discuss issues when retreiving data from Wikipedia and code that was used to overcome it. 

Finally, all print statements are commented out as they were used for debugging purposes. Should you wish to use the code you are welcome to uncomment them to see what was producded. 

Our first step is importing the python packages we need in order to accomplish our task. In this instance we will need requests bs4 (BeatifulSoup), pandas, re and os. Once we have installed our packages, we wil use the `os` package to ensure we are in the correct directory. 

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import os
os.getcwd()

'/Users/andrewtamez/Desktop/NPS/OA3802_Comp3/Final_Project'

For the following portions of code, we first inspect the website we wish to pull data from. This is a key step as each website has their websites set up differently and use different html code to set them up.

Given our website has a base url, we assign the url a variable that will allow us to concate the year in a loop later in the code. 

The variable below is how we established our base_url variable. Please insert the website you wish to explore in place for the one utilized by my team. 

`base_url = "https://www.spotrac.com/nfl/position/_/year/"`

We then assign a variable for the years we are interested in. We also add in the salary caps for each year to allow the spending by position to be represented as a percentage. We then run our loop:

1) The loop first appeneds the year as a string to the base url
2) We make a GET request with `response.get(url)`
3) Use `soup = BeautifulSoup(response.content, 'html.parser')` to parse out the HTML contect of the response created in step 2
4) We then search for the `<div>` element using `table_wrapper = soup.find('div', id='table-wrapper')` and then search for `<table>` element using `table_wrapper.find('table')`
5) The next portion `if table`, retrieves the table headers elements. We then extract the table rows with the `<tr>` elements. 
6) We then search each row in our `for row in rows` and extract all the cells, `<td>` elements, while stripping the whitespace. We then append to the cells to our data list. 
7) This list is then turned into a pandas dataframe and add in the year to each row. We also included a step that kes our team abbreviations to ensure we convert the names correctly. 

In [3]:
team_abbreviations = {"San Francisco 49ers": "SF", "Kansas City Chiefs": "KC", "Green Bay Packers": "GB", "Baltimore Ravens": "BAL",
    "Dallas Cowboys": "DAL", "Miami Dolphins": "MIA", "Pittsburgh Steelers": "PIT", "Oakland Raiders": "OAK", "Las Vegas Raiders": "LV",
    "Minnesota Vikings": "MIN", "New York Giants": "NYG", "New England Patriots": "NE", "Los Angeles Rams": "LAR", "Chicago Bears": "CHI",
    "Philadelphia Eagles": "PHI", "Denver Broncos": "DEN", "Tampa Bay Buccaneers": "TB", "Seattle Seahawks": "SEA", "Indianapolis Colts": "IND",
    "Washington Commanders": "WAS", "Washington Redskins": "WAS", "Buffalo Bills": "BUF", "Tennessee Titans": "TEN", "Atlanta Falcons": "ATL",
    "Carolina Panthers": "CAR", "Arizona Cardinals": "ARI", "Cincinnati Bengals": "CIN", "Detroit Lions": "DET", "Jacksonville Jaguars": "JAX",
    "New York Jets": "NYJ", "Houston Texans": "HOU", "Los Angeles Chargers": "LAC", "St. Louis Rams": "STL", "San Diego Chargers": "SD",
    "New Orleans Saints": "NO",}

def scrape_spending_data():
    base_url_spending = "https://www.spotrac.com/nfl/position/_/year/"
    years = range(2011, 2025)
    data_frames = {}

    # Salary cap data in millions
    salary_cap_data = {2024: 255.4, 2023: 224.8, 2022: 208.2,
        2021: 182.5, 2020: 198.2, 2019: 188.2, 2018: 177.2,
        2017: 167.0, 2016: 155.27, 2015: 143.28, 2014: 133.0,
        2013: 123.0, 2012: 120.6, 2011: 120.0}

    for year in years:
        url = base_url_spending + str(year)
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        table_wrapper = soup.find('div', id='table-wrapper')
        if not table_wrapper:
            print(f"Missing data for year: {year}")
            continue
        
        table = table_wrapper.find('table')
        if table:
            headers = [header.text.strip() for header in table.find('thead').find_all('th')]
            rows = table.find('tbody').find_all('tr')
            
            #print(f"Year {year}: Found {len(rows)} rows of data.")  

            data = []
            for row in rows:
                cells = [cell.text.strip() for cell in row.find_all('td')]
                if cells: 
                    data.append(cells)
            
            df = pd.DataFrame(data, columns=headers)
            df['Year'] = year
            df['Team'] = df['Team'].apply(lambda x: team_abbreviations.get(x, x))

            
            for col in df.columns[1:-1]: 
                df[col] = df[col].replace('-', '0') 
                df[col] = df[col].str.replace('[$M]', '', regex=True).astype(float)
                df[col] = df[col] / salary_cap_data[year]

            data_frames[year] = df
        #else:
            #print(f"Year {year}: No table found.")

    return data_frames

#spending_data_frames = scrape_spending_data()


As mentioned previously, we will not discuss the loops that are different for these codes as they follow simliar steps as the initial webscraping code above.

In [4]:
def scrape_super_bowl_winners():
    url = 'https://www.topendsports.com/events/super-bowl/winners-list.htm'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    table = soup.find('table')
    rows = table.find_all('tr')
    
    super_bowl_winners = {}
    for row in rows[1:]:
        cols = row.find_all('td')
        if len(cols) >= 3:
            year = int(cols[0].text.strip()) - 1  # Adjust for season year
            winner = cols[2].text.strip()
            short_name = team_abbreviations.get(winner, winner)
            super_bowl_winners[year] = short_name
    
    return super_bowl_winners


super_bowl_winners = scrape_super_bowl_winners()

for year, df in spending_data_frames.items():
    winner_team = super_bowl_winners.get(year)
    
    if winner_team:
        df['SuperBowl_Win'] = df['Team'].apply(
            lambda team: 1 if winner_team in team else 0)
    else:
        df['SuperBowl_Win'] = 0
    
    #print(f"Year: {year}, Super Bowl Winner: {winner_team}")
    #print(df[['Team', 'SuperBowl_Win']].head())  # Display first few rows to check


In [5]:
def scrape_nfc_champions():
    url = 'https://www.foxsports.com/stories/nfl/nfc-champions-complete-list-winners-year'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    champions = {}
    for li in soup.find_all('li', class_='ff-h'):
        span = li.find('span')
        if span:
            year_and_team = span.get_text(strip=True)
            year, team = year_and_team.split(':', 1)
            team = re.sub(r'\s*\(\d+-\d+\)', '', team).strip()
            champions[int(year.strip())] = team_abbreviations.get(team, team)

    return champions

def scrape_afc_champions():
    url = 'https://www.foxsports.com/stories/nfl/afc-champions-complete-list-winners-year'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    champions = {}
    for li in soup.find_all('li', class_='ff-h'):
        span = li.find('span')
        if span:
            year_and_team = span.get_text(strip=True)
            year, team = year_and_team.split(':', 1)
            team = re.sub(r'\s*\(\d+-\d+\)', '', team).strip()
            champions[int(year.strip())] = team_abbreviations.get(team, team)

    return champions

nfc_champions = scrape_nfc_champions()
afc_champions = scrape_afc_champions()

#print("NFC Champions:")
#for year, team in nfc_champions.items():
#    print(f"Year: {year}, Team: {team}")

#print("\nAFC Champions:")
#for year, team in afc_champions.items():
#    print(f"Year: {year}, Team: {team}")


In [7]:
base_url_wins = 'https://www.nfl.com/standings/league/{}/REG'

def scrape_wins(year):
    url = base_url_wins.format(year)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    table = soup.find('table')
    rows = table.find('tbody').find_all('tr')

    wins_data = {}
    for row in rows:
        cols = row.find_all('td')
        if len(cols) > 0:
            team_name = cols[0].text.strip()
            team_name = re.sub(r'\n.*', '', team_name).strip()  # Remove any text after newline
            team_name = team_abbreviations.get(team_name, team_name)  # Map to abbreviation
            wins = int(cols[1].text.strip())
            wins_data[team_name] = wins

    return wins_data

all_wins_data = {}
for year in range(2011, 2025):
    #print(f"Scraping data for {year}...")
    all_wins_data[year] = scrape_wins(year)

#for year, data in all_wins_data.items():
    #print(f"Wins data for {year}:")
    #for team, wins in data.items():
        #print(f"{team}: {wins} wins")
    #print("-" * 40)




While the codes above have been similar, we are adding in a final portion of code that grabs the teams that entered the playoffs for the years we are interested in. Most of the code is similar but we have run into issues. First, we must search for the `<wikitable>` elements within the wikipedia pages to find our data. 

We then look find the `<tr>` and `<td>` elements and pull our data. We have a few conditionals given the make up of the HTML code that will not be touched on as each page is unique in terms of HTML. 

I will discuss two new functions made specifically for the playoff wins. Pulling the data from wikipedia provided an additional step to ensure we only pulled the data we wanted. 

The first function `is_valid_team_name` that searches through the data we pulled. We are looking for any digits, values with `OT` in the name, or if the name was less than a length 2 as all the names are longer than 2. We then return a FALSE if these conditions are met, otherwise we return a TRUE. 

Finally, our last function then looks for the years we are interested in and uses the above function to pull the teams only if the values are TRUE. We filter them out and add them to a dictionary with the year as the key and the teams as the items. 

In [8]:
def get_playoff_teams():
    url = "https://en.wikipedia.org/wiki/NFL_playoff_results#All-time_playoff_records_(NFL/AFL)"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    playoff_data = {}
    
    tables = soup.find_all('table', class_='wikitable')
    
    for table in tables:
        rows = table.find_all('tr')
        current_year = None
        
        for row in rows:
            cells = row.find_all('td')
            if cells:
                first_cell_text = cells[0].get_text(strip=True)
                if first_cell_text[:4].isdigit() and 2011 <= int(first_cell_text[:4]) <= 2024:
                    current_year = int(first_cell_text[:4])
                    if current_year not in playoff_data:
                        playoff_data[current_year] = []
                
                if current_year:
                    for cell in cells:
                        team_links = cell.find_all('a', href=True)
                        for link in team_links:
                            if 'title' in link.attrs and 'cite_note' not in link['href']:
                                team_name = link.get_text(strip=True)
                                if is_valid_team_name(team_name):
                                    if team_name not in playoff_data[current_year]:
                                        playoff_data[current_year].append(team_name)
    
    return playoff_data

def is_valid_team_name(name):
    if ('OT' in name or any(char.isdigit() for char in name.split('-')) or 
        '–' in name or len(name) <= 2):
        return False
    return True


def get_filtered_playoff_teams(playoff_teams, team_abbreviations):
    filtered_playoff_teams = {}
    for year, teams in playoff_teams.items():
        if 2011 <= year <= 2024:
            valid_teams = [team_abbreviations.get(team, team) for team in teams if is_valid_team_name(team)]
            filtered_playoff_teams[year] = valid_teams
    return filtered_playoff_teams

def main():
    super_bowl_winners = scrape_super_bowl_winners()
    nfc_champions = scrape_nfc_champions()
    afc_champions = scrape_afc_champions()
    spending_data_frames = scrape_spending_data()
    playoff_teams = get_playoff_teams()
    filtered_playoff_teams = get_filtered_playoff_teams(playoff_teams, team_abbreviations)
    
    for year, df in spending_data_frames.items():
        wins_data = scrape_wins(year)
        
        df['Wins'] = df['Team'].map(wins_data).fillna(0).astype(int)
        df['SuperBowl_Win'] = df['Team'].apply(lambda team: 1 if super_bowl_winners.get(year) == team else 0)
        df['CC_Win'] = df['Team'].apply(lambda team: 1 if nfc_champions.get(year) == team or afc_champions.get(year) == team else 0)
        df['Playoffs'] = df['Team'].apply(lambda team: 1 if team in filtered_playoff_teams.get(year, []) else 0)

        #print(f"Year: {year}")
        #print(df[['Team', 'Wins', 'SuperBowl_Win', 'CC_Win', 'Playoffs']].head())
        #print("-" * 40)

    all_years_df = pd.concat(spending_data_frames.values())
    all_years_df.to_csv('NFL_Positional_Spending_with_Championships_and_Wins.csv', index=False)

if __name__ == "__main__":
    main()
