## NFL Playoffs Data Pull
Welcome to this Jupyter notebook on pulling NFL playoff result data from profootballreference

### Importing Packages
I use the BeautifulSoup package for working with HTML, and playwright for the async data scraping

In [1]:
import os
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout
import time
import re
import pandas as pd
import numpy as np
from datetime import datetime

In [2]:
# !pip playwright install-deps firefox

### Defining Globals
Here it keeps track of the file paths, as well as the number of playoff teams in each year of NFL history. <br>
The array concatenation is admittedly a little suspicious, but I print the output to better visualize changes I need to make, if any.

In [3]:
DATA_DIR = "sbpgs";
YR_DIR = "leagueYrs";
PLAYOFF_SIZES = [4] * 4 + [8] * 8 + [10] * 4 + [16] + [10] * 8 + [12] * 30 + [14] * 5;
year_sizes = list(zip(PLAYOFF_SIZES,list(range(1966,2024))))
print(year_sizes)

[(4, 1966), (4, 1967), (4, 1968), (4, 1969), (8, 1970), (8, 1971), (8, 1972), (8, 1973), (8, 1974), (8, 1975), (8, 1976), (8, 1977), (10, 1978), (10, 1979), (10, 1980), (10, 1981), (16, 1982), (10, 1983), (10, 1984), (10, 1985), (10, 1986), (10, 1987), (10, 1988), (10, 1989), (10, 1990), (12, 1991), (12, 1992), (12, 1993), (12, 1994), (12, 1995), (12, 1996), (12, 1997), (12, 1998), (12, 1999), (12, 2000), (12, 2001), (12, 2002), (12, 2003), (12, 2004), (12, 2005), (12, 2006), (12, 2007), (12, 2008), (12, 2009), (12, 2010), (12, 2011), (12, 2012), (12, 2013), (12, 2014), (12, 2015), (12, 2016), (12, 2017), (12, 2018), (12, 2019), (12, 2020), (14, 2021), (14, 2022), (14, 2023)]


In [4]:
def playoff_size(year):
    return PLAYOFF_SIZES[year - 1966]

### Function Declarations
This get_html function is heavily inspired by Dataquest, a youtube creator. <br>
Link to video here: https://www.youtube.com/watch?v=o6Ih934hADU

In [5]:
async def get_html(url,selector,sleep =5, retries =3):
    html = None
    for i in range(1, retries+1):
        time.sleep(sleep *i)
        
        try:
            async with async_playwright() as p:
#                 chromium is open sourced version of chrome
                browser = await p.firefox.launch()
                page = await browser.new_page()
                await page.goto(url)
                print(await page.title())
                html = await page.inner_html(selector)
        except PlaywrightTimeout:
            print(f"Timeout error on {url}")
            continue
        else:
            break
    return html

Saves each page to the computer, name is a naming function

In [6]:
async def savePath(link,directory,name,tag):
    save_path = os.path.join(directory, name(link))
    if not(os.path.exists(save_path)):
        html = await get_html(link, tag);
        with open(save_path, "w+") as f:
            f.write(html)
    else :
        with open(save_path, 'r') as f:
            html = f.read()
    return html

In [7]:
def getYear(url):
    yr = int(url[url.find(".htm")-4:url.find(".htm")]);
    return yr;

Two digit year is for easier nicknaming of teams in visualizations

In [8]:
def twoDigitYear(year):
    return str(year)[-2:]

In [9]:
twoDigitYear(2004)

'04'

These functions look at the url and help name the file path

In [10]:
def getAbbrv(url):
    abbrv = url[url.index("teams/")+6:url.index(".htm")-5];
    return abbrv;

In [11]:
def nameYear(link):
    i = link.find("years")+6;
    return link[i:i+4] + "league.htm";

In [12]:
def nameTeam(link):
    return getAbbrv(link)+str(getYear(link))+".htm";

Searches array of playoff teams, and gives round of playoff exit

In [13]:
def getResult(index,length):
    if index < 0 :
        return 0
    elif index == length-1 :
      return 5
    elif index == length-2 :
      return 4
    elif index >= length-4 :
      return 3
    elif index >= length-8 :
      return 2
    else :
      return 1

In [14]:
def convertResult(index):
    if index <= 0 :
        return "Missed"
    elif index == 5 :
      return "Won SB"
    elif index == 4:
      return "Lost SB"
    elif index == 3:
      return "Lost Title"
    elif index == 2:
      return "Lost Div"
    else :
      return "Lost WC"

Getting the playoff teams

In [15]:
def getYearURL(year):
    base = f"https://www.pro-football-reference.com";
    url = f"{base}/years/{year}/index.htm"
    return url

In [16]:
async def getPlayoffTeamsArr(url):
    base = f"https://www.pro-football-reference.com";
    a_tags = (await findLosers(url)) + (await findWinner(url));
    hrefs = [a["href"]  for a in a_tags];
    teams = [l for l in hrefs if "/teams/" in l];
    finalTeams = [base + t for t in teams]
    return finalTeams;

In [17]:
async def findWinner(url):
    html = BeautifulSoup(await savePath(url, YR_DIR, nameYear,"#div_playoff_results"))
    winner = html.find_all("td",{'data-stat': 'winner'})[-1]
    a_tag = [div.find_all("a") for div in winner]
    return sum(a_tag,[])

In [18]:
async def findLosers(url):
    html = BeautifulSoup(await savePath(url, YR_DIR, nameYear,"#div_playoff_results"))
    losers = html.find_all("td",{'data-stat': 'loser'})
    html.find_all("td",{'data-stat': 'winner'})[-1]
    a_tags = [div.find_all("a") for div in losers]
    return sum(a_tags,[])

Example of calling code below:

In [19]:
losers = await getPlayoffTeamsArr("https://www.pro-football-reference.com/years/2021/index.htm")
# winners = await findWinner("https://www.pro-football-reference.com/years/2021/index.htm")
# losers = losers.append("")
losers

['https://www.pro-football-reference.com/teams/nwe/2021.htm',
 'https://www.pro-football-reference.com/teams/rai/2021.htm',
 'https://www.pro-football-reference.com/teams/pit/2021.htm',
 'https://www.pro-football-reference.com/teams/dal/2021.htm',
 'https://www.pro-football-reference.com/teams/phi/2021.htm',
 'https://www.pro-football-reference.com/teams/crd/2021.htm',
 'https://www.pro-football-reference.com/teams/oti/2021.htm',
 'https://www.pro-football-reference.com/teams/gnb/2021.htm',
 'https://www.pro-football-reference.com/teams/buf/2021.htm',
 'https://www.pro-football-reference.com/teams/tam/2021.htm',
 'https://www.pro-football-reference.com/teams/kan/2021.htm',
 'https://www.pro-football-reference.com/teams/sfo/2021.htm',
 'https://www.pro-football-reference.com/teams/cin/2021.htm',
 'https://www.pro-football-reference.com/teams/ram/2021.htm']

Searches through array of playoff results to find team, and gives the playoff outcome

In [20]:
async def findResult(teamURL,year):
    yearURL = getYearURL(year)
    size = playoff_size(year)
    if year > 2022 or year < 1970 :
        return "undefined"
    arr = await getPlayoffTeamsArr(yearURL)
    index = arr.index(teamURL) if teamURL in arr else -1
    return getResult(index,size)

In [21]:
await findResult("https://www.pro-football-reference.com/teams/det/1970.htm",2024)

'undefined'

## Pulling Data from HTML
Each function pulls a piece of HTML. <br>
In practice, they are all similar.

In [22]:
def getWins(string):
    w = int(string[:string.find("-")])
    return w;

In [23]:
def getGames(string):
    w = getWins(string)
    rest = string[string.find("-") + 1:]
    l = getWins(rest)
    t = int(rest[rest.find("-") + 1:])
    return w + l + t

In [24]:
# sus url manipulation to get the next year
def getNext(url,diff):
    base = url[:url.find(".htm")-5]
    yr = getYear(url)+diff;
    if((yr > 2023) or (yr < 1950)):
        return "";
    return f"{base}/{yr}.htm";

In [25]:
def searchHTML(string,html,div):
    arr = html.find_all(div)
    for i,p in enumerate(arr):
        if str(p).find(string)>0 :
            return i
    return -1

Meta is the HTML section of the pages that I'm pulling from. <br>
I realized I had a lot of code duplication, so I made this nice helper to execute most pulls.

In [26]:
async def searchMeta(url,div,string):
    html = BeautifulSoup(await savePath(url,DATA_DIR,nameTeam,"#meta"))
    index = searchHTML(string,html,div)
    if index < 0 :
        return "undefined"
    return html.find_all(div)[index].getText();

In [27]:
#get record from html
async def getRec(link):
    ret = await searchMeta(link,"p","Record")
    ret = ret[ret.find(":")+2:ret.find(",")]
    return ret

In [28]:
async def getFullName(url):
    html = BeautifulSoup(await savePath(url,DATA_DIR,nameTeam,"#meta"))
    ret = html.find_all("span")[1].getText()
    return ret

In [29]:
def getNickname(year, name):
    return "'" + twoDigitYear(year) + " " + name.split()[-1]

In [30]:
async def getDivision(url):
    ret = await searchMeta(url,"p","Record")
    if ret == "undefined" :
        return ret;
    return ret[ret.find("\t")+1:ret.find("Div")-1]

In [31]:
async def getConf(url):
    ret = await getDivision(url)
    if ret == "undefined" :
        return ret;
    return ret[0:3]

In [32]:
async def getCoach(url):
    ret = await searchMeta(url,"p","Coach")
    if ret == "undefined" :
        return ret;
    return ret[ret.find("\n")+1:ret.find("(")-1]

In [33]:
# can be buggy for older seasons
async def getSBOdds(url):
    ret = await searchMeta(url,"p","Preseason Odds")
    if ret == "undefined" :
        return ret;
    endI = min(len(ret) - 1, ret.find(";"))
    return ret[ret.find("Bowl")+5:endI]

In [34]:
await getSBOdds("https://www.pro-football-reference.com/teams/rai/1980.htm")

'+350'

In [35]:
# can be buggy for older seasons
async def getOverUnder(url):
    ret = await searchMeta(url,"p","O/U:")
    if ret == "undefined" :
        return ret;
    return float(ret[ret.find("O/U:")+5:])

In [36]:
await searchMeta("https://www.pro-football-reference.com/teams/clt/1970.htm","p","Odds")

'undefined'

In [37]:
await getOverUnder("https://www.pro-football-reference.com/teams/kan/2021.htm")

12.5

In [38]:
async def getPFRank(url):
    ret = await searchMeta(url,"p","Points For")
    if ret == "undefined" :
        return ret;
    return ret[ret.find(")")+2:ret.find("of")-3]

In [39]:
async def getPARank(url):
    ret = await searchMeta(url,"p","Points Against")
    if ret == "undefined" :
        return ret;
    return ret[ret.find(")")+2:ret.find("of")-3]

In [40]:
async def getExpRec(url):
    ret = await searchMeta(url,"p","Expected W-L")
    if ret == "undefined" :
        return ret;
    return ret[ret.find(":")+2:]

In [41]:
async def getSRS(url):
    ret = await searchMeta(url,"p","#srs")
    if ret == "undefined" :
        return ret;
    return ret[ret.find(":")+2:ret.find("(")-1]

In [42]:
async def getSOS(url):
    ret = await searchMeta(url,"p","#sos")
    if ret == "undefined" :
        return ret;
    return ret[ret.find("SOS: ")+5:-1]

## Scrape_season function
Executes the calling of the playoff URL

In [43]:
async def scrape_season(season):
    url = getYearURL(season)
    finalTeams = await getPlayoffTeamsArr(url)
    return finalTeams

## Manipulating the Data

### Project 1: NFL Playoff Results vs Next Season Playoff Results

Here, I'm trying to create to visualize the odds a team lands in a given playoff round the next season, based on them winnning a given number of playoff rounds in the current season.  Essentially, a markov matrix of with each playoff result as a row.

In [44]:
SEASONS = list(range(1970,2022));
SEASONS = [await scrape_season(yr) for yr in SEASONS]

In [45]:
simple = [];
for arrays in SEASONS:
    for team in arrays:
        year = getYear(team)
        fullName = await getFullName(team)
        simple.append([year,
                         getNickname(year,fullName),
                         await findResult(team, year),
                         await findResult(getNext(team,+1), year+1)])

In [46]:
sf = pd.DataFrame(simple)
sf.columns = ["Year", "Team","Round","Next_Round"]
# sf.columns = ["Year", "Team","Prev_Round","Round","Next_Round","Prev_Wins","W","Next_Wins"]
sf_losers = sf[sf['Round'] == 4]
sf_losers

Unnamed: 0,Year,Team,Round,Next_Round
6,1970,'70 Cowboys,4,5
14,1971,'71 Dolphins,4,5
22,1972,'72 Redskins,4,2
30,1973,'73 Vikings,4,4
38,1974,'74 Vikings,4,2
46,1975,'75 Cowboys,4,2
54,1976,'76 Vikings,4,3
62,1977,'77 Broncos,4,2
72,1978,'78 Cowboys,4,2
82,1979,'79 Rams,4,1


In [47]:
start = [];

# for teams that made the playoffs
for val in range(0, 6):
    df = sf[sf['Round'] == val]
    percents = df['Next_Round'].value_counts(normalize=True).sort_index().tolist()
    start.append(percents)

# Print the collected percentage breakdowns as a list of lists
start = [percent for percent in start if percent]
rounded = [[round(num, 3) for num in sublist] for sublist in start]
mtrx = np.array(rounded)
df = pd.DataFrame(mtrx)
rows = ['Lost WC', 'Lost Div','Lost Title', 'Lost SB','Won SB']
cols = ['Missed', 'Lost WC', 'Lost Div','Lost Title', 'Lost SB','Won SB']
df.index = rows;
df.columns = cols;
print(df)
# df.to_csv('playoffhangover.csv', index=True, header=True)
# print(rounded)

            Missed  Lost WC  Lost Div  Lost Title  Lost SB  Won SB
Lost WC      0.544    0.114     0.190       0.089    0.038   0.025
Lost Div     0.505    0.115     0.149       0.115    0.053   0.062
Lost Title   0.352    0.102     0.194       0.167    0.056   0.130
Lost SB      0.288    0.154     0.327       0.096    0.077   0.058
Won SB       0.288    0.096     0.231       0.135    0.096   0.154


The following output was created on tableau:
![NFL Playoff Results](sbhangover/nflplayoffresults.png)

### Project 2: NFL Super Bowl Loser Wins Drop-Off

In [48]:
SEASONS = list(range(1970,2022));
SEASONS = [await scrape_season(yr) for yr in SEASONS]

In [49]:
simple = [];
for arrays in SEASONS:
    for team in arrays:
        year = getYear(team)
        games = getGames(await getRec(team))
        nextGames = getGames(await getRec(getNext(team,+1)))
        wins = getWins(await getRec(team))
        nextWins = getWins(await getRec(getNext(team,+1)))
        fullName = await getFullName(team)
        simple.append([year,
                         getNickname(year,fullName),
                         await findResult(team, year),
                         await findResult(getNext(team,+1), year+1),
                         games,
                         nextGames,
                         wins,
                         nextWins,
                         nextWins - wins])

In [50]:
sf = pd.DataFrame(simple)
sf.columns = ["Year", "Team","Round","Next_Round","Games","NextGames","W","Next_Wins","Diff"]
# sf.columns = ["Year", "Team","Prev_Round","Round","Next_Round","Prev_Wins","W","Next_Wins"]
sf_losers = sf[sf['Round'] == 4]

In [51]:
output_df = sf_losers[["Year","Team","Next_Wins","Diff"]]
output_df
# output_df.to_csv('winsDropOff.csv', index = False, header=True)

Unnamed: 0,Year,Team,Next_Wins,Diff
6,1970,'70 Cowboys,11,1
14,1971,'71 Dolphins,14,4
22,1972,'72 Redskins,10,-1
30,1973,'73 Vikings,10,-2
38,1974,'74 Vikings,12,2
46,1975,'75 Cowboys,11,1
54,1976,'76 Vikings,9,-2
62,1977,'77 Broncos,10,-2
72,1978,'78 Cowboys,11,-1
82,1979,'79 Rams,11,2


The following output was created on tableau:
![NFL SB Loser drop-off](winsdropOff/output.png)

### Project 3: NFL Respect for Super Bowl Winners - Visualized with Next Season Odds

In [52]:
SEASONS = list(range(1989,2022));
SEASONS = [await scrape_season(yr) for yr in SEASONS]

In [53]:
winners = [arr[-1] for arr in SEASONS]

In [54]:
simple = [];
for winner in winners:
    full = await getFullName(winner)
    year = getYear(winner)
    odds = await getSBOdds(getNext(winner,1))
    wins = await getOverUnder(getNext(winner,1))
    simple.append([year,
                    getNickname(year,full),
                    getWins(await getRec(getNext(winner,1))),
                    await findResult(getNext(winner,1),year +1),
                     odds,
                     int(odds[1:]),
                     wins])

In [52]:
sf = pd.DataFrame(simple)
sf.columns = ["Year", "Team","N_Wins","N_Round","Odds_Str","Odds","Over-Under"]
# sf.to_csv('sbOdds.csv', index = False, header=True)
sf

Unnamed: 0,Year,Team,N_Wins,N_Round,Odds_Str,Odds,Over-Under
0,1989,'89 49ers,14,5,350,350,11.5
1,1990,'90 Giants,8,0,400,400,11.0
2,1991,'91 Redskins,9,2,600,600,11.5
3,1992,'92 Cowboys,12,5,350,350,11.5
4,1993,'93 Cowboys,12,3,300,300,11.0
5,1994,'94 49ers,11,2,200,200,12.5
6,1995,'95 Cowboys,10,2,600,600,10.5
7,1996,'96 Packers,13,4,250,250,12.0
8,1997,'97 Broncos,14,5,600,600,11.0
9,1998,'98 Broncos,6,0,500,500,10.5


The following output was created on tableau:
![NFL SB Winner Odds](sbodds/sbOddsYears.png)
![NFL SB Winner Odds](sbodds/sbOddsSorted.png)

And there it is!  We made these beautiful tableau outputs!