### Home Field Advantage

We use the same code as the Park Factor notebook to scrape the schedules for the past 5 years (excluding 2020) and calculate the win percentage of the home team in all games.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import pickle

In [40]:
teams =["Bethel", "Goshen", "Grace", "HU", "IWU", "Marian", "MVNU", "SAU", "SFU", "Taylor"]
t_nums = [1629, 1678, 1679, 1688, 1694, 1717, 1736, 1780, 1805, 1784]
years = [2015, 2016, 2017, 2018, 2019]

The code from the Park Factor notebook looped for each year:

In [42]:
homeField = []
for year in years:
    urls = ['http://www.dakstats.com/WebSync/Pages/Team/TeamSchedule.aspx?association=10&sg=MBA&sea=NAIMBA_' + str(year) + '&team=' +
            str(num) for num in t_nums]
    #Create a handle, page, to handle the contents of the website
    pages = [requests.get(url) for url in urls]
    #Store the page as an element tree using BeautifulSoup4
    soups = [BeautifulSoup(page.content) for page in pages]
    team_tables = [
      [
        [
          [td.get_text(strip=True) for td in tr.find_all('td')] 
          for tr in table.find_all('tr') 
        ]#for each row in each table
        for table in soup.find_all('table') 
      ]#for each table on each webpage
      for soup in soups 
    ]#for each team's webpage
    headers = [['Date', 'Opponent', 'Location', 'Score', 'Outcome'] for tables in team_tables]
    team_rows = [[r[:5] for r in tables[35][1::2]] for tables in team_tables]
    dfc = [pd.DataFrame(columns = headers[i], data = team_rows[i]) for i in range(len(headers))]
    conf_df = [df[df.Opponent.str.contains("*", regex = False)] for df in dfc]
    tidy_conf = conf_df.copy()
    for i, df in enumerate(conf_df):
      split_scores = df['Score'].str.replace(r"\(.*\)","").str.split('-', expand = True)
      tidy_conf[i] = df.assign(Score = pd.to_numeric(split_scores[0]),
                               Opp_score = pd.to_numeric(split_scores[1]),
                               Opponent = df.Opponent.str.replace(' \*', '', regex= True),
                               Date = pd.to_datetime(df.Date)
                               )
    conf_h = [df[df.Location.str.contains("H", regex = False)] for df in tidy_conf]
    conf_a = [df[df.Location.str.contains("A", regex = False)] for df in tidy_conf]
    conf_h_w = [df[df.Outcome.str.contains("W", regex = False)] for df in conf_h]
    all_h_wins = pd.concat(conf_h_w)
    all_h_games = pd.concat(conf_h)
    h_win_num = len(all_h_wins.index)
    h_game_num = len(all_h_games.index)
    hf = h_win_num / h_game_num
    homeField.append(hf)

Each year's homefield advantage:

In [43]:
homeField

[0.5507246376811594,
 0.559322033898305,
 0.5378787878787878,
 0.553030303030303,
 0.5703125]

Here is the 5 year average homefield advantage:

In [44]:
avgHomeField = sum(homeField) / len(homeField)
avgHomeField

0.5542536524977111