### Individual Conference Statistics

Before any analysis, we must first scrape the DakStats webpages for the individual statistics for each team.

We ran this line once to install the necessary package.

In [17]:
#pip install selenium

Additionally, we had to complete the following steps:
1. Ensure Mozilla Firefox was downloaded on our devices. 
2. Install a program named "Geckodriver" which can be found at https://github.com/mozilla/geckodriver/releases.
3. Create a new folder named `Geckodriver` in our `C:\\Program Files` folder and move the `geckodriver.exe` file there.

We import the necessary statements, assign the team names and numbers given by DakStats, and gather the URLs for each team's statistics page.

In [18]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
import pickle

In [19]:
teams =["Bethel", "Goshen", "Grace", "HU", "IWU", "Marian", "MVNU", "SAU", "SFU", "Taylor"]
t_nums = [1629, 1678, 1679, 1688, 1694, 1717, 1736, 1780, 1805, 1784]

In [20]:
urls = ['http://www.dakstats.com/WebSync/Pages/Team/IndividualStats.aspx?association=10&sg=MBA&conference=NAIMBA1_CROSS&team='
        +str(num)+'&sea=NAIMBA_2019' for num in t_nums]

We utilize the Geckodriver program to open the URLs in Firefox. The loop allows us to do this for each team, and we use the code `find_elements_by_css_selector("input[type='radio']")[1].click()` to find the radio button on the page and click the second option, which shows only the conference statistics. Then, we download the html for each team's page.

In [21]:
driver = webdriver.Firefox(executable_path=r'C:\\Program Files\\Geckodriver\\geckodriver.exe')
page_sources = []
for url in urls:
    driver.get(url)
    driver.find_elements_by_css_selector("input[type='radio']")[1].click()
    page_sources.append(driver.page_source)

We parse the html for each team with BeautifulSoup. We used `soups[2].prettify()` to show the html, and found that the tables we are looking for all share the property `class="gridViewReportBuilderWide"`. We use this distinction to get only the data we need and store it in lists.

In [22]:
#Store the page as an element tree using BeautifulSoup4
soups = [BeautifulSoup(page) for page in page_sources]
#print(soups[2].prettify()) <-- This line commented out because of giant output

In [23]:
stat_tables = [[
    [
      [td.get_text(strip=True) for td in tr.find_all('td')] 
      for tr in table.find_all('tr') 
    ]#for each row in each table
    for table in soup.find_all('table',{"class":"gridViewReportBuilderWide"}) 
  ] for soup in soups] #for each table on each webpage

Now that we have the data, we define the headers for our batting dataframe and gather the rows. We subset only the first table from each team's web page (denoted by the index `[0]`) because the batting table shows up first on each page.

In [24]:
bat_headers = ["Batting", "GP", "GS", "AVG", "AB", "R", "H", "2B", "3B", "HR", "RBI", "TB", "SLG",
               "BB", "HBP", "SO", "GDP", "OBP", "SF", "SH", "SB", "SBA"]
bat_rows = [stat_tables[i][0] for i in range(len(teams))]

Then, we organize the data in a list of dataframes. We noticed the first row of each dataframe was empty, so we reassigned our list beginning with the second row of each dataframe using the code `df.iloc[1:]`. Here we show the first 5 rows of Grace's conference batting statistics.

In [25]:
dfb = [pd.DataFrame(columns = bat_headers, data = bat_rows[i]) for i in range(len(teams))]
dfb = [df.iloc[1:] for df in dfb] # remove first empty row
dfb[2][:5]

Unnamed: 0,Batting,GP,GS,AVG,AB,R,H,2B,3B,HR,...,SLG,BB,HBP,SO,GDP,OBP,SF,SH,SB,SBA
1,"Griffin, Chris",26,25,0.39,82,18,32,4,4,4,...,0.683,15,4,21,0,0.505,0,0,0,2
2,"Enyart, Mitchell",23,22,0.352,71,11,25,4,0,1,...,0.451,11,0,11,1,0.434,1,3,1,1
3,"Harris, Xavier",27,27,0.337,86,19,29,3,0,0,...,0.372,11,7,16,2,0.448,1,1,3,4
4,"Elford, Sid",18,8,0.333,36,5,12,2,0,1,...,0.472,4,1,10,2,0.415,0,0,1,1
5,"Haney, Houston",24,23,0.321,78,8,25,7,0,2,...,0.487,5,2,20,0,0.364,3,0,1,1


Since the data in each dataframe is formatted as strings, we use the code `df["column_name"].apply(pd.to_numeric)` to convert the data into integers and floats as necessary. We do this for every column excluding the column of player names. 

In [26]:
for df in dfb:
    df[bat_headers[1:]] = df[bat_headers[1:]].apply(pd.to_numeric)

We follow a similar process to create the dataframes for pitching statistics. Instead of collecting data from the first table from each DakStats web page, we collect it from second table which displays the pitching statistics.

In [27]:
pitch_headers = ["Pitching", "ERA", "W", "L", "GP", "GS", "CG", "SHO", "CBO", "SV", "IP", "H",
                 "R", "ER", "BB", "SO", "2B", "3B", "HR", "TBF", "B_AVG", "WP", "HBP", "BK", "SFA", "SHA"]
pitch_rows = [stat_tables[i][1] for i in range(len(teams))]   #[1] for second table

In [28]:
dfp = [pd.DataFrame(columns = pitch_headers, data = pitch_rows[i]) for i in range(len(teams))]
dfp = [df.iloc[1:] for df in dfp]
dfp[2][:5]

Unnamed: 0,Pitching,ERA,W,L,GP,GS,CG,SHO,CBO,SV,...,2B,3B,HR,TBF,B_AVG,WP,HBP,BK,SFA,SHA
1,"Haney, Houston",3.0,2,1,7,4,2,0,0,0,...,10,0,3,140,0.348,0,1,0,2,1
2,"Noska, Jordan",4.71,1,0,10,0,0,0,0,1,...,6,1,2,100,0.367,2,4,1,2,2
3,"Hammel, Jacob",6.0,1,0,5,0,0,0,0,1,...,3,0,0,28,0.217,1,1,0,0,0
4,"Peterson, Ike",7.59,0,0,8,0,0,0,0,1,...,3,0,0,54,0.306,2,3,1,1,0
5,"Anderson, David",8.25,2,3,6,5,1,0,0,0,...,9,4,2,131,0.387,3,3,0,1,2


In [29]:
for df in dfp:
    df[pitch_headers[1:]] = df[pitch_headers[1:]].apply(pd.to_numeric)

Finally, we do the same with each page's third table which displays fielding statistics.

In [30]:
field_headers = ["Fielding", "GP", "GS", "C", "PO", "A", "E", "FLD_pct", "DP", "TP",
                 "SBA", "RCS", "SB_pct", "PB", "CI", "OBS"]
field_rows = [stat_tables[i][2] for i in range(len(teams))]    #[2] for third table

In [31]:
dff = [pd.DataFrame(columns = field_headers, data = field_rows[i]) for i in range(len(teams))]
dff = [df.iloc[1:] for df in dff]
dff[2][:5]

Unnamed: 0,Fielding,GP,GS,C,PO,A,E,FLD_pct,DP,TP,SBA,RCS,SB_pct,PB,CI,OBS
1,"Swartzentruber, Logan",6,5,3,0,3,0,1.0,0,0,0,0,0.0,0,0,0
2,"Clark, Scottie",6,2,1,0,1,0,1.0,0,0,0,0,0.0,0,0,0
3,"Noska, Jordan",10,0,3,0,3,0,1.0,0,0,0,0,0.0,0,0,0
4,"Buzbee, Wyatt",14,0,1,1,0,0,1.0,0,0,0,0,0.0,0,0,0
5,"Haney, Houston",24,23,67,30,36,1,0.985,11,0,0,0,0.0,0,0,0


In order to load this data on other worksheets, we must save our dataframes to a pickle file. We will name this file `Stats.pkl` and it can be located in the same folder as this workbook.

In [32]:
with open('Stats.pkl', 'wb') as f:
        pickle.dump(dfb, f)
        pickle.dump(dfp, f)
        pickle.dump(dff, f)
        pickle.dump(teams, f)