## Data Scape - Team Stats for Bayes Analysis

This notebook will be used to collect season statistics by team for a Bayes analysis. The information will be used to calculate a prior distribution.

### Contents

- [Imports](#Imports)
- [Test Scrape](#Test-Scrape)
- [Complete Scrape](#Complete-Scrape)
- [Average Free Throw %](#Average-Free-Throw-%)
- [Average 2 Point %](#Average-2-Point-%)
- [Average 3 Point %](#Average-3-Point-%)

### Imports

In [33]:
# Import libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time

### Test Scrape

In [2]:
# creating request from the website
url = 'https://www.sports-reference.com/cbb/seasons/2019-school-stats.html'
rec = requests.get(url)


In [3]:
#looking at status code
rec.status_code

200

In [4]:
soup = BeautifulSoup(rec.content, 'lxml')

In [8]:
#table contains data we're looking for
table = soup.find('table', {'class': 'sortable stats_table'})

In [14]:
#creating list of columns headers for the data frame
columns = [th.text for th in table.find_all('tr')[1].find_all('th')]

In [18]:
#example of scrape for one row in the table
[td.text for td in table.find('tbody').find_all('tr')[0].find_all('td')]

['Abilene Christian\xa0NCAA',
 '34',
 '27',
 '7',
 '.794',
 '-1.91',
 '-7.34',
 '14',
 '4',
 '13',
 '2',
 '10',
 '4',
 '2502',
 '2161',
 '',
 '1370',
 '897',
 '1911',
 '.469',
 '251',
 '660',
 '.380',
 '457',
 '642',
 '.712',
 '325',
 '1110',
 '525',
 '297',
 '93',
 '407',
 '635']

### Complete Scrape

In [19]:
#creating list of stats for each college team in Division I since 2008
teams = []
#loop through college teams stats since 2008
for i in range(2008, 2020):
    print(f'Collected {i} stats')
    key = i
    next_url = f'https://www.sports-reference.com/cbb/seasons/{i}-school-stats.html'
    rec = requests.get(next_url)
    soup = BeautifulSoup(rec.content, 'lxml')
    new_table = soup.find('table', {'class': 'sortable stats_table'})
    
    #collecting all rows for each year
    for team in new_table.find('tbody').find_all('tr'):
        team_stat = [td.text for td in team.find_all('td')]
        teams.append(team_stat)
    time.sleep(2)

Collected 2008 stats
Collected 2009 stats
Collected 2010 stats
Collected 2011 stats
Collected 2012 stats
Collected 2013 stats
Collected 2014 stats
Collected 2015 stats
Collected 2016 stats
Collected 2017 stats
Collected 2018 stats
Collected 2019 stats


In [21]:
#creating dataframe
df = pd.DataFrame(teams, columns = columns[1:])

In [22]:
df.head()

Unnamed: 0,School,G,W,L,W-L%,SRS,SOS,W.1,L.1,W.2,...,FT,FTA,FT%,ORB,TRB,AST,STL,BLK,TOV,PF
0,Air Force,30,16,14,0.533,0.44,0.19,8,8,12,...,406,583,0.696,157,796,365,193,54,380,510
1,Akron,35,24,11,0.686,6.24,0.33,11,5,13,...,521,741,0.703,391,1096,466,251,96,448,681
2,Alabama A&M,29,14,15,0.483,-16.14,-12.97,11,7,9,...,475,694,0.684,377,1061,287,214,170,481,550
3,Alabama-Birmingham,34,23,11,0.676,8.35,2.09,12,4,14,...,523,740,0.707,370,1252,475,195,115,482,630
4,Alabama State,31,20,11,0.645,-7.73,-12.7,15,3,11,...,380,594,0.64,385,1192,457,176,126,412,601


In [25]:
df.shape

(4538, 33)

In [27]:
df.isnull().sum().sum()

13266

In [28]:
#removing nulls
df.dropna(inplace = True)

In [30]:
df.isnull().sum().sum()

0

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4136 entries, 0 to 4537
Data columns (total 33 columns):
School    4136 non-null object
G         4136 non-null object
W         4136 non-null object
L         4136 non-null object
W-L%      4136 non-null object
SRS       4136 non-null object
SOS       4136 non-null object
W         4136 non-null object
L         4136 non-null object
W         4136 non-null object
L         4136 non-null object
W         4136 non-null object
L         4136 non-null object
Tm.       4136 non-null object
Opp.      4136 non-null object
          4136 non-null object
MP        4136 non-null object
FG        4136 non-null object
FGA       4136 non-null object
FG%       4136 non-null object
3P        4136 non-null object
3PA       4136 non-null object
3P%       4136 non-null object
FT        4136 non-null object
FTA       4136 non-null object
FT%       4136 non-null object
ORB       4136 non-null object
TRB       4136 non-null object
AST       4136 non-null o

In [38]:
#only need these 6 columns for my analysis and will turn them into integer data type
df = df[['FT', 'FTA', '3P', '3PA', 'FG', 'FGA']].astype(int)

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4136 entries, 0 to 4537
Data columns (total 6 columns):
FT     4136 non-null int64
FTA    4136 non-null int64
3P     4136 non-null int64
3PA    4136 non-null int64
FG     4136 non-null int64
FGA    4136 non-null int64
dtypes: int64(6)
memory usage: 226.2 KB


### Average Free Throw %

In [45]:
ft_avg = round(sum(df['FT']) / sum(df['FTA']), 3)
ft_avg

0.697

The average free throw percentage over the last 10 years in college is approximately 70%.

### Average 2 Point %

Making two additional columns for two point field goals. Subtracting three point field goals from total field goals to get the value.

In [49]:
df['2P'] = df['FG'] - df['3P']

In [51]:
df['2PA'] = df['FGA'] - df['3PA']

In [53]:
two_avg = round(sum(df['2P']) / sum(df['2PA']), 3)
two_avg

0.489

### Average 3 Point %

The average two point percentage over the last 10 years in college is approximately 49%.

In [47]:
three_avg = round(sum(df['3P']) / sum(df['3PA']), 3)
three_avg

0.347

The average three point percentage over the last 10 years in college is approximately 35%.