# 1.1 Player Stats Data Collection
---
This notebook aims to create a dataframe of names of active pitching baseball players in use to get the batting statistics of the players from the website Savantbaseball.mlb.com. The statistics of the baseball players from the years 2018 - 2021 will be collected.

## Import Libraries
---

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

## Import Data for Player Names and ID's
---

In [2]:
player_id = pd.read_csv('../data/og_datasets/SFBB Player ID Map - PLAYERIDMAP.csv')

In [3]:
player_id.head()

Unnamed: 0,IDPLAYER,PLAYERNAME,BIRTHDATE,FIRSTNAME,LASTNAME,TEAM,LG,POS,IDFANGRAPHS,FANGRAPHSNAME,...,DRAFTKINGSNAME,OTTONEUID,HQID,RAZZBALLNAME,FANTRAXID,FANTRAXNAME,ROTOWIRENAME,ALLPOS,NFBCLASTFIRST,ACTIVE
0,aardsda01,David Aardsma,12/27/1981,David,Aardsma,,,P,1902,David Aardsma,...,David Aardsma,4362.0,,David Aardsma,,,David Aardsma,P,"Aardsma, David",N
1,abadfe01,Fernando Abad,12/17/1985,Fernando,Abad,BAL,AL,P,4994,Fernando Abad,...,Fernando Abad,7372.0,3556.0,Fernando Abad,*01viz*,Fernando Abad,Fernando Abad,P,"Abad, Fernando",Y
2,abbotco01,Cory Abbott,9/20/1995,Cory,Abbott,CHC,NL,P,sa3005305,Cory Abbott,...,,,6286.0,Cory Abbott,*04ef6*,Cory Abbott,Cory Abbott,P,"Abbott, Cory",Y
3,abramcj01,CJ Abrams,10/3/2000,CJ,Abrams,SD,NL,SS,sa3010152,CJ Abrams,...,,,,CJ Abrams,*04qk8*,CJ Abrams,CJ Abrams,SS,"Abrams, CJ",Y
4,abreual01,Albert Abreu,9/26/1995,Albert,Abreu,NYY,AL,P,17485,Albert Abreu,...,,,5762.0,Albert Abreu,*03xy4*,Albert Abreu,Albert Abreu,P,"Abreu, Albert",Y


## Import List of all Active Pitching Players From Rotowire.com
---
*Obtained data above has active column but not up-to data*

In [4]:
pitchers = pd.read_csv('../data/og_datasets/mlb-player-stats-P.csv')
pitchers.head()

Unnamed: 0,Player,Team,Age,G,GS,CG,SHO,IP,H,ER,K,BB,HR,W,L,SV,BS,HLD,ERA,WHIP
0,Zack Wheeler,PHI,31,32,32,3,2,213.1,169,66,247,46,16,14,10,0,0,0,2.78,1.01
1,Walker Buehler,LAD,27,33,33,0,0,207.2,149,57,212,52,19,16,4,0,0,0,2.47,0.97
2,Adam Wainwright,STL,40,32,32,3,1,206.1,168,70,174,50,21,17,7,0,0,0,3.05,1.06
3,Sandy Alcantara,MIA,26,33,33,1,0,205.2,171,73,201,50,21,9,15,0,0,0,3.19,1.07
4,Robbie Ray,TOR,30,32,32,0,0,193.1,150,61,248,52,33,13,7,0,0,0,2.84,1.04


In [5]:
pitchers = pitchers[pitchers['IP']> 10]

### Specify which Columns and Rows Needed

In [6]:
#Only get players that are not pitchers
player_id = player_id[player_id['POS'] == 'P']

In [7]:
mlb_id = player_id[['MLBID', 'FIRSTNAME', 'LASTNAME', 'ACTIVE', 'PLAYERNAME']]

In [8]:
mlb_id.head(3)

Unnamed: 0,MLBID,FIRSTNAME,LASTNAME,ACTIVE,PLAYERNAME
0,430911.0,David,Aardsma,N,David Aardsma
1,472551.0,Fernando,Abad,Y,Fernando Abad
2,676265.0,Cory,Abbott,Y,Cory Abbott


#### Only Want Active Players. Merge both Datasets to get mlbid of all active pitching players

In [9]:
active = mlb_id.merge(pitchers, how = 'inner', left_on = 'PLAYERNAME', right_on = 'Player')
active.head(5)

Unnamed: 0,MLBID,FIRSTNAME,LASTNAME,ACTIVE,PLAYERNAME,Player,Team,Age,G,GS,...,K,BB,HR,W,L,SV,BS,HLD,ERA,WHIP
0,472551.0,Fernando,Abad,Y,Fernando Abad,Fernando Abad,BAL,35,16,0,...,10,7,1,0,0,0,0,2,5.6,1.7
1,676265.0,Cory,Abbott,Y,Cory Abbott,Cory Abbott,CHC,26,7,1,...,12,11,7,0,0,0,0,0,6.75,1.79
2,656061.0,Albert,Abreu,Y,Albert Abreu,Albert Abreu,NYY,26,28,0,...,35,19,8,2,0,1,0,3,5.15,1.25
3,650556.0,Bryan,Abreu,Y,Bryan Abreu,Bryan Abreu,HOU,24,31,0,...,36,18,4,3,3,1,4,7,5.75,1.47
4,642758.0,Domingo,Acevedo,Y,Domingo Acevedo,Domingo Acevedo,OAK,27,10,0,...,9,4,3,0,0,0,0,0,3.27,1.18


#### Change ID to Integer

In [10]:
active['MLBID'] = active['MLBID'].astype(int)

In [11]:
active.shape

(551, 25)

#### Drop any Duplicate IDs

In [12]:
active.drop_duplicates(subset=['MLBID'], keep='last', inplace = True)

In [13]:
# Make sure no duplicates
active[active.duplicated(['MLBID'])]

Unnamed: 0,MLBID,FIRSTNAME,LASTNAME,ACTIVE,PLAYERNAME,Player,Team,Age,G,GS,...,K,BB,HR,W,L,SV,BS,HLD,ERA,WHIP


#### Drop column playername

In [14]:
active.drop(columns = 'PLAYERNAME', inplace = True)
active.head(3)

Unnamed: 0,MLBID,FIRSTNAME,LASTNAME,ACTIVE,Player,Team,Age,G,GS,CG,...,K,BB,HR,W,L,SV,BS,HLD,ERA,WHIP
0,472551,Fernando,Abad,Y,Fernando Abad,BAL,35,16,0,0,...,10,7,1,0,0,0,0,2,5.6,1.7
1,676265,Cory,Abbott,Y,Cory Abbott,CHC,26,7,1,0,...,12,11,7,0,0,0,0,0,6.75,1.79
2,656061,Albert,Abreu,Y,Albert Abreu,NYY,26,28,0,0,...,35,19,8,2,0,1,0,3,5.15,1.25


#### Save Dataframe with Id and names

In [15]:
active.to_csv('../data/og_datasets/mlb_players_pitch.csv')

## Function for Collecting Stats from Baseballsavant.mlb.com
---

In [16]:
def get_stats(mlbid, first_name, last_name):
    """
    This function collects the gamelog stats of every 
    game for that season for the specified player from the arguments.
    
    Returns csv file named after the player of their stats to specified location.
    """
    years = ['2018', '2019', '2020', '2021']
    
    for year in years:
        #This is the url to the website
        base_url = 'https://baseballsavant.mlb.com/savant-player/'

        #This string will be used to specifiy the player
        player_name = first_name.lower() +'-'+last_name.lower()+'-'+str(mlbid)

        #Url for the page with the stats
        url = base_url + player_name + '?stats=gamelogs-r-pitching-mlb&season=' + year

        #Requests for the page
        res = requests.get(url)

        if res.status_code != 200:
            raise Exception('API response: {}'.format(res.status_code))   
            #Modified from https://pypi.org/project/ratelimit/

        else:

            soup = BeautifulSoup(res.content, 'lxml')

            player_stats = []

            try:

                #Find the table with desired stats
                table = soup.find('div', {'id':['gamelogs-mlb']})

                #Finds all the columns needed
                for row in table.find('tbody').find_all('tr'):

                    td_tags = row.find_all('td')
                    for index, td in enumerate(td_tags):
                        stats = {}
                        stats['date'] = td_tags[0].text.strip()
                        stats['W'] = td_tags[3].text.strip()
                        stats['L'] = td_tags[4].text.strip()
                        stats['ERA'] = td_tags[5].text.strip()
                        stats['IP'] = td_tags[9].text.strip()
                        stats['H'] = td_tags[10].text.strip()
                        stats['ER'] = td_tags[12].text.strip()
                        stats['HR'] = td_tags[13].text.strip()
                        stats['BB'] = td_tags[14].text.strip()
                        stats['SO'] = td_tags[15].text.strip()
                        stats['WHIP'] = td_tags[16].text.strip()

                    #Appends the row of stats to the list
                    player_stats.append(stats)

                #Creates data frame of all stats
                df = pd.DataFrame(player_stats)

                #Saves Dataframe to a file with player name
                df.to_csv(f'../data/og_players_pitch/{first_name}-{last_name}-{mlbid}-{year}.csv')

                time.sleep(1) #suspends execution for 1 second to prevent too many requests
                #inspired from https://realpython.com/python-sleep/

            except (AttributeError, IndexError):
                pass


## Get Stats for All Active Pitchers
---

In [17]:
%%time
for index, row in active.iterrows():
    
    mlbid = row['MLBID']
    first = row['FIRSTNAME']
    last = row['LASTNAME']

    get_stats(mlbid, first, last)
    
# Copied from https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas

CPU times: user 24min 45s, sys: 43.9 s, total: 25min 29s
Wall time: 3h 7min 3s


## Recap
---
Saved a csv file of active pitching player names with ID's and current 2021 regular season stats. This file will be used to get names of players. The data collected from BaseballSavant.com are the stats of every game of active pitching players from 2018 - 2021. This will be used for the VAR model for forecasting. 