# 1.1 Player Stats Data Collection
---
*By Ihza Gonzales*

This notebook aims to create a dataframe of names of active non pitching baseball players in use to get the batting statistics of the players from the website Savantbaseball.mlb.com. The statistics of the baseball players from the years 2018 - 2021 will be collected.

## Import Libraries
---

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

## Import Data for Player Names and ID's
---

In [2]:
player_id = pd.read_csv('../data/og_datasets/SFBB Player ID Map - PLAYERIDMAP.csv')

In [3]:
player_id.head()

Unnamed: 0,IDPLAYER,PLAYERNAME,BIRTHDATE,FIRSTNAME,LASTNAME,TEAM,LG,POS,IDFANGRAPHS,FANGRAPHSNAME,...,DRAFTKINGSNAME,OTTONEUID,HQID,RAZZBALLNAME,FANTRAXID,FANTRAXNAME,ROTOWIRENAME,ALLPOS,NFBCLASTFIRST,ACTIVE
0,aardsda01,David Aardsma,12/27/1981,David,Aardsma,,,P,1902,David Aardsma,...,David Aardsma,4362.0,,David Aardsma,,,David Aardsma,P,"Aardsma, David",N
1,abadfe01,Fernando Abad,12/17/1985,Fernando,Abad,BAL,AL,P,4994,Fernando Abad,...,Fernando Abad,7372.0,3556.0,Fernando Abad,*01viz*,Fernando Abad,Fernando Abad,P,"Abad, Fernando",Y
2,abbotco01,Cory Abbott,9/20/1995,Cory,Abbott,CHC,NL,P,sa3005305,Cory Abbott,...,,,6286.0,Cory Abbott,*04ef6*,Cory Abbott,Cory Abbott,P,"Abbott, Cory",Y
3,abramcj01,CJ Abrams,10/3/2000,CJ,Abrams,SD,NL,SS,sa3010152,CJ Abrams,...,,,,CJ Abrams,*04qk8*,CJ Abrams,CJ Abrams,SS,"Abrams, CJ",Y
4,abreual01,Albert Abreu,9/26/1995,Albert,Abreu,NYY,AL,P,17485,Albert Abreu,...,,,5762.0,Albert Abreu,*03xy4*,Albert Abreu,Albert Abreu,P,"Abreu, Albert",Y


## Import List of all Active non-Pitching Players From Rotowire.com
---
*Obtained data has active column but not up-to data*

In [4]:
batters = pd.read_csv('../data/og_datasets/mlb-player-stats-Batters.csv')
batters.head()

Unnamed: 0,Player,Team,Pos,Age,G,AB,R,H,2B,3B,...,CS,BB,SO,SH,SF,HBP,AVG,OBP,SLG,OPS
0,Whit Merrifield,KC,2B,32,162,664,97,184,42,3,...,4,40,103,0,12,4,0.277,0.317,0.395,0.712
1,Marcus Semien,TOR,SS,31,162,652,115,173,39,2,...,1,66,146,0,3,3,0.265,0.334,0.538,0.872
2,Tommy Edman,STL,2B,26,159,641,91,168,41,3,...,5,38,95,2,4,6,0.262,0.308,0.387,0.695
3,Bo Bichette,TOR,SS,23,159,640,121,191,30,1,...,1,40,137,0,4,6,0.298,0.343,0.484,0.827
4,Isiah Kiner-Falefa,TEX,SS,26,158,635,74,172,25,3,...,5,28,90,1,2,11,0.271,0.312,0.357,0.669


In [5]:
# Want to get players with a minimum of 50 at-bats
batters = batters[batters['AB']> 50]

### Specify which Columns and Rows Needed

In [6]:
#Only get players that are not pitchers
player_id = player_id[player_id['POS'] != 'P']

In [7]:
mlb_id = player_id[['MLBID', 'FIRSTNAME', 'LASTNAME', 'ACTIVE', 'PLAYERNAME']]

In [8]:
mlb_id.head(3)

Unnamed: 0,MLBID,FIRSTNAME,LASTNAME,ACTIVE,PLAYERNAME
3,682928.0,CJ,Abrams,Y,CJ Abrams
5,110029.0,Bobby,Abreu,N,Bobby Abreu
7,547989.0,Jose,Abreu,Y,Jose Abreu


#### Only Want Active Players. Merge both Datasets to get mlbid of all active non pitching players

In [9]:
active = mlb_id.merge(batters, how = 'inner', left_on = 'PLAYERNAME', right_on = 'Player')
active.head(5)

Unnamed: 0,MLBID,FIRSTNAME,LASTNAME,ACTIVE,PLAYERNAME,Player,Team,Pos,Age,G,...,CS,BB,SO,SH,SF,HBP,AVG,OBP,SLG,OPS
0,547989.0,Jose,Abreu,Y,Jose Abreu,Jose Abreu,CWS,1B,34,152,...,0,61,143,0,10,22,0.261,0.351,0.481,0.832
1,660670.0,Ronald,Acuna,Y,Ronald Acuna,Ronald Acuna,ATL,OF,23,82,...,6,49,85,0,5,9,0.283,0.394,0.596,0.99
2,642715.0,Willy,Adames,Y,Willy Adames,Willy Adames,MIL,SS,26,99,...,2,47,105,0,1,0,0.285,0.366,0.521,0.887
3,642715.0,Willy,Adames,Y,Willy Adames,Willy Adames,TB,SS,26,41,...,2,10,51,0,0,0,0.197,0.254,0.371,0.625
4,666176.0,Jo,Adell,Y,Jo Adell,Jo Adell,LAA,OF,22,35,...,1,8,32,1,0,1,0.246,0.295,0.408,0.703


#### Change ID to Integer

In [10]:
active['MLBID'] = active['MLBID'].astype(int)

In [11]:
active.shape

(512, 28)

#### Drop any Duplicate IDs

In [12]:
active.drop_duplicates(subset=['MLBID'], keep='last', inplace = True)

In [13]:
# Make sure no duplicates
active[active.duplicated(['MLBID'])]

Unnamed: 0,MLBID,FIRSTNAME,LASTNAME,ACTIVE,PLAYERNAME,Player,Team,Pos,Age,G,...,CS,BB,SO,SH,SF,HBP,AVG,OBP,SLG,OPS


#### Drop column playername

In [14]:
active.drop(columns = 'PLAYERNAME', inplace = True)
active.head(3)

Unnamed: 0,MLBID,FIRSTNAME,LASTNAME,ACTIVE,Player,Team,Pos,Age,G,AB,...,CS,BB,SO,SH,SF,HBP,AVG,OBP,SLG,OPS
0,547989,Jose,Abreu,Y,Jose Abreu,CWS,1B,34,152,566,...,0,61,143,0,10,22,0.261,0.351,0.481,0.832
1,660670,Ronald,Acuna,Y,Ronald Acuna,ATL,OF,23,82,297,...,6,49,85,0,5,9,0.283,0.394,0.596,0.99
3,642715,Willy,Adames,Y,Willy Adames,TB,SS,26,41,132,...,2,10,51,0,0,0,0.197,0.254,0.371,0.625


#### Save Dataframe with Id and names

In [15]:
active.to_csv('../data/og_datasets/mlb_players_bat.csv')

## Function for Collecting Stats from Baseballsavant.mlb.com
----

In [16]:
def get_stats(mlbid, first_name, last_name):
    """
    This function collects the gamelog stats of every 
    game for that season for the specified player from the arguments.
    
    Returns csv file named after the player of their stats to specified location.
    """
    years = ['2018', '2019', '2020', '2021']
    
    for year in years:
        #This is the url to the website
        base_url = 'https://baseballsavant.mlb.com/savant-player/'

        #This string will be used to specifiy the player
        player_name = first_name.lower() +'-'+last_name.lower()+'-'+str(mlbid)

        #Url for the page with the stats
        url = base_url + player_name + '?stats=gamelogs-r-hitting-mlb&season=' + year

        #Requests for the page
        res = requests.get(url)

        if res.status_code != 200:
            raise Exception('API response: {}'.format(res.status_code))   
            #Modified from https://pypi.org/project/ratelimit/

        else:

            soup = BeautifulSoup(res.content, 'lxml')

            player_stats = []

            try:

                #Find the table with desired stats
                table = soup.find('div', {'id':['gamelogs-mlb']})

                #Finds all the columns needed
                for row in table.find('tbody').find_all('tr'):

                    td_tags = row.find_all('td')
                    for index, td in enumerate(td_tags):
                        stats = {}
                        stats['date'] = td_tags[0].text.strip()
                        stats['PA'] = td_tags[3].text.strip()
                        stats['AB'] = td_tags[4].text.strip()
                        stats['R'] = td_tags[5].text.strip()
                        stats['H'] = td_tags[6].text.strip()
                        stats['2B'] = td_tags[7].text.strip()
                        stats['3B'] = td_tags[8].text.strip()
                        stats['HR'] = td_tags[9].text.strip()
                        stats['RBI'] = td_tags[10].text.strip()
                        stats['BB'] = td_tags[11].text.strip()
                        stats['SO'] = td_tags[12].text.strip()
                        stats['AVG'] = td_tags[16].text.strip()
                        stats['OBP'] = td_tags[17].text.strip()
                        stats['SLG'] = td_tags[18].text.strip()
                        stats['OPS'] = td_tags[19].text.strip()

                    #Appends the row of stats to the list
                    player_stats.append(stats)

                #Creates data frame of all stats
                df = pd.DataFrame(player_stats)

                #Saves Dataframe to a file with player name
                df.to_csv(f'../data/og_players_bat/{first_name}-{last_name}-{mlbid}-{year}.csv')

                time.sleep(1) #suspends execution for 1 second to prevent too many requests
                #inspired from https://realpython.com/python-sleep/

            except (AttributeError, IndexError):
                pass


## Get Stats for All Active Batters
---

In [17]:
%%time
for index, row in active.iterrows():
    
    mlbid = row['MLBID']
    first = row['FIRSTNAME']
    last = row['LASTNAME']

    get_stats(mlbid, first, last)
    
# Copied from https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas

CPU times: user 33min 23s, sys: 1min, total: 34min 24s
Wall time: 3h 27min 12s


## Recap
---
Saved a csv file of active batters names with ID's and current 2021 regular season stats. This file will be used to get names of players. The data collected from BaseballSavant.com are the stats of every game of active non-pitching players from 2018 - 2021. This will be used for the VAR model for forecasting. 