# Background: Extracting Data from baseballsavant.mlb.com

<strong>Websites</strong>
 - baseballsavant.mlb.com: https://baseballsavant.mlb.com/statcast_search 
 - baseballsavant.mlb.com documentation: https://baseballsavant.mlb.com/csv-docs
 - baseballsavant.mlb.com webscraping tool: Github Alan R Kessler https://github.com/alanrkessler/savantscraper
<hr  style="height:1px;border:none;color:#333;background-color:#333;">
<strong>Baseballsavant Information taken from website</strong><br>""
<strong>What is BaseballSavant?</strong>
 - BaseballSavant is a site dedicated to providing player matchups, Statcast metrics, and advanced statistics in a simple and easy-to-view way.<br>
<strong>What is Statcast?</strong>
 - Statcast is a state-of-the-art tracking technology, capable of measuring previously unquantifiable aspects of the game. Set up in all 30 Major League ballparks, Statcast collects data using a series of high-resolution optical cameras along with radar equipment. The technology precisely tracks the location and movements of the ball and every player on the field, resulting in an unparalleled amount of information covering everything from the pitcher to the batter to baserunners and defensive players. Visit MLB.com's glossary for more information.""
<hr  style="height:1px;border:none;color:#333;background-color:#333;">

### How to extract data from baseballsavant.mlb.com

 - Several options are available for data gathering, although download of search results into a csv is currently unavailable.
 - To this end, others have developed tools to gather data from this website. One of those tools for python is <a href="https://github.com/alanrkessler/savantscrapersavantscraper."> Github Alan R Kessler .</a>
 - Step 1: Download or clone repository.
 - Step 2: Save file in directory you intend to down csv file
 - Step 3: Activate a python environment that has the following libraries installed
 <ul>
    <li>import os</li>
    <li>sleep</li>
    <li>urllib.error import HTTPError</li>
    <li>sqlite3</li>
    <li>pandas as pd</li>
    <li>tqdm</li>
 </ul>
 - Step 4: run savantscraper.py in MAC terminal use: <code>python savantscaper.py</code>
 - Step 5: open jupter notebook, and use the following as an example.
 <hr  style="height:1px;border:none;color:#333;background-color:#333;">

# Extracting Data via Webscraping

### Jupyter Notebook Step 1: Import Basic Libraries

In [1]:
import sqlite3 #for querying data 
import pandas as pd #for manipulating database and exporting as a csv.
import savantscraper #will import the savantscraper function that you will invoke to gather the data you want

### Jupyter Notebook Step 2:  Invocation Webscraping Function


<code>database_import('baseball_savant', (2017, 2018), teams=['STL', 'COL'])</code>
- First parameter: 'baseball_savant' do not change.
- Second parameter: a year interval tupple, the first index is the start of the season year, and the second index is the end of the season year. Exaple for the 2019 MLB season, the tupple would be (2018, 2019)
- Third parameter: is optional, you can select a single team, or omit all together, and it will gather all teams.

Running this code may take some time.

### Will webscrape all data for the MLB season 2019..

In [None]:
# Invoke function
savantscraper.database_import('baseball_savant',
                              (2019, 2020))

### Jupyter Notebook Step 3: Connect to database using .connect() method

In [2]:
# Connect to the database using the .connect() method
# This will only connect to the website data, once you save the data as either a csv or db, 
# you will need to reconnect with the correct filepath
conn = sqlite3.connect('static/documents/baseball_savant.db')

### Jupyter Notebook Step 4: Save data to a pandas dataframe
Use pandas <code>pd.read_sql_query()</code> method and run an SQL command to gather the rows from the statcast table that was scraped in step 2.

<strong>How the sql command works</strong><br>
<code>SELECT * </code> will get us all the rows<br>
<code>FROM statcast;</code> will indicate from where all the rows should be selected.<br>
<code>database_import()</code> only provided a single table option 'statcast'<br>

In [None]:
df = pd.read_sql_query("SELECT * FROM statcast;", conn)

### Jupyter Notebook Step 5: Use pandas .to_csv() to save sql query output into a csv file

How the code works
<code>first parameter: filepath</code> is the location that the file will be saved<br>
<code>second parameter: encoding</code> encoding='utf-8'is the encoding to use in the output file. This is not needed as 'utf-8' is the default.
<code>third parameter: index</code>  Write row names (index), default is 'True', but we did not want this action here, so set as 'False'.

In [None]:
df.to_csv("/Volumes/Elements/Downloads/statcast_2019_file", encoding='utf-8', index=False)

### Jupyter Notebook Step 6: close connection with .close() The pandas dataframe is available in this notebook as df, and a csv file has been saved in the indicated filepath.

In [3]:
# Close connection when finished
conn.close()

### The CSV was loaded into an sqlite db and using the same connection method as above.
 - You can use the standard pandas import method if you'd like
 <code>df = pd.read_csv (r'Path where the CSV file is stored\File name.csv')</code>

### Filter data for top top home run batters in the MLB Houston Astros and Washington Nationals

<img src="static/images/alex_bregman_astros.png" >
<div align="middle">
 - Third baseman / Shortstop<br>
 - Houston Atros since 2016<br>
 - Homeruns: 41<br>
 - BatterID: 608324
</div>
<h3>Note: I had to inspect the player page on baseballsavant.mlb.com to get their batter ID!</h3>

In [4]:
# Connection to my local db copy
conn = sqlite3.connect('static/documents/statcast.db')
# The batter id for this player was obtained by inspecting the player page on baseballsavant (looking through the HTML code)
bregman = pd.read_sql_query("SELECT * FROM statcast WHERE batter = '608324';", conn)
conn.close()

<table align="middle">
<tr>
    <td> <img src="static/images/anthony_rendon_nationals.png" alt="Drawing" style="width: 250px;"/> 
    </td>
    <td> <img src="static/images/juan_soto_nationals.png" alt="Drawing" style="width: 250px;"/> </td>
   </tr>
     <tr>
    <td> Third baseman<br>Nationals since 2013<br>Homeruns:34<br>Player ID:543685
    </td>
    <td> Outfielder<br>Nationals since 2018<br>Homeruns:34<br>Player ID:665742
    </td>
   </tr>
</table>

In [5]:
# Connection to my local db copy
conn = sqlite3.connect('static/documents/statcast.db')
# The batter id for this player was obtained by inspecting the player page on baseballsavant (looking through the HTML code)
rendon = pd.read_sql_query("SELECT * FROM statcast WHERE batter = '543685';", conn)
soto = pd.read_sql_query("SELECT * FROM statcast WHERE batter = '665742';", conn)
conn.close()

### Optional: Save filtered data into csv files

In [6]:
# Load Alex Bregman's Data
bregman.to_csv('static/documents/bregman')

# Load Anthony Rendon's Data
rendon.to_csv('static/documents/rendon')

# Load Juan Soto's Data
soto.to_csv('static/documents/soto')