# Project Overview

### Step 1: Identify the top 2019 home run batter for the Washington Nationals, and the Houston Astros
 - Home run data for the 2019 season

#### Methodology
 - Extract 2019 team statistics from baseball-reference.com (Web Scraping > pandas > csv)
 - Clean data in pandas
 - analyze data using pandas, visualize data with matplotlib

### Step 2: Characterzie the players performance for the 2019 season and determine which of these two outperformed the other.

#### Compare players 2019 Statcast data for:
 - Ball Events
 - launch speed and launch angle
 - home runs by pitch velocity
 - home runs by pitch location
 - player home run zones

#### Methodology
 - Extract 2019 team Statcast data as a csv from https://baseballsavant.mlb.com/statcast_search
 - Clean data in pandas
 - analyze data using pandas, visualize data with matplotlib and seaborn

# STEP 1 - Data Extracting Phase

### Step 1 Study Design
 - Extract 2019 team statistics from baseball-reference.com (Web Scraping > pandas > csv)
 - Clean data in pandas
 - analyze data using pandas, visualize data with matplotlib

To compare the top home run scoring players for the 2019 World Series teams, we must first identify these players. <br>
<strong>Data source:</strong> Baseball_reference.com<br>
<strong>Data extraction method:</strong> Web Scraping using the python libraries: requests, lxml.html, pandas.<br>
<strong>Rationale for Methodology and Study Assumptions</strong><br>
 - All batter statistics will be gathered in case we want to carry out additional analysis. Just looking at who are the top home run scorers is not enough.
 - We will assume that the data is not available in a downloadable format, thereby necessitating the use of this web scraping method.

## Gathering HTML Table Data

<img src="static/images/baseball_ref_landing.png">

### Import Python Libraries

In [1]:
import requests
import lxml.html as lh
import pandas as pd

In [2]:
# Store website in url variables
hou_url = "https://www.baseball-reference.com/teams/HOU/2019.shtml"
wsn_url= "https://www.baseball-reference.com/teams/WSN/2019.shtml#all_team_batting"

### Request HTML

In [3]:
# Use the request method on the url
req_hou = requests.get(hou_url)
req_wsn = requests.get(wsn_url)

### Store DOM

In [4]:
# Store the contents of the website using the html lxml.html parser lh
doc_hou = lh.fromstring(req_hou.content)
doc_wsn = lh.fromstring(req_wsn.content)

### Parse HTML

In [5]:
#Parse data that is stored between HTML table row tags <tr>..</tr>
tr_hou_elements = doc_hou.xpath('//tr')
tr_wsn_elements = doc_wsn.xpath('//tr')

##### Quality control
 - Confirm that you are gathering tabular data
 - You can do that by inspecting the length of 
 - the rows by using a list comprehension
 - make sure that each row has the same number of columns

In [6]:
# Inspect the astros data
[len(T) for T in tr_hou_elements[:10]]

[28, 28, 28, 28, 28, 28, 28, 28, 28, 28]

In [7]:
# Inspect the nationals data
[len(T) for T in tr_wsn_elements[:10]]

[28, 28, 28, 28, 28, 28, 28, 28, 28, 28]

### Extract Header Data from HTML tr elements

##### Save the header row
###### Variales:
 - variables storing the table row html data: tr_hou_elements and tr_wsn_elements
 - Empty lists that will be used to store tr extracted data: hou_table_data, wsn_table_data

In [8]:
# Extract Houston Astros html tr data and store in a list
#Create empty list
hou_table_data=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_hou_elements[0]:
    i+=1

    #lxlm text_content() method extracts text values
    name=t.text_content()
    #Check you are gathering the header information with print statement
    #This is optional   
#     print (i,name) 
    
    #use the List .append() method to add the text from each row
    #into the empty list col you created
    hou_table_data.append((name,[]))

In [9]:
#Create empty list
wsn_table_data=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_wsn_elements[0]:
    i+=1

    #lxlm text_content() method extracts text values
    name=t.text_content()
    #Check you are gathering the header information with print statement
    #This is optional   
#     print (i,name) 
    
    #use the List .append() method to add the text from each row
    #into the empty list col you created
    wsn_table_data.append((name,[]))

### Extract Table Body Data from all other HTML tr elements

##### Save the data rows

###### Variales:
 - variables storing the table row html data: tr_hou_elements and tr_wsn_elements
 - The new lists created that have the header row data: hou_table_data, wsn_table_data
 - The header data was stored in the first row 
 - From above code(index 0 = tr_elements[0])
 - data is stored on the second row onwards 
 - Use a for loop to eterate through the reamining tr elements
 - Make sure that each row is tabular, if not break out
 - For tabular data (row with equal columms) 
 - Store the data into your table_data list.

In [10]:
# Extract the Houston Astros Data
for j in range(1,len(tr_hou_elements)):
    
    #T is our j'th row
    T=tr_hou_elements[j]
    
    #Count the number of columns in the table and assign to variable
    #In this example, we have 28 columns with data     
    col_nbr = 28
    
    #If row is not of size 28 (# columns), 
    #the //tr data is not from our table 
    if len(T)!=col_nbr:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    #This for loop uses the lxlm .iterchildren() method
    for t in T.iterchildren():
        #This code uses the lxlm .text_content() method         
        data=t.text_content() 
        
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        hou_table_data[i][1].append(data)
        #Increment i for the next column
        i+=1

In [11]:
# Extract the Washington Nationals Data
for j in range(1,len(tr_wsn_elements)):
    
    #T is our j'th row
    T=tr_hou_elements[j]
    
    #Count the number of columns in the table and assign to variable
    #In this example, we have 28 columns with data     
    col_nbr = 28
    
    #If row is not of size 28 (# columns), 
    #the //tr data is not from our table 
    if len(T)!=col_nbr:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    #This for loop uses the lxlm .iterchildren() method
    for t in T.iterchildren():
        #This code uses the lxlm .text_content() method         
        data=t.text_content() 
        
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        wsn_table_data[i][1].append(data)
        #Increment i for the next column
        i+=1

In [12]:
# Quality control - confirm that you gathered columns with equal rows
[len(C) for (title,C) in hou_table_data]

[52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52]

In [13]:
# Quality control - confirm that you gathered columns with equal rows
[len(C) for (title,C) in wsn_table_data]

[52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52,
 52]

### Save data into a pandas dataframe

In [14]:
# Create the dataframe for the houston astros
Dict={title:column for (title,column) in hou_table_data}
hou_df=pd.DataFrame(Dict)

In [15]:
# Create the dataframe for the washington nationals
Dict={title:column for (title,column) in wsn_table_data}
wsn_df=pd.DataFrame(Dict)

In [16]:
# Inspect houston astros dataframe
hou_df.head()

Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
0,1,C,Robinson Chirinos,35,114,437,366,57,87,22,...,0.347,0.443,0.79,105,162,11,13,2,5,1
1,2,1B,Yuli Gurriel,35,144,612,564,85,168,40,...,0.343,0.541,0.884,126,305,12,5,0,6,2
2,3,2B,Jose Altuve,29,124,548,500,89,149,27,...,0.353,0.55,0.903,131,275,19,3,1,3,0
3,4,SS,Carlos Correa,24,75,321,280,42,78,16,...,0.358,0.568,0.926,137,159,8,2,0,4,0
4,5,3B,Alex Bregman,25,156,690,554,122,164,37,...,0.423,0.592,1.015,162,328,9,9,0,8,2


In [17]:
# Inspect washtingon nationals dataframe
wsn_df.head()

Unnamed: 0,Rk,Pos,Name,Age,G,PA,AB,R,H,2B,...,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB
0,1,C,Robinson Chirinos,35,114,437,366,57,87,22,...,0.347,0.443,0.79,105,162,11,13,2,5,1
1,2,1B,Yuli Gurriel,35,144,612,564,85,168,40,...,0.343,0.541,0.884,126,305,12,5,0,6,2
2,3,2B,Jose Altuve,29,124,548,500,89,149,27,...,0.353,0.55,0.903,131,275,19,3,1,3,0
3,4,SS,Carlos Correa,24,75,321,280,42,78,16,...,0.358,0.568,0.926,137,159,8,2,0,4,0
4,5,3B,Alex Bregman,25,156,690,554,122,164,37,...,0.423,0.592,1.015,162,328,9,9,0,8,2


### Save dataframe to csv file

In [18]:
hou_df.to_csv("2019_houston_astros", encoding='utf-8', index=False)

In [19]:
wsn_df.to_csv("2019_houston_astros", encoding='utf-8', index=False)