# Tutorial: Web Scraping Form Data off Fangraphs.com

This tutorial is a demonstration of how to scrape data off of a web page that has an option to click and download a dataset, as opposed to getting the HTML data and trying to parse the HTML tag soup. Sometimes the latter is the only option. But data friendly websites often have links allowing you to download data. Doing so programmatically requires making POST requests to a web server in the parlance of making web server requests via [HTTP](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol). 

The goal will be to scrape information from the daily fantasy baseball projections posted [here](https://www.fangraphs.com/dailyprojections.aspx?pos=all&stats=bat&type=sabersim).

## Chrome DevTools: Identify Form Data

To have our program click and download the data associated with the **export data** button, we'll be making a "POST" request. A [POST](https://en.wikipedia.org/wiki/POST_(HTTP) request is what you use to fill out a web form, usually to submit data or communicate information to a script on a web server (example: logging into a website) in order for it to do something. Unfortunately, the HTML *< href >* tag associated with the **export data** hyperlink for exporting data is in javascript and not an actual hyperlink pointing to a different URL. 

So, using **Chrome Developer Tools**, I navigated to the network traffic associated with making a click of the **export data** button to learn the **request headers** and **form data** associated with this action of clicking to download the data. The headers tell us specifically how the data are encoded when making a HTTP request. In this case, [application/x-www-form-urlencoded](https://en.wikipedia.org/wiki/Percent-encoding#The_application/x-www-form-urlencoded_type) is used.

![chrome_headers](img/request and headers.png)

While still in the *Network tab* of Chrome Developer Tools, scroll down and view the Form Data. Identify all the parsed key:value pairs of information that the Fangraphs webserver receives associated with this POST request in order for it to go find and retrieve the player projections data.

![chrome_form_data](img/form data.png)

The form data associated with the above screenshot from the Chrome DevTools, can also be identified in the underlying HTML source since we now know what form data parameters to look for. **This will be important when we want to convert this testing framework into a Python script later on.**

![form_data_html](img/html_form_data.png)

## Postman: Recreate the server request

Next, I use a website testing service called [POSTman](https://www.getpostman.com) to figure out what information from everything explored so far is needed to make a successful POST request to the Fangraphs.com web server that will retrieve the data of player projections.

Below, you can see I leveraged information from Chrome Developer tools, then tinkered with the encoding headers and and form data key:value pairs that needed to be submitted for a successful server request. I know the POST request is successful when I get a *200* status code, plus the data of interest is returned in the bottom window. Sweet!

![postman_test](img/POSTman.png)

## Program the solution with Python

Having deconstructed the POST request to understand how and what data elements will make it work, we can convert this into a programming workflow using Python.

My workflow:

* Make a GET request to return underlying HTML of the Fangraphs.com URL  with `requests` library. This is important in the event that the website owner changes up the form data in the future. We want to capture the changes programmatically.
* parse the HTML with `BeautifulSoup` to capture the POST request parameter data, store key:value in a dict
* pass the dict of POST request parameters, and the encoding headers as a POST request from `requests` library.

### Make GET request

In [61]:
import requests
from bs4 import BeautifulSoup

url = "https://www.fangraphs.com/dailyprojections.aspx?pos=all&stats=bat&type=sabersim"

r = requests.get(url)
# check if we accurately get the HTML
print(r.status_code, r.ok)

html_doc = r.text
tag_soup = BeautifulSoup(html_doc, 'html.parser')

200 True


### Parse HTML to capture parameters needed

As noted in testing with the Postman app, we need 3 form data parameters: EVENTTARGET, VIEWSTATE, and EVENTVALIDATION to generate a working POST request to the Fangraphs web server.

In [62]:
form_info = tag_soup.find_all('input',{"id" : {"__VIEWSTATE", "__EVENTVALIDATION"}})

# initialize dict with initial key:value couldn't easily scrape from HTML
param_dict = {"__EVENTTARGET" : "DFSBoard1$cmdCSV"}

# append other parameters to dict.
for i in range(len(form_info)):
    key = form_info[i]['id']
    param_dict[key] = form_info[i]['value']
    
# store headers. These shouldn't change. 
headers = {'Content-Type' : 'application/x-www-form-urlencoded'}


### Make POST request to website server

Finally, we can package the POST request with the parameters and encoding headers, then send & catch a response.

In [63]:
r = requests.post(url, 
                  data=param_dict, 
                  headers=headers)

print(r.status_code, r.ok)

200 True


The response will not be a .CSV file, but a string of the raw text that would be in that file. We can access it.

In [64]:
batter_text = r.text

print(batter_text)

﻿"Name","Team","Game","Pos","PA","H","1B","2B","3B","HR","R","RBI","SB","CS","BB","SO","Yahoo","FanDuel","DraftKings","playerid"
"Mike Trout","Angels","LAA @ BAL","CF","4.67","1.23",".65",".21",".02",".34",".89",".73",".14",".07",".83",".92","10.46","16.24","12.20","10155"
"Anthony Rizzo","Cubs","MIN @ CHC","1B","4.44","1.13",".54",".23",".02",".35",".88",".92",".04",".02",".75",".59","10.13","15.83","11.67","3473"
"Eric Thames","Brewers","MIL @ CIN","1B","4.76","1.11",".51",".20",".03",".38",".89",".79",".07",".03",".71","1.18","10.06","15.67","11.65","3711"
"Ian Happ","Cubs","MIN @ CHC","CF","4.78","1.06",".44",".18",".04",".40",".86",".77",".08",".04",".68","1.67","10.00","15.55","11.59","17919"
"Kyle Schwarber","Cubs","MIN @ CHC","LF","4.22",".97",".40",".14",".03",".40",".80",".89",".03",".02",".73","1.14","9.69","15.14","11.12","16478"
"Travis Shaw","Brewers","MIL @ CIN","3B","4.55","1.16",".56",".25",".01",".33",".73",".86",".03",".02",".58",".75","9.28","14.50","10.81","1

## Parse the output into a DataFrame

Now, start to parse the data. Here, we convert the long string of text data into a list of strings. each string is a player name and their values.

In [65]:
batter_data = list(batter_text.strip().replace('"', '').split('\r\n'))

# inspect first 5 elements
batter_data[0:5]

['\ufeffName,Team,Game,Pos,PA,H,1B,2B,3B,HR,R,RBI,SB,CS,BB,SO,Yahoo,FanDuel,DraftKings,playerid',
 'Mike Trout,Angels,LAA @ BAL,CF,4.67,1.23,.65,.21,.02,.34,.89,.73,.14,.07,.83,.92,10.46,16.24,12.20,10155',
 'Anthony Rizzo,Cubs,MIN @ CHC,1B,4.44,1.13,.54,.23,.02,.35,.88,.92,.04,.02,.75,.59,10.13,15.83,11.67,3473',
 'Eric Thames,Brewers,MIL @ CIN,1B,4.76,1.11,.51,.20,.03,.38,.89,.79,.07,.03,.71,1.18,10.06,15.67,11.65,3711',
 'Ian Happ,Cubs,MIN @ CHC,CF,4.78,1.06,.44,.18,.04,.40,.86,.77,.08,.04,.68,1.67,10.00,15.55,11.59,17919']

We'll use the `pandas` package to take these data and convert them into a dataframe.

Workflow:

* convert the above list of strings to list of lists
* pop off the first element of the list, store it as dataframe's column headers.
* Take the list of lists and make it a list of tuples, a requirement for converting list to a Pandas dataframe.
* apply the **from_records()** method of pd.DataFrame to produce a clean dataframe of projections data for analysis.


In [66]:
# convert list of strings to list of lists
batter_list = [player.split(',') for player in batter_data]
print(batter_list[0:5])

[['\ufeffName', 'Team', 'Game', 'Pos', 'PA', 'H', '1B', '2B', '3B', 'HR', 'R', 'RBI', 'SB', 'CS', 'BB', 'SO', 'Yahoo', 'FanDuel', 'DraftKings', 'playerid'], ['Mike Trout', 'Angels', 'LAA @ BAL', 'CF', '4.67', '1.23', '.65', '.21', '.02', '.34', '.89', '.73', '.14', '.07', '.83', '.92', '10.46', '16.24', '12.20', '10155'], ['Anthony Rizzo', 'Cubs', 'MIN @ CHC', '1B', '4.44', '1.13', '.54', '.23', '.02', '.35', '.88', '.92', '.04', '.02', '.75', '.59', '10.13', '15.83', '11.67', '3473'], ['Eric Thames', 'Brewers', 'MIL @ CIN', '1B', '4.76', '1.11', '.51', '.20', '.03', '.38', '.89', '.79', '.07', '.03', '.71', '1.18', '10.06', '15.67', '11.65', '3711'], ['Ian Happ', 'Cubs', 'MIN @ CHC', 'CF', '4.78', '1.06', '.44', '.18', '.04', '.40', '.86', '.77', '.08', '.04', '.68', '1.67', '10.00', '15.55', '11.59', '17919']]


Pop off the first sub list, which will become dataframe column names 

In [67]:
col_names = batter_list.pop(0)
col_names[0] = 'Name'
print(col_names)

['Name', 'Team', 'Game', 'Pos', 'PA', 'H', '1B', '2B', '3B', 'HR', 'R', 'RBI', 'SB', 'CS', 'BB', 'SO', 'Yahoo', 'FanDuel', 'DraftKings', 'playerid']


Now convert the remaining elements from the list of lists (batter_list) into list of tuples

In [68]:
batter_list_tup = [tuple(l) for l in batter_list] 
print(batter_list_tup[0:4])

[('Mike Trout', 'Angels', 'LAA @ BAL', 'CF', '4.67', '1.23', '.65', '.21', '.02', '.34', '.89', '.73', '.14', '.07', '.83', '.92', '10.46', '16.24', '12.20', '10155'), ('Anthony Rizzo', 'Cubs', 'MIN @ CHC', '1B', '4.44', '1.13', '.54', '.23', '.02', '.35', '.88', '.92', '.04', '.02', '.75', '.59', '10.13', '15.83', '11.67', '3473'), ('Eric Thames', 'Brewers', 'MIL @ CIN', '1B', '4.76', '1.11', '.51', '.20', '.03', '.38', '.89', '.79', '.07', '.03', '.71', '1.18', '10.06', '15.67', '11.65', '3711'), ('Ian Happ', 'Cubs', 'MIN @ CHC', 'CF', '4.78', '1.06', '.44', '.18', '.04', '.40', '.86', '.77', '.08', '.04', '.68', '1.67', '10.00', '15.55', '11.59', '17919')]


The list of tuples is now ready to be read in as a pandas dataframe, and a rendering of the top 25 batters projected with the most daily fantasy points is provided. 

In [69]:
import pandas as pd
pd.set_option('display.max_columns', 25)

df = pd.DataFrame.from_records(batter_list_tup,
                               columns=col_names,
                               coerce_float=True)

df.head(25)


Unnamed: 0,Name,Team,Game,Pos,PA,H,1B,2B,3B,HR,R,RBI,SB,CS,BB,SO,Yahoo,FanDuel,DraftKings,playerid
0,Mike Trout,Angels,LAA @ BAL,CF,4.67,1.23,0.65,0.21,0.02,0.34,0.89,0.73,0.14,0.07,0.83,0.92,10.46,16.24,12.2,10155
1,Anthony Rizzo,Cubs,MIN @ CHC,1B,4.44,1.13,0.54,0.23,0.02,0.35,0.88,0.92,0.04,0.02,0.75,0.59,10.13,15.83,11.67,3473
2,Eric Thames,Brewers,MIL @ CIN,1B,4.76,1.11,0.51,0.2,0.03,0.38,0.89,0.79,0.07,0.03,0.71,1.18,10.06,15.67,11.65,3711
3,Ian Happ,Cubs,MIN @ CHC,CF,4.78,1.06,0.44,0.18,0.04,0.4,0.86,0.77,0.08,0.04,0.68,1.67,10.0,15.55,11.59,17919
4,Kyle Schwarber,Cubs,MIN @ CHC,LF,4.22,0.97,0.4,0.14,0.03,0.4,0.8,0.89,0.03,0.02,0.73,1.14,9.69,15.14,11.12,16478
5,Travis Shaw,Brewers,MIL @ CIN,3B,4.55,1.16,0.56,0.25,0.01,0.33,0.73,0.86,0.03,0.02,0.58,0.75,9.28,14.5,10.81,11982
6,Javier Baez,Cubs,MIN @ CHC,2B,4.34,1.16,0.6,0.2,0.04,0.31,0.74,0.93,0.07,0.04,0.27,1.14,8.92,13.99,10.5,12979
7,Paul Goldschmidt,Diamondbacks,SF @ ARI,1B,4.38,1.09,0.56,0.24,0.03,0.25,0.72,0.68,0.08,0.03,0.78,0.99,9.01,14.01,10.47,9218
8,Brian Dozier,Twins,MIN @ CHC,2B,4.43,1.06,0.49,0.21,0.03,0.33,0.73,0.79,0.06,0.03,0.53,0.85,8.97,13.99,10.45,9810
9,Freddie Freeman,Braves,ATL @ STL,1B,4.55,1.29,0.73,0.28,0.02,0.25,0.71,0.69,0.04,0.03,0.62,0.75,8.9,13.83,10.49,5361
