# Web Scraping Coding Sample
#### By Daryl Adopo

Any web scraping require a basic knowlege of HTML.
The basic HTML Styntax of any webpage looks like this:

In [1]:
%%html
<!DOCTYPE html>   
    <head>
    <meta charset="utf-8">
    </head>
    <body>
        <h1 class = "heading"> My Website </h1>
        <p>Hello World! </p>
    <body>
</html>

Every tag in HTML can have attribute information such as **class**, **id**, **href**, and other useful information that helps to uniquely identify the element.

For more information about basic HTML tags, check out [w3schools](https://www.w3schools.com/html/).

For this sample, I used:
- **requests** to get the raw HTML
- **BeautifulSoup** to parse HTML in python
- **csv** to export the data

In [2]:
# Required Packages
try:
    from bs4 import BeautifulSoup
    import requests
    import csv
except ImportError:
    %%capture
    !pip install bs4
    !pip install requests
    from bs4 import BeautifulSoup
    import requests
    import csv

I will be using the Major League Baseball Salaries Data publicly available on [spotrac.com](https://www.spotrac.com/mlb/rankings/2021/salary/). \
I am not really a Baseball fan, but I am currently working on a project requiring this data

In [3]:
%%html
<iframe src="https://www.spotrac.com/mlb/rankings/2021/salary/" width="800" height="500"></iframe>

In [4]:
url="https://www.spotrac.com/mlb/rankings/2021/salary/"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "lxml")

I take a look at the source code of the page to know where I can find the data that I need.

In [5]:
# print(soup.prettify()) # print the parsed data of html

After looking at the source code, I notice the table with the player salaries is embeded in the **table** tag with the attribute **class = "datatable noborder"**

In [6]:
# Find Table of interest on the webpage
mlb_table = soup.find("table", attrs={"class": "datatable noborder"})

Since my goal is to extract (scrape) the data from this table,
\
I will need to store each row of the table as dictionary with the headings as keys to conform to the format of the **CSV** package.

After a process long process of trial and error, I am able to store the information as I need it.

In [7]:
mlb_table_header = mlb_table.thead.find_all("tr") # Headers
mlb_table_data = mlb_table.tbody.find_all("tr")  # Rows

# Extract Information from the Table
# Get all the headings
headings = []
for th in mlb_table_header[0].find_all("th"):
    # remove any newlines and extra spaces from left and right
    headings.append(th.text.replace('\n', ' ').strip())
    
data = []
for tr in mlb_table_data: # find all tr's from table's tbody

    row = {}
    # Each row is stored in the form of
    # row = {'Rank': '', 'Player': '',etc...}

    # find all td's in tr and zip it with headings
    for td, th in zip(tr.find_all("td"), headings):
        if td.attrs:
            # Getting the player name only
            if td.attrs['class'][0] == "rank-name":
                row[th] = td.find('h3').text.replace('\n', '').strip()
                # Creating custom column for the team code
                row['code'] = td.find('div', "rank-position").text.replace('\n', '').strip()
                continue
        row[th] = td.text.replace('\n', '').strip()
    data.append(row)

# Adding custom column to headings
headings.insert(3, 'code')

Now I am ready to export the data as a csv file.

In [None]:
# Exporting Table as a CSV File
with open(f"mlb_player_salary.csv", 'w', newline = '') as out_file:
    writer = csv.DictWriter(out_file, headings)
    writer.writeheader()
    writer.writerows(data)

Now that I know what I am doing, I can automate the process to get the player salaries for each teams

### List of Teams Url

First I need a list of all the teams on the website

After inspection of the source code, I notice that the team urls are embeded in the **select** tag with attribute **name="teamUrl"**

In [8]:
mlb_teams = soup.find("select", attrs={"name": "teamUrl1"})
mlb_teams_data = mlb_teams.find_all('option')

# Storing team url
teams = []
for option in mlb_teams_data:
    teams.append(option.attrs['value'])
    
print(teams)

['', 'arizona-diamondbacks', 'atlanta-braves', 'baltimore-orioles', 'boston-red-sox', 'chicago-cubs', 'chicago-white-sox', 'cincinnati-reds', 'cleveland-indians', 'colorado-rockies', 'detroit-tigers', 'houston-astros', 'kansas-city-royals', 'los-angeles-angels', 'los-angeles-dodgers', 'miami-marlins', 'milwaukee-brewers', 'minnesota-twins', 'new-york-mets', 'new-york-yankees', 'oakland-athletics', 'philadelphia-phillies', 'pittsburgh-pirates', 'san-diego-padres', 'san-francisco-giants', 'seattle-mariners', 'st-louis-cardinals', 'tampa-bay-rays', 'texas-rangers', 'toronto-blue-jays', 'washington-nationals']


Then, I dynamically request the pages for each team, and extract the data using the code from before

In [None]:
for team in teams:
    url=f"https://www.spotrac.com/mlb/rankings/2021/salary/{team}"

    # Make a GET request to fetch the raw HTML content
    html_content = requests.get(url).text

    # Parse the html content
    soup = BeautifulSoup(html_content, "lxml")
    
    # Find Table of interest on the webpage
    mlb_table = soup.find("table", attrs={"class": "datatable noborder"})

    mlb_table_header = mlb_table.thead.find_all("tr") # Headers
    mlb_table_data = mlb_table.tbody.find_all("tr")  # Rows
    
    # Extract Information from the Table
    # Get all the headings
    headings = []
    for th in mlb_table_header[0].find_all("th"):
        # remove any newlines and extra spaces from left and right
        headings.append(th.text.replace('\n', ' ').strip())
    
    data = []
    for tr in mlb_table_data: # find all tr's from table's tbody

        row = {}
        # Each row is stored in the form of
        # row = {'Rank': '', 'Player': '',etc...}

        # find all td's in tr and zip it with headings
        for td, th in zip(tr.find_all("td"), headings):
            if td.attrs:
                # Getting the player name only
                if td.attrs['class'][0] == "rank-name":
                    row[th] = td.find('h3').text.replace('\n', '').strip()
                    # Creating custom column for team code
                    row['code'] = td.find('div', "rank-position").text.replace('\n', '').strip()
                    continue
            row[th] = td.text.replace('\n', '').strip()
        data.append(row)

    # Adding custom column to headings
    headings.insert(3, 'code')
    
    # Exporting Table to Excel
    with open(f"mlb_player_salary_{team}.csv", 'w', newline = '') as out_file:
        writer = csv.DictWriter(out_file, headings)
        writer.writeheader()
        writer.writerows(data)