Webscraping
===
MAIC - Spring, Week 2<br>
```
  _____________
 /0   /     \  \
/  \ M A I C/  /\
\ / *      /  / /
 \___\____/  @ /
          \_/_/
```
(Rosie is not needed!)

Prereqs:
- Install [VSCode](https://code.visualstudio.com/)
- Install [Python](https://www.python.org/downloads/)
- Ensure you can run notebooks in VSCode.

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

**What is webscraping?**

Are you in need of data? Maybe you want to analyze some data for insights. Or maybe you just want to train a model. In any case, you may be able to get the data you need via webscraping!

Webscraping is the process of *automatically* extracting data from websites. You can manually extract website data on your browser via "inspect," but automating this process is ideal if you need anything more than a few samples.

- Go to any website (for instance, the [MAIC](https://msoe-maic.com/) site).
- Right-click anywhere on the page. Select the "inspect" option or something labeled similarly. This is usually at the bottom of the pop-up menu.
- Note the window that opened. It contains the raw HTML (and possibly JS/CSS) site data. This is what we want to scrape automatically.
- Use the element selector at the top left of the inspect window to see the HTML of specific elements.

---

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

**That's cool. How can I scrape automatically?**

Let's try scraping the MAIC leaderboard!

Basic scraping only needs the `requests` library which comes with Python.

In [None]:
import requests

URL = 'https://msoe-maic.com'

html = requests.get(URL).text # Make a request to the URL and get the HTML via `.text`
print(html[:500]) # Print some of the resulting HTML

This html now contains the leaderboard for us to extract. But how do we extract it?

One easy way is to *inspect* the page on your browser, and to see if the HTML can easily identify the leaderboard. It seems that the leaderboard element is in the "leaderboard-table" class:

```html
<table border="1" class="dataframe leaderboard-table" id="df_data">
    ...
</table>
```

We could try looking for "leaderboad-table" in the html string, but there's a better way. `Beautifulsoup` is a Python library that makes parsing HTML easy.

In [None]:
!pip install beautifulsoup4 # Install BeautifulSoup and possibly restart your notebook, being sure to re-run prior cells.

In [None]:
from bs4 import BeautifulSoup # We can now use BeautifulSoup to parse the HTML

soup = BeautifulSoup(html, 'html.parser') 
print(soup.prettify()[:500]) # print it as before, but now it's prettified

Now we can use BeautifulSoup to find the "leaderboard-table" element.

In [None]:
# find the table element with class "leaderboard-table"

leaderboard_table = soup.find('table', {'class': 'leaderboard-table'})

print(leaderboard_table.prettify()[:500]) # print the first 500 characters of the table

Not only can Beautifulsoup find the element, it also allows us to easily extract the data.

In [None]:
# Extract table data into a list of dictionaries

rows = leaderboard_table.find_all('tr') # Find all rows in the table
header = [cell.text for cell in rows[0].find_all('th')] # Get the header row
data = [
    {header[i]: cell.text for i, cell in enumerate(row.find_all('td'))} # Create a dictionary for each row using the header to name the keys
    for row in rows[1:]
]

data

Pretty neat, right?

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

TODO