Webscraping
===
MAIC - Spring, Week 3<br>
```
  _____________
 /0   /     \  \
/  \ M A I C/  /\
\ / *      /  / /
 \___\____/  @ /
          \_/_/
```
(Rosie is not needed!)

Prereqs:
- Install [VSCode](https://code.visualstudio.com/)
- Install [Python](https://www.python.org/downloads/)
  - "The best language" - Guido van Rossum
- Ensure you can run notebooks in VSCode.

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

**What is webscraping?**

Are you in need of data? Maybe you want to analyze some data for insights. Or maybe you just want to train a model. In any case, you may be able to get the data you need via webscraping!

Webscraping is the process of *automatically* extracting data from websites. You can manually extract website data on your browser via "inspect," but automating this process is ideal if you need anything more than a few samples.

- Go to any website (for instance, the [MAIC](https://msoe-maic.com/) site).
- Right-click anywhere on the page. Select the "inspect" option or something labeled similarly. This is usually at the bottom of the pop-up menu.
- Note the window that opened. It contains the raw HTML (and possibly JS/CSS) site data. This is what we want to scrape automatically.
- Use the element selector at the top left of the inspect window to see the HTML of specific elements.

---

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

**That's cool. How can I scrape automatically?**

Let's try scraping the MAIC leaderboard!

Basic scraping only needs the `requests` library.

In [None]:
%pip install requests

In [None]:
import requests

URL = 'https://msoe-maic.com'

html = requests.get(URL).text # Make a request to the URL and get the HTML via `.text`
print(html[:500]) # Print some of the resulting HTML

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vect


This html now contains the leaderboard for us to extract. But how do we extract it?

One easy way is to *inspect* the page on your browser, and to see if the HTML can easily identify the leaderboard. It seems that the leaderboard element is in the "leaderboard-table" class:

```html
<table border="1" class="dataframe leaderboard-table" id="df_data">
    ...
</table>
```

We could try looking for "leaderboad-table" in the html string, but there's a better way. `Beautifulsoup` is a Python library that makes parsing HTML easy.

In [None]:
# Install BeautifulSoup and possibly restart your notebook, being sure to re-run prior cells.
%pip install beautifulsoup4

In [None]:
from bs4 import BeautifulSoup # We can now use BeautifulSoup to parse the HTML

soup = BeautifulSoup(html, 'html.parser') 
print(soup.prettify()[:500]) # print it as before, but now it's prettified

Now we can use BeautifulSoup to find the "leaderboard-table" element.

In [None]:
# find the table element with class "leaderboard-table"

leaderboard_table = soup.find('table', {'class': 'leaderboard-table'})

print(leaderboard_table.prettify()[:500]) # print the first 500 characters of the table

Not only can Beautifulsoup find the element, it also allows us to easily extract the data.

In [None]:
# Extract table data into a list of dictionaries

rows = leaderboard_table.find_all('tr') # Find all rows in the table
header = [cell.text for cell in rows[0].find_all('th')] # Get the header row
data = [
    {header[i]: cell.text for i, cell in enumerate(row.find_all('td'))} # Create a dictionary for each row using the header to name the keys
    for row in rows[1:]
]

data

Pretty neat, right?

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

**It's not always this easy.**

Some pages dynamically generate content using Javascript. This is a problem for us because the `requests` library cannot run Javascript code. Let's try to scrape content from a page that uses a lot of Javascript.

- Go to [the MAIC research groups page](https://msoe-maic.com/library?nav=Research).
- Use the element selector to select a group's section.
- Note the id of the element.

For instance, the page has this div with an id of `agent-simulation-experts`.

```html
<div class="MuiPaper-root MuiPaper-elevation MuiPaper-rounded MuiPaper-elevation1 MuiCard-root modal css-1kil0ip" id="agent-simulation-experts">
    ...
</div>
```

It's important to note, however, that this element was generated with Javascript. So what happens if we try scraping this element with `requests`?

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

In [None]:
import requests

URL = 'https://msoe-maic.com/library?nav=Research'

html = requests.get(URL).text # Make a request to the URL and get the HTML via `.text`

# We don't seem to get much HTML from this page
print(html)

In [None]:
# In fact, the HTML has zero mentions of the div we saw earlier!
print('agent-simulation-experts' in html)


<span style="color:#5555ff;font-weight:bold;">
    Try this yourself.
</span>

Go to some websites and see what HTML you can scrape with `requests`. See if anything in the browser inspection tool appears in the `html` variable. You may find that a majority of websites aren't easily scrapable.

Some sites to try:
- https://www.youtube.com/
- https://www.google.com/search?q=your+search+here
- https://www.reddit.com/
- https://stackoverflow.com/questions
- https://github.com/MSOE-AI-Club

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

**Q: So how do we scrape pages that use Javascript?**

A: Use Selenium.

Selenium is a headless browser that can execute page Javascript.

the `requests` library cannot run Javascript, so any page content generated by said Javascript is impossible to scrape with `requests` alone. Luckily, browsers are *made* to run Javascript. Selinum runs javascript like a regular browser (and it even uses a regular browser such as Chrome under the hood), but it functions without a UI so you can interact with pages programatically


<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

We'll wrap `selenium` in a function call to make it work similarly to `requests`. Feel free to read the function comments if you want to dive deeper into `selenium`.

In [None]:
%pip install selenium
%pip install webdriver-manager

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.edge.service import Service as EdgeService
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.edge.options import Options as EdgeOptions
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.microsoft import EdgeChromiumDriverManager
from bs4 import BeautifulSoup
import time

def setup_chrome():
    options = ChromeOptions()
    options.add_argument('--headless')  # Run in headless mode (no GUI)
    options.add_argument('--disable-gpu')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    
    print('Opening Chrome Webdriver')
    return webdriver.Chrome(
        service=ChromeService(ChromeDriverManager().install()),
        options=options
    )
        
def setup_edge():
    options = EdgeOptions()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    
    print('Opening Edge Webdriver')
    return webdriver.Edge(
        service=EdgeService(EdgeChromiumDriverManager().install()),
        options=options
    )

driver = None
try:
    driver = setup_chrome()
except Exception as e:
    print(f"Chrome failed")
    print("Falling back to Edge...")
    driver = setup_edge()

def get_page_content(url):
    """
    Opens a URL using Selenium and retrieves the page contents.
    Tries Chrome first, falls back to Edge if Chrome fails.
    
    Args:
        url (str): The URL to open
        
    Returns:
        tuple: (raw_html, parsed_text) where raw_html is the page source and 
               parsed_text is the cleaned text content
    """
    
    print('Scrape')
    
    try:
        # Open the URL
        driver.get(url)
        
        # Wait for a short time to ensure the page loads
        time.sleep(2)
        
        # Get the page source
        page_content = driver.page_source
        
        # Parse with BeautifulSoup
        soup = BeautifulSoup(page_content, 'html.parser')
        
        # Remove script and style elements
        for script in soup(["script", "style"]):
            script.decompose()
        
        # Find all code blocks and wrap their text in backticks
        code_blocks = soup.find_all(['pre', 'code'])
        for block in code_blocks:
            if block.string:
                block.string = f'```{block.string}```'
            else:
                # Handle nested elements within code blocks
                block.string = f'```{block.get_text()}```'
                
        # Get text and clean it
        text = soup.get_text().replace("```Copy", "```")
        
        # Clean up the text
        # Break into lines and remove leading/trailing space on each
        lines = (line.strip() for line in text.splitlines())
        
        # Break multi-headlines into a line each
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        
        # Drop blank lines
        text = '\n'.join(chunk for chunk in chunks if chunk)
        
        return page_content, text
    
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return None, None

In [None]:
URL = 'https://msoe-maic.com/library?nav=Research'

# YOU NEED Chrome or Edge installed. Sorry Mac users :(
html, _ = get_page_content(URL) # Get the page html, but this time with selenium. NOTE: This will take a while to run the first time around b/c the webdriver has to be installed

In [None]:
# Find the element with id 'agent-simulation-experts' and then find any descendant with class 'modal-header'
soup = BeautifulSoup(html, 'html.parser')
agent_sim_div = soup.find('div', {'id': 'agent-simulation-experts'})
modal_header = agent_sim_div.find(class_='modal-header')
print(modal_header.get_text().strip() if modal_header else "No modal-header found")
print(agent_sim_div)

<span style="color:#ff5555;font-weight:bold;font-size:1.5rem;">
    STOP
</span>

... or keep going if you want to work ahead.

---

**Using LLMs to summarize scraped data.**

If you're scraping unstructured data, then LLMs are a must. Although there is structure in the HTML elements, it can often be easier to ask an LLM to structure the output for you.

Let's structure the output of a page listing refurbished iPhones for sale.

You will need Gemini API keys to run this example. [Link to Gemini API](https://aistudio.google.com/)

<span style="color:#55ff55;font-weight:bold;font-size:1.5rem;">
    GO
</span>

This example will:
- Use Selenium to scrape for refurbished iPhones.
- Use an LLM to summarize the results into a structured format.

In [None]:
# "You're doing great kid" - Linus Torvalds
%pip install pydantic pydantic-ai pandas

In [57]:
import os
from pydantic_ai import Agent
from typing import List
from pydantic import BaseModel, Field

In [None]:
URL = 'https://www.apple.com/shop/refurbished/iphone/iphone-14-pro'

# --- NOTE: put your key here ---
os.environ["GEMINI_API_KEY"] = ...

In [47]:
# This is where you can specify the output structure

class ProductResult(BaseModel):  
    model: str = Field(description='The model of the product')
    description: str = Field(description='The description of the product')
    cost: int = Field(description="The cost of the product")
    isp: str = Field(description="The internet service provider")
    color: str = Field(description="The color of the product")
    refurbished: bool = Field(description="Whether the product is refurbished")

# We are storing a list of ProductResults in the final output
class RequestResults(BaseModel):
    products: List[ProductResult] = Field(description='The list of product results')

In [64]:
# "Here's where the AI comes in" - Andrej Karpathy
agent = Agent( # Create an agent that will structure the output
    'google-gla:gemini-1.5-flash',
    result_type=RequestResults,
    system_prompt='Be concise, reply with one sentence.',  
)

# Agent system prompt - tell it what to do
@agent.system_prompt  
async def add_customer_name(ctx) -> str:
    return f"Your goal is to extract product information from web scraped pages and format it to a structured response."

In [None]:
_, text = get_page_content(URL) # Scrape the list of iPhones, and get the text (so the LLM can read it more easily)

In [66]:
result = await agent.run(text)

In [None]:
import pandas as pd
from IPython.display import display

# Structure the output into a DataFrame to see our results
item_dicts = [item.model_dump() for item in result.data.products]
df = pd.DataFrame(item_dicts)
display(df.head(20))

In [None]:
driver.quit() # Always remember to close the webdriver when you're done with it