# Web Scraping

## Marvin lets you build web scrapers that automatically *evolve with* your source material.

One of the hardest parts of webscraping at scale is having to maintain custom code for nearly every page. If you're trying to monitor and extract new products from, say, Apple, Amazon, eBay, etc. you're on the hook for writing new code for each page. When they change their code, your pipeline breaks, and you lose your data for that run.  Large Language Models can deduce, reason, and infer where in the webpage your desired data lives. As your source material's schema changes, LLMs can reason where to find the new data, instead of failing. 

## Define your data model and let Marvin *infer* and *deduce* it from the source material.

We'll first define our data model using Pydantic. Say we have a signup for our website, wherein a user adds their work email to our 'Contact Us' page.
We want to learn more things about that company so that we can better lead score or customize our outreach.

In [1]:
import marvin
from pydantic import BaseModel, Field
from enum import Enum


class CompanyType(Enum):
    startup = "startup"
    scaleup = "scaleup"
    enterprise = "enterprise"


@marvin.ai_model
class Company(BaseModel):
    name: str = Field(..., description="Company name")
    description_short: str
    description_long: str
    industries: list[str]
    type: CompanyType

Now we get an email from ford@prefect.io - a company that's not in our database. Our team is trying cutomize our messaging for enterprise users so we'll fire off a request to www.prefect.io and see what comes back.

In [2]:
from bs4 import BeautifulSoup as bs
import requests

# Get the text from the Apple website
soup = bs(requests.get("https://www.prefect.io").content, "html.parser")

soup.get_text(strip=True, separator=" ")

# Pass the text to the model
company = Company(soup.get_text(strip=True, separator=" "))

{
  "name": "Prefect",
  "description_short": "The New Standard in Dataflow Automation",
  "description_long": "Prefect is a modern workflow orchestration tool for coordinating all of your data tools. Orchestrate and observe your dataflow using Prefect's open source Python library, the glue of the modern data stack. Scheduling, executing and visualizing your data workflows has never been easier.",
  "industries": ["Dataflow Automation", "Workflow Orchestration", "Data Tools"],
  "type": "startup"
}
