# Example Usage of Scraipe Library

Here's a quick example using scraipe to extract mentions of celebrities in news articles.

## Setup
Install and import things we need. We will use `NewsScraper` and `OpenAiAnalyzer` from the `scraipe[extras]` subpackage.

We will also load [your OpenAI API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key) from a file called 'openai_key.txt'. This file should be in the same folder as this notebook.

In [9]:
# Install Scraipe
%pip install scraipe[extras] --quiet
!pip show scraipe | grep Version

Note: you may need to restart the kernel to use updated packages.
Version: 0.1.16


In [2]:

# Import modules
import pandas as pd
from scraipe import Workflow
from scraipe.extras import NewsScraper, OpenAiAnalyzer
from pydantic import BaseModel

# Load OpenAI API key
OPENAI_API_KEY = open("openai_key.txt").read().strip()

## Extract links
First, we need a list of links to target with scraipe. We will extract all links from the front page of https://apnews.com.

In [3]:
import requests
import re

url = "https://apnews.com/"
response = requests.get(url)
html = response.text

# Use a regex to extract article links
pattern = r'href=["\'](?=[^"\']*/article)([^"\']+)["\']'
news_links = re.findall(pattern, html)

# Remove duplicates
news_links = list(set(news_links))

# Display a summary of the links
news_links_df = pd.DataFrame(news_links, columns=['link'])
import time
print(f"Found {len(news_links_df)} front page AP News links on {time.strftime('%Y-%m-%d')}")
display(news_links_df.head())

Found 142 front page AP News links on 2025-04-02


Unnamed: 0,link
0,https://apnews.com/article/nasa-stuck-astronau...
1,https://apnews.com/article/minecraft-movie-rev...
2,https://apnews.com/article/south-carolina-acco...
3,https://apnews.com/article/black-hair-formalde...
4,https://apnews.com/article/nuclear-dow-xenergy...


## Configure Workflow
Now we'll configure the scraipe workflow using `NewsScraper` and `OpenAiAnalyzer`.

`NewsScraper` uses the `trafilatura` to extract article content from a news site without all the HTML clutter.

`OpenAiAnalyzer` uses OpenAI models to extract data from the article content.

In [None]:
#===Configure NewsScraper===
# NewsScraper doesn't require any additional configuration
scraper = NewsScraper()

#===Configure OpenAiAnalyzer===
# Define the instruction for the LLM. Ensure the instruction specifies a return schema.
instruction = '''
Extract a list of celebrities mentioned in the article text.
Return a JSON dictionary with the following schema:
{"celebrities":["celebrity1", "celebrity2", ...]}
'''

# (Optional) Create a pydantic schema to validate the LLM output
from typing import List
class ExpectedOutput(BaseModel):
    celebrities: List[str]
    
# Create the analyzer with the API key, instruction, and schema
analyzer = OpenAiAnalyzer(OPENAI_API_KEY,instruction,pydantic_schema=ExpectedOutput)

#===Create Workflow===
# Create a workflow with the configured scraper and analyzer
workflow = Workflow(scraper, analyzer)

## Scrape content from news links
Next we will scrape content from news links. This content will be saved within the workflow's scrape store.

In [5]:
# Scrape the news links
workflow.scrape(news_links)
# Display the scraped content
scrape_store_df = workflow.get_scrapes()
display(scrape_store_df.head())

Scraping 142/142 new or retry links...


Scraping URLs:   0%|          | 0/142 [00:00<?, ?it/s]

Scraping URLs: 100%|██████████| 142/142 [00:55<00:00,  2.57it/s]

Successfully scraped 142/142 links.





Unnamed: 0,link,content,scrape_success,scrape_error
0,https://apnews.com/article/nasa-stuck-astronau...,NASA’s newly returned astronauts say they woul...,True,
1,https://apnews.com/article/minecraft-movie-rev...,Movie Review: Jason Momoa shines in ‘A Minecra...,True,
2,https://apnews.com/article/south-carolina-acco...,A $1.8 billion mistake could cost the South Ca...,True,
3,https://apnews.com/article/black-hair-formalde...,Black women’s hair products are in the safety ...,True,
4,https://apnews.com/article/nuclear-dow-xenergy...,Dow wants to power its Texas manufacturing com...,True,


## Analyze content with OpenAI
Next we will analyze the stored scrapes.

In [6]:
# Analyze the scraped content
workflow.analyze()

# Display the analyses
analysis_store_df = workflow.get_analyses()
display(analysis_store_df.head())

Analyzing 142/142 new or retry links with content...


Analyzing content: 100%|██████████| 142/142 [02:25<00:00,  1.03s/it]

Successfully analyzed 142/142 links.





Unnamed: 0,link,output,analysis_success,analysis_error
0,https://apnews.com/article/nasa-stuck-astronau...,"{'celebrities': ['Butch Wilmore', 'Suni Willia...",True,
1,https://apnews.com/article/minecraft-movie-rev...,"{'celebrities': ['Jason Momoa', 'Jennifer Cool...",True,
2,https://apnews.com/article/south-carolina-acco...,{'celebrities': []},True,
3,https://apnews.com/article/black-hair-formalde...,"{'celebrities': ['Javon Ford', 'James Rogers',...",True,
4,https://apnews.com/article/nuclear-dow-xenergy...,{'celebrities': ['Bill Gates']},True,


## Compile the results
Finally, let's export the completed analysis. 

In [10]:
export_df = workflow.export()
display(export_df)
export_df.to_csv('celebrities.csv', index=False)

Unnamed: 0,link,scrape_success,analysis_success,celebrities
0,https://apnews.com/article/nasa-stuck-astronau...,True,True,"[Butch Wilmore, Suni Williams, Elon Musk, Dona..."
1,https://apnews.com/article/minecraft-movie-rev...,True,True,"[Jason Momoa, Jennifer Coolidge, Jack Black, E..."
2,https://apnews.com/article/south-carolina-acco...,True,True,[]
3,https://apnews.com/article/black-hair-formalde...,True,True,"[Javon Ford, James Rogers, Jasmine McDonald, G..."
4,https://apnews.com/article/nuclear-dow-xenergy...,True,True,[Bill Gates]
...,...,...,...,...
137,https://apnews.com/article/marc-fogel-pittsbur...,True,True,"[Marc Fogel, Alexander Vinnik]"
138,https://apnews.com/article/michigan-marijuana-...,True,True,[]
139,https://apnews.com/article/boston-ice-district...,True,True,[]
140,https://apnews.com/article/wnba-womens-basketb...,True,True,"[Kitty Henderson, Natalie White, Deion Sanders..."


## Analyze the results
Now you can conduct your own analysis on the structured data collected by the scraipe workflow.

In [11]:
# Load the extracted data
celebrities_df = pd.read_csv('celebrities.csv')
from ast import literal_eval
celebrities_df['celebrities'] = celebrities_df['celebrities'].apply(literal_eval)

# Explode the nested list of celebrities
celebrities_df = celebrities_df.explode('celebrities')
celebrities_df['celebrities'] = celebrities_df['celebrities'].str.strip()

# Display the top 10 most mentioned celebrities
celebrities_df = celebrities_df['celebrities'].value_counts().reset_index()
celebrities_df.columns = ['celebrity', 'mentions']
celebrities_df = celebrities_df.sort_values('mentions', ascending=False)
celebrities_df.head(10)

Unnamed: 0,celebrity,mentions
0,Donald Trump,41
1,Elon Musk,13
2,Joe Biden,7
3,Cory Booker,4
4,Benjamin Netanyahu,4
5,Robert F. Kennedy Jr.,4
6,Giorgia Meloni,3
7,Claudia Sheinbaum,3
8,Taylor Swift,3
9,Kristi Noem,3
