# Celebrity Mentions

In this example, we will use scraipe to extract mentions of celebrities in news articles.

## Setup
Install and import things we need. We will use `NewsScraper` and `OpenAiAnalyzer` from the `scraipe[extended]` subpackage.

We will also load [your OpenAI API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key) from a file called 'openai_key.txt'. This file should be in the same folder as this notebook.

In [2]:
# Install scraipe from PyPI:
%pip install -qU scraipe
# Alternatively, install scraipe package from repo
#%pip install -qe ..

# Check package version
!pip show scraipe | grep Version

# Install notebook widgets for cleaner output
%pip install --quiet ipywidgets

Note: you may need to restart the kernel to use updated packages.
Version: 0.1.67
Note: you may need to restart the kernel to use updated packages.


In [3]:

# Import modules
import pandas as pd
from scraipe import Workflow
from scraipe.extended import NewsScraper, OpenAiAnalyzer
from pydantic import BaseModel

# Load OpenAI API key
OPENAI_API_KEY = open("openai_key.txt").read().strip()

## Extract links
First, we need a list of links to target with scraipe. We will extract all links from the front page of https://apnews.com.

In [None]:
import requests
import re

url = "https://apnews.com/"
response = requests.get(url)
html = response.text

# Use a regex to extract article links
pattern = r'href=["\'](?=[^"\']*/article)([^"\']+)["\']'
news_links = re.findall(pattern, html)

# Remove duplicates
news_links = list(set(news_links))

# Display a summary of the links
news_links_df = pd.DataFrame(news_links, columns=['link'])
import time
print(f"Found {len(news_links_df)} front page AP News links on {time.strftime('%Y-%m-%d')}")
display(news_links_df)

Found 135 front page AP News links on 2025-05-11


Unnamed: 0,link
0,https://apnews.com/article/pakistan-india-ipl-...
1,https://apnews.com/article/fact-check-military...
2,https://apnews.com/article/gaza-aid-israel-dis...
3,https://apnews.com/article/ethics-bowl-student...
4,https://apnews.com/article/gold-prices-high-ta...
...,...
130,https://apnews.com/article/panthers-maple-leaf...
131,https://apnews.com/article/employee-resource-g...
132,https://apnews.com/article/lindor-mets-my-girl...
133,https://apnews.com/article/how-to-drop-an-egg-...


## Configure Workflow
Now we'll configure the scraipe workflow using `NewsScraper` and `OpenAiAnalyzer`.

`NewsScraper` uses `trafilatura` to extract article content from a news site without all the HTML clutter.

`OpenAiAnalyzer` uses OpenAI models to extract data from the article content.

In [5]:
#===Configure NewsScraper===
# NewsScraper doesn't require any additional configuration
scraper = NewsScraper()

#===Configure OpenAiAnalyzer===
# Define the instruction for the LLM. Ensure the instruction specifies a return schema.
instruction = '''
Extract a list of celebrities mentioned in the article text.
Return a JSON dictionary with the following schema:
{"celebrities":["celebrity1", "celebrity2", ...]}
'''

# (Optional) Create a pydantic schema to validate the LLM output
from typing import List
class ExpectedOutput(BaseModel):
    celebrities: List[str]
    
# Create the analyzer with the API key, instruction, and schema
analyzer = OpenAiAnalyzer(OPENAI_API_KEY,instruction,pydantic_schema=ExpectedOutput)

#===Create Workflow===
# Create a workflow with the configured scraper and analyzer
workflow = Workflow(scraper, analyzer)

## Scrape links
Next we will scrape content from news links. This content will be saved within the workflow's scrape store.

In [6]:
# Scrape the news links
workflow.scrape(news_links)
# Display the scraped content
display(workflow.get_scrapes())

Scraping:   0%|          | 0/135 [00:00<?, ?link/s]

Unnamed: 0,link,content,scrape_success,scrape_error,metadata
0,https://apnews.com/article/spain-fire-chemical...,Fire at chemical plant in northeastern Spain k...,True,,
1,https://apnews.com/article/zealand-snail-egg-n...,A rare New Zealand snail is filmed for the fir...,True,,
2,https://apnews.com/article/indigenous-colombia...,Colombia takes regional lead in Indigenous sel...,True,,
3,https://apnews.com/article/nhl-playoffs-jets-s...,Mikko Rantanen has a goal and 2 assists for St...,True,,
4,https://apnews.com/article/cannes-film-festiva...,"Cannes, the global Colosseum of film, readies ...",True,,
...,...,...,...,...,...
130,https://apnews.com/article/candy-crush-ai-arti...,,False,Failed to get page: Failed to scrape https://a...,
131,https://apnews.com/article/autism-kennedy-rfk-...,,False,Failed to get page: Failed to scrape https://a...,
132,https://apnews.com/article/gold-prices-high-ta...,,False,Failed to get page: Failed to scrape https://a...,
133,https://apnews.com/article/venice-arts-biennal...,,False,Failed to get page: Failed to scrape https://a...,


## Analyze content
Next we will extract celebrities mentioned in each article using OpenAI

In [7]:
# Analyze the scraped content
workflow.analyze()

# Display the analyses
display(workflow.get_analyses())

Analyzing:   0%|          | 0/100 [00:00<?, ?link/s]

Unnamed: 0,link,output,analysis_success,analysis_error
0,https://apnews.com/article/spain-fire-chemical...,{'celebrities': []},True,
1,https://apnews.com/article/zealand-snail-egg-n...,{'celebrities': []},True,
2,https://apnews.com/article/indigenous-colombia...,{'celebrities': []},True,
3,https://apnews.com/article/nhl-playoffs-jets-s...,"{'celebrities': ['Mikko Rantanen', 'Alexander ...",True,
4,https://apnews.com/article/cannes-film-festiva...,"{'celebrities': ['Donald Trump', 'Kleber Mendo...",True,
...,...,...,...,...
95,https://apnews.com/article/eeuu-china-comercio...,"{'celebrities': ['Donald Trump', 'Scott Bessen...",True,
96,https://apnews.com/article/how-to-drop-an-egg-...,{'celebrities': []},True,
97,https://apnews.com/article/israel-palestinians...,"{'celebrities': ['Donald Trump', 'Benjamin Net...",True,
98,https://apnews.com/article/poland-ukrainians-p...,"{'celebrities': ['Donald Trump', 'Andrzej Duda...",True,


## Compile the results
Finally, let's export the completed analysis. 

In [8]:
export_df = workflow.export()
display(export_df)

Unnamed: 0,link,celebrities
0,https://apnews.com/article/spain-fire-chemical...,[]
1,https://apnews.com/article/zealand-snail-egg-n...,[]
2,https://apnews.com/article/indigenous-colombia...,[]
3,https://apnews.com/article/nhl-playoffs-jets-s...,"[Mikko Rantanen, Alexander Petrovic, Connor He..."
4,https://apnews.com/article/cannes-film-festiva...,"[Donald Trump, Kleber Mendonça Filho, Spike Le..."
...,...,...
130,https://apnews.com/article/candy-crush-ai-arti...,
131,https://apnews.com/article/autism-kennedy-rfk-...,
132,https://apnews.com/article/gold-prices-high-ta...,
133,https://apnews.com/article/venice-arts-biennal...,


## Analyze the results
Now you can conduct your own analysis on the structured data collected by the scraipe workflow.

In [9]:
# Explode the nested list of celebrities
export_df = export_df.explode('celebrities')
export_df['celebrities'] = export_df['celebrities'].str.strip()

# Display the top 10 most mentioned celebrities
export_df = export_df['celebrities'].value_counts().reset_index()
export_df.columns = ['celebrity', 'mentions']
export_df = export_df.sort_values('mentions', ascending=False)
export_df.head(10)

Unnamed: 0,celebrity,mentions
0,Donald Trump,26
1,Joe Biden,5
2,Robert F. Kennedy Jr.,5
3,Pope Leo XIV,4
4,Pope Francis,4
5,Robert Prevost,3
15,Keir Starmer,2
17,Sheikh Hasina,2
16,Donald Tusk,2
6,Stephen Miller,2
