# Reddit Scrape & LLM Analysis
In this example, we will use RedditLinkCollector and RedditPostScraper to determine trail conditions based on the latest posts from [r/coloradohikers](https://reddit.com/r/coloradohikers), [r/colorado](https://reddit.com/r/colorado), [r/denver](https://reddit.com/r/denver), and [r/boulder](https://reddit.com/r/boulder). Special thanks to u/ColoRadBros69 for [the use case](https://www.reddit.com/r/opensource/comments/1kjoknx/comment/mrp5bo3/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)!

## Setup
Install and import things we need. We will use `RedditLinkCollector`, `RedditPostScraper`, and `OpenAiAnalyzer` from the `scraipe[extended]` subpackage.

We will also load [your OpenAI API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key) and [Reddit client credentials](https://www.reddit.com/prefs/apps) from an environment file called 'reddit_hiking.env'. Copy the below code into a file called 'reddit_hiking.env' that is located in the same folder as this notebook and fill in your secrets.

``` bash
export OPENAI_API_KEY=your_secret_here
export REDDIT_CLIENT_ID=your_secret_here
export REDDIT_CLIENT_SECRET=your_secret_here
```

In [1]:
# Install scraipe from PyPI:
#%pip install --upgrade --quiet scraipe
# Alternatively, install scraipe package from repo
%pip install -qe ..


# Install utility packages
%pip install --quiet ipywidgets
%pip install dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:

# Import modules
import pandas as pd
from scraipe import Workflow
from scraipe.extended import RedditLinkCollector, RedditSubmissionScraper, OpenAiAnalyzer
from pydantic import BaseModel

# Load OpenAI and Reddit credentials
import dotenv, os
dotenv.load_dotenv("reddit_hiking.env")
keys = ["OPENAI_API_KEY", "REDDIT_CLIENT_ID", "REDDIT_CLIENT_SECRET"]
OPENAI_API_KEY,REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET = [os.getenv(key) for key in keys]
assert all([OPENAI_API_KEY, REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET]), "Please configure environment secrets in reddit_hiking.env"

## Configure Workflow
Now we'll configure the scraipe workflow using `RedditLinkCollector`, `RedditPostScraper` and `OpenAiAnalyzer`.

- `RedditLinkCollector` collects post submission links on targeted subreddits using asyncpraw.
- `RedditPostScraper` scrapes posts' selftext, comments, and other data using asyncpraw.
- `OpenAiAnalyzer` uses OpenAI models to extract data from the article content.

In [None]:
#===Configure RedditLinkCollector===
# Collects the newest posts in r/colorado and r/coloradohikers from the last week
collector = RedditLinkCollector(
    client_id = REDDIT_CLIENT_ID,
    client_secret = REDDIT_CLIENT_SECRET, 
    subreddits=["colorado","coloradohikers","denver","boulder"],
    sorts="new",
    max_age=60*60*24*7,  # 1 week
    limit=100 # Limit to 100 posts per subreddit
    )

#===Configure RedditPostScraper===
scraper = RedditSubmissionScraper(
    client_id=REDDIT_CLIENT_ID,
    client_secret=REDDIT_CLIENT_SECRET)

#===Configure OpenAiAnalyzer===
# Define the instruction for the LLM. Ensure the instruction specifies a return schema.
instruction = '''
Extract the trail name and conditions from the attached Reddit post. Infer the trail name from the post title and comments.
If the trail name cannot be inferred, leave both fields empty.
The conditions can be but are not limited to: "dry", "wet", "snowy", "icy", "muddy", "clear", "cloudy", "rainy", "sunny", "windy".
Return a JSON dictionary with the following schema:
{
    "trail_name": "name of trail",
    "conditions": ["cond1", "cond2", ...]
}
'''

# (Optional) Create a pydantic schema to validate the LLM output
from typing import List
class ExpectedOutput(BaseModel):
    trail_name: str
    conditions: List[str]
    
# Create the analyzer with the API key, instruction, and schema
analyzer = OpenAiAnalyzer(OPENAI_API_KEY,instruction,pydantic_schema=ExpectedOutput)

#===Create Workflow===
# Create a workflow with the configured link collector, scraper, and analyzer
workflow = Workflow(scraper, analyzer, link_collector=collector)

## Run the workflow

In [4]:
# Collect links from subreddits
workflow.collect_links()
display(workflow.get_links())

Collecting links: 0link [00:00, ?link/s]

Unnamed: 0,link
0,https://reddit.com/r/Colorado/comments/1kkenf2...
1,https://reddit.com/r/Colorado/comments/1kkbfbt...
2,https://reddit.com/r/Colorado/comments/1kk5li8...
3,https://reddit.com/r/Colorado/comments/1kk02dc...
4,https://reddit.com/r/Colorado/comments/1kjgk49...
...,...
245,https://reddit.com/r/coloradohikers/comments/1...
246,https://reddit.com/r/coloradohikers/comments/1...
247,https://reddit.com/r/coloradohikers/comments/1...
248,https://reddit.com/r/coloradohikers/comments/1...


In [5]:
# Scrape posts from the collected links
workflow.scrape()
workflow.get_scrapes()

Scraping:   0%|          | 0/250 [00:00<?, ?link/s]

Unnamed: 0,link,content,scrape_success,scrape_error,metadata
0,https://reddit.com/r/Colorado/comments/1kkenf2...,\n\n===Comments===\n- u/Lgallegos17:\n Than...,True,,{'title': 'Happy Mother's Day Weekend from the...
1,https://reddit.com/r/Colorado/comments/1kkbfbt...,\n\n===Comments===\n- u/Sourkraute:\n 975 s...,True,,"{'title': 'Abandoned Cabin, 35409 US-24 near L..."
2,https://reddit.com/r/Colorado/comments/1kk5li8...,"\n\n===Comments===\n- u/Snlxdd:\n Gorgeous,...",True,,"{'title': 'Sneffels in watercolor', 'author': ..."
3,https://reddit.com/r/Colorado/comments/1kk02dc...,\n\n===Comments===\n- u/skovalen:\n I love ...,True,,"{'title': 'Taken last year around fall', 'auth..."
4,https://reddit.com/r/Colorado/comments/1kjgk49...,\n\n===Comments===\n- u/eric_b0x:\n I love ...,True,,"{'title': 'Chautauqua Park 5/10/25', 'author':..."
...,...,...,...,...,...
245,https://reddit.com/r/coloradohikers/comments/1...,Hi! My boyfriend is coming in town this weeken...,True,,"{'title': 'Allenspark Area', 'author': 'aylexa..."
246,https://reddit.com/r/coloradohikers/comments/1...,\n\n===Comments===\n- u/delusionalxx:\n Wha...,True,,"{'title': 'A few pics from hikes this spring',..."
247,https://reddit.com/r/coloradohikers/comments/1...,"In CO for the weekend, decided to start at Ber...",True,,"{'title': 'Mt Flora + Colorado Mines Peak', 'a..."
248,https://reddit.com/r/coloradohikers/comments/1...,\n\n===Comments===\n- u/None:\n [deleted]\n...,True,,{'title': 'Call your Senators and Reps now. Th...


In [6]:
# Analyze the scraped posts
workflow.analyze()
# Display the analysis results
workflow.get_analyses()


Analyzing:   0%|          | 0/250 [00:00<?, ?link/s]

Unnamed: 0,link,output,analysis_success,analysis_error
0,https://reddit.com/r/Colorado/comments/1kkenf2...,"{'trail_name': '', 'conditions': []}",True,
1,https://reddit.com/r/Colorado/comments/1kkbfbt...,"{'trail_name': '', 'conditions': []}",True,
2,https://reddit.com/r/Colorado/comments/1kk5li8...,"{'trail_name': '', 'conditions': []}",True,
3,https://reddit.com/r/Colorado/comments/1kk02dc...,"{'trail_name': '', 'conditions': []}",True,
4,https://reddit.com/r/Colorado/comments/1kjgk49...,"{'trail_name': '', 'conditions': []}",True,
...,...,...,...,...
245,https://reddit.com/r/coloradohikers/comments/1...,"{'trail_name': 'Allenspark', 'conditions': ['s...",True,
246,https://reddit.com/r/coloradohikers/comments/1...,"{'trail_name': 'Indian Paint Brush', 'conditio...",True,
247,https://reddit.com/r/coloradohikers/comments/1...,"{'trail_name': 'Colorado Mines Peak', 'conditi...",True,
248,https://reddit.com/r/coloradohikers/comments/1...,"{'trail_name': '', 'conditions': []}",True,


## View results
Finally, let's view our new information about trail conditions.

In [8]:
# Use `workflow.export()` to flatten the dictionary outputs for convenience.
results = workflow.export()
# Filter for non-empty trail names and conditions
results[(results["trail_name"] != "") & (results["conditions"].apply(len) > 0)]

Unnamed: 0,link,trail_name,conditions
42,https://reddit.com/r/boulder/comments/1kjzfu3/...,Highway 93,[clear]
104,https://reddit.com/r/boulder/comments/1khvjq9/...,Bear Peak,"[icy, snowy, wet, muddy]"
193,https://reddit.com/r/Denver/comments/1kim07n/8...,88 Drive-In,"[light and breezy, distant thunderstorms]"
228,https://reddit.com/r/coloradohikers/comments/1...,Blue Lakes,"[snowy, wet]"
231,https://reddit.com/r/coloradohikers/comments/1...,Long's Peak,"[snowy, mushy]"
234,https://reddit.com/r/coloradohikers/comments/1...,Eldorado Canyon,[sunny]
235,https://reddit.com/r/coloradohikers/comments/1...,Upper Cheeseman,"[perfect, clear, not too busy]"
236,https://reddit.com/r/coloradohikers/comments/1...,Sangre de Cristo Wilderness,"[muddy, snowy, wet, flowing water]"
237,https://reddit.com/r/coloradohikers/comments/1...,Staunton State Park,[snowy]
239,https://reddit.com/r/coloradohikers/comments/1...,High Dune,"[breathtaking, dry]"
