# X_Scraper end-to-end walkthrough

This notebook is self contained so it runs cleanly in Google Colab without needing a local Python package.
It installs the required dependencies, defines the scraper/cleaning/sentiment helpers, and walks through configuration,
scraping, cleaning, sentiment analysis, and reporting.

Use the configuration cell to change the source URL or thresholds, and refer to the interpretation notes at the end
for guidance on reading the outputs.


In [None]:
%pip install -q --disable-pip-version-check pandas requests vaderSentiment matplotlib


## Helpers (defined inline)

All logic lives in this notebook so you do not need to import `x_scraper`. Feel free to tweak the helper functions
here if you want to customize the cleaning rules or sentiment thresholds.


In [None]:
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Tuple

import pandas as pd
import requests
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


@dataclass
class ScraperConfig:
    """Configuration for fetching, cleaning, and reporting scraped content."""

    source_url: str = "https://jsonplaceholder.typicode.com/posts"
    limit: int = 20
    min_chars: int = 40
    output_dir: Path = Path("artifacts")
    detail_filename: str = "scraped_posts.csv"
    report_filename: str = "sentiment_report.csv"

    def resolved_output_dir(self) -> Path:
        """Return a fully expanded output directory path."""

        return Path(self.output_dir).expanduser().resolve()


class XScraper:
    """Basic scraper and sentiment pipeline."""

    def __init__(self, config: ScraperConfig):
        self.config = config
        self._analyzer = SentimentIntensityAnalyzer()
        self._session = requests.Session()

    def scrape(self) -> pd.DataFrame:
        response = self._session.get(self.config.source_url, timeout=15)
        response.raise_for_status()
        payload = response.json()

        normalized: List[Dict[str, str]] = []
        for raw_post in payload[: self.config.limit]:
            content = self._merge_fields(raw_post)
            normalized.append(
                {
                    "id": raw_post.get("id"),
                    "title": raw_post.get("title", ""),
                    "body": raw_post.get("body", ""),
                    "content": content,
                }
            )
        return pd.DataFrame(normalized)

    def clean(self, frame: pd.DataFrame) -> pd.DataFrame:
        clean_frame = frame.copy()
        clean_frame["clean_text"] = (
            clean_frame["content"].astype(str)
            .str.replace("\\s+", " ", regex=True)
            .str.strip()
        )
        if self.config.min_chars:
            clean_frame = clean_frame[clean_frame["clean_text"].str.len() >= self.config.min_chars]
        clean_frame.reset_index(drop=True, inplace=True)
        return clean_frame

    def score_sentiment(self, frame: pd.DataFrame) -> pd.DataFrame:
        scored = frame.copy()

        def _score_row(text: str) -> Tuple[float, str]:
            scores = self._analyzer.polarity_scores(text)
            compound = scores["compound"]
            if compound >= 0.05:
                label = "positive"
            elif compound <= -0.05:
                label = "negative"
            else:
                label = "neutral"
            return compound, label

        scored[["sentiment_score", "sentiment"]] = scored["clean_text"].apply(
            lambda text: pd.Series(_score_row(text))
        )
        return scored

    def generate_reports(self, frame: pd.DataFrame):
        output_dir = self.config.resolved_output_dir()
        output_dir.mkdir(parents=True, exist_ok=True)

        detail_path = output_dir / self.config.detail_filename
        frame.to_csv(detail_path, index=False)

        summary = frame["sentiment"].value_counts().rename_axis("sentiment").reset_index(name="count")
        summary_path = output_dir / self.config.report_filename
        summary.to_csv(summary_path, index=False)

        return {
            "detail_csv": detail_path,
            "sentiment_report_csv": summary_path,
        }

    @staticmethod
    def _merge_fields(raw_post: Dict[str, str]) -> str:
        title = raw_post.get("title", "")
        body = raw_post.get("body", "")
        return f"{title}\n{body}".strip()


## Configuration

Adjust the values below to control the run:

- `source_url`: API endpoint or JSON feed to scrape.
- `limit`: Maximum number of records to fetch.
- `min_chars`: Minimum character length to keep after cleaning (set to 0 to skip filtering).
- `output_dir`: Directory where CSV outputs are written.
- `detail_filename` / `report_filename`: File names for the detailed rows and aggregated sentiment counts.


In [None]:
config = ScraperConfig(
    source_url="https://jsonplaceholder.typicode.com/posts",  # swap in your endpoint
    limit=25,
    min_chars=40,
    output_dir=Path("artifacts"),
)
config


## Scrape

The cell below downloads posts from the configured `source_url`. For a different dataset, update the configuration above or inject your own DataFrame if you already have raw content.


In [None]:
pipeline = XScraper(config)
raw_posts = pipeline.scrape()
raw_posts.head()


## Clean the text

The cleaning step collapses whitespace and trims short entries. Increase `min_chars` if you see noisy, low-value rows, or set it to `0` to keep everything.


In [None]:
clean_posts = pipeline.clean(raw_posts)
clean_posts.head()


## Sentiment analysis

VADER sentiment is applied to the cleaned text. The `sentiment_score` is the compound score in `[-1, 1]` and the `sentiment` label is derived from the thresholds used by the analyzer (`>= 0.05` positive, `<= -0.05` negative, otherwise neutral).


In [None]:
sentiment_posts = pipeline.score_sentiment(clean_posts)
sentiment_posts[["id", "sentiment_score", "sentiment"]].head()


## Reporting

Both detailed and aggregated reports are saved under `output_dir`:

- `scraped_posts.csv`: Each cleaned post with sentiment columns.
- `sentiment_report.csv`: Counts of positive/neutral/negative posts.


In [None]:
report_paths = pipeline.generate_reports(sentiment_posts)
report_paths


In [None]:
summary = sentiment_posts["sentiment"].value_counts().rename_axis("sentiment").reset_index(name="count")
summary


## Interpreting the outputs

- Review the `summary` table to see whether your dataset skews positive, neutral, or negative.
- Inspect the `scraped_posts.csv` file to confirm that cleaning preserved the important text. If it looks truncated, lower `min_chars`.
- Check `sentiment_report.csv` for quick counts you can chart elsewhere. You can also use the `summary` DataFrame directly in this notebook to create plots.
