# X.com user scraper, cleaner, and sentiment walkthrough

Run this notebook end-to-end in Google Colab without any extra files. It installs dependencies, scrapes recent posts for a list of X.com usernames, cleans and scores sentiment, and saves CSV reports. Every editable setting is surfaced in the configuration cell so you can point the scraper at your own accounts.

In [None]:
%pip install -q --disable-pip-version-check pandas vaderSentiment matplotlib snscrape

## Helpers (defined inline)

All logic lives in this notebook. The scraper uses [`snscrape`](https://github.com/JustAnotherArchivist/snscrape) to pull posts by username without needing an API token. Sentiment is scored with VADER.

In [None]:
from dataclasses import dataclass
from pathlib import Path
from typing import List, Tuple

import pandas as pd
import snscrape.modules.twitter as sntwitter
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


@dataclass
class ScraperConfig:
    """Configuration for scraping X.com users and reporting sentiment."""

    usernames: List[str]
    max_posts_per_user: int = 50
    min_chars: int = 20
    output_dir: Path = Path("artifacts")
    detail_filename: str = "x_posts.csv"
    report_filename: str = "sentiment_report.csv"

    def resolved_output_dir(self) -> Path:
        return Path(self.output_dir).expanduser().resolve()


class XUserScraper:
    """Scrape posts for one or more X.com usernames."""

    def __init__(self, config: ScraperConfig):
        self.config = config
        self._analyzer = SentimentIntensityAnalyzer()

    def scrape(self) -> pd.DataFrame:
        rows = []
        for username in self.config.usernames:
            scraper = sntwitter.TwitterUserScraper(username=username)
            for idx, tweet in enumerate(scraper.get_items()):
                if idx >= self.config.max_posts_per_user:
                    break
                rows.append(
                    {
                        "username": username,
                        "id": tweet.id,
                        "date": tweet.date.isoformat(),
                        "content": tweet.content,
                        "language": getattr(tweet, "lang", None),
                        "likeCount": tweet.likeCount,
                        "replyCount": tweet.replyCount,
                        "retweetCount": tweet.retweetCount,
                        "quoteCount": tweet.quoteCount,
                        "viewCount": getattr(tweet, "viewCount", None),
                        "url": tweet.url,
                    }
                )
        return pd.DataFrame(rows)

    def clean(self, frame: pd.DataFrame) -> pd.DataFrame:
        clean_frame = frame.copy()
        clean_frame["clean_text"] = (
            clean_frame["content"].astype(str)
            .str.replace(r"\s+", " ", regex=True)
            .str.strip()
        )
        if self.config.min_chars:
            clean_frame = clean_frame[clean_frame["clean_text"].str.len() >= self.config.min_chars]
        clean_frame.reset_index(drop=True, inplace=True)
        return clean_frame

    def score_sentiment(self, frame: pd.DataFrame) -> pd.DataFrame:
        scored = frame.copy()

        def _score_row(text: str) -> Tuple[float, str]:
            scores = self._analyzer.polarity_scores(text)
            compound = scores["compound"]
            if compound >= 0.05:
                label = "positive"
            elif compound <= -0.05:
                label = "negative"
            else:
                label = "neutral"
            return compound, label

        scored[["sentiment_score", "sentiment"]] = scored["clean_text"].apply(
            lambda text: pd.Series(_score_row(text))
        )
        return scored

    def generate_reports(self, frame: pd.DataFrame):
        output_dir = self.config.resolved_output_dir()
        output_dir.mkdir(parents=True, exist_ok=True)

        detail_path = output_dir / self.config.detail_filename
        frame.to_csv(detail_path, index=False)

        summary = frame["sentiment"].value_counts().rename_axis("sentiment").reset_index(name="count")
        summary_path = output_dir / self.config.report_filename
        summary.to_csv(summary_path, index=False)

        return {
            "detail_csv": detail_path,
            "sentiment_report_csv": summary_path,
        }


## Configuration

Edit the usernames list to target the accounts you want to scrape (10–15 usernames work fine).

- `usernames`: X.com handles without the `@`.
- `max_posts_per_user`: How many recent posts to pull for each username.
- `min_chars`: Minimum character length to keep after cleaning (set to `0` to keep everything).
- `output_dir` / filenames: Where CSV exports are written.

In [None]:
usernames = [
    "X",  # replace with your target handles (no @)
    "TwitterDev",
    "jack",
]

config = ScraperConfig(
    usernames=usernames,
    max_posts_per_user=40,
    min_chars=20,
    output_dir=Path("artifacts"),
)
config

## Scrape

Run the cell below to fetch posts for the configured usernames. The resulting DataFrame includes the posting date and common engagement fields so you can inspect what is available before cleaning.

In [None]:
pipeline = XUserScraper(config)
raw_posts = pipeline.scrape()
raw_posts.head()

Inspect the available columns. Feel free to slice the DataFrame however you like; the `date`, `content`, and engagement counts are already included.

In [None]:
raw_posts[["username", "date", "content", "likeCount", "replyCount", "retweetCount", "quoteCount"]].head()

## Clean the text

Whitespace is collapsed and rows shorter than `min_chars` are dropped. Adjust `min_chars` in the config if you want to keep shorter posts.

In [None]:
clean_posts = pipeline.clean(raw_posts)
clean_posts.head()

## Sentiment analysis

VADER sentiment is applied to the cleaned text. `sentiment_score` is the compound score in `[-1, 1]`; `sentiment` is the bucketed label (positive/neutral/negative).

In [None]:
sentiment_posts = pipeline.score_sentiment(clean_posts)
sentiment_posts[["username", "date", "sentiment_score", "sentiment"]].head()

## Reporting

Detailed and aggregated CSVs are saved under `output_dir` for reuse outside the notebook.

In [None]:
report_paths = pipeline.generate_reports(sentiment_posts)
report_paths

In [None]:
summary = sentiment_posts["sentiment"].value_counts().rename_axis("sentiment").reset_index(name="count")
summary

## Interpreting the outputs

- Use `summary` to understand whether the scraped accounts skew positive, neutral, or negative.
- Open `x_posts.csv` to audit the cleaned text and ensure important content was preserved. Lower `min_chars` if rows look truncated.
- `sentiment_report.csv` is a quick export of label counts you can plot elsewhere.