# Fetch Cramer tweets from Nitter.net
Following the deprecation of Twitter's search function (without login), tools like snscrape no longer work (so does the original notebook for twitter scraping). 

To fetch tweets now, we have to scrape Nitter.net, a privacy friendly alternative front-end for Twitter, which is powered by the GraphQL API from Twitter that is still available.

Examples of html of jim cramer's tweets on Nitter.net are stored in `data/tweets/sample_html`.

## Format of the tweets:
Please find exported tweets in the `data/tweets` folder. The fields are:
- time
- content
- comments
- retweets
- quotes
- hearts

## Downstream processing:
1. To perform NER on the tweets and extract the tickers
2. To perform sentiment analysis on the tweets to extract Crater's sentiment on the tickers


## Import packages and setting up logging

In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import logging
from rich.logging import RichHandler
from tqdm import tqdm
import os

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.FileHandler('../logs/fetch_tweets.log'),
    #  logging.StreamHandler()
    ]
)

LOGGER = logging.getLogger(__name__)

## Defining functions
Various functions are defined to scrape tweets from Nitter.net. 

In [14]:
def parse_tweet_stats(tweet) -> dict:
    """Parse tweet stats from a single tweet
    Deals with empty stats, comma in numbers, etc.
    Ignores 'plays' and 'views' stats
    """
    stats = {}
    stat_icons = ["icon-comment", "icon-retweet", "icon-quote", "icon-heart"]
    keys = ["comments", "retweets", "quotes", "hearts"]

    for icon, key in zip(stat_icons, keys):
        stat = tweet.find("span", class_=icon)
        if stat:
            stat_value = stat.find_next("div").get_text(strip=True)
            if stat_value:
                stats[key] = int(stat.text.strip().replace(",", "") or 0)
            else:
                stats[key] = np.nan
        else:
            stats[key] = np.nan

    return stats


def parse_tweet(tweet) -> dict:
    """Parse tweet data from a single tweet

    """
    time_element = tweet.find("span", class_="tweet-date")
    try:
        time = datetime.strptime(
            time_element.find("a")["title"], r"%b %d, %Y · %I:%M %p %Z"
        )
    except AttributeError:
        LOGGER.warning("Failed to parse tweet time: {}".format(tweet))
        time = datetime.strptime(time_element.text, r"%b %d, %Y · %I:%M %p %Z")

    # LOGGER.info("Current tweet time: {}".format(time))

    content = tweet.find("div", class_="tweet-content").text

    stats = tweet.find("div", class_="tweet-stats")
    # LOGGER.info(stats)
    stats_parsed = parse_tweet_stats(stats)

    return {
        "time": time,
        "content": content,
        "comments": stats_parsed['comments'],
        "retweets": stats_parsed['retweets'],
        "quotes": stats_parsed['quotes'],
        "hearts": stats_parsed['hearts'],
    }


def get_next_page_cursor(soup) -> str|None:
    """Get the next page cursor from the soup object

    :param soup: bs4 soup object
    :type soup: bs4.BeautifulSoup
    :return: next page cursor
    :rtype: str or None
    """
    timeline_end = soup.find("h2", class_="timeline-end")
    if timeline_end:
        return None

    load_more_buttons = soup.find_all("div", class_="show-more")
    if not load_more_buttons:
        return None
    load_more = load_more_buttons[-1]
    next_page = load_more.find("a")["href"].split("cursor=")[-1]
    return next_page


def fetch_tweets(user, start_date=None, end_date=None, cursor=None):
    """Fetch tweets from a user from nitter.net
    Specify the start_date and end_date to limit the time range.
    Handles pagination.
    Skips retweets and show-more buttons


    :param user: user name without @
    :type user: str
    :param start_date: _description_, defaults to None
    :type start_date: _type_, optional
    :param end_date: _description_, defaults to None
    :type end_date: _type_, optional
    :param cursor: _description_, defaults to None
    :type cursor: _type_, optional
    :return: _description_
    :rtype: _type_
    """
    base_url = f"https://nitter.net/{user}/search?f=tweets&e-native_video=on&e-pro_video=on&e-news=on&e-replies=on&e-nativeretweets=on"
    tweets_data = []
    page_count = 0

    LOGGER.info("Start fetching tweets from {}".format(user))
    if start_date:
        LOGGER.info("Start date: {}".format(start_date))
    if end_date:
        LOGGER.info("End date: {}".format(end_date))

    while True:
        url = base_url
        if cursor:
            url += f"&cursor={cursor}"
        LOGGER.info("current url: " + url)

        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")

        tweets = soup.find_all("div", class_="timeline-item")
        LOGGER.info("Fetched tweet count: {}".format(len(tweets)))
        for tweet in tweets:
            # skip retweets and show-more
            if (
                tweet.find("div", class_="retweet-header")
                or "show-more" in tweet["class"]
            ):
                continue

            tweet_data = parse_tweet(tweet)
            tweet_date = tweet_data["time"]
            # print(tweet_date)

            if start_date and tweet_date < start_date:
                LOGGER.info("start date reached")
                return cursor, tweets_data
            if end_date and tweet_date > end_date:
                continue

            tweets_data.append(tweet_data)
        LOGGER.info("finished fetching page {} starts from {}".format(page_count, tweets_data[-1]["time"] if tweets_data else None))
        LOGGER.info("Total tweet count: {}".format(len(tweets_data)))
        page_count += 1

        new_cursor = get_next_page_cursor(soup)
        if new_cursor and new_cursor != f"/{user}" and cursor != new_cursor:
            cursor = new_cursor
            LOGGER.info("next page = {}, cursor = {}".format(page_count, cursor))
        else:
            break

    return cursor, tweets_data 


In [None]:
def generate_yearly_date_ranges(start_date, end_date):
    date_ranges = []
    current_date = start_date
    while current_date <= end_date:
        next_year_start = current_date.replace(year=current_date.year + 1, month=1, day=1)
        if next_year_start > end_date:
            date_ranges.append((current_date, end_date))
        else:
            date_ranges.append((current_date, next_year_start - timedelta(days=1)))
        current_date = next_year_start
    return date_ranges

## Main web scraping loop


In [None]:
user = "jimcramer"
start_date = datetime(2010, 3, 20)
end_date = datetime(2022, 3, 20)

# Set the cursor to skip the previously fetched pages
last_cursor = "DAADDAABCgABFuU7axTaYAAKAAIUgjf_sxcwBwAIAAIAAAACCAADAAAAAAgABAAAALAKAAUW5e3Ch8AnEAoABhbl7cKHpP3wAAA"


In [16]:
date_ranges = generate_yearly_date_ranges(start_date, end_date)

for start, end in tqdm(date_ranges[::-1]):
    filename = f"{user}_tweets_{start.strftime('%Y-%m-%d')}-{end.strftime('%Y-%m-%d')}"
    if os.path.exists(f"../data/{filename}.csv"):
        print(f"Skipping {filename}, already exists")
        continue
    
    last_cursor, tweets_data = fetch_tweets(user, start, end, cursor=last_cursor)
    print("cursor: ", last_cursor)
    tweets_df = pd.DataFrame(tweets_data)

    if len(tweets_df) > 0:
        tweets_df.to_csv(f"../data/tweets/{filename}.csv", index=False)
        print(f"Scraped {len(tweets_df)} tweets from {user} between {start} and {end}")
    else:
        print(f"No tweets found for {user} between {start} and {end}")

  0%|          | 0/13 [00:00<?, ?it/s]

Skipping jimcramer_tweets_2022-01-01-2022-03-20, already exists


 15%|█▌        | 2/13 [05:55<32:34, 177.68s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAISrGa_-pbwAQAIAAIAAAACCAADAAAAAAgABAAAAaAKAAUW5e3Ch8AnEAoABhbl7cKHgF7wAAA
Scraped 4766 tweets from jimcramer between 2021-01-01 00:00:00 and 2021-12-31 00:00:00


 23%|██▎       | 3/13 [14:41<53:49, 323.00s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAIQ1IWaVdewAAAIAAIAAAACCAADAAAAAAgABAAAAv0KAAUW5e3Ch8AnEAoABhbl7cKHSx4gAAA
Scraped 6946 tweets from jimcramer between 2020-01-01 00:00:00 and 2020-12-31 00:00:00


 31%|███       | 4/13 [20:54<51:13, 341.45s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAIO_1Hwy1dAAAAIAAIAAAACCAADAAAAAAgABAAAA-wKAAUW5e3Ch8AnEAoABhbl7cKHJqYwAAA
Scraped 4763 tweets from jimcramer between 2019-01-01 00:00:00 and 2019-12-31 00:00:00


 38%|███▊      | 5/13 [26:56<46:30, 348.83s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAINKL1WQpfgAAAIAAIAAAACCAADAAAAAAgABAAABNsKAAUW5e3Ch8AnEAoABhbl7cKHAi5AAAA
Scraped 4758 tweets from jimcramer between 2018-01-01 00:00:00 and 2018-12-31 00:00:00


 46%|████▌     | 6/13 [36:03<48:21, 414.54s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAILVAsiyVZAAAAIAAIAAAACCAADAAAAAAgABAAABjoKAAUW5e3Ch8AnEAoABhbl7cKGzJ9QAAA
Scraped 6989 tweets from jimcramer between 2017-01-01 00:00:00 and 2017-12-31 00:00:00


 54%|█████▍    | 7/13 [45:36<46:32, 465.37s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAIJfDCL1FawAAAIAAIAAAACCAADAAAAAAgABAAAB6gKAAUW5e3Ch8AnEAoABhbl7cKGlMZwAAA
Scraped 7264 tweets from jimcramer between 2016-01-01 00:00:00 and 2016-12-31 00:00:00


 62%|██████▏   | 8/13 [56:47<44:10, 530.00s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAIHqUNNOQhQAAAIAAIAAAACCAADAAAAAAgABAAACXAKAAUW5e3Ch8AnEAoABhbl7cKGTzHwAAA
Scraped 9059 tweets from jimcramer between 2015-01-01 00:00:00 and 2015-12-31 00:00:00


 69%|██████▉   | 9/13 [1:04:30<33:57, 509.38s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAIFzppcJggQAAAIAAIAAAACCAADAAAAAAgABAAACscKAAUW5e3Ch8AnEAoABhbl7cKGGtuAAAA
Scraped 6863 tweets from jimcramer between 2014-01-01 00:00:00 and 2014-12-31 00:00:00


 77%|███████▋  | 10/13 [1:12:45<25:15, 505.04s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAID-JUsWAJQAQAIAAIAAAACCAADAAAAAAgABAAADDIKAAUW5e3Ch8AnEAoABhbl7cKF43fQAAA
Scraped 7262 tweets from jimcramer between 2013-01-01 00:00:00 and 2013-12-31 00:00:00


 85%|████████▍ | 11/13 [1:18:56<15:27, 463.94s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAICJD-X2EIwAQAIAAIAAAACCAADAAAAAAgABAAADUgKAAUW5e3Ch8AnEAoABhbl7cKFuQxwAAA
Scraped 5557 tweets from jimcramer between 2012-01-01 00:00:00 and 2012-12-31 00:00:00


 92%|█████████▏| 12/13 [1:23:48<06:51, 411.88s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAIATeXRw0AgAAAIAAIAAAACCAADAAAAAAgABAAADiUKAAUW5e3Ch8AnEAoABhbl7cKFl1OgAAA
Scraped 4405 tweets from jimcramer between 2011-01-01 00:00:00 and 2011-12-31 00:00:00


100%|██████████| 13/13 [1:27:53<00:00, 405.64s/it]

cursor:  DAADDAABCgABFuU7axTaYAAKAAIAAAACgld-4gAIAAIAAAACCAADAAAAAAgABAAADt0KAAUW5e3Ch8AnEAoABhbl7cKFe0AgAAA
Scraped 3676 tweets from jimcramer between 2010-03-20 00:00:00 and 2010-12-31 00:00:00



