# Twitter Scraping Notebook Overview
Welcome to the Twitter Scraping Notebook. This tool is designed for extracting tweets from specific Twitter accounts using Selenium for web scraping and BeautifulSoup for HTML parsing. Before starting, make sure you have installed the necessary libraries (`selenium`, `beautifulsoup4`, `pandas`) and have a WebDriver set up for Selenium. The code is organized into modular functions for ease of use and understanding.


## Importing Libraries and Setting Constants
This section imports necessary Python libraries for web scraping, handling dates, and data manipulation. It also sets up important constants that will be used throughout the notebook.


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.parse import urlparse
import pandas as pd

# Constants
END_DATE = datetime(2023, 10, 7)


## Function to Process Individual Tweets
The `process_tweet` function is designed to extract key information from each tweet. It gathers data such as the tweet's ID, content, stats, and timestamp. The function is built to handle exceptions and returns a structured dictionary of tweet data.

In [None]:
def process_tweet(div, leaning):
    try:
        # Extracting tweet date
        date_html = div.find_element(By.CLASS_NAME, 'tweet-date').get_attribute('innerHTML')
        date_soup = BeautifulSoup(date_html, 'html.parser')
        date_str = date_soup.a['title']
        tweet_date = datetime.strptime(date_str, "%b %d, %Y · %I:%M %p UTC")

        # Check for pinned, retweet, and quote tweets
        if div.find_elements(By.CLASS_NAME, 'pinned') or \
           div.find_elements(By.CLASS_NAME, 'retweet-header') or \
           div.find_elements(By.CLASS_NAME, 'quote'):
            return None

        # Extracting tweet stats
        tweet_stats = div.find_element(By.CLASS_NAME, 'tweet-stats').get_attribute('innerHTML')
        soup = BeautifulSoup(tweet_stats, 'html.parser')
        numbers = [stat.get_text(strip=True).replace(",", "") for stat in soup.find_all(class_="tweet-stat")]

        # Extracting tweet ID
        tweet_id = div.find_element(By.CLASS_NAME, 'tweet-link').get_attribute('href')
        parsed_url = urlparse(tweet_id)
        path = parsed_url.path
        id = path.split('/')[-1]

        return {
            'tweet_id': id,
            'tweets': div.find_element(By.CLASS_NAME, 'tweet-content').get_attribute('innerHTML'),
            'dates': date_str,
            'username': div.find_element(By.CLASS_NAME, 'username').get_attribute('innerHTML'),
            'url': tweet_id,
            'replies': numbers[0],
            'retweets': numbers[1],
            'quote_retweets': numbers[2],
            'likes': numbers[3],
            'leaning': leaning
        }
    except Exception as e:
        print(f"Error processing tweet: {e}")
        return None


## Function for Scraping Tweets from a Page
`get_tweets_from_page` navigates to a Twitter page and scrapes tweets from it. It uses the `process_tweet` function for each tweet found and returns a list of tweet data in dictionary format.


In [None]:
def get_tweets_from_page(driver, url, leaning):
    tweet_data = []
    driver.get(url)
    elements = driver.find_elements(By.CLASS_NAME, 'timeline-item')

    for div in elements:
        data = process_tweet(div, leaning)
        if data and data['dates'] >= END_DATE:
            tweet_data.append(data)
        elif data:
            break

    return tweet_data


## Function to Save Data as CSV
The `save_to_csv` function converts the list of tweet dictionaries into a pandas DataFrame and then saves this data into a CSV file. This function is crucial for data storage and later analysis.


In [None]:
def save_to_csv(data, filename):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)


## Main Scraping Function by Username
The `get_tweets_by_username` function is the main script for scraping tweets for a given Twitter username. It handles the browser setup with Selenium, manages page navigation, and calls the `save_to_csv` function to store the scraped data.


In [None]:
def get_tweets_by_username(username, leaning):
    driver = webdriver.Chrome()
    url = f'https://nitter.rawbit.ninja/{username}'
    all_tweets = []

    while True:
        tweets = get_tweets_from_page(driver, url, leaning)
        if not tweets:
            break
        all_tweets.extend(tweets)

        # Attempt to find and click "Show more" button
        try:
            button = driver.find_element(By.CLASS_NAME, "show-more")
            button.click()
            url = driver.current_url
        except:
            break

    driver.close()
    save_to_csv(all_tweets, f'{username}.csv')


## Executing Scraping for Each User in the DataFrame
This final cell runs the scraping process for each user listed in the 'clusters.csv' file. It iterates through each row in the DataFrame, scraping and saving tweets for each specified username. Make sure the CSV file is correctly formatted and available.


In [None]:
df = pd.read_csv('clusters.csv')
for index, row in df.iterrows():
    get_tweets_by_username(row[0], row[1])
