**Cell 1**:
- **Data Loading and Initial Exploration**: This cell loads the tweet dataset, displays data types, summary statistics for numerical columns, identifies missing values, and displays the first few rows of the DataFrame to understand its format. It also prints the number of tweets in the dataset.

In [25]:
# 1. Data Loading and Initial Exploration
import pandas as pd
import numpy as np
import torch
from transformers import RobertaModel, RobertaTokenizer
from torch.utils.data import DataLoader, Dataset
from sklearn.preprocessing import LabelEncoder
from multiprocessing import Pool, cpu_count

# Load the dataset
tweets_df = pd.read_csv('TweetData/combined_tweets_data.csv')

# Display data types
print("Data Types:")
print(tweets_df.dtypes)
print("\n")

# Display summary statistics for numerical columns
print("Summary Statistics:")
print(tweets_df.describe())
print("\n")

# Identify missing values
print("Missing Values:")
missing_values = tweets_df.isnull().sum()
print(missing_values)
print("\n")

# Display the first few rows of the dataframe to understand the data format
print("Data Format (first few rows):")
print(tweets_df.head())

print(f"Number of tweets: {tweets_df.shape[0]}")

Data Types:
public_metrics             object
text                       object
conversation_id           float64
edit_history_tweet_ids     object
lang                       object
referenced_tweets          object
author_id                 float64
context_annotations        object
created_at                 object
tweet_id                    int64
in_reply_to_user_id       float64
geo                        object
metrics                    object
total_engagement            int64
log_engage                float64
dtype: object


Summary Statistics:
       conversation_id     author_id      tweet_id  in_reply_to_user_id  \
count     2.388000e+03  2.390000e+03  2.440000e+03         4.390000e+02   
mean      1.762755e+18  7.136270e+17  1.762799e+18         6.279488e+17   
std       4.840185e+15  7.057288e+17  3.995114e+15         6.789884e+17   
min       1.634546e+18  6.124730e+05  1.757409e+18         6.124730e+05   
25%       1.758673e+18  4.313188e+08  1.758643e+18         1.754350

**Cell 2**:
- **Text Preprocessing**: This cell imports necessary NLP libraries and defines a function to preprocess text by converting it to lowercase, removing URLs, mentions, hashtags, punctuation, and applying tokenization and lemmatization. It downloads necessary NLTK resources and applies the preprocessing function to the tweet text column, saving the processed DataFrame.

In [26]:
# 2. Text Preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Downloading necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


def preprocess_text(text):
    """Preprocess text by lowercasing, removing URLs, mentions, hashtags, punctuation, and applying tokenization and lemmatization."""
    # Convert text to lowercase
    text = text.lower()
    # Remove URLs, mentions, and hashtags
    text = re.sub(r'http\S+|www\S+|https\S+|@\w+|#\w+', '', text)
    # Remove non-word characters and tokenize
    tokens = word_tokenize(re.sub(r'\W+', ' ', text))
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)


# Apply preprocessing to the tweet column
tweets_df['processed_tweet'] = tweets_df['text'].apply(preprocess_text)
tweets_df.to_csv('TweetData/combined_tweets_data.csv', index=False)
print(tweets_df.head)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Truck\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Truck\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Truck\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<bound method NDFrame.head of                                          public_metrics  \
0     {'retweet_count': 6, 'reply_count': 0, 'like_c...   
1     {'retweet_count': 0, 'reply_count': 0, 'like_c...   
2     {'retweet_count': 0, 'reply_count': 0, 'like_c...   
3     {'retweet_count': 4, 'reply_count': 0, 'like_c...   
4     {'retweet_count': 0, 'reply_count': 0, 'like_c...   
...                                                 ...   
2435  {'retweet_count': 0, 'reply_count': 0, 'like_c...   
2436  {'retweet_count': 0, 'reply_count': 0, 'like_c...   
2437  {'retweet_count': 2, 'reply_count': 0, 'like_c...   
2438  {'retweet_count': 2, 'reply_count': 0, 'like_c...   
2439  {'retweet_count': 0, 'reply_count': 0, 'like_c...   

                                                   text  conversation_id  \
0     RT @RestartProject: It appears the government ...     1.767657e+18   
1     The U.S. Department of Energy's new Electronic...     1.767650e+18   
2     Keep UCF green with 4Green 

**Cell 3**:
- **Removing Non-English Tweets, Duplicates, and Advertisements**: This cell removes duplicate tweets, non-English tweets, and promotional tweets based on specific keywords. It prints the number of tweets removed at each step and saves the cleaned DataFrame.

In [27]:
# 3. Removing non-English tweets, duplicates, and any that might be advertisements
import pandas as pd

# Assuming tweets_df is the DataFrame already loaded from the previous steps

# Initial number of tweets
initial_count = tweets_df.shape[0]

# Step 1: Remove Duplicate Tweets
tweets_df = tweets_df.drop_duplicates(subset='tweet_id', keep='first')
after_duplicates_count = tweets_df.shape[0]
duplicates_removed = initial_count - after_duplicates_count

# Step 2: Remove Non-English Tweets
tweets_df = tweets_df[tweets_df['lang'] == 'en']
after_non_english_count = tweets_df.shape[0]
non_english_removed = after_duplicates_count - after_non_english_count

# Step 3: Remove Advertising/Promotional Tweets


def is_promotional(text):
    """Identify promotional tweets based on common promotional keywords."""
    promotional_keywords = [
        'buy now', 'free', 'discount', 'offer', 'sale', 'shop', 'promotion',
        'sponsored', 'advertisement', 'ad', 'click here', 'visit our site',
        'subscribe', 'check out', 'limited time'
    ]
    for keyword in promotional_keywords:
        if keyword in text.lower():
            return True
    return False


# Apply the is_promotional function to filter out promotional tweets
tweets_df = tweets_df[~tweets_df['text'].apply(is_promotional)]
after_promotional_count = tweets_df.shape[0]
promotional_removed = after_non_english_count - after_promotional_count

tweets_df.to_csv('TweetData/combined_tweets_data.csv', index=False)

# Display the number of tweets removed at each step
print(f"Initial number of tweets: {initial_count}")
print(f"Number of duplicate tweets removed: {duplicates_removed}")
print(f"Number of non-English tweets removed: {non_english_removed}")
print(f"Number of promotional tweets removed: {promotional_removed}")
print(f"Number of tweets after cleaning: {tweets_df.shape[0]}")

# Display a sample of the cleaned data
print("Sample of cleaned data:")
print(tweets_df.head())

Initial number of tweets: 2440
Number of duplicate tweets removed: 0
Number of non-English tweets removed: 185
Number of promotional tweets removed: 729
Number of tweets after cleaning: 1526
Sample of cleaned data:
                                      public_metrics  \
0  {'retweet_count': 6, 'reply_count': 0, 'like_c...   
4  {'retweet_count': 0, 'reply_count': 0, 'like_c...   
5  {'retweet_count': 2, 'reply_count': 0, 'like_c...   
7  {'retweet_count': 20, 'reply_count': 0, 'like_...   
8  {'retweet_count': 2, 'reply_count': 0, 'like_c...   

                                                text  conversation_id  \
0  RT @RestartProject: It appears the government ...     1.767657e+18   
4  @davidfickling @IEA Rich countries import moun...     1.767438e+18   
5  RT @M_Star_Online: Government failing to take ...     1.767644e+18   
7  RT @caniravkaria: This may be beginning of E-w...     1.767640e+18   
8  RT @ahier: Pulling gold out of e-waste suddenl...     1.767639e+18   

    edit_

**Cell 4**:
- **Extracting Metrics and Log Normalization**: This cell defines a function to extract metrics from JSON strings and handle possible errors. It applies this function to the 'public_metrics' column, sums the metrics into a new column, and log normalizes the total engagement. The updated DataFrame is saved.

In [28]:
# 4. Extracting Metrics and Log Normalization
import json

def extract_metrics(json_str):
    """Extract metrics from JSON string and handle possible errors."""
    keys = ['retweet_count', 'reply_count', 'like_count',
            'quote_count', 'bookmark_count', 'impression_count']
    try:
        metrics = json.loads(json_str.replace("'", '"'))
        # Ensure all keys are present, defaulting to 0 if not
        return {key: metrics.get(key, 0) for key in keys}
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}, input: {json_str}")
        return {key: 0 for key in keys}


# Apply the extract_metrics function
tweets_df['metrics'] = tweets_df['public_metrics'].apply(extract_metrics)

# Sum the metrics into a new column
tweets_df['total_engagement'] = tweets_df['metrics'].apply(
    lambda x: sum(x.values()))

# Log normalize the total engagement
tweets_df['log_engage'] = np.log1p(tweets_df['total_engagement'])

tweets_df.to_csv('TweetData/combined_tweets_data.csv', index=False)

**Cell 5**:
- **Generate Embeddings**: This cell defines a custom Dataset class for handling text data and a function to generate embeddings using a pre-trained RoBERTa model. It loads the processed dataset, verifies columns, loads the RoBERTa model and tokenizer, generates embeddings, and saves them to a .npy file.

In [30]:
# 5. Generate embeddings
import pandas as pd
import numpy as np
import torch
from transformers import RobertaModel, RobertaTokenizer
from torch.utils.data import DataLoader, Dataset
import os

# Ensure Numpy prints arrays completely
np.set_printoptions(threshold=np.inf)


class TextDataset(Dataset):
    """
    Custom Dataset class for handling text data.
    """

    def __init__(self, texts):
        """
        Initialize with a list of texts.
        """
        self.texts = texts

    def __len__(self):
        """
        Return the length of the dataset.
        """
        return len(self.texts)

    def __getitem__(self, idx):
        """
        Return the text at the given index.
        """
        return self.texts[idx]
    

def get_embeddings(model, tokenizer, texts, batch_size=16, device='cpu'):
    """
    Generate embeddings for a list of texts using a pre-trained RoBERTa model.
    
    Args:
    - model: Pre-trained RoBERTa model.
    - tokenizer: Corresponding tokenizer.
    - texts: List of texts to process.
    - batch_size: Batch size for processing.
    - device: Device to run the model on ('cpu' or 'cuda').

    Returns:
    - np.ndarray: Array of embeddings.
    """
    dataset = TextDataset(texts)
    data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
    model = model.to(device)
    all_embeddings = []

    for batch_texts in data_loader:
        try:
            inputs = tokenizer(batch_texts, return_tensors="pt",
                               padding=True, truncation=True, max_length=512)
            inputs = {k: v.to(device) for k, v in inputs.items()}
            with torch.no_grad():
                outputs = model(**inputs)
            embeddings = outputs.last_hidden_state.mean(
                dim=1).detach().cpu().numpy()
            all_embeddings.append(embeddings)
        except Exception as e:
            print(f"Error processing batch: {e}")

    return np.vstack(all_embeddings)


def save_embeddings(embeddings, file_name):
    """
    Save embeddings to a .npy file for efficient loading and use in PyG.
    
    Args:
    - embeddings: Embeddings to save.
    - file_name: Name of the file to save the embeddings.
    """
    np.save(file_name, embeddings)
    print(f"Embeddings saved successfully to {file_name}.")


if __name__ == "__main__":
    try:
        # Load preprocessed dataset
        print("Loading dataset...")
        tweets_df = pd.read_csv('TweetData/combined_tweets_data.csv')

        # Verify columns
        if 'processed_tweet' not in tweets_df.columns:
            raise ValueError("Processed tweet column not found in the dataset.")
        
        device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {device}")

        # Load RoBERTa model and tokenizer
        print("Loading RoBERTa model and tokenizer...")
        model = RobertaModel.from_pretrained('roberta-base')
        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

        # Get the list of processed texts
        texts = tweets_df['processed_tweet'].tolist()

        # Generate embeddings
        print("Generating embeddings...")
        embeddings = get_embeddings(model, tokenizer, texts, device=device)
        print("Embeddings generation complete.")

        # Save embeddings to .npy file
        embeddings_file = "TweetData/roberta_tweets_embeddings.npy"
        print("Saving embeddings...")
        save_embeddings(embeddings, embeddings_file)
        print("Process completed successfully.")
    
    except Exception as e:
        print(f"An error occurred: {e}")

'''import numpy as np
embeddings = np.load("TweetData/roberta_tweets_embeddings.npy")
# to use embeddings for cosine similarity calculations
'''

Loading dataset...
Using device: cpu
Loading RoBERTa model and tokenizer...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Generating embeddings...
Embeddings generation complete.
Saving embeddings...
Embeddings saved successfully to TweetData/roberta_tweets_embeddings.npy.
Process completed successfully.


**Cell 6**:
- **Identify and Remove Duplicate Tweets Using Embeddings**: This cell loads the embeddings and the original dataset, ensures they have the same length, computes pairwise cosine similarity, identifies duplicate tweets based on a similarity threshold, removes duplicates from the DataFrame and embeddings array, and saves the cleaned data.

To identify and remove duplicate tweets using the embeddings, we can follow these steps:

1. **Loading**: Loaded embeddings and the original dataset.
2. **Similarity Calculation**: Computed pairwise cosine similarity.
3. **Thresholding**: Defined a threshold to identify duplicates.
4. **Removing Duplicates**: Identified and removed duplicate tweets based on the similarity threshold, saving the cleaned data.

By following this process, we ensure that we retain unique tweets in the dataset, which will help in improving the accuracy and efficiency of the subsequent analysis, especially in a graph-based model where redundancy can lead to skewed results.

In [38]:
from scipy.spatial.distance import cdist
import numpy as np
import pandas as pd

# Load the embeddings and the original dataset
embeddings = np.load("TweetData/roberta_tweets_embeddings.npy")
tweets_df = pd.read_csv('TweetData/combined_tweets_data.csv')

# Ensure the embeddings and the DataFrame have the same length
assert len(embeddings) == len(
    tweets_df), "Mismatch between embeddings and DataFrame length."

# Compute pairwise cosine similarity
cosine_similarities = 1 - cdist(embeddings, embeddings, metric='cosine')

# Define the similarity threshold for considering tweets as duplicates
similarity_threshold = 0.99999999


def identify_duplicates(similarity_matrix, threshold):
    """
    Identify duplicates in the similarity matrix based on the given threshold.
    
    Args:
    - similarity_matrix: Pairwise cosine similarity matrix.
    - threshold: Cosine similarity threshold to consider tweets as duplicates.
    
    Returns:
    - List of indices of duplicate tweets to be removed.
    """
    num_tweets = similarity_matrix.shape[0]
    duplicates = set()

    for i in range(num_tweets):
        for j in range(i + 1, num_tweets):
            if similarity_matrix[i, j] > threshold:
                duplicates.add(j)

    return list(duplicates)


# Identify duplicate tweet indices
duplicate_indices = identify_duplicates(
    cosine_similarities, similarity_threshold)

# Remove duplicates from the DataFrame
tweets_df_no_duplicates = tweets_df.drop(
    index=duplicate_indices).reset_index(drop=True)

# Remove duplicates from the embeddings array
embeddings_no_duplicates = np.delete(embeddings, duplicate_indices, axis=0)

# Save the cleaned DataFrame
tweets_df_no_duplicates.to_csv(
    'TweetData/combined_tweets_no_duplicates.csv', index=False)

# Save the updated embeddings array to a new file
np.save("TweetData/roberta_tweets_embeddings_no_duplicates.npy",
        embeddings_no_duplicates)

print(f"Removed {len(duplicate_indices)} duplicate tweets. Cleaned data and embeddings saved successfully.")

Removed 352 duplicate tweets. Cleaned data and embeddings saved successfully.


**Cell 7**:
- **Load and Apply Sentiment Analysis Model**: This cell loads a pre-trained sentiment analysis model and tokenizer, performs sentiment analysis on the processed tweet texts, and saves the updated DataFrame with sentiment probabilities and the tensor containing sentiment probabilities.

The `cardiffnlp/twitter-roberta-base-sentiment` model is a pre-trained sentiment analysis model specifically fine-tuned on Twitter data. When you use this model to analyze text, it outputs probabilities for three sentiment classes: positive, neutral, and negative.

### How the Model Responds

When you pass a tweet or text to the model, it tokenizes the text and feeds it through the RoBERTa architecture. The model then outputs a tensor with probabilities for each sentiment class. Here's how you can use the model and interpret its response:

### Interpreting the Response

1. **Sentiment Labels**: The model provides a prediction for one of three sentiment labels:
    - **Negative**: Indicates a negative sentiment in the text.
    - **Neutral**: Indicates a neutral sentiment in the text.
    - **Positive**: Indicates a positive sentiment in the text.

2. **Probabilities**: The model outputs a list of probabilities corresponding to each sentiment class. These probabilities indicate the model's confidence in each class:
    - **Probability Distribution**: The probabilities for the negative, neutral, and positive classes will sum to 1. For instance, if the probabilities are `[0.1, 0.3, 0.6]`, it means the model assigns a 10% chance to negative sentiment, 30% to neutral, and 60% to positive.

### Example Interpretation

Given the text "I love using the new sentiment analysis model!", suppose the model returns the following probabilities:

```python
Probabilities: [0.05, 0.1, 0.85]
```

This output means:
- **Negative**: 5% chance
- **Neutral**: 10% chance
- **Positive**: 85% chance

The highest probability is for the positive class, so the predicted sentiment is **positive**. This indicates that the model is highly confident that the text expresses a positive sentiment.

### Summary

- **Model Response**: The model outputs a probability distribution over three sentiment classes: negative, neutral, and positive.
- **Interpretation**: The class with the highest probability is the predicted sentiment. The probabilities provide a measure of the model's confidence in each sentiment class.
- **Application**: This approach can be used to analyze the sentiment of tweets or any other short text inputs, making it particularly useful for social media analysis.

By using the `cardiffnlp/twitter-roberta-base-sentiment` model, you can accurately and efficiently gauge the sentiment of tweets, which can be valuable for various applications, such as monitoring public opinion, tracking brand sentiment, and more.

In [40]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np


def load_sentiment_model():
    """
    Load the sentiment analysis model and tokenizer.
    
    Returns:
    - tokenizer: Pre-trained tokenizer for sentiment analysis.
    - model: Pre-trained sentiment analysis model.
    """
    try:
        tokenizer = AutoTokenizer.from_pretrained(
            "cardiffnlp/twitter-roberta-base-sentiment")
        model = AutoModelForSequenceClassification.from_pretrained(
            "cardiffnlp/twitter-roberta-base-sentiment")
        print("Sentiment model and tokenizer loaded successfully.")
        return tokenizer, model
    except Exception as e:
        print(f"Error loading sentiment model or tokenizer: {e}")
        raise


def sentiment_analysis(texts, tokenizer, model, batch_size=16, device='cpu'):
    """
    Perform sentiment analysis on a list of texts using a pre-trained model.
    
    Args:
    - texts: List of texts to analyze.
    - tokenizer: Pre-trained tokenizer.
    - model: Pre-trained sentiment analysis model.
    - batch_size: Batch size for processing.
    - device: Device to run the model on ('cpu' or 'cuda').

    Returns:
    - torch.Tensor: Tensor containing sentiment probabilities for each text.
    """
    try:
        model = model.to(device)
        all_scores = []
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            encoded_input = tokenizer(
                batch_texts, return_tensors='pt', truncation=True, max_length=512, padding=True)
            encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
            with torch.no_grad():
                output = model(**encoded_input)
            scores = torch.nn.functional.softmax(
                output.logits, dim=-1).cpu()
            all_scores.append(scores)
        return torch.cat(all_scores, dim=0)
    except Exception as e:
        print(f"Error performing sentiment analysis: {e}")
        raise


def load_tweets(file_path):
    """
    Load the tweets and embeddings from a CSV file.
    
    Args:
    - file_path: Path to the CSV file containing the tweets and embeddings.
    
    Returns:
    - pd.DataFrame: DataFrame containing the tweets and their embeddings.
    """
    try:
        tweets_df = pd.read_csv(file_path)
        print(f"Loaded {len(tweets_df)} tweets from {file_path}.")
        return tweets_df
    except FileNotFoundError as e:
        print(f"Error loading file: {e}")
        raise
    except pd.errors.ParserError as e:
        print(f"Error parsing file: {e}")
        raise


def apply_sentiment_analysis(tweets_df, tokenizer, model, batch_size=16, device='cpu'):
    """
    Apply sentiment analysis to each tweet in the DataFrame.
    
    Args:
    - tweets_df: DataFrame containing the tweets.
    - tokenizer: Pre-trained tokenizer.
    - model: Pre-trained sentiment analysis model.
    - batch_size: Batch size for processing.
    - device: Device to run the model on ('cpu' or 'cuda').

    Returns:
    - pd.DataFrame: Updated DataFrame with sentiment probabilities.
    - torch.Tensor: Tensor containing sentiment probabilities.
    """
    try:
        texts = tweets_df['processed_tweet'].tolist()
        sentiments = sentiment_analysis(
            texts, tokenizer, model, batch_size, device)
        sentiments_df = pd.DataFrame(
            sentiments.numpy(), columns=['positive', 'neutral', 'negative'])
        tweets_df = pd.concat([tweets_df, sentiments_df], axis=1)
        print("Sentiment analysis applied to all tweets.")
        return tweets_df, sentiments
    except Exception as e:
        print(f"Error applying sentiment analysis: {e}")
        raise


def save_updated_dataframe(tweets_df, file_path):
    """
    Save the updated DataFrame with sentiment probabilities to a CSV file.
    
    Args:
    - tweets_df: DataFrame containing the updated tweets.
    - file_path: Path to save the updated CSV file.
    """
    try:
        tweets_df.to_csv(file_path, index=False)
        print(f"Updated DataFrame saved to {file_path}.")
    except Exception as e:
        print(f"Error saving updated DataFrame: {e}")
        raise


def save_tensor(tensor, file_path):
    """
    Save the tensor to a file.
    
    Args:
    - tensor: Tensor to save.
    - file_path: Path to save the tensor file.
    """
    try:
        torch.save(tensor, file_path)
        print(f"Tensor saved to {file_path}.")
    except Exception as e:
        print(f"Error saving tensor: {e}")
        raise


def verify_saved_file(file_path):
    """
    Verify the integrity of the saved CSV file by loading it and checking the first few rows.
    
    Args:
    - file_path: Path to the saved CSV file.
    """
    try:
        df = pd.read_csv(file_path)
        print(f"Verification successful. First few rows of {file_path}:")
        print(df.head())
    except Exception as e:
        print(f"Error verifying saved file: {e}")
        raise


if __name__ == "__main__":
    try:
        # Load the sentiment analysis model and tokenizer
        tokenizer, model = load_sentiment_model()

        # Load the tweets DataFrame
        tweets_df = load_tweets('TweetData/combined_tweets_no_duplicates.csv')

        # Determine device
        device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {device}")

        # Apply sentiment analysis to the DataFrame
        tweets_df, sentiments_tensor = apply_sentiment_analysis(
            tweets_df, tokenizer, model, device=device)

        # Save the updated DataFrame
        save_updated_dataframe(
            tweets_df, 'TweetData/roberta_tweets_sentiments.csv')

        # Save the tensor containing sentiment probabilities
        save_tensor(sentiments_tensor,
                    'TweetData/roberta_tweets_sentiments_tensor.pt')

        # Verify the saved file
        verify_saved_file('TweetData/roberta_tweets_sentiments.csv')

    except Exception as e:
        print(f"An error occurred in the main execution block: {e}")

Sentiment model and tokenizer loaded successfully.
Loaded 1174 tweets from TweetData/combined_tweets_no_duplicates.csv.
Using device: cpu
Sentiment analysis applied to all tweets.
Updated DataFrame saved to TweetData/roberta_tweets_sentiments.csv.
Tensor saved to TweetData/roberta_tweets_sentiments_tensor.pt.
Verification successful. First few rows of TweetData/roberta_tweets_sentiments.csv:
                                      public_metrics  \
0  {'retweet_count': 6, 'reply_count': 0, 'like_c...   
1  {'retweet_count': 0, 'reply_count': 0, 'like_c...   
2  {'retweet_count': 2, 'reply_count': 0, 'like_c...   
3  {'retweet_count': 20, 'reply_count': 0, 'like_...   
4  {'retweet_count': 2, 'reply_count': 0, 'like_c...   

                                                text  conversation_id  \
0  RT @RestartProject: It appears the government ...     1.767657e+18   
1  @davidfickling @IEA Rich countries import moun...     1.767438e+18   
2  RT @M_Star_Online: Government failing to take 

**Cell 8**:
- **PCA and Normalization**: This cell applies dimensionality reduction using PCA to the embeddings and normalizes them using StandardScaler. It retains 95% variance, saves the updated embeddings, updates the DataFrame with PCA embeddings, and saves the DataFrame.

### PCA and normalization:
Apply dimensionality reduction using Principal Component Analysis (PCA), and normalization. This will help to spread out the embeddings in the feature space, reducing the likelihood of high similarity among embeddings and making connections more meaningful. 

In [59]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the tweet embeddings and the original dataset
tweet_embeddings = np.load(
    "TweetData/roberta_tweets_embeddings_no_duplicates.npy")
tweets_df = pd.read_csv('TweetData/roberta_tweets_sentiments.csv')

# Ensure the embeddings and the DataFrame have the same length
assert len(tweet_embeddings) == len(
    tweets_df), "Mismatch between tweet embeddings and DataFrame length."

# Standardize the embeddings
scaler = StandardScaler()
tweet_embeddings_scaled = scaler.fit_transform(tweet_embeddings)

# Apply PCA to reduce dimensionality while retaining 95% variance
pca = PCA(n_components=0.95)
tweet_embeddings_pca = pca.fit_transform(tweet_embeddings_scaled)

# Save the updated embeddings
np.save("TweetData/roberta_tweets_embeddings_pca.npy", tweet_embeddings_pca)
print(f"Updated tweet embeddings saved to 'TweetData/roberta_tweets_embeddings_pca.npy'.")

# Update the DataFrame with PCA embeddings
tweets_df['pca_embeddings'] = list(tweet_embeddings_pca)
tweets_df.to_csv('TweetData/updated_tweets_with_pca.csv', index=False)

print("DataFrame with PCA embeddings saved successfully.")

Updated tweet embeddings saved to 'TweetData/roberta_tweets_embeddings_pca.npy'.
DataFrame with PCA embeddings saved successfully.


**Cell 9**:
- **Drop Unnecessary Columns and Convert Date Format**: This cell drops specified columns from the DataFrame, converts the 'created_at' column to datetime format, and saves the modified DataFrame. It prints the DataFrame for troubleshooting.

In [61]:
import pandas as pd

# Load the dataset
tweets_df = pd.read_csv('TweetData/updated_tweets_with_pca.csv')

# Drop the specified columns
columns_to_drop = ['public_metrics', 'text', 'edit_history_tweet_ids', 'lang',
                   'referenced_tweets', 'in_reply_to_user_id', 'author_id', 'context_annotations',
                   'geo', 'metrics', 'processed_tweet']
tweets_df.drop(columns=columns_to_drop, inplace=True)

# Convert 'created_at' column to datetime format and focus on month, day, and year
tweets_df['created_at'] = pd.to_datetime(
    tweets_df['created_at']).dt.strftime('%Y-%m-%d')

# Save the modified DataFrame
output_path = 'TweetData/tweets.csv'
tweets_df.to_csv(output_path, index=False)

# Output for troubleshooting
print(f"DataFrame saved to {output_path}")
print(tweets_df.head())

DataFrame saved to TweetData/tweets.csv
   conversation_id  created_at             tweet_id  total_engagement  \
0     1.767657e+18  2024-03-12  1767656760607719717                 6   
1     1.767438e+18  2024-03-12  1767645630338584718                 4   
2     1.767644e+18  2024-03-12  1767643949961785639                 2   
3     1.767640e+18  2024-03-12  1767640057232470519                20   
4     1.767639e+18  2024-03-12  1767639365520388344                 2   

   log_engage  positive   neutral  negative  \
0    1.945910  0.321021  0.646973  0.032006   
1    1.609438  0.452236  0.497398  0.050366   
2    1.098612  0.872248  0.120716  0.007036   
3    3.044522  0.266101  0.707849  0.026050   
4    1.098612  0.076413  0.463236  0.460351   

                                      pca_embeddings  
0  [ 1.23029685e+00  2.54166198e+00 -5.97019196e+...  
1  [-3.64204764e+00 -5.98940229e+00 -2.60534310e+...  
2  [ 2.38513160e+00  5.06190443e+00  2.03545779e-...  
3  [-1.23132029e+0

**Cell 10**:
- **Convert Date to Unix Timestamp**: This cell converts the 'created_at' column to Unix timestamp format and saves the updated DataFrame. It prints the DataFrame and its data types for verification.

In [63]:
import pandas as pd

# Load the dataset
tweets_df = pd.read_csv('TweetData/tweets.csv')

# Convert 'created_at' to datetime and then to int64
tweets_df['created_at'] = pd.to_datetime(tweets_df['created_at'])
tweets_df['created_at'] = tweets_df['created_at'].astype(
    'int64') // 10**9  # Convert to Unix timestamp in seconds

# Save the updated DataFrame
tweets_df.to_csv('TweetData/final_tweets.csv', index=False)

# Output for verification
print(f"DataFrame saved to 'TweetData/final_tweets.csv'")
print(tweets_df.head())
print(tweets_df.dtypes)

DataFrame saved to 'TweetData/final_tweets.csv'
   conversation_id  created_at             tweet_id  total_engagement  \
0     1.767657e+18  1710201600  1767656760607719717                 6   
1     1.767438e+18  1710201600  1767645630338584718                 4   
2     1.767644e+18  1710201600  1767643949961785639                 2   
3     1.767640e+18  1710201600  1767640057232470519                20   
4     1.767639e+18  1710201600  1767639365520388344                 2   

   log_engage  positive   neutral  negative  \
0    1.945910  0.321021  0.646973  0.032006   
1    1.609438  0.452236  0.497398  0.050366   
2    1.098612  0.872248  0.120716  0.007036   
3    3.044522  0.266101  0.707849  0.026050   
4    1.098612  0.076413  0.463236  0.460351   

                                      pca_embeddings  
0  [ 1.23029685e+00  2.54166198e+00 -5.97019196e+...  
1  [-3.64204764e+00 -5.98940229e+00 -2.60534310e+...  
2  [ 2.38513160e+00  5.06190443e+00  2.03545779e-...  
3  [-1.231

**Cell 11**:
- **Final Data Verification**: This cell displays data types, summary statistics for numerical columns, identifies missing values, displays the first few rows of the DataFrame, and prints the total number of tweets.

In [64]:
# Load the dataset
tweets_df = pd.read_csv('TweetData/final_tweets.csv')

# Display data types
print("Data Types:")
print(tweets_df.dtypes)
print("\n")

# Display summary statistics for numerical columns
print("Summary Statistics:")
print(tweets_df.describe())
print("\n")

# Identify missing values
print("Missing Values:")
missing_values = tweets_df.isnull().sum()
print(missing_values)
print("\n")

# Display the first few rows of the dataframe to understand the data format
print("Data Format (first few rows):")
print(tweets_df.head())

print(f"Number of tweets: {tweets_df.shape[0]}")

Data Types:
conversation_id     float64
created_at            int64
tweet_id              int64
total_engagement      int64
log_engage          float64
positive            float64
neutral             float64
negative            float64
pca_embeddings       object
dtype: object


Summary Statistics:
       conversation_id    created_at      tweet_id  total_engagement  \
count     1.144000e+03  1.174000e+03  1.174000e+03       1174.000000   
mean      1.762844e+18  1.709111e+09  1.762955e+18        381.459114   
std       5.444208e+15  9.344493e+05  3.920997e+15       3782.488004   
min       1.634546e+18  1.707782e+09  1.757409e+18          0.000000   
25%       1.758884e+18  1.708128e+09  1.758851e+18          4.000000   
50%       1.765360e+18  1.709683e+09  1.765383e+18         14.000000   
75%       1.766548e+18  1.710007e+09  1.766610e+18         54.750000   
max       1.767657e+18  1.710202e+09  1.767657e+18      78533.000000   

        log_engage     positive      neutral     ne