# What is Sentiment Analysis?
It's the process of analyzing text to determine the emotional tone. By default the text can be positive, neutral, or negative.

Credit: https://www.datacamp.com/tutorial/text-analytics-beginners-nltk    

Dataset: https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset

# Preprocessing

Before we input our data (the tweets) into our sentiment analysis, we will need to clean the text.

## **Tokenization:**
The first step in preprocessing our text. Tokenization will break down the tweets into separate words, characters, or subwords. This way, we can analyze the individual words from the text for our sentiment analysis. The words will be processed into a vector, in our case, an array.

## **Removing Stop Words:**
The tweets will include words that have little to no sentiment. For example, "and", "the", "of", "it", "from", and "to" are examples of stop words. It is crucial to remove the stop words to avoid inaccuracy of our analysis

## **Stemming and Lemmaization**:
Stemming reduces words to their base form by removing suffixes. We will use the base word to see if it is positive or negative. Sometimes stemming will reduce into meaningless forms. Lemmaization reduces words to their base form taking account of their part of speech in the text. This will take more time to process, but gives the words more meaning and representation during the analysis

## **Step 1: Load the Dataset and Libraries**


In [None]:
# import libraries
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import seaborn as sns

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

In [None]:
df = pd.read_csv('/content/Tweets.csv')
df['text'] = df['text'].astype(str)
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


# Step 2: Preprocess text

In [None]:
df['positive'] = pd.get_dummies(df, columns=['sentiment'])['sentiment_positive']
df.head()

Unnamed: 0,textID,text,selected_text,sentiment,positive
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,0
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,0
2,088c60f138,my boss is bullying me...,bullying me,negative,0
3,9642c003ef,what interview! leave me alone,leave me alone,negative,0
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,0


In [None]:
def preprocess(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())

    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    # Join the tokens back into a string
    processed_text = ' '.join(lemmatized_tokens)
    return processed_text

# apply the function df
df['text'] = df['text'].apply(preprocess)
df

Unnamed: 0,textID,text,selected_text,sentiment,positive
0,cb774db0d1,"` responded , going","I`d have responded, if I were going",neutral,0
1,549e992a42,sooo sad miss san diego ! ! !,Sooo SAD,negative,0
2,088c60f138,bos bullying ...,bullying me,negative,0
3,9642c003ef,interview ! leave alone,leave me alone,negative,0
4,358bd9e861,"son * * * * , ` put release already bought","Sons of ****,",negative,0
...,...,...,...,...,...
27476,4eac33d1c0,wish could come see u denver husband lost job ...,d lost,negative,0
27477,4f4c4fc327,"` wondered rake . client made clear .net , ` f...",", don`t force",negative,0
27478,f67aae2310,yay good . enjoy break - probably need hectic ...,Yay good for both of you.,positive,1
27479,ed167662a5,worth * * * * .,But it was worth it ****.,positive,1


# Step 3: Predict and evaluate the model

In [None]:
analyzer = SentimentIntensityAnalyzer()

def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    sentiment = 1 if scores['pos'] > 0 else 0
    return sentiment

df['sentiment'] = df['text'].apply(get_sentiment)
df

Unnamed: 0,textID,text,selected_text,sentiment,positive
0,cb774db0d1,"` responded , going","I`d have responded, if I were going",0,0
1,549e992a42,sooo sad miss san diego ! ! !,Sooo SAD,0,0
2,088c60f138,bos bullying ...,bullying me,0,0
3,9642c003ef,interview ! leave alone,leave me alone,0,0
4,358bd9e861,"son * * * * , ` put release already bought","Sons of ****,",0,0
...,...,...,...,...,...
27476,4eac33d1c0,wish could come see u denver husband lost job ...,d lost,1,0
27477,4f4c4fc327,"` wondered rake . client made clear .net , ` f...",", don`t force",1,0
27478,f67aae2310,yay good . enjoy break - probably need hectic ...,Yay good for both of you.,1,1
27479,ed167662a5,worth * * * * .,But it was worth it ****.,1,1


In [None]:
phrase = input("Enter your phrase to be analyized:" )

Enter your phrase to be analyized:I love college


In [None]:
# positive = 1, negative = 0
processed = preprocess(phrase)
print(get_sentiment(processed))

1


In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(df['positive'], df['sentiment']))

[[10546  8353]
 [  796  7786]]


In [None]:
from sklearn.metrics import classification_report
print(classification_report(df['positive'], df['sentiment']))

              precision    recall  f1-score   support

           0       0.93      0.56      0.70     18899
           1       0.48      0.91      0.63      8582

    accuracy                           0.67     27481
   macro avg       0.71      0.73      0.66     27481
weighted avg       0.79      0.67      0.68     27481



# Modern Approach: Transformers
**But why are Transformers so effective for sentiment analysis** ?
<br>
• **Multi-Head Attention Mechanism**: This allows the model to simultaneously look at different parts of the input text, capturing complex interactions and dependencies that are vital for accurate sentiment classification. <br>
• **Positional Encoding**: By explicitly encoding the position of words, transformers retain crucial sequential information, enabling them to differentiate between sentiment-altering phrases. <br>
• **Vector Embeddings**: Transformer models produce high-quality vector embeddings that go beyond simple word co-occurrence, capturing deeper semantic relationships and enabling a more nuanced understanding of sentiment. <br>

# Install Libraries

In [None]:
!pip install transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from scipy.special import softmax
import torch
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score



Set up the transformer model to be used for sentiment analysis

In [None]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment" # to omit for student copy
tokenizer = AutoTokenizer.from_pretrained(MODEL, local_files_only=False)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, local_files_only=False)


In [None]:
# --- Data Loading and Preparation ---
DATA_PATH = '/content/Tweets.csv'
df_original = pd.read_csv(DATA_PATH)
df_original['text'] = df_original['text'].astype(str)
texts_original = df_original['text'].tolist()

# --- Sentiment Analysis ---
results_original = []
for text in texts_original:
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        output = model(**encoded_input)
    scores = output.logits[0].cpu().numpy()  # Move logits to CPU for numpy
    probabilities = softmax(scores)

    sentiment = {
        'text': text,
        'roberta_negative': float(probabilities[0]),
        'roberta_neutral': float(probabilities[1]),
        'roberta_positive': float(probabilities[2]),
        'predicted_sentiment': ['negative', 'neutral', 'positive'][probabilities.argmax()]
    }
    results_original.append(sentiment)
sentiment_df_original = pd.DataFrame(results_original)

# --- Evaluation ---
LABEL_MAP = {'negative': 0, 'neutral': 1, 'positive': 2}
sentiment_df_original['pred_label'] = sentiment_df_original['predicted_sentiment'].map(LABEL_MAP)
y_pred = sentiment_df_original['pred_label']
y_true = df_original['sentiment'].map(LABEL_MAP) # Ensure ground truth labels are also mapped

print("\n--- Evaluation on First 100 Tweets ---")
print("\nConfusion Matrix:\n", confusion_matrix(y_true, y_pred, labels=[0, 1, 2]))
print("\nClassification Report:\n", classification_report(y_true, y_pred, target_names=['negative', 'neutral', 'positive'], labels=[0, 1, 2], zero_division=0))
print("\nAccuracy:", accuracy_score(y_true, y_pred))


--- Evaluation on First 100 Tweets ---

Confusion Matrix:
 [[236  45  13]
 [ 73 215 103]
 [  8  33 274]]

Classification Report:
               precision    recall  f1-score   support

    negative       0.74      0.80      0.77       294
     neutral       0.73      0.55      0.63       391
    positive       0.70      0.87      0.78       315

    accuracy                           0.72      1000
   macro avg       0.73      0.74      0.73      1000
weighted avg       0.73      0.72      0.72      1000


Accuracy: 0.725


In [None]:
!pip install huggingface_hub[hf_xet]

Collecting hf-xet>=0.1.4 (from huggingface_hub[hf_xet])
  Downloading hf_xet-1.0.5-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (494 bytes)
Downloading hf_xet-1.0.5-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (54.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.0/54.0 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hf-xet
Successfully installed hf-xet-1.0.5


In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from scipy.special import softmax
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from tqdm.notebook import tqdm  # For progress bar
import numpy as np # Import numpy for argmax

# --- Check for GPU Availability and Set Device ---
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Using CPU")

# --- Model and Tokenizer Loading (Using a fine-tuned sentiment model) ---
# Use a model specifically fine-tuned for sentiment analysis
MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to(device) # Move model to GPU
print("Model and Tokenizer Loaded.") # Added confirmation

# --- Data Loading and Preparation (Keep all sentiments) ---
DATA_PATH = '/content/Tweets.csv' # Make sure this path is correct
print(f"Loading data from: {DATA_PATH}")
df_original = pd.read_csv(DATA_PATH)

# Handle potential missing values in 'text' or 'sentiment'
print(f"Original data shape: {df_original.shape}")
df_original.dropna(subset=['text', 'sentiment'], inplace=True)
print(f"Data shape after dropping NA in text/sentiment: {df_original.shape}")
df_original['text'] = df_original['text'].astype(str)

# We will predict on all sentiments now, as the model supports it
texts = df_original['text'].tolist()
true_sentiments = df_original['sentiment'].tolist() # Keep original labels for evaluation
print(f"Number of texts to process: {len(texts)}")

# --- Sentiment Analysis (Batched for GPU Efficiency) ---
results = []
# batch_size = 128 # Can be large, but start smaller if memory errors occur
batch_size = 128  # Adjust batch size based on your GPU memory (32 is often safer)
print(f"Processing in batches of size: {batch_size}")

# --- Get Labels from Model Config ---
# The model config stores the label mapping learned during fine-tuning
config_labels = model.config.id2label
print(f"Model labels based on config: {config_labels}")
# Example mapping for cardiffnlp model (usually):
# {0: 'negative', 1: 'neutral', 2: 'positive'} # Or sometimes label_0, label_1 etc.

for i in tqdm(range(0, len(texts), batch_size), desc="Processing batches"):
    batch_texts = texts[i:i + batch_size]
    encoded_inputs = tokenizer(batch_texts, return_tensors='pt', truncation=True, padding=True, max_length=512).to(device) # Move inputs to GPU

    with torch.no_grad():
        outputs = model(**encoded_inputs)
    logits = outputs.logits.cpu().numpy() # Move logits back to CPU for numpy
    probabilities = softmax(logits, axis=1)

    for j, text in enumerate(batch_texts):
        scores = probabilities[j]

        predicted_label_index = np.argmax(scores)
        # Use the mapping from the model's config
        predicted_sentiment = config_labels[predicted_label_index]

        # Store scores dynamically based on config labels
        sentiment_scores = {f'roberta_{config_labels[k]}': float(scores[k]) for k in range(scores.shape[0])}

        result = {
            'text': text,
            **sentiment_scores, # Unpack the scores dictionary
            'predicted_sentiment': predicted_sentiment
        }
        results.append(result)

sentiment_df = pd.DataFrame(results)
print("Batch processing complete.")

# Merge predicted sentiments back with original data for easier evaluation
# Ensure indices align if you didn't reset_index after dropna
df_original = df_original.reset_index(drop=True)
sentiment_df = sentiment_df.reset_index(drop=True)

# Add predicted sentiment to the original dataframe (use join if indices might mismatch)
df_original['predicted_sentiment'] = sentiment_df['predicted_sentiment']

# --- Evaluation (Multi-Class Classification) ---
print("\n--- Evaluation on Sentiment (Negative/Neutral/Positive) ---")

# Filter out rows where true sentiment might be missing if any slipped through (redundant if dropna worked)
eval_df = df_original.dropna(subset=['sentiment', 'predicted_sentiment'])

y_true = eval_df['sentiment']
y_pred = eval_df['predicted_sentiment']

# Check if there are any predicted labels not present in the true labels (or vice-versa)
# Map model output labels (e.g., 'negative') to match dataset labels if they differ (e.g., 'Negative') - case insensitive check
# Assuming dataset labels are 'negative', 'neutral', 'positive' - adjust if needed
true_labels_set = set(s.lower() for s in y_true.unique())
pred_labels_set = set(s.lower() for s in y_pred.unique())

print(f"Unique True Labels (lower): {sorted(list(true_labels_set))}")
print(f"Unique Predicted Labels (lower): {sorted(list(pred_labels_set))}")

# Use labels from the dataset as the ground truth standard
all_labels = sorted(list(y_true.unique()))
print(f"Labels used for report (from true labels): {all_labels}")


# Ensure labels parameter includes all possible labels encountered in y_true
print("\nConfusion Matrix:\n", confusion_matrix(y_true, y_pred, labels=all_labels))
# Use target_names based on the sorted true labels for clarity in the report
print("\nClassification Report:\n", classification_report(y_true, y_pred, labels=all_labels, target_names=all_labels, zero_division=0))
print("\nAccuracy:", accuracy_score(y_true, y_pred))

# Display the first few results including scores
print("\n--- Sample Results ---")
# Add the score columns to the preview
score_cols = [col for col in sentiment_df.columns if col.startswith('roberta_')]
print(sentiment_df[['text', 'predicted_sentiment'] + score_cols].head())

# Display counts of predicted vs true
print("\n--- Value Counts ---")
print("True Sentiment Counts:")
print(y_true.value_counts())
print("\nPredicted Sentiment Counts:")
print(y_pred.value_counts())

Using GPU: Tesla T4


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model and Tokenizer Loaded.
Loading data from: /content/Tweets.csv
Original data shape: (27481, 4)
Data shape after dropping NA in text/sentiment: (27480, 4)
Number of texts to process: 27480
Processing in batches of size: 128
Model labels based on config: {0: 'negative', 1: 'neutral', 2: 'positive'}


Processing batches:   0%|          | 0/215 [00:00<?, ?it/s]

Batch processing complete.

--- Evaluation on Sentiment (Negative/Neutral/Positive) ---
Unique True Labels (lower): ['negative', 'neutral', 'positive']
Unique Predicted Labels (lower): ['negative', 'neutral', 'positive']
Labels used for report (from true labels): ['negative', 'neutral', 'positive']

Confusion Matrix:
 [[6246  992  543]
 [2479 5613 3025]
 [ 343  681 7558]]

Classification Report:
               precision    recall  f1-score   support

    negative       0.69      0.80      0.74      7781
     neutral       0.77      0.50      0.61     11117
    positive       0.68      0.88      0.77      8582

    accuracy                           0.71     27480
   macro avg       0.71      0.73      0.71     27480
weighted avg       0.72      0.71      0.70     27480


Accuracy: 0.7065866084425036

--- Sample Results ---
                                                text predicted_sentiment  \
0                I`d have responded, if I were going             neutral   
1      Sooo S

In [None]:
# --- Evaluation (Binary Comparison - Positive vs. Non-Positive) ---
print("\n--- Evaluation (Binary Comparison - Positive vs. Non-Positive) ---")

# Map RoBERTa's predictions to binary: 1 if 'positive', 0 otherwise
binary_pred = np.where(eval_df['predicted_sentiment'] == 'positive', 1, 0)

# Create binary true labels: 1 if 'positive', 0 otherwise
binary_true = np.where(eval_df['sentiment'] == 'positive', 1, 0)

print("\nBinary Confusion Matrix (RoBERTa):\n", confusion_matrix(binary_true, binary_pred, labels=[0, 1]))
print("\nBinary Classification Report (RoBERTa):\n", classification_report(binary_true, binary_pred, target_names=['non-positive', 'positive'], labels=[0, 1], zero_division=0))
print("\nBinary Accuracy (RoBERTa):", accuracy_score(binary_true, binary_pred))

# Compare this Binary Accuracy (RoBERTa) to the VADER accuracy (67%)


--- Evaluation (Binary Comparison - Positive vs. Non-Positive) ---

Binary Confusion Matrix (RoBERTa):
 [[15330  3568]
 [ 1024  7558]]

Binary Classification Report (RoBERTa):
               precision    recall  f1-score   support

non-positive       0.94      0.81      0.87     18898
    positive       0.68      0.88      0.77      8582

    accuracy                           0.83     27480
   macro avg       0.81      0.85      0.82     27480
weighted avg       0.86      0.83      0.84     27480


Binary Accuracy (RoBERTa): 0.8328966521106259


In [None]:
# Exciting Applications in Sentiment Analysis! Game development

In [None]:
analyzer = SentimentIntensityAnalyzer()

def detect_sentiment(text):
    scores = analyzer.polarity_scores(text)
    compound = scores['compound']
    if compound >= 0.2:
        return "positive"
    elif compound <= -0.2:
        return "negative"
    else:
        return "neutral"


In [None]:
def intro():
    print("Welcome to the Emotion Escape Room!")
    print("You wake up in a dark room. A faint light glows from under a door.")
    input("Press Enter to continue...")
    first_choice()

def first_choice():
    print("\nThere’s a voice that says: 'Speak your truth to pass.'")
    action = input("What do you say? ")

    sentiment = detect_sentiment(action)
    print(f"(Detected sentiment: {sentiment})")

    if sentiment == "positive":
        print("A warm light glows. The door unlocks.")
        hallway()
    elif sentiment == "negative":
        print("A trapdoor opens beneath you and you fall.")
        pitfall()
    else:
        print("Nothing happens. The room remains still.")
        retry(first_choice)

def hallway():
    print("\nYou walk into a hallway lined with mirrors.")
    action = input("What do you say as you walk? ")

    sentiment = detect_sentiment(action)
    print(f"(Detected sentiment: {sentiment})")

    if sentiment == "negative":
        print("One mirror shatters. A shadow steps out...")
        print("You are consumed by doubt. Game over.")
    elif sentiment == "positive":
        print("The mirrors reflect a peaceful path forward.")
        final_room()
    else:
        print("A mirror asks: 'Are you ready to confront yourself?'")
        retry(hallway)

def pitfall():
    print("\nYou land in a pit of snakes... But one offers you a deal.")
    action = input("Speak to the snake: ")

    sentiment = detect_sentiment(action)
    print(f"(Detected sentiment: {sentiment})")

    if sentiment == "positive":
        print("The snake smiles. It lifts you to safety.")
        hallway()
    else:
        print("You’re not worthy of trust. You perish.")
        end_game()

def final_room():
    print("\nYou reach the final chamber. A riddle appears on the wall:")
    print("'What frees the mind and opens all locks?'")
    answer = input("Your answer: ")

    sentiment = detect_sentiment(answer)
    print(f"(Detected sentiment: {sentiment})")

    if sentiment == "positive":
        print("🎉 The door opens to sunlight. You have escaped!")
    else:
        print("🚪 The door remains shut. Try again later.")
    end_game()

def retry(func):
    print("Try expressing yourself differently...")
    func()

def end_game():
    print("\n💫 Thank you for playing the Emotion Escape Room!")

intro()