# Project 5: Sentiment Analysis on Twitter Data

This notebook builds a deep learning model to classify the sentiment of tweets from the Sentiment140 dataset. The goal is to determine whether a tweet expresses a positive or negative sentiment.

We will use a Long Short-Term Memory (LSTM) network, a type of Recurrent Neural Network (RNN), which is well-suited for sequence data like text.

## 1. Setup and Library Imports

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional

## 2. Data Loading and Initial Cleaning

In [None]:
try:
    # The dataset has no header, so we provide column names
    col_names = ['target', 'id', 'date', 'flag', 'user', 'text']
    df = pd.read_csv('data/training.1600000.processed.noemoticon.csv',
                     encoding='latin-1',
                     names=col_names)
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Data file not found. Please download and place it in the 'data/' directory.")

# Keep only the necessary columns
df = df[['target', 'text']]

# The target values are 0 (negative) and 4 (positive). Let's map 4 to 1.
df['target'] = df['target'].replace(4, 1)

print("\nValue counts for sentiment:")
print(df['target'].value_counts())

## 3. Text Preprocessing

In [None]:
# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')

In [None]:
def clean_tweet(text):
    """Function to clean tweet text."""
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # Remove user mentions
    text = re.sub(r'@\w+', '', text)
    # Remove punctuation
    text = re.sub(r'[^A-Za-z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Tokenize and remove stopwords, and apply stemming
    tokens = []
    for token in text.split():
        if token not in stop_words:
            tokens.append(stemmer.stem(token))
    return " ".join(tokens)

# Apply the cleaning function
# This may take a few minutes to run on the full dataset
df['clean_text'] = df['text'].apply(clean_tweet)

## 4. Exploratory Data Analysis (Word Clouds)

In [None]:
positive_tweets = " ".join(df[df['target'] == 1]['clean_text'])
negative_tweets = " ".join(df[df['target'] == 0]['clean_text'])

wordcloud_pos = WordCloud(width=800, height=400, background_color='white').generate(positive_tweets)
wordcloud_neg = WordCloud(width=800, height=400, background_color='black').generate(negative_tweets)

plt.figure(figsize=(20, 10))
plt.subplot(1, 2, 1)
plt.imshow(wordcloud_pos, interpolation='bilinear')
plt.title('Most Common Words in Positive Tweets')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(wordcloud_neg, interpolation='bilinear')
plt.title('Most Common Words in Negative Tweets')
plt.axis('off')

plt.show()

## 5. Feature Engineering and Sequencing

In [None]:
# Split data into training and testing sets
X = df['clean_text'].values
y = df['target'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tokenize the text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad sequences to ensure uniform length
max_len = 50 # Max length of a sequence
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

## 6. Building and Training the LSTM Model

In [None]:
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=128, input_length=max_len))
model.add(Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

In [None]:
history = model.fit(X_train_pad, y_train,
                    epochs=5, # Using 5 epochs for demonstration; more can improve accuracy
                    batch_size=512,
                    validation_data=(X_test_pad, y_test),
                    verbose=1)

## 7. Evaluating the Model

In [None]:
# Plotting training history
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.title('Model Accuracy')

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.title('Model Loss')
plt.show()

In [None]:
# Evaluate on test set
loss, accuracy = model.evaluate(X_test_pad, y_test, verbose=0)
print(f"Test Accuracy: {accuracy:.4f}")

# Get predictions
y_pred_probs = model.predict(X_test_pad)
y_pred = (y_pred_probs > 0.5).astype('int32')

# Classification Report and Confusion Matrix
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## 8. Conclusion

This notebook demonstrated a complete workflow for a common NLP task: sentiment analysis. We cleaned raw text data, converted it into a format suitable for a deep learning model, and trained an LSTM network to classify tweet sentiment with reasonable accuracy.

### Potential Improvements
- **Use Pre-trained Embeddings:** Instead of training our own embedding layer, we could use pre-trained embeddings like GloVe or Word2Vec, which are trained on much larger text corpora and can capture more semantic meaning.
- **Try Different Architectures:** Experiment with GRUs (Gated Recurrent Units) or even more advanced models like Transformers (e.g., BERT).
- **More Sophisticated Preprocessing:** Use techniques like lemmatization instead of stemming, or handle negations more carefully.