# Twitter Sentiment Analysis

This notebook demonstrates sentiment analysis on Twitter data using Support Vector Machines (SVM) with different text embedding approaches:
1. **Baseline**: TF-IDF with basic parameters
2. **Improved TF-IDF**: Enhanced vectorization with bigrams and hyperparameter tuning
3. **Modern Embeddings**: Sentence Transformers for semantic understanding

In [42]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import OrdinalEncoder

## 1. Import Libraries and Load Data

In [3]:
df = pd.read_csv("../data/twitter_training.csv", header=None)

In [5]:
df = df.dropna()

In [6]:
df.columns = ["tweetID", "entity", "sentiment", "content"]
df.head()

Unnamed: 0,tweetID,entity,sentiment,content
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


## 2. Data Preprocessing

In [7]:
print(df['sentiment'].value_counts())

sentiment
Negative      22358
Positive      20655
Neutral       18108
Irrelevant    12875
Name: count, dtype: int64


In [33]:
df2 = df.sample(n=20000, random_state=42) # Added random_state for reproducibility
print(f"Shape of original df: {df.shape}")
print(f"Shape of df2 (subset): {df2.shape}")

Shape of original df: (73996, 5)
Shape of df2 (subset): (20000, 5)


In [34]:
# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df2['content'], df2['sentiment'], test_size=0.2, random_state=42)

# Step 3: Text encoding using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

## 3. Baseline Model: Basic TF-IDF + SVM

In [35]:
# Step 4: Train the SVM
model = SVC()
model.fit(X_train_tfidf, y_train)

# Step 5: Make predictions and evaluate
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.68925

Classification Report:
               precision    recall  f1-score   support

  Irrelevant       0.67      0.48      0.56       696
    Negative       0.72      0.81      0.76      1171
     Neutral       0.73      0.63      0.67       999
    Positive       0.64      0.75      0.69      1134

    accuracy                           0.69      4000
   macro avg       0.69      0.67      0.67      4000
weighted avg       0.69      0.69      0.68      4000



## 4. Hyperparameter Tuning with Grid Search

In [36]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'C': [1,5,7,9,10],
    'kernel': ['linear', 'rbf', 'sigmoid', 'poly']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(SVC(), param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Fit the grid search to the data
grid_search.fit(X_train_tfidf, y_train)

# Print the best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation accuracy: ", grid_search.best_score_)

# Evaluate the best estimator on the test data
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_tfidf)

print("\nAccuracy with tuned model:", accuracy_score(y_test, y_pred_tuned))
print("\nClassification Report with tuned model:\n", classification_report(y_test, y_pred_tuned))

Best parameters found:  {'C': 1, 'kernel': 'poly'}
Best cross-validation accuracy:  0.6867498256886275

Accuracy with tuned model: 0.73075

Classification Report with tuned model:
               precision    recall  f1-score   support

  Irrelevant       0.81      0.53      0.64       696
    Negative       0.66      0.87      0.75      1171
     Neutral       0.81      0.70      0.75       999
    Positive       0.73      0.73      0.73      1134

    accuracy                           0.73      4000
   macro avg       0.75      0.71      0.72      4000
weighted avg       0.74      0.73      0.73      4000


Accuracy with tuned model: 0.73075

Classification Report with tuned model:
               precision    recall  f1-score   support

  Irrelevant       0.81      0.53      0.64       696
    Negative       0.66      0.87      0.75      1171
     Neutral       0.81      0.70      0.75       999
    Positive       0.73      0.73      0.73      1134

    accuracy                      

## 5. Improved TF-IDF Approach

Key improvements over baseline:
- **More features**: 15,000 vs 1,000 features
- **N-grams**: Captures phrases like "not good" (trigrams)
- **Sublinear TF**: Better term frequency scaling
- **Advanced tuning**: Testing gamma and class weighting

In [None]:
# Improved TF-IDF vectorization
vectorizer_improved = TfidfVectorizer(
    stop_words='english',
    max_features=15000,        # More features for better representation
    ngram_range=(1, 3),       # Include bigrams (e.g., "not good")
    min_df=2,                 # Filter rare words
    sublinear_tf=True         # Better scaling: 1 + log(tf)
)

X_train_tfidf_improved = vectorizer_improved.fit_transform(X_train)
X_test_tfidf_improved = vectorizer_improved.transform(X_test)

# Expanded parameter grid
param_grid_improved = {
    'C': [1,10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto'],      # For rbf kernel
    # 'class_weight': [None, 'balanced']  # Handle class imbalance
}

# GridSearchCV with improved parameters
grid_search_improved = GridSearchCV(
    SVC(),
    param_grid_improved,
    cv=2,                    # More folds for better validation
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search_improved.fit(X_train_tfidf_improved, y_train)

print("Best parameters:", grid_search_improved.best_params_)
print("Best CV accuracy:", grid_search_improved.best_score_)

# Evaluate on test set
best_model_improved = grid_search_improved.best_estimator_
y_pred_improved = best_model_improved.predict(X_test_tfidf_improved)

print("\nTest Accuracy:", accuracy_score(y_test, y_pred_improved))
print("\nClassification Report:\n", classification_report(y_test, y_pred_improved))

Fitting 2 folds for each of 8 candidates, totalling 16 fits
Best parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Best CV accuracy: 0.68625
Best parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Best CV accuracy: 0.68625

Test Accuracy: 0.78325

Classification Report:
               precision    recall  f1-score   support

  Irrelevant       0.81      0.67      0.73       696
    Negative       0.80      0.85      0.82      1171
     Neutral       0.81      0.74      0.77       999
    Positive       0.74      0.83      0.78      1134

    accuracy                           0.78      4000
   macro avg       0.79      0.77      0.78      4000
weighted avg       0.79      0.78      0.78      4000


Test Accuracy: 0.78325

Classification Report:
               precision    recall  f1-score   support

  Irrelevant       0.81      0.67      0.73       696
    Negative       0.80      0.85      0.82      1171
     Neutral       0.81      0.74      0.77       999
    Positive  

## 6. Text Preprocessing + Improved TF-IDF

Adding preprocessing to clean the text before vectorization.

In [None]:
import re

def preprocess_text(text):
    """Clean and normalize text for better sentiment analysis"""
    # Convert to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)

    # Remove mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)

    # Remove special characters but keep spaces
    text = re.sub(r'[^a-z\s]', '', text)

    # Remove extra whitespace
    text = ' '.join(text.split())

    return text

# Apply preprocessing to both train and test
X_train_clean = X_train.apply(preprocess_text)
X_test_clean = X_test.apply(preprocess_text)

# Now vectorize the cleaned text
X_train_tfidf_clean = vectorizer_improved.fit_transform(X_train_clean)
X_test_tfidf_clean = vectorizer_improved.transform(X_test_clean)

# Train with cleaned data
grid_search_clean = GridSearchCV(SVC(), param_grid_improved, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search_clean.fit(X_train_tfidf_clean, y_train)

print("Best parameters (cleaned):", grid_search_clean.best_params_)
print("Best CV accuracy (cleaned):", grid_search_clean.best_score_)

y_pred_clean = grid_search_clean.best_estimator_.predict(X_test_tfidf_clean)
print("\nTest Accuracy (cleaned):", accuracy_score(y_test, y_pred_clean))
print("\nClassification Report (cleaned):\n", classification_report(y_test, y_pred_clean))

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best parameters (cleaned): {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Best CV accuracy (cleaned): 0.73525

Test Accuracy (cleaned): 0.76175

Classification Report (cleaned):
               precision    recall  f1-score   support

  Irrelevant       0.77      0.65      0.71       696
    Negative       0.78      0.83      0.80      1171
     Neutral       0.78      0.72      0.75       999
    Positive       0.72      0.80      0.76      1134

    accuracy                           0.76      4000
   macro avg       0.76      0.75      0.75      4000
weighted avg       0.76      0.76      0.76      4000



In [39]:
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")  # Fast and good

# Encode text to embeddings
X_train_embeddings = model.encode(X_train.tolist(), show_progress_bar=True)
X_test_embeddings = model.encode(X_test.tolist(), show_progress_bar=True)

# Use with SVC directly (no TfidfVectorizer needed!)
svc = SVC(kernel="rbf", C=1)
svc.fit(X_train_embeddings, y_train)

  from .autonotebook import tqdm as notebook_tqdm
Batches: 100%|██████████| 500/500 [00:20<00:00, 24.30it/s]
Batches:   0%|          | 0/125 [00:00<?, ?it/s]
Batches: 100%|██████████| 125/125 [00:03<00:00, 35.68it/s]
Batches: 100%|██████████| 125/125 [00:03<00:00, 35.68it/s]


0,1,2
,C,1
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


## 7. Modern Approach: Sentence Transformers

Using pre-trained transformer models for semantic text embeddings instead of TF-IDF.

In [41]:

# Step 5: Make predictions and evaluate
y_pred_embeddings = svc.predict(X_test_embeddings)

print("Accuracy:", accuracy_score(y_test, y_pred_embeddings))
print("\nClassification Report:\n", classification_report(y_test, y_pred_embeddings))

Accuracy: 0.71925

Classification Report:
               precision    recall  f1-score   support

  Irrelevant       0.71      0.55      0.62       696
    Negative       0.72      0.83      0.77      1171
     Neutral       0.72      0.67      0.69       999
    Positive       0.72      0.75      0.74      1134

    accuracy                           0.72      4000
   macro avg       0.72      0.70      0.71      4000
weighted avg       0.72      0.72      0.72      4000



In [None]:
## 8. Model Comparison Summary

Compare the performance of all approaches:
- **Baseline TF-IDF**: Simple bag-of-words with basic SVM
- **Grid Search**: Same features but optimized hyperparameters
- **Improved TF-IDF**: Better vectorization + tuning
- **Text Preprocessing**: Cleaning + improved vectorization
- **Sentence Transformers**: Modern semantic embeddings

The Sentence Transformer approach typically provides the best semantic understanding but requires more computational resources. The improved TF-IDF with preprocessing offers a good balance of performance and efficiency.