# Twitter Sentiment Analysis

This notebook demonstrates sentiment analysis on Twitter data using Support Vector Machines (SVM) with different text embedding approaches:
1. **Baseline**: TF-IDF with basic parameters
2. **Improved TF-IDF**: Enhanced vectorization with bigrams and hyperparameter tuning
3. **Modern Embeddings**: Sentence Transformers for semantic understanding

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import OrdinalEncoder

## 1. Import Libraries and Load Data

In [3]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("crowdflower/twitter-airline-sentiment")

print("Path to dataset files:", path)

Path to dataset files: /Users/pewhite/.cache/kagglehub/datasets/crowdflower/twitter-airline-sentiment/versions/4


  from .autonotebook import tqdm as notebook_tqdm


In [15]:
df = pd.read_csv("../data/airline_tweets.csv")

In [34]:
df.sample(1000)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
11533,568037876761546753,positive,0.6983,,0.0000,US Airways,,RealcornerboyT,,0,@USAirways still not in the air for deicing of...,,2015-02-18 05:22:38 -0800,Any Martin Luther St.,
3379,568481989776564224,negative,0.7069,Can't Tell,0.3583,United,,MissMcB76,,0,@united no it weighed 45.5 and it was the only...,,2015-02-19 10:47:22 -0800,PA,
1160,569910299983130625,negative,1.0000,Flight Attendant Complaints,0.3623,United,,unvoiced,,0,"@united is unfriendly screw family, that hates...",,2015-02-23 09:22:58 -0800,Taipei,Taipei
1386,569748099616215040,negative,1.0000,Cancelled Flight,0.6667,United,,chocolossus,,0,@united your ground crew was inept and left a ...,,2015-02-22 22:38:27 -0800,"Vancouver, CA",
9356,569970518381744128,negative,1.0000,Customer Service Issue,0.6633,US Airways,,smaguire2,,0,@USAirways you are supposed to be in the busin...,,2015-02-23 13:22:15 -0800,,Atlantic Time (Canada)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1212,569885015615664129,positive,0.6576,,,United,,e_russell,,0,@united Thanks!,,2015-02-23 07:42:30 -0800,Frequently DC/NYC/San Diego,Eastern Time (US & Canada)
9092,570231607757381632,negative,1.0000,Late Flight,1.0000,US Airways,,MadeleineDay,,0,@USAirways @AmericanAir our honeymoon was dela...,,2015-02-24 06:39:44 -0800,St. Louis,Pacific Time (US & Canada)
5073,569369185559699456,negative,1.0000,Customer Service Issue,0.6771,Southwest,,KMadson,,0,@SouthwestAir been on hold for 1.5 hrs. What's...,,2015-02-21 21:32:46 -0800,,Central Time (US & Canada)
4644,569965376186019840,neutral,1.0000,,,Southwest,,Felittle,,0,@southwestair watching planes do their thing h...,,2015-02-23 13:01:49 -0800,"Boston, MA",Pacific Time (US & Canada)


In [16]:
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [38]:
df.columns

Index(['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

## 2. Data Preprocessing

In [39]:
classification = "airline_sentiment"
text = "text"

In [40]:
print(df[classification].value_counts())

airline_sentiment
negative    9178
neutral     3099
positive    2363
Name: count, dtype: int64


In [41]:
df2 = df.sample(n=2000, random_state=42) # Added random_state for reproducibility
df2 = df
print(f"Shape of original df: {df.shape}")
print(f"Shape of df2 (subset): {df2.shape}")

Shape of original df: (14640, 15)
Shape of df2 (subset): (14640, 15)


In [42]:
# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df2[text], df2[classification], test_size=0.2, random_state=42)

# Step 3: Text encoding using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

## 3. Baseline Model: Basic TF-IDF + SVM

In [44]:
# Step 4: Train the SVM
model = SVC()
model.fit(X_train_tfidf, y_train)

# Step 5: Make predictions and evaluate
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7855191256830601

Classification Report:
               precision    recall  f1-score   support

    negative       0.81      0.93      0.87      1889
     neutral       0.65      0.43      0.52       580
    positive       0.78      0.63      0.69       459

    accuracy                           0.79      2928
   macro avg       0.75      0.66      0.69      2928
weighted avg       0.77      0.79      0.77      2928



## 4. Hyperparameter Tuning with Grid Search

In [45]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'C': [1,5,7,9,10],
    'kernel': ['linear', 'rbf', 'sigmoid', 'poly']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(SVC(), param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Fit the grid search to the data
grid_search.fit(X_train_tfidf, y_train)

# Print the best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation accuracy: ", grid_search.best_score_)

# Evaluate the best estimator on the test data
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_tfidf)

print("\nAccuracy with tuned model:", accuracy_score(y_test, y_pred_tuned))
print("\nClassification Report with tuned model:\n", classification_report(y_test, y_pred_tuned))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Best parameters found:  {'C': 1, 'kernel': 'rbf'}
Best cross-validation accuracy:  0.7640881147540983

Accuracy with tuned model: 0.7855191256830601

Classification Report with tuned model:
               precision    recall  f1-score   support

    negative       0.81      0.93      0.87      1889
     neutral       0.65      0.43      0.52       580
    positive       0.78      0.63      0.69       459

    accuracy                           0.79      2928
   macro avg       0.75      0.66      0.69      2928
weighted avg       0.77      0.79      0.77      2928



## 5. Improved TF-IDF Approach

Key improvements over baseline:
- **More features**: 15,000 vs 1,000 features
- **N-grams**: Captures phrases like "not good" (trigrams)
- **Sublinear TF**: Better term frequency scaling
- **Advanced tuning**: Testing gamma and class weighting

In [46]:
# Improved TF-IDF vectorization
vectorizer_improved = TfidfVectorizer(
    stop_words='english',
    max_features=15000,        # More features for better representation
    ngram_range=(1, 3),       # Include bigrams (e.g., "not good")
    min_df=2,                 # Filter rare words
    sublinear_tf=True         # Better scaling: 1 + log(tf)
)

X_train_tfidf_improved = vectorizer_improved.fit_transform(X_train)
X_test_tfidf_improved = vectorizer_improved.transform(X_test)

# Expanded parameter grid
param_grid_improved = {
    'C': [1,10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto'],      # For rbf kernel
    # 'class_weight': [None, 'balanced']  # Handle class imbalance
}

# GridSearchCV with improved parameters
grid_search_improved = GridSearchCV(
    SVC(),
    param_grid_improved,
    cv=2,                    # More folds for better validation
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search_improved.fit(X_train_tfidf_improved, y_train)

print("Best parameters:", grid_search_improved.best_params_)
print("Best CV accuracy:", grid_search_improved.best_score_)

# Evaluate on test set
best_model_improved = grid_search_improved.best_estimator_
y_pred_improved = best_model_improved.predict(X_test_tfidf_improved)

print("\nTest Accuracy:", accuracy_score(y_test, y_pred_improved))
print("\nClassification Report:\n", classification_report(y_test, y_pred_improved))

Fitting 2 folds for each of 8 candidates, totalling 16 fits
Best parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
Best CV accuracy: 0.7634050546448088

Test Accuracy: 0.7937158469945356

Classification Report:
               precision    recall  f1-score   support

    negative       0.83      0.92      0.87      1889
     neutral       0.64      0.47      0.54       580
    positive       0.78      0.69      0.73       459

    accuracy                           0.79      2928
   macro avg       0.75      0.69      0.71      2928
weighted avg       0.78      0.79      0.78      2928



## 6. Text Preprocessing + Improved TF-IDF

Adding preprocessing to clean the text before vectorization.

In [30]:
import re

def preprocess_text(text):
    """Clean and normalize text for better sentiment analysis"""
    # Convert to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)

    # Remove mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)

    # Remove special characters but keep spaces
    text = re.sub(r'[^a-z\s]', '', text)

    # Remove extra whitespace
    text = ' '.join(text.split())

    return text

# Apply preprocessing to both train and test
X_train_clean = X_train.apply(preprocess_text)
X_test_clean = X_test.apply(preprocess_text)

# Now vectorize the cleaned text
X_train_tfidf_clean = vectorizer_improved.fit_transform(X_train_clean)
X_test_tfidf_clean = vectorizer_improved.transform(X_test_clean)

# Train with cleaned data
grid_search_clean = GridSearchCV(SVC(), param_grid_improved, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search_clean.fit(X_train_tfidf_clean, y_train)

print("Best parameters (cleaned):", grid_search_clean.best_params_)
print("Best CV accuracy (cleaned):", grid_search_clean.best_score_)

y_pred_clean = grid_search_clean.best_estimator_.predict(X_test_tfidf_clean)
print("\nTest Accuracy (cleaned):", accuracy_score(y_test, y_pred_clean))
print("\nClassification Report (cleaned):\n", classification_report(y_test, y_pred_clean))

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best parameters (cleaned): {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Best CV accuracy (cleaned): 0.718125

Test Accuracy (cleaned): 0.745

Classification Report (cleaned):
               precision    recall  f1-score   support

    negative       0.78      0.91      0.84       271
     neutral       0.40      0.28      0.33        69
    positive       0.87      0.55      0.67        60

    accuracy                           0.74       400
   macro avg       0.68      0.58      0.61       400
weighted avg       0.73      0.74      0.73       400



In [31]:
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")  # Fast and good

# Encode text to embeddings
X_train_embeddings = model.encode(X_train.tolist(), show_progress_bar=True)
X_test_embeddings = model.encode(X_test.tolist(), show_progress_bar=True)

# Use with SVC directly (no TfidfVectorizer needed!)
svc = SVC(kernel="rbf", C=1)
svc.fit(X_train_embeddings, y_train)

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:03<00:00, 14.59it/s]
Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 13/13 [00:00<00:00, 31.34it/s]


0,1,2
,C,1
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


## 7. Modern Approach: Sentence Transformers

Using pre-trained transformer models for semantic text embeddings instead of TF-IDF.

In [32]:

# Step 5: Make predictions and evaluate
y_pred_embeddings = svc.predict(X_test_embeddings)

print("Accuracy:", accuracy_score(y_test, y_pred_embeddings))
print("\nClassification Report:\n", classification_report(y_test, y_pred_embeddings))

Accuracy: 0.84

Classification Report:
               precision    recall  f1-score   support

    negative       0.86      0.96      0.91       271
     neutral       0.71      0.59      0.65        69
    positive       0.90      0.58      0.71        60

    accuracy                           0.84       400
   macro avg       0.82      0.71      0.75       400
weighted avg       0.84      0.84      0.83       400

