Problem 1: Twitter US Airline Sentiment Analysis

(Word2Vec + Logistic Regression)

Objective:
Perform sentiment classification (positive, neutral, negative) on airline-related tweets using classical NLP techniques and pre-trained Word2Vec embeddings.

Key Steps:

Text preprocessing

Word2Vec feature extraction

Train-test split

Multiclass Logistic Regression

Sentiment prediction function

In [3]:
!pip install nltk gensim contractions emoji



In [4]:
import pandas as pd
import numpy as np
import re
import string
import emoji
import contractions

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from gensim.models import KeyedVectors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [5]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

We use the Twitter US Airline Sentiment dataset, which contains:

text: Tweet content

airline_sentiment: Target label (positive, neutral, negative)

The dataset is loaded as a Pandas DataFrame.

In [6]:
from google.colab import files
uploaded = files.upload()

df = pd.read_csv(list(uploaded.keys())[0])
df = df[['airline_sentiment', 'text']]
df.head()

Saving Tweets.csv to Tweets.csv


Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


Each tweet undergoes the following preprocessing steps:

Convert to lowercase

Expand contractions (e.g., don’t → do not)

Remove URLs, mentions, hashtags

Remove punctuation and emojis

Tokenize text

Lemmatize words using WordNetLemmatizer

Remove stopwords

This ensures cleaner and more meaningful representations

In [9]:
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_tweet(text):
    text = text.lower()
    text = contractions.fix(text)
    text = re.sub(r"http\S+|www\S+|@\w+|#\w+", "", text)
    text = emoji.replace_emoji(text, replace='')
    text = text.translate(str.maketrans('', '', string.punctuation))

    tokens = nltk.word_tokenize(text)

    tokens = [
        lemmatizer.lemmatize(word)
        for word in tokens
        if word.isalpha() and word not in stop_words
    ]
    return tokens

In [11]:
df['tokens'] = df['text'].apply(preprocess_tweet)
df.head()

Unnamed: 0,airline_sentiment,text,tokens
0,neutral,@VirginAmerica What @dhepburn said.,[said]
1,positive,@VirginAmerica plus you've added commercials t...,"[plus, added, commercial, experience, tacky]"
2,neutral,@VirginAmerica I didn't today... Must mean I n...,"[today, must, mean, need, take, another, trip]"
3,negative,@VirginAmerica it's really aggressive to blast...,"[really, aggressive, blast, obnoxious, enterta..."
4,negative,@VirginAmerica and it's a really big bad thing...,"[really, big, bad, thing]"


We use the Google News Word2Vec model (300-dimensional).
Each tweet is converted into a fixed-length vector by averaging word embeddings of its tokens.

Words not present in the vocabulary are ignored.

In [12]:
import gensim.downloader as api

# Load Google News Word2Vec (300d)
w2v = api.load("word2vec-google-news-300")



In [13]:
def tweet_vector(tokens, model, vector_size=300):
    vectors = [model[word] for word in tokens if word in model]
    if len(vectors) == 0:
        return np.zeros(vector_size)
    return np.mean(vectors, axis=0)

X = np.vstack(df['tokens'].apply(lambda x: tweet_vector(x, w2v)))
y = df['airline_sentiment']

The dataset is split into:

80% Training data

20% Testing data

This ensures fair evaluation of the classifier.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

We use Multiclass Logistic Regression:

Efficient for medium-sized datasets

Works well with dense embeddings

Supports softmax for multi-class classification

In [16]:
model = LogisticRegression(max_iter=1000, multi_class='multinomial')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))



Test Accuracy: 0.7721994535519126


This function:

Takes a raw tweet

Applies preprocessing

Converts it to Word2Vec representation

Returns predicted sentiment

In [17]:
def predict_tweet_sentiment(model, w2v_model, tweet):
    tokens = preprocess_tweet(tweet)
    vec = tweet_vector(tokens, w2v_model)
    return model.predict([vec])[0]

predict_tweet_sentiment(
    model, w2v,
    "The flight was delayed and the staff was rude"
)

'negative'

Problem 2: Hugging Face BERT Pipeline

(IMDb Sentiment Analysis)

In this task, we fine-tune a pre-trained BERT model for binary sentiment classification using the IMDb dataset.

Pipeline Components:

Hugging Face Datasets

Tokenization using BERT tokenizer

Model fine-tuning

Evaluation (Accuracy & F1-score)

Model saving & inference

In [1]:
!pip install transformers datasets evaluate accelerate



In [2]:
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import TrainingArguments, Trainer
import evaluate
import numpy as np

In [3]:
dataset = load_dataset("imdb")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

Text is tokenized using bert-base-uncased tokenizer with:

Padding

Truncation

Fixed max length

This prepares the text for transformer input.

In [4]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=256
    )

tokenized_dataset = dataset.map(tokenize, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(["text"])
tokenized_dataset.set_format("torch")

In [5]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We evaluate performance using:

Accuracy → overall correctness

F1-score → balance between precision & recall

In [6]:
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1": f1.compute(predictions=preds, references=labels)["f1"]
    }

In [8]:
training_args = TrainingArguments(
    output_dir="./bert-imdb",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs"
)

In [9]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

  trainer = Trainer(
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2736,0.256318,0.91484,0.912613
2,0.16,0.326935,0.92248,0.923052


TrainOutput(global_step=6250, training_loss=0.24505454956054687, metrics={'train_runtime': 3054.6894, 'train_samples_per_second': 16.368, 'train_steps_per_second': 2.046, 'total_flos': 6577776384000000.0, 'train_loss': 0.24505454956054687, 'epoch': 2.0})

In [10]:
trainer.save_model("bert-imdb-sentiment")
tokenizer.save_pretrained("bert-imdb-sentiment")

('bert-imdb-sentiment/tokenizer_config.json',
 'bert-imdb-sentiment/special_tokens_map.json',
 'bert-imdb-sentiment/vocab.txt',
 'bert-imdb-sentiment/added_tokens.json')

In [12]:
import torch

# Select device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move model to device
model.to(device)
model.eval()

# Sample input text
text = "The movie was absolutely fantastic and emotionally powerful."

# Tokenize
inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=256
)

# Move inputs to same device as model
inputs = {key: value.to(device) for key, value in inputs.items()}

# Inference
with torch.no_grad():
    outputs = model(**inputs)

# Prediction
prediction = torch.argmax(outputs.logits, dim=1).item()

print("Predicted Sentiment:", "Positive" if prediction == 1 else "Negative")

Predicted Sentiment: Positive


This assignment explores both classical and modern approaches to sentiment analysis. In Problem 1, traditional NLP techniques are combined with distributed word representations using the Google News Word2Vec model. Tweets are preprocessed through normalization, lemmatization, and noise removal before being converted into dense vectors by averaging word embeddings. A multiclass Logistic Regression classifier is trained on these vectors, providing an interpretable and efficient baseline model.

Problem 2 demonstrates a deep learning-based pipeline using Hugging Face’s Transformers library. A pre-trained BERT model is fine-tuned on the IMDb dataset for binary sentiment classification. Tokenization, training, evaluation, and inference are handled using standardized APIs, ensuring scalability and reproducibility.

Challenges include high memory usage and computational cost, especially for Word2Vec loading and BERT fine-tuning. These are mitigated by batching, truncation, and limiting epochs. Together, these approaches highlight the evolution from feature-based NLP to end-to-end transformer models.