### Problem 1


In [None]:
import pandas as pd
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import contractions

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

df = pd.read_csv('tweets.csv')

lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if pd.isna(text):
        return ""
    
    # Convert to lowercase
    text = str(text).lower()
    
    # Expand contractions 
    text = contractions.fix(text)

    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    
    # Remove mentions (@username)
    text = re.sub(r'@\w+', '', text)
    
    # Remove hashtags
    text = re.sub(r'#\w+', '', text)
    
    # Remove emojis and special symbols
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'[\U00010000-\U0010ffff]', '', text, flags=re.UNICODE)
    
    # Remove punctuation (any remaining)
    text = re.sub(r'[^\w\s]', '', text)
    
    # Tokenize and lemmatize
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Join tokens back to string
    text = ' '.join(lemmatized_tokens)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply preprocessing to the text column
df['cleaned_text'] = df['text'].apply(preprocess_text)

# Display results
print("Original vs Cleaned Text:")
print(df[['text', 'cleaned_text']].head(10))

# Save the cleaned dataset
df.to_csv('tweets_cleaned.csv', index=False)
print("\nCleaned data saved to 'tweets_cleaned.csv'")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ishan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ishan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ishan\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ishan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Original vs Cleaned Text:
                                                text  \
0                @VirginAmerica What @dhepburn said.   
1  @VirginAmerica plus you've added commercials t...   
2  @VirginAmerica I didn't today... Must mean I n...   
3  @VirginAmerica it's really aggressive to blast...   
4  @VirginAmerica and it's a really big bad thing...   
5  @VirginAmerica seriously would pay $30 a fligh...   
6  @VirginAmerica yes, nearly every time I fly VX...   
7  @VirginAmerica Really missed a prime opportuni...   
8    @virginamerica Well, I didn'tâ€¦but NOW I DO! :-D   
9  @VirginAmerica it was amazing, and arrived an ...   

                                        cleaned_text  
0                                          what said  
1  plus you have added commercial to the experien...  
2  i did not today must mean i need to take anoth...  
3  it is really aggressive to blast obnoxious ent...  
4          and it is a really big bad thing about it  
5  seriously would pay 30

In [None]:
from gensim.models import KeyedVectors

# Load the pre-trained Google News Word2Vec model, you must have the model file downloaded
word2vec_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

print(f"Model loaded successfully!")
print(f"Vocabulary size: {len(word2vec_model)}")
print(f"Vector dimension: {word2vec_model.vector_size}")

Model loaded successfully!
Vocabulary size: 3000000
Vector dimension: 300


In [6]:
import numpy as np

def tweet_to_vector(text, model, vector_size=300):
    """Convert a tweet to a fixed-length vector by averaging word vectors."""
    if pd.isna(text) or text == "":
        return np.zeros(vector_size)
    
    words = text.split()
    word_vectors = []
    
    for word in words:
        if word in model:
            word_vectors.append(model[word])
    
    if len(word_vectors) == 0:
        return np.zeros(vector_size)
    
    return np.mean(word_vectors, axis=0)

# Convert all tweets to vectors
df['tweet_vector'] = df['cleaned_text'].apply(lambda x: tweet_to_vector(x, word2vec_model))

# Create a matrix of all tweet vectors
tweet_vectors = np.vstack(df['tweet_vector'].values)

print(f"Tweet vectors shape: {tweet_vectors.shape}")
print(f"Sample vector (first 10 dimensions): {tweet_vectors[0][:10]}")

Tweet vectors shape: (14640, 300)
Sample vector (first 10 dimensions): [ 0.0652771  -0.025177    0.15722656 -0.00170898 -0.10888672  0.06860352
  0.21191406 -0.1796875   0.07128906 -0.04376221]


In [None]:
# Create target column based on airline_sentiment, by mapping 'positive' to 1, 'negative' to -1, and 'neutral' to 0
df['target'] = df['airline_sentiment'].map({'positive': 1, 'negative': -1, 'neutral': 0})

print("Target column created:")
print(df[['airline_sentiment', 'target']].head(10))
print(f"\nTarget value counts:\n{df['target'].value_counts()}")

Target column created:
  airline_sentiment  target
0           neutral       0
1          positive       1
2           neutral       0
3          negative      -1
4          negative      -1
5          negative      -1
6          positive       1
7           neutral       0
8          positive       1
9          positive       1

Target value counts:
target
-1    9178
 0    3099
 1    2363
Name: count, dtype: int64


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Prepare X and y
X = tweet_vectors
y = df['target'].values

# Spliting the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

# Train Multiclass Logistic Regression classifier
lr_classifier = LogisticRegression(max_iter=1000, random_state=42)
lr_classifier.fit(X_train, y_train)

# Predict on test set
y_pred = lr_classifier.predict(X_test)

# Calculate and report accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nLogistic Regression Accuracy on Test Set: {accuracy:.4f}")

Training set size: 11712
Testing set size: 2928

Logistic Regression Accuracy on Test Set: 0.7876


In [24]:
def predict_tweet_sentiment(classifier, w2v_model, tweet):
    """
    Predict the sentiment of a single tweet.
    
    Parameters:
    - classifier: Trained classifier (e.g., LogisticRegression)
    - w2v_model: Word2Vec model for vectorization
    - tweet: String containing the tweet text
    
    Returns:
    - String: 'positive', 'negative', or 'neutral'
    """
    # Preprocess the tweet
    cleaned_tweet = preprocess_text(tweet)
    
    # Convert to vector
    tweet_vector = tweet_to_vector(cleaned_tweet, w2v_model)
    
    # Reshape for prediction (single sample)
    tweet_vector = tweet_vector.reshape(1, -1)
    
    # Predict
    prediction = classifier.predict(tweet_vector)[0]
    
    # Map prediction to sentiment label
    sentiment_map = {1: 'positive', -1: 'negative', 0: 'neutral'}
    
    return sentiment_map[prediction]

# Test the function
sample_tweet = "It was a fantastic flight! The crew was so friendly and helpful."
predicted_sentiment = predict_tweet_sentiment(lr_classifier, word2vec_model, sample_tweet)
print(f"Tweet: {sample_tweet}")
print(f"Predicted Sentiment: {predicted_sentiment}")

Tweet: It was a fantastic flight! The crew was so friendly and helpful.
Predicted Sentiment: positive


### Problem 2


In [16]:
from datasets import load_dataset
from transformers import BertTokenizer


# Load the IMDB dataset from Hugging Face
imdb_dataset = load_dataset('imdb')

print("Dataset loaded successfully!")
print(imdb_dataset)

# Load the BERT tokenizer for bert-base-uncased
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define preprocessing function
def preprocess_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=512
    )

# Apply preprocessing to the dataset
tokenized_imdb = imdb_dataset.map(preprocess_function, batched=True)

print("\nTokenization complete!")
print(f"Tokenized dataset: {tokenized_imdb}")
print(f"\nSample tokenized input (first 20 tokens): {tokenized_imdb['train'][0]['input_ids'][:20]}")

README.md: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset loaded successfully!
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]


Tokenization complete!
Tokenized dataset: DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

Sample tokenized input (first 20 tokens): [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009]


In [19]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
import torch

# Set format for PyTorch
tokenized_imdb.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])

# Check if GPU is available and set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Load pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./bert_imdb_results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb['train'],
    eval_dataset=tokenized_imdb['test'],
)

# Fine-tune the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(f"\nEvaluation Results: {eval_results}")


Using device: cuda


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.3124,0.312909
2,0.1804,0.248039
3,0.0388,0.330846



Evaluation Results: {'eval_loss': 0.24803878366947174, 'eval_runtime': 437.3916, 'eval_samples_per_second': 57.157, 'eval_steps_per_second': 7.145, 'epoch': 3.0}


In [20]:
from sklearn.metrics import accuracy_score, f1_score

# Get predictions on the test set
predictions = trainer.predict(tokenized_imdb['test'])
y_pred_bert = predictions.predictions.argmax(axis=-1)
y_true_bert = predictions.label_ids

# Calculate accuracy
bert_accuracy = accuracy_score(y_true_bert, y_pred_bert)

# Calculate F1-score (binary classification)
bert_f1 = f1_score(y_true_bert, y_pred_bert)

print("BERT Model Performance on IMDB Test Set:")
print(f"Accuracy: {bert_accuracy:.4f}")
print(f"F1-Score: {bert_f1:.4f}")

BERT Model Performance on IMDB Test Set:
Accuracy: 0.9336
F1-Score: 0.9344


In [None]:
# Save fine-tuned model + tokenizer after training
save_dir = "./bert_imdb_finetuned"
trainer.save_model(save_dir)          # saves model weights + config
tokenizer.save_pretrained(save_dir)   # saves tokenizer files

print(f"Saved fine-tuned model to: {save_dir}")

# Load for inference on a sample text
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

inf_tokenizer = AutoTokenizer.from_pretrained(save_dir)
inf_model = AutoModelForSequenceClassification.from_pretrained(save_dir)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inf_model.to(device)
inf_model.eval()

sample_text = "The movie was fantastic! I really loved it."

inputs = inf_tokenizer(
    sample_text,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=512,
).to(device)

with torch.no_grad():
    outputs = inf_model(**inputs)
    pred_id = outputs.logits.argmax(dim=-1).item()

label_map = {0: "negative", 1: "positive"}  # IMDB convention
print("Text:", sample_text)
print("Predicted label:", label_map[pred_id], f"(id={pred_id})")


Saved fine-tuned model to: ./bert_imdb_finetuned
Text: The movie was fantastic! I really loved it.
Predicted label: positive (id=1)


## End-to-end Sentiment Analysis Pipeline (Problem 1 + Problem 2)

This notebook implements two sentiment-analysis pipelines. In **Problem 1**, tweets are cleaned with `preprocess_text` (lowercasing, contraction expansion, removal of URLs/mentions/hashtags/punctuation, tokenization, and lemmatization). Each cleaned tweet is then converted into a fixed-length feature vector using `tweet_to_vector`, which averages **Google News Word2Vec** embeddings; this yields `tweet_vectors` / `X` with shape `(14640, 300)`. Labels are mapped to a 3-class target (`negative=-1`, `neutral=0`, `positive=1`) and a **multinomial Logistic Regression** model (`lr_classifier`) is trained on `X_train` and evaluated on `X_test` (accuracy stored in `accuracy`). In **Problem 2**, the **IMDB** dataset is tokenized with a BERT tokenizer and fine-tuned using `Trainer` with `BertForSequenceClassification`, producing strong test metrics (`bert_accuracy`, `bert_f1`).  

The design contrasts a lightweight, interpretable baseline (Word2Vec + LR) with a higher-capacity contextual model (BERT). Key challenges include compute and memory: loading Word2Vec and fine-tuning BERT are resource-intensive, so GPU usage (`device`) and checkpointing/smaller batch sizes help. Preprocessing choices must also avoid removing sentiment-bearing cues; validating cleaning steps and monitoring OOV rates mitigates this.

Note: couldn't upload bert_imdb_results and bert_imdb_finetuned folder due to file constraints of github.