# CS-5313/7313 Project 4

## IMPORTANT - Make a copy of this colab notebook before working on it.

In this project, you will be exploring various pre-built language models based on the transformer architecture. Transformer networks are a state of the art approach to langauge and time series modeling that makes use a concept called "attention". The first time this design was proposed was in the paper "Attention is All You Need" by Vaswani et al. The paper can be read here: https://arxiv.org/pdf/1706.03762.pdf

Here you will be making use of the pre-built transformer pipelines provided by Hugging Face Co. You can reference this link on how to use the package for the given task you are trying to complete: https://huggingface.co/transformers/task_summary.html

## Task 1 - Text generation

In this task you will be using the "text-generation" pipeline to generate text.

Use three different sized prompts, 1 word, ~5 words, and ~10 words, to generate sequences of length n+5 words, n+20 words, and 100 words, where n is the number of words in the prompt phrase you provided. Generate 3 sequences for each prompt and output length pair. Since this is qualitative, comment on the relative quality of the text that is generated in your report and include examples. How do the parameters affect the quality of the output. In addition to the report, submit another document containing each of these generated sequences, including what the prompt was.

In [5]:
import logging
from transformers import pipeline

# Suppress HF logs
logging.getLogger("transformers").setLevel(logging.ERROR)

# Prompts
one_word_prompt = "Asta"
five_word_prompt = "Jotaro's stand is called Star"
ten_word_prompt = "Princess Mononoke is a film by Hayao Miyazaki all about"
prompts = [one_word_prompt, five_word_prompt, ten_word_prompt]

# Setup generator
generator = pipeline("text-generation", model="EleutherAI/gpt-neo-125M")

# Output file
output_path = "generated_responses.txt"

# Generate and save to file
with open(output_path, "w") as f:
    for prompt in prompts:
        f.write(f"Prompt: {prompt}\n")
        word_count = len(prompt.split())
        for offset in [5, 10, 20]:
            max_len = (word_count + offset) * 5  # estimate token count - because that's what the max length is talking about
            results = generator(
                prompt,
                max_length=max_len,
                num_return_sequences=1,
                do_sample=True,
                temperature=0.9,
                top_p=0.95,
                pad_token_id=50256
            )
            for res in results:
                continuation = res['generated_text'].strip()
                f.write(f"→ Response (~{max_len} tokens): {continuation}\n")
        f.write("\n" + "-"*60 + "\n\n")

print(f"✅ All generations saved to: {output_path}")

✅ All generations saved to: generated_responses.txt


### You'll have to run everything from here on out on Google Colab...

## Task 2 - Sentiment Analysis

Here you will be using the "sentiment-analysis" pipeline to look at the sentiment of Amazon reviews. The database is provided at this link: https://jmcauley.ucsd.edu/data/amazon/ . It is recommended to use one of the smaller databases, such as the Musical Instruments database with 10,261 reviews.

Each review has both the text of the review, as well as the reviewer's rating.

### Subtask 2.1
Perform sentiment analysis on each review, and compare the model's output to the users review to get a sense of the accuracy of the model. The user review score, which is out of a maximum 5 stars, is found in the "overall" datafield. For this, assume that 3+ in the "overall" datafield is a positive review.


In [None]:
# Subtask 2.1

import json
from tqdm import tqdm

from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1
classifier = pipeline("sentiment-analysis", truncation=True, padding=True, device=device)

# Load data
reviews = []
with open("Musical_Instruments.json", 'r') as f:
    for line in f:
        # Create list of json objects by the line
        reviews.append(json.loads(line))

# Filter usable reviews (non-empty text + score present)
filtered_reviews = [
    r for r in reviews
    if r.get("reviewText") is not None and isinstance(r.get("overall"),float)
]

# Batch size for review analysis
batch_size = 16

# Run sentiment analysis in batches
correct = 0
total = 0

for i in tqdm(range(0, len(filtered_reviews), batch_size)):
    batch = filtered_reviews[i:i+batch_size]
    # Grab the information for this batch of reviews
    texts = [r["reviewText"][:512] for r in batch]  # Truncate text input BECAUSE APPARENLTY THERE IS A TOKEN LIMIT
    true_scores = [r["overall"] for r in batch]

    try:
        # Parallel runtime for the batch
        predictions = classifier(texts)
        for pred, score in zip(predictions, true_scores):
            model_sentiment = pred["label"] == "POSITIVE" # True - positive review
            # Our criteria is that a score of 3 or higher is a positive review
            actual_sentiment = score >= 3
            if model_sentiment == actual_sentiment:
                correct += 1
            total += 1
    except Exception as e:
        print(f"⚠️ Batch at index {i} failed: {e}")

print(f"\n✅ Accuracy: {correct / total:.2%} over {total} reviews")

### Results
![Task 2.1](Results%20Task%202.1.png)

### Subtask 2.2
In addition to looking at the accuracy of the model for each review, also compare the percentage of products with more positive reviews than negative reviews to the true percentage.

Reminder: Refer to the hugging face link on how to perform sentiment analysis task.

In [3]:
# Subtask 2.2
# First we need to group the data by product ID
import json
from collections import defaultdict

# Load data
reviews_by_product = defaultdict(list)
with open("Musical_Instruments.json", 'r') as f:
    for line in f:
        # Create list of json objects by the line
        review = json.loads(line)
        product_id = review.get("asin")
        if product_id:
            reviews_by_product[product_id].append(review)

# Now we can analyze the reviews for each product
count_positive = 0
count_negative = 0
for product_id, reviews in reviews_by_product.items():
    # Filter out empty reviews
    filtered_reviews = [r for r in reviews if r.get("reviewText") and isinstance(r.get("overall"), float)]
    if not filtered_reviews:
        continue

    # Calculate the average score
    avg_score = sum(r["overall"] for r in filtered_reviews) / len(filtered_reviews)
    if avg_score >= 3:
        count_positive += 1
    else:
        count_negative += 1

print(f"Percentage of products with more positive reviews: {count_positive / (count_positive + count_negative) * 100:.2f}%")

Percentage of products with more positive reviews: 90.75%


In [None]:
# TODO Task 3
# Subtask 2.1

import json
from tqdm import tqdm
from collections import defaultdict

from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1
classifier = pipeline("sentiment-analysis", truncation=True, padding=True, device=device)

# Load data
reviews = []
with open("Musical_Instruments.json", 'r') as f:
    for line in f:
      try:
          reviews.append(json.loads(line))
      except json.JSONDecodeError as e:
          print(f"⚠️ Skipping malformed line: {e}")

# Filter usable reviews (non-empty text + score present)
filtered_reviews = [
    r for r in reviews
    if r.get("reviewText") is not None and isinstance(r.get("overall"),float)
]

# Batch size for review analysis
batch_size = 16

# Run sentiment analysis in batches
correct = 0
total = 0

# Create a dictionary to store the positive and negative review counts for each product
counts_by_product = defaultdict(lambda: {"POSITIVE": 0, "NEGATIVE": 0})

for i in tqdm(range(0, len(filtered_reviews), batch_size)):
    batch = filtered_reviews[i:i+batch_size]
    # Grab the information for this batch of reviews
    texts = [r["reviewText"][:512] for r in batch]  # Truncate text input BECAUSE APPARENLTY THERE IS A TOKEN LIMIT
    product_ids = [r["asin"] for r in batch]

    try:
        # Parallel runtime for the batch
        predictions = classifier(texts)
        for pred, id in zip(predictions, product_ids):
            counts_by_product[id][pred["label"]] += 1
    except Exception as e:
        print(f"⚠️ Batch at index {i} failed: {e}")

positive_count = sum([1 for v in counts_by_product.values() if v["POSITIVE"] > v["NEGATIVE"]])
negative_count = len(counts_by_product) - positive_count
print(f"Percentage of products with more PREDICTED positive reviews: {positive_count / (positive_count + negative_count) * 100:.2f}%")

### Results
![Task 2.2](Results%20Task%202.2.png)

## Task 3 - Masked Language Modeling

In this task you will be completing a sentence or phrase that has a missing word. I will prepare three datasets for you so you can perform this task. The three datasets will contain missing verbs in one, missing nouns in another, and missing adjectives in the last.

Your task will be to look at the success rate of generating the true missing word in the top 1, top 5, and top 10 generated words for a given sentence. You will then compare the success rate between the three datasets.

In [1]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# We need pre-trained tokenizers and prediction models
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
model.eval()
if torch.cuda.is_available():
    model.to(device)

def prepare_input(sentence):
    # Necessary because bert-base-uncased expects a [MASK] token instead of __MASKED__
    return sentence.replace("__MASKED__", "[MASK]")

def predict_masked_word(sentence, top_k=10):
    masked_sentence = prepare_input(sentence)
    inputs = tokenizer(masked_sentence, return_tensors="pt")
    input_ids = inputs["input_ids"]

    # Check for presence of [MASK] token
    mask_token_indices = torch.where(input_ids == tokenizer.mask_token_id)[1]
    if len(mask_token_indices) == 0:
        print(f"⚠️ No [MASK] token found in: {sentence}")
        return []

    mask_index = mask_token_indices[0]

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    mask_logits = logits[0, mask_index, :]
    topk_ids = torch.topk(mask_logits, top_k, dim=0).indices.tolist()

    return [tokenizer.decode([idx]).strip() for idx in topk_ids]

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [2]:
def evaluate_mlm(dataset):
    top_1 = 0
    top_5 = 0
    top_10 = 0
    total = len(dataset)

    for example in dataset:
        # Each example is a sentence with a __MASK__ token (turned into [MASK] in the above function)
        text = example["text"]
        true_word = example["label"]
        predictions = predict_masked_word(text, top_k=10)

        if true_word == predictions[0]:
            top_1 += 1
        if true_word in predictions[:5]:
            top_5 += 1
        if true_word in predictions[:10]:
            top_10 += 1

    return {
        "top_1_acc": top_1 / total,
        "top_5_acc": top_5 / total,
        "top_10_acc": top_10 / total
    }

In [9]:
# Get our hands on the verbs, nouns, and adjectives data
import csv

def load_words(file_name):
    data = []
    with open(file_name, "r", encoding="utf-8") as f:
        reader = csv.reader(f, quotechar='"', delimiter=',', skipinitialspace=True)
        for row in reader:
            if len(row) != 2:
                print(f"⚠️ Bad row: {row}")
                continue
            sentence, label = row
            data.append({"text": sentence, "label": label.strip()})
    return data

verb_data = load_words("masked_verbs.csv")
noun_data = load_words("masked_nouns.csv")
adj_data = load_words("masked_adjs.csv")

In [10]:
# Run on all three datasets
results = {}
for category, dataset in [("verbs", verb_data), ("nouns", noun_data), ("adjectives", adj_data)]:
    results[category] = evaluate_mlm(dataset)

# Optional: Pretty print
for category, scores in results.items():
    print(f"\n{category.upper()} dataset:")
    for k, v in scores.items():
        print(f"  {k}: {v:.2%}")


VERBS dataset:
  top_1_acc: 26.60%
  top_5_acc: 52.20%
  top_10_acc: 63.40%

NOUNS dataset:
  top_1_acc: 18.60%
  top_5_acc: 32.40%
  top_10_acc: 37.80%

ADJECTIVES dataset:
  top_1_acc: 27.20%
  top_5_acc: 48.00%
  top_10_acc: 55.20%
