In [None]:
# DS776 Auto-Update (runs in ~2 seconds, only updates when needed)
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../../Lessons/Course_Tools/auto_update_introdl.py

# Homework 10: Named Entity Recognition

**Total Points: 50**
- Reading Questions: 8 points
- Part 1 (Named Entities for Analysis): 6 points
- Part 2 (Fine-tune BERT Models): 12 points
- Part 3 (LLM for NER): 12 points
- Part 4 (Comparison): 8 points
- Part 5 (Reflection): 2 points

## Reading Questions (8 points)

Answer the following questions based on Chapter 4: Multilingual Named Entity Recognition from *Natural Language Processing with Transformers*.

**Question 1 (2 points):** Explain the BIO tagging scheme used in Named Entity Recognition. What do the B-, I-, and O tags represent, and why is this tagging scheme necessary for NER tasks instead of simply labeling entity types?

📝 **YOUR ANSWER HERE:**

**Question 2 (2 points):** Describe the tokenization pipeline used in transformer models for NER. What are the four main steps in this pipeline, and what is the purpose of each step? How does this process prepare text for NER tasks?

📝 **YOUR ANSWER HERE:**

**Question 3 (2 points):** Explain the difference between token-level and entity-level evaluation for NER. Why is entity-level F1 score (using metrics like seqeval) generally preferred over token-level accuracy for evaluating NER models?

📝 **YOUR ANSWER HERE:**

**Question 4 (2 points):** What are nested entities in NER, and why do they pose a challenge for traditional sequence labeling approaches? Provide an example of nested entities and explain how they complicate the BIO tagging scheme.

📝 **YOUR ANSWER HERE:**

## Part 1 - Using Named Entities for Analysis (6 points)

NER is often used to look for trends or to do other analysis on text data. Once you have the NER tags you can use them to extract the entities from the text to do analysis.

Here we'll use a dataset of made-up movie reviews. The idea is to use the entity tags to extract the actors and directors from the reviews, then to figure out which actors and directors are most likely to be involved with positive sentiment movies and negative sentiment movies. We'll load the dataset for you.

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "hobbes99/fake_movie_reviews_ner_sentiment"
)
label_list = dataset["train"].features["ner_tags"].feature.names
print(label_list)

Here's an entry in the training set to get you started:

In [None]:
dataset["train"][0]

Notice that NER tags are stored as integers corresponding to their indices in `label_list`. You'll need to use those tags to extract the actor and director names. You can also extract the sentiment.

For the training split, find and display in order:
* The three actors most likely to appear in positive films.
* The three actors most likely to appear in negative films.
* The three directors most likely to have directed positive films.
* The three directors most likely to have directed negative films.

In [None]:
# Step 1: Extract entities from training set

# Your code here:
# 1. Create empty lists or dictionaries to store:
#    - Actors with positive sentiment
#    - Actors with negative sentiment
#    - Directors with positive sentiment
#    - Directors with negative sentiment
# 2. Loop through the training dataset
# 3. For each example, extract:
#    - The sentiment (0 or 1)
#    - Tokens tagged as B-ACTOR or I-ACTOR
#    - Tokens tagged as B-DIRECTOR or I-DIRECTOR
# 4. Keep track of counts for each actor/director by sentiment

# Hint: You'll need to:
# - Convert token lists to strings (join consecutive I- tags with B- tag)
# - Track how many times each actor appears in positive vs negative reviews
# - Use label_list to convert numeric tags to string labels

# Example structure:
# from collections import defaultdict
# 
# actor_positive = defaultdict(int)
# actor_negative = defaultdict(int)
# director_positive = defaultdict(int)
# director_negative = defaultdict(int)
#
# for example in dataset["train"]:
#     tokens = example["tokens"]
#     ner_tags = example["ner_tags"]
#     sentiment = example["sentiment"]
#     
#     # Your extraction logic here


In [None]:
# Step 2: Calculate proportions and find top 3

# Your code here:
# 1. For each actor/director, calculate:
#    total_appearances = positive_count + negative_count
#    positive_proportion = positive_count / total_appearances
# 2. Sort by positive proportion (descending for positive, ascending for negative)
# 3. Select top 3 for each category
# 4. Display results with counts and proportions

# Example:
# import pandas as pd
# 
# # Create DataFrame for analysis
# actor_stats = []
# for actor in set(list(actor_positive.keys()) + list(actor_negative.keys())):
#     pos_count = actor_positive[actor]
#     neg_count = actor_negative[actor]
#     total = pos_count + neg_count
#     if total >= 2:  # Filter out rare appearances
#         actor_stats.append({
#             'name': actor,
#             'positive': pos_count,
#             'negative': neg_count,
#             'total': total,
#             'positive_pct': pos_count / total
#         })
# 
# df_actors = pd.DataFrame(actor_stats)
# df_actors_sorted = df_actors.sort_values('positive_pct', ascending=False)
# 
# print("Top 3 actors in positive films:")
# display(df_actors_sorted.head(3))
# 
# print("\nTop 3 actors in negative films:")
# display(df_actors_sorted.tail(3))


## Part 2 - Fine Tuning Two BERT NER Models (12 points)

The MIT Movie Corpus is designed for movie-related NER tasks and includes the following entity types in BIO format:
- **Actor**: Names of actors or actresses (e.g., "Leonardo DiCaprio").
- **Character**: Names of characters in movies (e.g., "Jack Dawson").
- **Director**: Names of movie directors (e.g., "Christopher Nolan").
- **Genre**: Movie genres (e.g., "Action", "Drama").
- **Title**: Titles of movies (e.g., "Inception").
- **Year**: Year the movie was made.

The original movie corpus includes more entity types, but we've produced a simplified version for this assignment.

In this part of the assignment you should fine-tune "distilbert-base-uncased" and "bert-base-uncased" for NER on the dataset "hobbes99/mit-movie-ner-simplified". The dataset has "train" and "valid" splits. Use the "train" split for fine-tuning and evaluate the metrics using seqeval as shown in the lesson.
* Figure out a way to plot precision, recall, and F1 by entity type.
* Find two movie reviews on the internet and run inference on them to extract the named entities.
* Write a brief summary of the results. Include answers to:
    * Which entity types does the model struggle with?
    * Which does it do well on?
* The "distilbert-base-uncased" model is a distilled version of the "bert-base-uncased" model (distillation means a smaller model that was trained using the larger trained model as a "teacher"). The "bert-base-uncased" model should lead to better results here. Does it? Discuss.

In [None]:
# Step 1: Load the dataset
from datasets import load_dataset

# Your code here:
# 1. Load the "hobbes99/mit-movie-ner-simplified" dataset
# 2. Examine the 'train' and 'valid' splits
# 3. Print an example to see the structure
# 4. Get the label names from the dataset features

# Example:
# dataset = load_dataset("hobbes99/mit-movie-ner-simplified")
# print("Train examples:", len(dataset["train"]))
# print("Valid examples:", len(dataset["valid"]))
# print("\nFirst training example:")
# print(dataset["train"][0])
# 
# label_names = dataset["train"].features["ner_tags"].feature.names
# print("\nNER labels:", label_names)


In [None]:
# Step 2: Prepare for fine-tuning - Tokenization
from transformers import AutoTokenizer

# Your code here:
# 1. Load two tokenizers:
#    - "distilbert-base-uncased"
#    - "bert-base-uncased"
# 2. Create a tokenization function that:
#    - Tokenizes the tokens (use is_split_into_words=True)
#    - Aligns the NER labels with tokenized output
#    - Handles subword tokens (use -100 for ignored tokens)
# 3. Apply tokenization to both train and valid splits

# Example structure:
# tokenizer_distilbert = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
# 
# def tokenize_and_align_labels(examples, tokenizer):
#     tokenized_inputs = tokenizer(
#         examples["tokens"],
#         truncation=True,
#         is_split_into_words=True,
#         padding="max_length",
#         max_length=128
#     )
#     
#     labels = []
#     for i, label in enumerate(examples["ner_tags"]):
#         word_ids = tokenized_inputs.word_ids(batch_index=i)
#         label_ids = []
#         previous_word_idx = None
#         
#         for word_idx in word_ids:
#             if word_idx is None:
#                 label_ids.append(-100)
#             elif word_idx != previous_word_idx:
#                 label_ids.append(label[word_idx])
#             else:
#                 label_ids.append(-100)  # Set subword tokens to -100
#             previous_word_idx = word_idx
#         
#         labels.append(label_ids)
#     
#     tokenized_inputs["labels"] = labels
#     return tokenized_inputs
# 
# # Apply to datasets
# tokenized_train_distilbert = dataset["train"].map(
#     lambda x: tokenize_and_align_labels(x, tokenizer_distilbert),
#     batched=True
# )
# tokenized_valid_distilbert = dataset["valid"].map(
#     lambda x: tokenize_and_align_labels(x, tokenizer_distilbert),
#     batched=True
# )


In [None]:
# Step 3: Set up metrics for evaluation
import evaluate
import numpy as np

# Your code here:
# 1. Load the seqeval metric
# 2. Create a compute_metrics function that:
#    - Extracts predictions and labels
#    - Removes ignored tokens (-100)
#    - Converts numeric labels to string labels
#    - Computes precision, recall, F1 overall and per-entity

# Example:
# seqeval = evaluate.load("seqeval")
# 
# def compute_metrics(eval_pred):
#     predictions, labels = eval_pred
#     predictions = np.argmax(predictions, axis=2)
#     
#     # Remove ignored index (special tokens) and convert to labels
#     true_predictions = [
#         [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
#         for prediction, label in zip(predictions, labels)
#     ]
#     true_labels = [
#         [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
#         for prediction, label in zip(predictions, labels)
#     ]
#     
#     results = seqeval.compute(predictions=true_predictions, references=true_labels)
#     return {
#         "precision": results["overall_precision"],
#         "recall": results["overall_recall"],
#         "f1": results["overall_f1"],
#         "accuracy": results["overall_accuracy"],
#     }


In [None]:
# Step 4: Fine-tune distilbert-base-uncased
from transformers import (
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification
)

# Your code here:
# 1. Load the model with num_labels matching your dataset
# 2. Create TrainingArguments (output_dir, num_epochs, batch_size, etc.)
# 3. Create a DataCollator for token classification
# 4. Create Trainer with model, args, datasets, tokenizer, data_collator, compute_metrics
# 5. Train the model

# Example:
# num_labels = len(label_names)
# 
# model_distilbert = AutoModelForTokenClassification.from_pretrained(
#     "distilbert-base-uncased",
#     num_labels=num_labels,
#     id2label={i: label for i, label in enumerate(label_names)},
#     label2id={label: i for i, label in enumerate(label_names)}
# )
# 
# training_args = TrainingArguments(
#     output_dir="./results_distilbert",
#     evaluation_strategy="epoch",
#     save_strategy="epoch",
#     learning_rate=2e-5,
#     per_device_train_batch_size=16,
#     per_device_eval_batch_size=16,
#     num_train_epochs=3,
#     weight_decay=0.01,
#     load_best_model_at_end=True,
# )
# 
# data_collator = DataCollatorForTokenClassification(tokenizer_distilbert)
# 
# trainer_distilbert = Trainer(
#     model=model_distilbert,
#     args=training_args,
#     train_dataset=tokenized_train_distilbert,
#     eval_dataset=tokenized_valid_distilbert,
#     tokenizer=tokenizer_distilbert,
#     data_collator=data_collator,
#     compute_metrics=compute_metrics,
# )
# 
# trainer_distilbert.train()


In [None]:
# Step 5: Evaluate distilbert and get per-entity metrics

# Your code here:
# 1. Run evaluation on the valid set
# 2. Get predictions for detailed analysis
# 3. Calculate precision, recall, F1 for each entity type
# 4. Create a visualization (bar chart) comparing metrics by entity

# Example:
# # Get predictions
# predictions_distilbert = trainer_distilbert.predict(tokenized_valid_distilbert)
# preds_distilbert = np.argmax(predictions_distilbert.predictions, axis=2)
# 
# # Convert to label strings
# true_predictions_distilbert = [
#     [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
#     for prediction, label in zip(preds_distilbert, predictions_distilbert.label_ids)
# ]
# true_labels_distilbert = [
#     [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
#     for prediction, label in zip(preds_distilbert, predictions_distilbert.label_ids)
# ]
# 
# # Get detailed results
# results_detailed_distilbert = seqeval.compute(
#     predictions=true_predictions_distilbert,
#     references=true_labels_distilbert
# )
# 
# print("DistilBERT Results:")
# print(f"Overall F1: {results_detailed_distilbert['overall_f1']:.3f}")
# print("\nPer-entity results:")
# for entity_type in results_detailed_distilbert.keys():
#     if entity_type not in ['overall_precision', 'overall_recall', 'overall_f1', 'overall_accuracy']:
#         print(f"{entity_type}: {results_detailed_distilbert[entity_type]}")


In [None]:
# Step 6: Create visualization of per-entity performance
import matplotlib.pyplot as plt
import pandas as pd

# Your code here:
# 1. Extract precision, recall, F1 for each entity type
# 2. Create a DataFrame with entity types and metrics
# 3. Plot a grouped bar chart comparing precision, recall, F1

# Example:
# # Extract entity metrics (skip 'overall_*' keys)
# entity_metrics = []
# for key, value in results_detailed_distilbert.items():
#     if not key.startswith('overall_'):
#         entity_metrics.append({
#             'Entity': key,
#             'Precision': value.get('precision', 0),
#             'Recall': value.get('recall', 0),
#             'F1': value.get('f1', 0)
#         })
# 
# df_metrics = pd.DataFrame(entity_metrics)
# 
# # Plot
# df_metrics.plot(x='Entity', y=['Precision', 'Recall', 'F1'], kind='bar', figsize=(10, 6))
# plt.title('DistilBERT NER Performance by Entity Type')
# plt.ylabel('Score')
# plt.xlabel('Entity Type')
# plt.legend(loc='lower right')
# plt.xticks(rotation=45)
# plt.tight_layout()
# plt.show()


In [None]:
# Step 7: Repeat for bert-base-uncased

# Your code here:
# 1. Tokenize the dataset with bert-base-uncased tokenizer
# 2. Load bert-base-uncased model
# 3. Create new Trainer with BERT model
# 4. Train the BERT model
# 5. Evaluate and compare with DistilBERT

# (Follow same steps as DistilBERT - cells above)


In [None]:
# Step 8: Run inference on movie reviews from internet
from transformers import pipeline

# Your code here:
# 1. Find 2 movie reviews from the internet (IMDB, Rotten Tomatoes, etc.)
# 2. Create a NER pipeline with your fine-tuned model
# 3. Run inference on the reviews
# 4. Display the extracted entities with their types

# Example:
# # Create NER pipeline
# ner_pipeline_distilbert = pipeline(
#     "ner",
#     model=model_distilbert,
#     tokenizer=tokenizer_distilbert,
#     aggregation_strategy="simple"
# )
# 
# # Example review
# review_1 = '''
# The new Christopher Nolan film starring Leonardo DiCaprio is a masterpiece.
# Set in 2010, this Science Fiction thriller takes viewers on a mind-bending journey.
# '''
# 
# review_2 = '''
# I watched The Dark Knight last night. Heath Ledger's performance as the Joker
# was incredible. This 2008 Action film directed by Christopher Nolan is a must-see.
# '''
# 
# # Run NER
# entities_1 = ner_pipeline_distilbert(review_1)
# entities_2 = ner_pipeline_distilbert(review_2)
# 
# print("Review 1 entities:")
# for entity in entities_1:
#     print(f"  {entity['word']}: {entity['entity_group']} (score: {entity['score']:.3f})")


In [None]:
# Step 9: Compare DistilBERT vs BERT

# Your code here:
# 1. Create a comparison table with:
#    - Model name
#    - Overall F1 score
#    - Training time (approximate)
#    - Model size (parameters)
#    - Best/worst entity types
# 2. Discuss which model performed better
# 3. Analyze whether BERT's larger size justified better performance

# Example:
# comparison_data = {
#     'Model': ['DistilBERT', 'BERT'],
#     'Parameters': ['66M', '110M'],
#     'Overall F1': [0.XX, 0.XX],  # Fill with your results
#     'Training Time': ['X min', 'Y min'],
#     'Best Entity': ['...', '...'],
#     'Worst Entity': ['...', '...']
# }
# 
# df_comparison = pd.DataFrame(comparison_data)
# display(df_comparison)
# 
# print("\nAnalysis:")
# print("BERT performed [better/worse] than DistilBERT by [X]% F1 score.")
# print("The improvement [was/was not] worth the extra training time because...")


## Part 3 - Using an LLM for NER (12 points)

For the first 100 texts in the "valid" split, mimic what we did in the lesson to extract the "Actor", "Character", "Director", "Genre", "Title" and "Year" entities using an LLM. Start with just a few examples to refine your prompt and instructions, then ramp up to 100 or more examples. Get the final evaluation metrics as shown in the lesson.

**Hint:** You can import the `llm_ner_extractor` function from `Lesson_10_Helpers` to streamline your LLM-based extraction, similar to how we used `llm_classifier` in Lesson 8.

In [None]:
# Step 1: Prepare validation subset
from Lesson_10_Helpers import llm_ner_extractor

# Your code here:
# 1. Get first 100 examples from valid split
# 2. Extract texts and true labels
# 3. Start with 2-3 examples to test your prompt

# Example:
# valid_subset = dataset["valid"].select(range(100))
# valid_texts = [" ".join(example["tokens"]) for example in valid_subset]
# valid_labels = valid_subset["ner_tags"]
# 
# # Test with just 3 examples first
# test_texts = valid_texts[:3]
# print("Testing with 3 examples:")
# for i, text in enumerate(test_texts):
#     print(f"\nExample {i+1}: {text}")


In [None]:
# Step 2: Design prompt for LLM-based NER

# Your code here:
# 1. Create system_prompt explaining the task
# 2. Create prompt_template with:
#    - Instructions to extract entities
#    - Entity types to look for: Actor, Character, Director, Genre, Title, Year
#    - Request structured output (JSON format recommended)
#    - Include the {text} placeholder

# Example:
# system_prompt = '''You are an expert at identifying named entities in movie-related text.
# Your task is to extract entities and classify them into the following types:
# - Actor: Names of actors or actresses
# - Character: Names of characters in movies
# - Director: Names of movie directors
# - Genre: Movie genres (Action, Drama, Comedy, etc.)
# - Title: Titles of movies
# - Year: Years when movies were made
# '''
# 
# prompt_template = '''Extract all named entities from the following text and classify them by type.
# 
# Return ONLY a JSON object with this structure:
# {{
#   "Actor": ["name1", "name2"],
#   "Character": ["name1"],
#   "Director": ["name1"],
#   "Genre": ["genre1"],
#   "Title": ["title1"],
#   "Year": ["year1"]
# }}
# 
# If no entities of a type are found, use an empty list [].
# 
# Text: {text}
# 
# JSON output:'''


In [None]:
# Step 3: Test prompt on small subset with llm_ner_extractor

# Your code here:
# 1. Use llm_ner_extractor function (similar to llm_classifier from Lesson 8)
# 2. Start with a small API model or local model
# 3. Test on 3-5 examples
# 4. Refine your prompt based on results
# 5. Parse JSON output and convert to BIO format

# Example:
# from introdl import llm_generate
# import json
# 
# # Test with one model first
# model_name = "gemini-flash-lite"  # or try a local model
# 
# # Generate predictions for test examples
# predictions_raw = []
# for text in test_texts:
#     prompt = prompt_template.format(text=text)
#     response = llm_generate(
#         prompt=prompt,
#         system_prompt=system_prompt,
#         model=model_name,
#         temperature=0.1,  # Low temperature for more consistent outputs
#         max_tokens=500
#     )
#     predictions_raw.append(response)
#     print(f"\nText: {text}")
#     print(f"Response: {response}")


In [None]:
# Step 4: Convert LLM outputs to BIO format

# Your code here:
# 1. Parse JSON from LLM response
# 2. Match entity spans to token positions
# 3. Convert to BIO format (B-ACTOR, I-ACTOR, etc.)
# 4. Handle errors/malformed JSON gracefully

# This is complex! You may want to use llm_ner_extractor from Lesson_10_Helpers
# which handles the conversion for you.

# Example using llm_ner_extractor:
# predictions_bio = llm_ner_extractor(
#     model_name=model_name,
#     texts=test_texts,
#     system_prompt=system_prompt,
#     prompt_template=prompt_template,
#     label_names=label_names,
#     estimate_cost=True
# )
# 
# print("\nBIO predictions:", predictions_bio)


In [None]:
# Step 5: Scale up to 100 examples and evaluate

# Your code here:
# 1. Once prompt is refined, run on all 100 validation examples
# 2. Convert predictions to BIO format
# 3. Calculate metrics using seqeval (like Part 2)
# 4. Generate classification report

# Example:
# # Run on all 100 examples
# predictions_llm = llm_ner_extractor(
#     model_name="gemini-flash-lite",
#     texts=valid_texts,  # All 100 texts
#     system_prompt=system_prompt,
#     prompt_template=prompt_template,
#     label_names=label_names,
#     estimate_cost=True
# )
# 
# # Evaluate
# results_llm = seqeval.compute(
#     predictions=predictions_llm,
#     references=[[label_names[l] for l in labels] for labels in valid_labels]
# )
# 
# print("\nLLM NER Results:")
# print(f"Overall F1: {results_llm['overall_f1']:.3f}")
# print(f"Precision: {results_llm['overall_precision']:.3f}")
# print(f"Recall: {results_llm['overall_recall']:.3f}")


## Part 4 - Comparison (8 points)

* Compare the results of the two entity recognition techniques (fine-tuned BERT models vs LLM zero-shot) both quantitatively and qualitatively.
* Consider the difficulty of obtaining labeled data in your comparison. It's time-consuming and/or costly to get tagged text, but that's not necessary for the LLM approach which may be less accurate.
* Which approach would you choose for a production system and why? Consider accuracy, speed, cost, and maintenance requirements.
* Give a brief summary of what you learned in this assignment.

📝 **YOUR COMPARISON AND SUMMARY HERE:**

## Part 5 - Reflection (2 points)

1. What, if anything, did you find difficult to understand for the lesson? Why?

📝 **YOUR ANSWER HERE:**

2. What resources did you find supported your learning most and least for this lesson? (Be honest - I use your input to shape the course.)

📝 **YOUR ANSWER HERE:**

### Export Notebook to HTML for Canvas Upload

Uncomment the two lines below and run the cell to export the current notebook to HTML.

In [None]:
# from introdl import export_this_to_html
# export_this_to_html()