In [None]:
# DS776 Auto-Update (runs in ~2 seconds, only updates when needed)
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../../Lessons/Course_Tools/auto_update_introdl.py

# Homework 10: Named Entity Recognition

**Total Points: 50**
- Reading Questions: 8 points
- Part 1 (Named Entities for Analysis): 6 points
- Part 2 (Fine-tune BERT Models): 12 points
- Part 3 (LLM for NER): 12 points
- Part 4 (Comparison): 8 points
- Part 5 (Reflection): 2 points

## Reading Questions (8 points)

Answer the following questions based on Chapter 4: Multilingual Named Entity Recognition from *Natural Language Processing with Transformers*.

**Question 1 (2 points):** Explain the BIO tagging scheme used in Named Entity Recognition. What do the B-, I-, and O tags represent, and why is this tagging scheme necessary for NER tasks instead of simply labeling entity types?

📝 **YOUR ANSWER HERE:**

**Question 2 (2 points):** Compare the traditional BiLSTM-CRF architecture to transformer-based models (like BERT) for NER tasks. What are the main advantages of using transformer models for NER, and what challenges remain?

📝 **YOUR ANSWER HERE:**

**Question 3 (2 points):** Explain the difference between token-level and entity-level evaluation for NER. Why is entity-level F1 score (using metrics like seqeval) generally preferred over token-level accuracy for evaluating NER models?

📝 **YOUR ANSWER HERE:**

**Question 4 (2 points):** What are nested entities in NER, and why do they pose a challenge for traditional sequence labeling approaches? Provide an example of nested entities and explain how they complicate the BIO tagging scheme.

📝 **YOUR ANSWER HERE:**

## Part 1 - Using Named Entities for Analysis (6 points)

NER is often used to look for trends or to do other analysis on text data. Once you have the NER tags you can use them to extract the entities from the text to do analysis.

Here we'll use a dataset of made-up movie reviews. The idea is to use the entity tags to extract the actors and directors from the reviews, then to figure out which actors and directors are most likely to be involved with positive sentiment movies and negative sentiment movies. We'll load the dataset for you.

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "hobbes99/fake_movie_reviews_ner_sentiment"
)
label_list = dataset["train"].features["ner_tags"].feature.names
print(label_list)

Here's an entry in the training set to get you started:

In [None]:
dataset["train"][0]

Notice that NER tags are stored as integers corresponding to their indices in `label_list`. You'll need to use those tags to extract the actor and director names. You can also extract the sentiment.

For the training split, find and display in order:
* The three actors most likely to appear in positive films.
* The three actors most likely to appear in negative films.
* The three directors most likely to have directed positive films.
* The three directors most likely to have directed negative films.

## Part 2 - Fine Tuning Two BERT NER Models (12 points)

The MIT Movie Corpus is designed for movie-related NER tasks and includes the following entity types in BIO format:
- **Actor**: Names of actors or actresses (e.g., "Leonardo DiCaprio").
- **Character**: Names of characters in movies (e.g., "Jack Dawson").
- **Director**: Names of movie directors (e.g., "Christopher Nolan").
- **Genre**: Movie genres (e.g., "Action", "Drama").
- **Title**: Titles of movies (e.g., "Inception").
- **Year**: Year the movie was made.

The original movie corpus includes more entity types, but we've produced a simplified version for this assignment.

In this part of the assignment you should fine-tune "distilbert-base-uncased" and "bert-base-uncased" for NER on the dataset "hobbes99/mit-movie-ner-simplified". The dataset has "train" and "valid" splits. Use the "train" split for fine-tuning and evaluate the metrics using seqeval as shown in the lesson.
* Figure out a way to plot precision, recall, and F1 by entity type.
* Find two movie reviews on the internet and run inference on them to extract the named entities.
* Write a brief summary of the results. Include answers to:
    * Which entity types does the model struggle with?
    * Which does it do well on?
* The "distilbert-base-uncased" model is a distilled version of the "bert-base-uncased" model (distillation means a smaller model that was trained using the larger trained model as a "teacher"). The "bert-base-uncased" model should lead to better results here. Does it? Discuss.

## Part 3 - Using an LLM for NER (12 points)

For the first 100 texts in the "valid" split, mimic what we did in the lesson to extract the "Actor", "Character", "Director", "Genre", "Title" and "Year" entities using an LLM. Start with just a few examples to refine your prompt and instructions, then ramp up to 100 or more examples. Get the final evaluation metrics as shown in the lesson.

**Hint:** You can import the `llm_ner_extractor` function from `Lesson_10_Helpers` to streamline your LLM-based extraction, similar to how we used `llm_classifier` in Lesson 8.

## Part 4 - Comparison (8 points)

* Compare the results of the two entity recognition techniques (fine-tuned BERT models vs LLM zero-shot) both quantitatively and qualitatively.
* Consider the difficulty of obtaining labeled data in your comparison. It's time-consuming and/or costly to get tagged text, but that's not necessary for the LLM approach which may be less accurate.
* Which approach would you choose for a production system and why? Consider accuracy, speed, cost, and maintenance requirements.
* Give a brief summary of what you learned in this assignment.

📝 **YOUR COMPARISON AND SUMMARY HERE:**

## Part 5 - Reflection (2 points)

1. What, if anything, did you find difficult to understand for the lesson? Why?

📝 **YOUR ANSWER HERE:**

2. What resources did you find supported your learning most and least for this lesson? (Be honest - I use your input to shape the course.)

📝 **YOUR ANSWER HERE:**

### Export Notebook to HTML for Canvas Upload

Uncomment the two lines below and run the cell to export the current notebook to HTML.

In [None]:
# from introdl import export_this_to_html
# export_this_to_html()