In [None]:
# DS776 Auto-Update (runs in ~2 seconds, only updates when needed)
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../../Lessons/Course_Tools/auto_update_introdl.py

## Part 1 - Using Named Entities for Analysis (6 points)

NER is often used to look for trends or to do other analysis on text data.  Once you have teh NER tags you can use them to extract the entities from the text to do analysis.  

Here we'll use dataset of made-up movie reviews.  The idea is to use the entity tags to extract the actors and directors from the reviews, then to figure out which actors and directors are most likely to be involved with positive sentiment movies and negative sentiment movies.  We'll load the dataset for you.

In [None]:
dataset = load_dataset(
    "hobbes99/fake_movie_reviews_ner_sentiment",
    cache_dir="C:/Users/bagge/huggingface_cache"
)
label_list = dataset["train"].features["ner_tags"].feature.names
print(label_list)

['O', 'B-ACTOR', 'I-ACTOR', 'B-CHARACTER', 'I-CHARACTER', 'B-DIRECTOR', 'I-DIRECTOR', 'B-GENRE', 'I-GENRE', 'B-TITLE', 'I-TITLE', 'B-YEAR', 'I-YEAR']


Here's an entry in the training set to get you started:

In [None]:
dataset["train"][0]

{'tokens': ['Even',
  'the',
  'usually',
  'reliable',
  'Logan',
  'Dark',
  "can't",
  'save',
  'Eternal',
  'Oath',
  ',',
  'a',
  'thriller',
  'film',
  'from',
  '1988',
  "that's",
  'as',
  'clumsy',
  'as',
  'they',
  'come',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  1,
  2,
  0,
  0,
  9,
  10,
  0,
  0,
  7,
  0,
  0,
  11,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'review': "Even the usually reliable Logan Dark can't save Eternal Oath, a thriller film from 1988 that's as clumsy as they come.",
 'sentiment': 'negative',
 'entities': {'Actor': 'Logan Dark',
  'Character': None,
  'Director': None,
  'Genre': 'thriller',
  'Title': 'Eternal Oath',
  'Year': '1988'},
 'movie_rating': 3}

Notice that NER tags are stored as integers corresponding to their indices in `label_list`.  You'll need to use those tags to extract the actor and director names.  You can also extract the sentiment.  

For the training split, find and display in order:
* The three actors most likely to appear in positive films.
* The three actors most likely to appear in negative films.
* The three directors most likely to have directed positive films.
* The three directors most likely to have directed negative films.


## Part 2 - Fine Tuning Two BERT NER Models (14 points)

The MIT Movie Corpus is designed for movie-related NER tasks and includes the following entity types in BIO format:
- **Actor**: Names of actors or actresses (e.g., "Leonardo DiCaprio").
- **Character**: Names of characters in movies (e.g., "Jack Dawson").
- **Director**: Names of movie directors (e.g., "Christopher Nolan").
- **Genre**: Movie genres (e.g., "Action", "Drama").
- **Title**: Titles of movies (e.g., "Inception").
- **Year**: Year the movie was made.

The original movie corpus includes more entity types, but we've produced a simplified version for this assignment.

In this part of the assignment you should fine-tune "distilbert-base-uncased" and "bert-base-uncased" for NER on the dataset "hobbes99/mit-movie-ner-simplified".  The dataset has "train" and "valid" splits.  Use the "train" split for fine-tuning and evaluate the metrics using seqeval as shown in the lesson.
* Figure out a way to plot precision, recall, and F1 by entity type.
* Find two movie reviews on the internet and run inference on them to extract the named entities.
* Write a brief summary of the results.  Include answers to:
    * Which entity types does the model struggle with?  
    * Which does it do well on?
* The "distilbert-based-uncased" model is a distilled version "bert-based-uncased" model (distillation means a smaller model that was trained using the larger trained model as a "teacher").  The "bert-based-uncased" model should lead to better results here.  Does it?  Discuss.



## Part 3 - Using an LLM for NERs. (14 points)

For the first 100 texts in the "valid" split,  mimic what we did in the lesson to extract the "Actor", "Character", "Director", "Genre", "Title" and "Year" entities using an LLM.  Start with just a few examples to refine your prompt and instructions, then ramp up to 100 or more examples.  Get the final evaluation metrics as shown in the lesson.

## Part 4 - Comparison and Reflection (6 points)

* Compare the results of two entity recognition techniques both quantitatively and qualitatively.  Consider the difficulty of obtained labeled data in your comparison. It's time-consuming and/or costly to get tagged text, but that's not necessary for the LLM approach which may be less accurate...

* Give a brief summary of what you learned in this assignment.

* What did you find most difficult to understand?
