In [None]:
# DS776 Auto-Update (runs in ~2 seconds, only updates when needed)
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../../Lessons/Course_Tools/auto_update_introdl.py

# Part 1 Tutorial: Understanding Named Entity Extraction

**This tutorial will help you understand the data structure and what you need to do for Part 1.**

If you've been struggling with Part 1, this notebook will walk you through it step by step with working code you can run and modify.

## Import Libraries

First, let's import everything we'll need:

In [None]:
from datasets import load_dataset
from collections import defaultdict
import pandas as pd

## Step 1: Understanding What the Data Looks Like

Let's start by looking at ONE example from the dataset to understand the structure:

In [None]:
# Load the dataset
dataset = load_dataset("hobbes99/fake_movie_reviews_ner_sentiment")
label_list = dataset["train"].features["ner_tags"].feature.names

# Look at the FIRST example
example = dataset["train"][0]
print("Full example:")
print(example)
print("\n" + "="*60)

### What does this mean?

The output shows a dictionary with three keys:
- **tokens**: A list of words in the review (one review broken into individual words)
- **ner_tags**: Numbers that represent the entity type for each word
- **sentiment**: 0 = negative, 1 = positive

### Understanding the `label_list`

The `label_list` is the "decoder" for the numbers in `ner_tags`:

In [None]:
print("Label list (the decoder):")
print(label_list)
print("\nWhat each number means:")
for i, label in enumerate(label_list):
    print(f"  {i} â†’ '{label}'")

Each number corresponds to a label:
- 0 â†’ 'O' (Outside - not an entity)
- 1 â†’ 'B-ACTOR' (Beginning of an actor's name)
- 2 â†’ 'I-ACTOR' (Inside/continuation of an actor's name)
- 3 â†’ 'B-DIRECTOR' (Beginning of a director's name)
- 4 â†’ 'I-DIRECTOR' (Inside/continuation of a director's name)

### Let's decode the example to make it readable:

In [None]:
# Show the example with decoded labels
tokens = example['tokens']
ner_tags = example['ner_tags']
sentiment = example['sentiment']

print("Let's decode this example:\n")
print(f"{'Token':<15} {'Tag (number)':<15} {'Tag (label)':<15}")
print("="*50)
for token, tag_num in zip(tokens, ner_tags):
    tag_label = label_list[tag_num]
    print(f"{token:<15} {tag_num:<15} {tag_label:<15}")

print("\nSentiment:", "Positive (1)" if sentiment == 1 else "Negative (0)")

**Can you see the pattern?**

When you see B-ACTOR followed by I-ACTOR, that means those tokens together form one actor's name!
- First token gets B-ACTOR (Begin)
- Following tokens get I-ACTOR (Inside/continuation)

Same logic for directors.

## Step 2: What Does "Extract" Mean?

**"Extract" means:** Find all the actors and directors from the review and save them as complete names.

From the example above, we want to **extract**:
- Complete actor names (combining B-ACTOR + I-ACTOR tokens)
- Complete director names (combining B-DIRECTOR + I-DIRECTOR tokens)
- The sentiment (positive or negative)

### Where do we store this information?

We'll create **dictionaries** to count how many times each actor/director appears in positive vs negative reviews.

In [None]:
# Example of what we're building toward:
example_storage = {
    "Tom Hanks": {"positive": 5, "negative": 2},
    "Meryl Streep": {"positive": 3, "negative": 1}
}

print("This is the kind of structure we want:")
print(example_storage)
print("\nThis would mean:")
print("- Tom Hanks appeared in 5 positive reviews and 2 negative reviews")
print("- Meryl Streep appeared in 3 positive reviews and 1 negative review")

## Step 3: Understanding "Token Lists to Strings"

### What is a "token list"?

A **token list** is just the list of words that make up a name.

In [None]:
# For an actor like "Tom Hanks"
token_list = ['Tom', 'Hanks']
print("Token list:", token_list)
print("Type:", type(token_list))

# Convert to a string by joining with spaces
name_string = ' '.join(token_list)
print("\nConverted to string:", name_string)
print("Type:", type(name_string))

### What about "consecutive I- tags with B- tag"?

This means when you see a B-ACTOR followed by I-ACTOR tags, they belong to the SAME person:

In [None]:
# Example with multiple actors
tokens_example = ['Tom', 'Hanks', 'and', 'Rita', 'Wilson']
tags_example = ['B-ACTOR', 'I-ACTOR', 'O', 'B-ACTOR', 'I-ACTOR']

print(f"{'Token':<10} {'Tag':<10}")
print("="*25)
for token, tag in zip(tokens_example, tags_example):
    print(f"{token:<10} {tag:<10}")

print("\nWe should extract TWO actors:")
print("1. 'Tom Hanks' (B-ACTOR + I-ACTOR)")
print("2. 'Rita Wilson' (B-ACTOR + I-ACTOR)")

## Step 4: Let's Look at a Real Example in Detail

In [None]:
# Let's examine the first example more carefully
example = dataset["train"][0]

tokens = example['tokens']
ner_tags = example['ner_tags']
sentiment = example['sentiment']

print("Full review:", ' '.join(tokens))
print("\nWord-by-word breakdown:")
print(f"{'Token':<15} {'Tag':<15} {'Meaning':<30}")
print("="*60)

for token, tag_num in zip(tokens, ner_tags):
    tag_label = label_list[tag_num]
    if tag_label.startswith('B-'):
        meaning = "START of " + tag_label[2:]
    elif tag_label.startswith('I-'):
        meaning = "CONTINUES " + tag_label[2:]
    else:
        meaning = "Not an entity"
    print(f"{token:<15} {tag_label:<15} {meaning:<30}")

print(f"\nSentiment: {'Positive' if sentiment == 1 else 'Negative'}")

## Step 5: Extracting Entities from One Example

Now let's write a function to extract actors and directors from ONE example:

In [None]:
def extract_entities_from_one_example(example, label_list):
    """
    Extract actor and director names from one example.
    Returns a dictionary with 'actors' and 'directors' lists.
    """
    tokens = example['tokens']
    ner_tags = example['ner_tags']
    
    actors = []  # Will store complete actor names
    directors = []  # Will store complete director names
    
    current_entity = []  # Store tokens for current entity we're building
    current_type = None  # Is it ACTOR or DIRECTOR?
    
    for i in range(len(tokens)):
        token = tokens[i]
        tag = label_list[ner_tags[i]]  # Convert number to label string
        
        if tag == 'B-ACTOR':
            # Before starting new entity, save the previous one if it exists
            if current_entity and current_type == 'ACTOR':
                actors.append(' '.join(current_entity))
            elif current_entity and current_type == 'DIRECTOR':
                directors.append(' '.join(current_entity))
            
            # Start new actor
            current_entity = [token]
            current_type = 'ACTOR'
        
        elif tag == 'I-ACTOR':
            # Continue current actor name
            if current_type == 'ACTOR':
                current_entity.append(token)
        
        elif tag == 'B-DIRECTOR':
            # Save previous entity
            if current_entity and current_type == 'ACTOR':
                actors.append(' '.join(current_entity))
            elif current_entity and current_type == 'DIRECTOR':
                directors.append(' '.join(current_entity))
            
            # Start new director
            current_entity = [token]
            current_type = 'DIRECTOR'
        
        elif tag == 'I-DIRECTOR':
            # Continue current director name
            if current_type == 'DIRECTOR':
                current_entity.append(token)
        
        else:  # tag == 'O' (outside any entity)
            # Save previous entity
            if current_entity and current_type == 'ACTOR':
                actors.append(' '.join(current_entity))
            elif current_entity and current_type == 'DIRECTOR':
                directors.append(' '.join(current_entity))
            
            # Reset
            current_entity = []
            current_type = None
    
    # IMPORTANT: Don't forget the last entity!
    # If the review ends with an actor/director name, we need to save it
    if current_entity and current_type == 'ACTOR':
        actors.append(' '.join(current_entity))
    elif current_entity and current_type == 'DIRECTOR':
        directors.append(' '.join(current_entity))
    
    return {'actors': actors, 'directors': directors}


# Test it on the first example
example = dataset["train"][0]
entities = extract_entities_from_one_example(example, label_list)

print("Review:", ' '.join(example['tokens']))
print("\nExtracted entities:")
print("  Actors:", entities['actors'])
print("  Directors:", entities['directors'])
print("  Sentiment:", "positive" if example['sentiment'] == 1 else "negative")

Let's test on a few more examples to make sure it works:

In [None]:
# Test on first 5 examples
print("Testing extraction on first 5 examples:\n")
print("="*80)

for i in range(5):
    example = dataset["train"][i]
    entities = extract_entities_from_one_example(example, label_list)
    
    print(f"\nExample {i+1}:")
    print(f"Review: {' '.join(example['tokens'][:15])}...")  # Show first 15 words
    print(f"Actors: {entities['actors']}")
    print(f"Directors: {entities['directors']}")
    print(f"Sentiment: {'Positive' if example['sentiment'] == 1 else 'Negative'}")
    print("-" * 80)

## Step 6: Counting Actors/Directors by Sentiment

Now we need to count how many times each actor/director appears in positive vs negative reviews.

We'll use `defaultdict` which automatically creates the nested dictionary structure we need:

In [None]:
# Create storage for counts using defaultdict
# This automatically creates {"positive": 0, "negative": 0} for new entries
actor_sentiment = defaultdict(lambda: {"positive": 0, "negative": 0})
director_sentiment = defaultdict(lambda: {"positive": 0, "negative": 0})

# Process FIRST 50 examples (for testing - later you'll do all of them)
print("Processing first 50 examples...\n")

for i in range(50):
    example = dataset["train"][i]
    entities = extract_entities_from_one_example(example, label_list)
    
    # Determine sentiment
    sentiment_label = "positive" if example['sentiment'] == 1 else "negative"
    
    # Count each actor
    for actor in entities['actors']:
        actor_sentiment[actor][sentiment_label] += 1
    
    # Count each director
    for director in entities['directors']:
        director_sentiment[director][sentiment_label] += 1

print(f"Found {len(actor_sentiment)} unique actors")
print(f"Found {len(director_sentiment)} unique directors")

# Show some results
print("\nFirst 5 actors and their counts:")
for actor, counts in list(actor_sentiment.items())[:5]:
    total = counts['positive'] + counts['negative']
    print(f"  {actor}: {counts['positive']} positive, {counts['negative']} negative (total: {total})")

## Step 7: Finding Top 3 Actors/Directors

To find who is most associated with positive/negative films, we calculate the **proportion** of their appearances that were positive:

```
positive_proportion = positive_count / (positive_count + negative_count)
```

A proportion close to 1.0 (100%) means mostly positive films.
A proportion close to 0.0 (0%) means mostly negative films.

In [None]:
def calculate_proportions(entity_sentiment_dict):
    """
    Calculate positive proportion for each entity.
    Returns a list of dictionaries with name, counts, and proportion.
    """
    results = []
    
    for entity_name, counts in entity_sentiment_dict.items():
        positive = counts['positive']
        negative = counts['negative']
        total = positive + negative
        
        if total > 0:  # Avoid division by zero
            positive_proportion = positive / total
            results.append({
                'name': entity_name,
                'positive': positive,
                'negative': negative,
                'total': total,
                'positive_proportion': positive_proportion
            })
    
    return results


# Calculate proportions
actor_results = calculate_proportions(actor_sentiment)
director_results = calculate_proportions(director_sentiment)

print(f"Calculated proportions for {len(actor_results)} actors")
print(f"Calculated proportions for {len(director_results)} directors")

### Example: Find Top 3 Actors in POSITIVE Films (from 50 examples)

In [None]:
# Sort by positive proportion (highest first)
actor_results_sorted = sorted(actor_results, key=lambda x: x['positive_proportion'], reverse=True)

print("TOP 3 ACTORS MOST LIKELY TO APPEAR IN POSITIVE FILMS (from first 50 examples)")
print("="*70)
for i in range(min(3, len(actor_results_sorted))):
    actor = actor_results_sorted[i]
    print(f"\n{i+1}. {actor['name']}")
    print(f"   Positive: {actor['positive']}, Negative: {actor['negative']}, Total: {actor['total']}")
    print(f"   Positive Proportion: {actor['positive_proportion']:.2%}")

### Example: Find Top 3 Actors in NEGATIVE Films (from 50 examples)

In [None]:
# Sort by positive proportion (lowest first) - these are in negative films
actor_results_sorted = sorted(actor_results, key=lambda x: x['positive_proportion'])

print("TOP 3 ACTORS MOST LIKELY TO APPEAR IN NEGATIVE FILMS (from first 50 examples)")
print("="*70)
for i in range(min(3, len(actor_results_sorted))):
    actor = actor_results_sorted[i]
    print(f"\n{i+1}. {actor['name']}")
    print(f"   Positive: {actor['positive']}, Negative: {actor['negative']}, Total: {actor['total']}")
    print(f"   Positive Proportion: {actor['positive_proportion']:.2%}")

## Step 8: Your Turn - Complete the Assignment! ðŸŽ¯

Now that you've seen how to:
- âœ… Extract entities from examples
- âœ… Count by sentiment
- âœ… Calculate proportions
- âœ… Find top 3 from a subset

**It's your turn to put it all together!**

You need to process **ALL** training examples (not just 50) and generate the four required lists:
1. Top 3 actors in positive films
2. Top 3 actors in negative films
3. Top 3 directors in positive films
4. Top 3 directors in negative films

### Step 8.1: Process All Training Examples

Use the code from Step 6, but change it to process ALL examples instead of just 50.

**What you need to change:**
- Look at Step 6 above where we processed 50 examples
- Change `range(50)` to `range(len(dataset["train"]))`
- This will loop through all training examples

**Hint:** You already have the `extract_entities_from_one_example` function ready to use!

In [None]:
# TODO: Process ALL training examples and count actors/directors by sentiment

# Step 1: Create storage dictionaries (reset them to start fresh)
actor_sentiment = defaultdict(lambda: {"positive": 0, "negative": 0})
director_sentiment = defaultdict(lambda: {"positive": 0, "negative": 0})

# Step 2: Loop through ALL training examples
# HINT: Change the number in range() to process all examples
print(f"Processing ALL {len(dataset['train'])} training examples...")

for i in range(???):  # TODO: Replace ??? with len(dataset["train"])
    # Get the example
    example = dataset["train"][i]
    
    # Extract entities (you have this function already!)
    entities = ???  # TODO: Call extract_entities_from_one_example(example, label_list)
    
    # Determine if it's positive or negative
    sentiment_label = ???  # TODO: "positive" if example['sentiment'] == 1 else "negative"
    
    # Count each actor
    for actor in entities['actors']:
        ???  # TODO: actor_sentiment[actor][sentiment_label] += 1
    
    # Count each director
    for director in entities['directors']:
        ???  # TODO: director_sentiment[director][sentiment_label] += 1

print(f"Processing complete!")
print(f"Found {len(actor_sentiment)} unique actors")
print(f"Found {len(director_sentiment)} unique directors")

### Step 8.2: Calculate Proportions

Use the `calculate_proportions` function (from Step 7) to calculate proportions for all actors and directors.

In [None]:
# TODO: Calculate proportions for actors and directors

# HINT: Use the calculate_proportions function you already have!
actor_results = ???  # TODO: calculate_proportions(actor_sentiment)
director_results = ???  # TODO: calculate_proportions(director_sentiment)

print(f"Calculated proportions for {len(actor_results)} actors")
print(f"Calculated proportions for {len(director_results)} directors")

### Step 8.3: Display Top 3 in Each Category

Now create four separate displays (you saw examples of this in Step 7).

For each category, you need to:
1. Sort the results by `positive_proportion`
   - For "positive films": sort with `reverse=True` (highest proportion first)
   - For "negative films": sort with `reverse=False` or no reverse (lowest proportion first)
2. Loop through the first 3 results
3. Print the name, counts, and proportion

**Example from Step 7 you can adapt:**
```python
sorted_results = sorted(actor_results, key=lambda x: x['positive_proportion'], reverse=True)
for i in range(min(3, len(sorted_results))):
    entity = sorted_results[i]
    print(f"{i+1}. {entity['name']}")
    print(f"   Positive: {entity['positive']}, Negative: {entity['negative']}")
```

In [None]:
# TODO: Top 3 actors in POSITIVE films

print("\n" + "="*80)
print("TOP 3 ACTORS MOST LIKELY TO APPEAR IN POSITIVE FILMS")
print("="*80)

# HINT: Sort actor_results by positive_proportion, HIGHEST first (reverse=True)
actor_positive = sorted(???, key=lambda x: x['???'], reverse=???)  # TODO: Fill in the ???

# HINT: Loop through first 3
for i in range(min(3, len(actor_positive))):
    actor = actor_positive[i]
    print(f"\n{i+1}. {actor['name']}")
    print(f"   Positive: {actor['positive']}, Negative: {actor['negative']}, Total: {actor['total']}")
    print(f"   Positive Proportion: {actor['positive_proportion']:.2%}")

In [None]:
# TODO: Top 3 actors in NEGATIVE films

print("\n" + "="*80)
print("TOP 3 ACTORS MOST LIKELY TO APPEAR IN NEGATIVE FILMS")
print("="*80)

# HINT: Sort actor_results by positive_proportion, LOWEST first (no reverse or reverse=False)
actor_negative = sorted(???, key=lambda x: x['???'])  # TODO: Fill in the ???

# TODO: Loop through first 3 and display (same format as above)
for i in range(???):
    # TODO: Your code here
    pass

In [None]:
# TODO: Top 3 directors in POSITIVE films

print("\n" + "="*80)
print("TOP 3 DIRECTORS MOST LIKELY TO DIRECT POSITIVE FILMS")
print("="*80)

# HINT: Same as actors positive, but use director_results
# TODO: Your code here

In [None]:
# TODO: Top 3 directors in NEGATIVE films

print("\n" + "="*80)
print("TOP 3 DIRECTORS MOST LIKELY TO DIRECT NEGATIVE FILMS")
print("="*80)

# HINT: Same as actors negative, but use director_results
# TODO: Your code here

## Checklist âœ“

Before you consider Part 1 complete, make sure you can check off all these items:

- [ ] I processed ALL training examples (not just 50)
- [ ] I extracted actors and directors using the BIO tags
- [ ] I counted how many times each appeared in positive vs negative reviews
- [ ] I calculated the positive proportion for each actor and director
- [ ] I displayed the top 3 actors in positive films
- [ ] I displayed the top 3 actors in negative films
- [ ] I displayed the top 3 directors in positive films
- [ ] I displayed the top 3 directors in negative films
- [ ] My output shows the counts (positive, negative, total) for each entity
- [ ] My output shows the positive proportion as a percentage

## Debugging Tips

If something isn't working:

1. **Check your loop range**: Are you processing all examples?
   ```python
   print(f"Processing {len(dataset['train'])} examples")
   ```

2. **Verify entity extraction**: Print a few to make sure they look right
   ```python
   entities = extract_entities_from_one_example(example, label_list)
   print(entities)
   ```

3. **Check your counts**: Do the numbers make sense?
   ```python
   print(f"Total actors: {len(actor_sentiment)}")
   print("Sample:", list(actor_sentiment.items())[:3])
   ```

4. **Verify sorting direction**:
   - Positive films: `reverse=True` (highest proportion first)
   - Negative films: `reverse=False` or omit (lowest proportion first)

5. **Common error**: `TypeError: ... argument after ** must be a mapping`
   - This means you forgot to fill in a `???` somewhere!

## Copy to Your Assignment

Once your code is working:
1. Copy the working code cells to your assignment notebook
2. Add the output showing your four lists
3. Make sure the output is clearly labeled
4. You're done with Part 1! ðŸŽ‰

## What You've Learned

By completing this tutorial, you've learned:
- How to work with structured data (lists, dictionaries)
- How to extract information from BIO-tagged text
- How to count and aggregate data
- How to calculate proportions and rank results
- How to sort and filter data in Python

These skills will be useful throughout the rest of the homework and in future data analysis tasks!