# AI-Generated Review Summaries: Keyword Extraction and Integration

This notebook analyzes a dictionary containing AI-generated summaries based on audience reviews of various films. Each entry includes a general film summary and detailed commentary on thematic aspects such as cinematography, direction, character development, narrative complexity, and others.

The objective is to extract relevant keywords from these subjective textual summaries. Since the source content is rooted in user reviews, the extracted keywords effectively capture audience interpretations and reception patterns beyond mere plot descriptions.

These keywords are filtered to remove duplicates and non-informative phrases (e.g., "reviewers say"), then ranked by relevance scores. They are integrated into the existing keyword dataset `keywords_ground_truth.pkl`, which primarily contains plot-based keywords. This integration enriches the dataset with audience-perspective descriptors, improving its coverage and relevance.

New keywords are assigned incremental "Helpful" vote counts starting above the highest existing votes for each film, ensuring that more relevant keywords receive higher votes consistent with their scores.

> **Note:** AI-generated summaries for *Star Wars: Episode I – The Phantom Menace*, *Star Wars: Episode II – Attack of the Clones*, and *La La Land* are currently missing from the dataset.


## Setup: Installing and Importing Required Libraries

In [56]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas", "pickle", "keybert", "sentence-transformers", "tqdm"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)

pandas is already installed.
pickle is already installed.
keybert is already installed.
Installing sentence-transformers...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
tqdm is already installed.


In [57]:
import pandas as pd
import pickle
from keybert import KeyBERT
from tqdm import tqdm

## Load the dictionary with the AI-generated summary

This cell loads a dictionary containing AI-generated summaries from the file `summary_IA.pkl`. Each dictionary entry corresponds to a specific film and includes a general summary alongside detailed thematic evaluations such as cinematography, direction, character development, and more.

The total number of films with available summaries is printed to confirm successful loading and provide an overview of the dataset size.

In [None]:
# Load the dictionary containing AI-generated summaries
with open("../Dataset/summary_IA.pkl", "rb") as f:
    all_summaries = pickle.load(f)

# Print how many films are included
print(f"Loaded summaries for {len(all_summaries)} films.\n")

# Show the structure (themes) for the first few films nicely formatted
for i, (film, themes) in enumerate(all_summaries.items()):
    if i >= 3:  # limit to first 3 films
        break
    print(f"Film: {film}")
    print("Themes included:")
    for theme in themes.keys():
        print(f"  - {theme}")
    print()


Loaded summaries for 12 films.

Film: Parasite
Themes included:
  - summary
  - Cinematography
  - Dark humor
  - Direction
  - Dramatic tension
  - Suspenseful
  - Narrative complexity
  - Social commentary
  - Symbolism
  - Thought-provoking

Film: Oppenheimer
Themes included:
  - summary
  - Cinematography
  - Performance
  - Sound design
  - Historical accuracy
  - Inner conflict
  - Narrative complexity
  - Scientific depth
  - Pacing

Film: The Good, the Bad and the Ugly
Themes included:
  - summary
  - Action sequences
  - Atmospheric
  - Cinematography
  - Direction
  - Dramatic tension
  - Great ensemble
  - Iconic scene
  - Iconic score
  - Thin story



## Keyword Extraction from AI-Generated Summaries Using KeyBERT

This cell initializes the KeyBERT model using the compact `"all-MiniLM-L6-v2"` transformer, optimized for efficient keyword extraction.

It iterates over the AI-generated summaries for each film and each thematic section, extracting the top 5 keywords or keyphrases per text. The extraction considers unigrams and bigrams and applies Maximal Marginal Relevance (MMR) to promote diversity among the selected keywords.

The extracted keywords are stored as tuples containing the keyword and its relevance score, organized in a nested dictionary structure keyed by film and theme. Empty or missing texts are skipped gracefully.

In [59]:
# Initialize the KeyBERT model with a compact sentence-transformer
kw_model = KeyBERT(model="all-MiniLM-L6-v2")

# Dictionary to store the extracted keywords and their scores for each film and each theme
extracted_keywords = {}

# Iterate over all films with progress bar
for film in tqdm(all_summaries, desc="Processing films"):
    extracted_keywords[film] = {}
    
    # Iterate over all thematic sections within each film
    for theme, text in all_summaries[film].items():
        # Skip empty or missing text
        if not text.strip():
            extracted_keywords[film][theme] = []
            continue

        # Extract top 5 keywords or keyphrases (using unigrams and bigrams)
        keywords = kw_model.extract_keywords(
            text,
            keyphrase_ngram_range=(1, 2),
            stop_words="english",
            top_n=5,
            use_mmr=True,  # Promote diversity among keywords
        )

        # Store tuples of (keyword, score)
        extracted_keywords[film][theme] = keywords

Processing films: 100%|██████████| 12/12 [00:36<00:00,  3.03s/it]


This cell below iterates over the extracted keywords dictionary, which contains thematic keyword lists with associated relevance scores for each film.

For each film, it prints a clear header to separate the output visually. Then, for every thematic section within the film, it displays the extracted keywords formatted with their corresponding scores rounded to three decimal places. If no keywords were extracted for a particular theme, a message indicating the absence of keywords is printed instead.

In [60]:
# Iterate over each film and its associated thematic keyword lists with scores
for film, themes in extracted_keywords.items():
    # Print a header for each film to separate the output clearly
    print(f"\n=== Keywords for {film} ===\n")
    
    # Iterate over each theme and the list of (keyword, score) tuples extracted for that theme
    for theme, keywords in themes.items():
        # Check if there are any keywords extracted for the current theme
        if keywords:
            # Format each keyword with its score (rounded to 3 decimals)
            kw_list = ", ".join([f"{kw} ({score:.3f})" for kw, score in keywords])
            # Print the theme and its corresponding formatted keywords
            print(f"\t{theme}: {kw_list}\n")
        else:
            # Print a message if no keywords were extracted for the current theme
            print(f"{theme}: No keywords extracted.")


=== Keywords for Parasite ===

	summary: parasite (0.629), director bong (0.379), thriller horror (0.363), writing critics (0.286), class disparity (0.195)

	Cinematography: cinematography breathtaking (0.751), storytelling enhances (0.487), captivating immersive (0.422), composition visual (0.381), reviewers say (0.258)

	Dark humor: dark humor (0.681), film memorable (0.390), shift tones (0.308), characters situations (0.286), blending seamlessly (0.190)

	Direction: blends genres (0.538), film pacing (0.496), direction seen (0.492), appreciate director (0.388), manipulates emotions (0.322)

	Dramatic tension: engaging climax (0.688), enhance suspense (0.635), dramatic tension (0.610), film masterfully (0.455), particularly praised (0.338)

	Suspenseful: builds suspense (0.710), immersive thrilling (0.594), cinematography sound (0.504), intricate plot (0.417), described gripping (0.381)

	Narrative complexity: narrative complexity (0.670), blend genres (0.495), depth criticized (0.4

This cell consolidates the extracted keywords for each film by aggregating all thematic keyword lists into a single set of unique keywords.
For each keyword, only the highest relevance score across all themes is retained. To improve data quality, the phrase "reviewers say" (case-insensitive) is explicitly removed if present.

The keywords are then sorted in descending order by their maximum scores and printed alongside their scores rounded to three decimal places.

In [61]:
# Iterate over each film and its associated thematic keyword lists with scores
for film, themes in extracted_keywords.items():
    # Temporary dictionary to store the highest score for each keyword
    keyword_max_scores = {}

    # Loop through all lists of (keyword, score) tuples across all categories
    for keywords in themes.values():
        for kw, score in keywords:
            # Update the maximum score for each keyword encountered
            if kw not in keyword_max_scores or score > keyword_max_scores[kw]:
                keyword_max_scores[kw] = score

    # Remove the keyword "reviewers say" if present (case insensitive)
    to_remove_reviewers_say = [kw for kw in keyword_max_scores if kw.lower() == "reviewers say"]
    for kw in to_remove_reviewers_say:
        keyword_max_scores.pop(kw)

    # Sort the keywords by their maximum score in descending order
    sorted_keywords = sorted(keyword_max_scores.items(), key=lambda x: x[1], reverse=True)

    # Print the film header with total unique keywords count
    print(f"\n=== Unique Keywords for {film} (total: {len(sorted_keywords)}) ===")
    
    # If keywords exist, print them with scores rounded to 3 decimals
    if sorted_keywords:
        print(", ".join([f"{kw} ({score:.3f})" for kw, score in sorted_keywords]))
    else:
        # Inform if no keywords were extracted for the film
        print("No keywords extracted.")



=== Unique Keywords for Parasite (total: 46) ===
cinematography breathtaking (0.751), use symbolism (0.713), builds suspense (0.710), engaging climax (0.688), dark humor (0.681), narrative complexity (0.670), social commentary (0.661), enhance suspense (0.635), parasite (0.629), dramatic tension (0.610), immersive thrilling (0.594), blends genres (0.538), opinions depth (0.517), cinematography sound (0.504), film pacing (0.496), blend genres (0.495), depth criticized (0.493), direction seen (0.492), storytelling enhances (0.487), characters stereotypical (0.478), film masterfully (0.455), class disparities (0.432), disparities moral (0.430), captivating immersive (0.422), intricate plot (0.417), film memorable (0.390), appreciate director (0.388), themes oversimplified (0.383), composition visual (0.381), described gripping (0.381), director bong (0.379), thriller horror (0.363), realism film (0.352), storytelling (0.349), particularly praised (0.338), innovative disorienting (0.325),

## Integration of AI-Extracted Keywords into the Existing Ground Truth Dataset

This cell performs the integration of newly extracted keywords from AI-generated film review summaries into the pre-existing keyword dataset (`keywords_ground_truth.pkl`).

The process includes the following steps:

1. **Loading and Mapping:** The existing dataset is loaded, and a mapping from film titles to unique movie identifiers (`Movie_ID`) is defined to ensure accurate association of keywords with their respective films.

2. **Keyword Consolidation:** For each film in the extracted keywords dictionary, the code consolidates the highest relevance scores for each keyword across all thematic categories.

3. **Filtering:** Non-informative phrases such as `"reviewers say"` are removed explicitly. Additionally, any keywords already present in the existing dataset for that film are identified and excluded from insertion to prevent duplicates.

4. **Ranking and Voting:** Remaining new keywords are sorted in descending order by relevance score. Helpful vote counts are assigned incrementally starting just above the maximum existing vote count for the film, ensuring that the most relevant keywords receive the highest votes.

5. **Dataframe Construction:** A new dataframe is created for the filtered keywords, including their associated `Movie_ID`, keyword text, and assigned helpful and not helpful vote counts.

6. **Dataset Update:** The new keywords dataframe is concatenated on top of the existing dataset, and the updated dataset is saved back to the original file path.

This workflow enhances the ground truth keyword dataset by enriching it with relevant audience-derived descriptors while maintaining data consistency and quality.


In [62]:
# Load the existing dataset
df_existing = pd.read_pickle("../Dataset/keywords_ground_truth.pkl")

# Map film titles to Movie_IDs (replace with your actual mapping)
movie_id_map = {
    "Parasite": "tt6751668",
    "Raiders of the Lost Ark": "tt0082971",
    "Oppenheimer": "tt15398776",
    "Harry Potter and the Sorcerer's Stone": "tt0241527",
    "The Good, the Bad and the Ugly": "tt0060196",
    "Star Wars Episode 3": "tt0121766",
    "Star Wars Episode 4": "tt0076759",
    "Star Wars Episode 5": "tt0080684",
    "Star Wars Episode 6": "tt0086190",
    "Star Wars Episode 7": "tt2488496",
    "Star Wars Episode 8": "tt2527336",
    "Star Wars Episode 9": "tt2527338",
}

dfs_to_prepend = []

for film, themes in extracted_keywords.items():
    if film not in movie_id_map:
        print(f"Warning: Movie_ID not found for film '{film}'. Skipping.")
        continue
    movie_id = movie_id_map[film]

    # Check if Movie_ID actually exists in the existing dataset
    if movie_id not in df_existing["Movie_ID"].values:
        print(f"Error: Movie_ID '{movie_id}' for film '{film}' not found in existing dataset. Skipping.")
        continue

    # Collect max score per keyword across all themes
    keyword_max_scores = {}
    for keywords in themes.values():
        for kw, score in keywords:
            if kw not in keyword_max_scores or score > keyword_max_scores[kw]:
                keyword_max_scores[kw] = score

    # Remove the keyword "reviewers say" if present (case insensitive)
    to_remove_reviewers_say = [kw for kw in keyword_max_scores if kw.lower() == "reviewers say"]
    if to_remove_reviewers_say:
        print(f"Removing 'reviewers say' keyword for '{film}'")
    for kw in to_remove_reviewers_say:
        keyword_max_scores.pop(kw)

    # Get existing keywords for this movie (lowercase for case-insensitive matching)
    existing_keywords = set(
        df_existing[df_existing["Movie_ID"] == movie_id]["Keyword"].str.lower()
    )

    # Identify keywords to remove (already present)
    to_remove = [kw for kw in keyword_max_scores if kw.lower() in existing_keywords]
    if to_remove:
        print(f"Removing existing keywords for '{film}': {to_remove}")

    # Remove them from new keywords
    for kw in to_remove:
        keyword_max_scores.pop(kw)

    # Sort remaining new keywords by descending score
    sorted_new_kws = sorted(keyword_max_scores.items(), key=lambda x: x[1], reverse=True)

    # Find max Helpful vote currently existing for this movie, default 0 if none
    max_helpful_existing = df_existing[df_existing["Movie_ID"] == movie_id]["Helpful"].max()
    if pd.isna(max_helpful_existing):
        max_helpful_existing = 0

    # Assign Helpful votes starting from max_helpful_existing + len(new keywords) down to max_helpful_existing + 1
    # So the keyword with highest score gets highest Helpful vote
    helpful_votes = list(range(int(max_helpful_existing) + len(sorted_new_kws), int(max_helpful_existing), -1))

    # Create dataframe of new keywords with votes and movie ID
    df_new = pd.DataFrame({
        "Movie_ID": movie_id,
        "Keyword": [kw for kw, _ in sorted_new_kws],
        "Helpful": helpful_votes,
        "Not_Helpful": [0] * len(sorted_new_kws)
    })

    dfs_to_prepend.append(df_new)

# Concatenate new keywords on top of existing ones
df_new_keywords = pd.concat(dfs_to_prepend, ignore_index=True)
df_updated = pd.concat([df_new_keywords, df_existing], ignore_index=True)

# Save updated dataframe
df_updated.to_pickle("../Dataset/keywords_ground_truth.pkl")
print("Updated dataset saved as 'keywords_ground_truth.pkl'")

Removing 'reviewers say' keyword for 'Parasite'
Removing existing keywords for 'Parasite': ['social commentary']
Removing 'reviewers say' keyword for 'Oppenheimer'
Removing existing keywords for 'Oppenheimer': ['manhattan project']
Removing 'reviewers say' keyword for 'The Good, the Bad and the Ugly'
Removing 'reviewers say' keyword for 'Harry Potter and the Sorcerer's Stone'
Removing existing keywords for 'Star Wars Episode 3': ['star wars', 'jedi council']
Removing existing keywords for 'Star Wars Episode 4': ['star wars', 'death star', 'alien']
Removing existing keywords for 'Star Wars Episode 5': ['asteroid field']
Removing existing keywords for 'Star Wars Episode 7': ['star wars']
Removing 'reviewers say' keyword for 'Star Wars Episode 8'
Removing existing keywords for 'Star Wars Episode 8': ['canto bight']
Removing 'reviewers say' keyword for 'Star Wars Episode 9'
Updated dataset saved as 'keywords_ground_truth.pkl'


## Loading and Inspecting the Updated Keywords Dataset

This cell loads the updated keyword dataset from the specified pickle file path (`keywords_ground_truth.pkl`) and provides an initial inspection.

It displays the total number of keywords contained in the dataset, followed by a preview of the first few rows to illustrate the data structure.

Additionally, it computes and prints the count of keywords associated with each movie, offering insight into the distribution of keyword data across the films in the dataset.

In [63]:
# File path
file_path = "../Dataset/keywords_ground_truth.pkl"

# Load data
with open(file_path, 'rb') as file:
    keywords_df = pickle.load(file)

# Total number of keywords
total_keywords = len(keywords_df)
print(f"Total number of keywords: {total_keywords}\n")

# Display the first few rows
display(keywords_df.head())

# Count number of keywords per movie
print("\nNumber of keywords per movie:")
keywords_per_movie = keywords_df.groupby("Movie_ID").size()
print(keywords_per_movie)


Total number of keywords: 6193



Unnamed: 0,Movie_ID,Keyword,Helpful,Not_Helpful
0,tt6751668,cinematography breathtaking,58,0
1,tt6751668,use symbolism,57,0
2,tt6751668,builds suspense,56,0
3,tt6751668,engaging climax,55,0
4,tt6751668,dark humor,54,0



Number of keywords per movie:
Movie_ID
tt0060196     322
tt0076759     394
tt0080684     345
tt0082971     272
tt0086190     272
tt0120915     410
tt0121765     386
tt0121766     523
tt0241527     449
tt15398776    367
tt2488496     562
tt2527336     589
tt2527338     651
tt3783958     290
tt6751668     361
dtype: int64
