# AI-Generated Review Summaries: Keyword Extraction and Integration

This notebook focuses on the analysis of a dictionary containing AI-generated summaries derived from user reviews of various films (not all the films). Each entry includes a general summary and detailed reflections on specific aspects such as cinematography, direction, character development, and other thematic dimensions.

The objective is to extract relevant keywords from these textual summaries. Since the content is rooted in audience reviews, the extracted keywords are particularly valuable for capturing subjective interpretations and reception patterns.

The most significant keywords will be identified and integrated into the existing file `keywords_ground_truth.pkl`. This file currently includes keywords primarily based on plot-level information. By appending review-based keywords, the keyword set is enriched with descriptors that better reflect audience perspectives, thereby improving the comprehensiveness and relevance of the feature set.

> **Note**: IA Summaries for *Star Wars: Episode I – The Phantom Menace*, *Star Wars: Episode II – Attack of the Clones*, and *La La Land* are currently missing from the dataset.

## Setup: Installing and Importing Required Libraries

In [6]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas", "pickle", "keybert", "sentence-transformers", "tqdm"
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)

pandas is already installed.
pickle is already installed.
keybert is already installed.
Installing sentence-transformers...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
tqdm is already installed.


In [7]:
import pandas as pd
import pickle
from keybert import KeyBERT
from tqdm import tqdm

## Load the dictionary with the IA-generated summary

This cell loads the dictionary of AI-generated summaries from the file `summary_IA.pkl`. Each entry in the dictionary corresponds to a film and contains both a general summary and a set of thematic evaluations. The total number of films with available summaries is displayed to verify successful loading.

In [8]:
# Load the dictionary containing AI-generated summaries
with open("../Dataset/summary_IA.pkl", "rb") as f:
    all_summaries = pickle.load(f)

# Check how many films are included
print(f"Loaded summaries for {len(all_summaries)} films.")

Loaded summaries for 12 films.


## Keyword Extraction from AI-Generated Summaries Using KeyBERT

This cell performs automatic keyword extraction from the AI-generated summaries using the KeyBERT library.

For each film in the dataset, all thematic sections (e.g., cinematography, character development) are processed individually. If the associated text is non-empty, KeyBERT is used to extract the top 5 most relevant keywords or keyphrases, considering both unigrams and bigrams.

The model used is `all-MiniLM-L6-v2`, a lightweight and efficient sentence-transformer well-suited for semantic similarity tasks.

The output is stored in a nested dictionary structure where each film maps to its thematic categories and the corresponding extracted keywords. The `tqdm` library is used to display a progress bar during processing, providing real-time feedback on the execution status.

In [10]:
# Initialize the KeyBERT model with a compact sentence-transformer
kw_model = KeyBERT(model="all-MiniLM-L6-v2")

# Dictionary to store the extracted keywords for each film and each theme
extracted_keywords = {}

# Iterate over all films with progress bar
for film in tqdm(all_summaries, desc="Processing films"):
    extracted_keywords[film] = {}
    
    # Iterate over all thematic sections within each film
    for theme, text in all_summaries[film].items():
        # Skip empty or missing text
        if not text.strip():
            extracted_keywords[film][theme] = []
            continue

        # Extract top 5 keywords or keyphrases (using unigrams and bigrams)
        keywords = kw_model.extract_keywords(
            text,
            keyphrase_ngram_range=(1, 2),
            stop_words="english",
            top_n=5,
            use_mmr=True,  # Promote diversity among keywords
        )

        # Store only the keyword strings (exclude scores)
        extracted_keywords[film][theme] = [kw for kw, _ in keywords]


Processing films: 100%|██████████| 12/12 [00:36<00:00,  3.02s/it]


In [13]:
for film, themes in extracted_keywords.items():
    print(f"\n=== Keywords for {film} ===\n")
    for theme, keywords in themes.items():
        if keywords:
            kw_list = ", ".join(keywords)
            print(f"{theme}: {kw_list}\n")
        else:
            print(f"{theme}: No keywords extracted.")


=== Keywords for Parasite ===

summary: parasite, director bong, thriller horror, writing critics, class disparity

Cinematography: cinematography breathtaking, storytelling enhances, captivating immersive, composition visual, reviewers say

Dark humor: dark humor, film memorable, shift tones, characters situations, blending seamlessly

Direction: blends genres, film pacing, direction seen, appreciate director, manipulates emotions

Dramatic tension: engaging climax, enhance suspense, dramatic tension, film masterfully, particularly praised

Suspenseful: builds suspense, immersive thrilling, cinematography sound, intricate plot, described gripping

Narrative complexity: narrative complexity, blend genres, depth criticized, innovative disorienting, twists plot

Social commentary: social commentary, characters stereotypical, depth criticized, disparities moral, realism film

Symbolism: use symbolism, depth criticized, storytelling, viewers appreciate, making film

Thought-provoking: soc