# Movie Reviews Keyword Extraction with KeyBERT Variants

This notebook focuses on **extracting representative keywords** from movie reviews using **KeyBERT** and its extended variants. The goal is to generate concise, meaningful keyword sets for each review that can later be used for evaluation.

## Models Used
We compare and apply the following keyword extraction models:
- **KeyBERT (base)**: Extracts keywords based on semantic similarity using BERT embeddings.
- **KeyBERT + Sentiment Reranker**: Reranks keywords based on their sentiment alignment with the review.
- **KeyBERT + Sentiment-Aware Selection**: Integrates sentiment in the candidate selection phase using a continuous sentiment model.
- **KeyBERT + Metadata**: Enriches document and candidate embeddings using review-level metadata (utility, length, polarity, recency).

## Workflow
1. **Select a movie** from the dataset (`.pkl` files).
2. **Load and run all models** to extract the top keywords for each review.
3. **Save the output** to a new `.pkl` file containing all extracted keyword columns.
4. **(Optional)**: Load and inspect the saved file to ensure correctness.

> This setup allows us to perform a comparative analysis of keyword extraction techniques with a focus on enhancing semantic quality through additional signals like sentiment and metadata.


## Setup: Installing and Importing Required Libraries

In [1]:
import subprocess
import sys

# List of required packages
required_packages = [
    "pandas", "tqdm", "keybert", "sentence-transformers",
]

def install_package(package):
    """Installs a package using pip if it's not already installed."""
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    install_package(package)


pandas is already installed.
tqdm is already installed.


  from .autonotebook import tqdm as notebook_tqdm


keybert is already installed.
Installing sentence-transformers...
Defaulting to user installation because normal site-packages is not writeable


In [2]:
# Add the custom module path for KeyBERTSentimentAware
sys.path.append("../KeyBERTSentimentAware")

# Add the custom module path for KeyBERTMetadata
sys.path.append("../KeyBERTMetadata")    

# Import custom extension of KeyBERT that integrates sentiment awareness in keyword scoring
from models.KeyBertSentimentAware import KeyBERTSentimentAware  # type: ignore

# Import custom reranker that uses sentiment polarity after keyword extraction to re-score them
from models.KeyBertSentimentReranker import KeyBERTSentimentReranker  # type: ignore

# Import custom extension of KeyBERT that enriches embeddings with metadata
from KeyBertMetadata import KeyBERTMetadata  # type: ignore

# Import the original KeyBERT model for semantic-based keyword extraction
from keybert import KeyBERT

# Import the SentenceTransformer class from the sentence-transformers library
from sentence_transformers import SentenceTransformer

# Import tqdm for progress bars
from tqdm import tqdm

# Import pandas for data manipulation
import pandas as pd

# Import os for file path operations
import os

## Load Available Movies from Dataset

This section lists all the available movies stored as `.pkl` files inside the review dataset directory.

- It defines the root path (`../Dataset/Reviews_By_Movie`) where all review files are saved.
- It automatically detects and lists all movie filenames (removing the `.pkl` extension).
- These names can then be used to dynamically select a movie for keyword extraction.

> This allows flexible selection and processing of any movie in the dataset without hardcoding paths.


In [3]:
# Define root directory
root_dir = "../Dataset/Reviews_By_Movie"

# List all available movies
available_movies = sorted([f[:-4] for f in os.listdir(root_dir) if f.endswith(".pkl")])
print("Available movies:", available_movies)

Available movies: ['GoodBadUgly', 'HarryPotter', 'IndianaJones', 'LaLaLand', 'Oppenheimer', 'Parasite', 'SW_Episode1', 'SW_Episode2', 'SW_Episode3', 'SW_Episode4', 'SW_Episode5', 'SW_Episode6', 'SW_Episode7', 'SW_Episode8', 'SW_Episode9']


## Select and Load a Specific Movie

In this step, we manually select one of the available movies listed earlier.

- Set the `movie_name` variable to one of the printed movie titles.
- The script constructs the full file path and loads the corresponding `.pkl` file using `pandas`.
- It then displays the number of reviews loaded for that movie.

> This forms the input dataset for keyword extraction using various models.

In [None]:
# Choose the movie (manually change this)
movie_name = "Parasite"  # Choose from printed list

# Load the selected movie
movie_path = os.path.join(root_dir, f"{movie_name}.pkl")
selected_film = pd.read_pickle(movie_path)

selected_film = selected_film.head(10)

print(f"Loaded {movie_name} with {len(selected_film)} reviews.")

Loaded Parasite with 5 reviews.


## Keyword Extraction with Multiple Models

In this section, we perform keyword extraction from movie reviews using four different models:

- `base`: The standard KeyBERT model using semantic similarity.
- `reranker`: A version that re-ranks extracted keywords using post-hoc sentiment alignment.
- `sentiment`: A sentiment-aware model that incorporates sentiment into keyword selection during the extraction process.
- `metadata`: A custom model that leverages review metadata to improve keyword selection, using a batch embedding strategy.

### Process Overview:
1. Metadata for all reviews is extracted once.
2. Each model is applied to the `Preprocessed_Review` text of each review.
3. For the `metadata` model, batch embeddings are computed for efficiency.
4. Extracted keywords are stored (only the keyword strings, scores are removed).
5. The results are stored in a new DataFrame `keywords_df` with the following columns:
   - `Movie_ID`
   - `Review_Text`
   - `keywords_base`
   - `keywords_reranker`
   - `keywords_sentiment`
   - `keywords_metadata`

This DataFrame will later be saved and evaluated to compare model performance.

In [5]:
# Define the sentence embedding model to be used
model_name = "all-MiniLM-L6-v2"  # A compact and fast transformer model from SentenceTransformers
embedding_model = SentenceTransformer(model_name)  # Load the model to generate sentence embeddings

# Initialize the keyword extraction models
models = {
    "base": KeyBERT(embedding_model),  # Standard KeyBERT model using only semantic similarity
    "reranker": KeyBERTSentimentReranker(embedding_model),  # KeyBERT variant that reranks keywords based on sentiment alignment
    "sentiment": KeyBERTSentimentAware(embedding_model),  # KeyBERT variant integrating sentiment in the candidate selection phase
    "metadata": KeyBERTMetadata(embedding_model),  # KeyBERT variant that incorporates external metadata for keyword extraction
}


In [6]:
# Extract metadata once for the entire dataset
metadata = KeyBERTMetadata.extract_metadata(selected_film)

# Define the n-gram range for keyword candidates
keyphrase_ngram_range = (1, 2)  # Unigrams and bigrams
top_n = 5  # Number of top keywords to extract

# Prepare a dictionary to collect results from all models
keyword_results = {
    "Movie_ID": selected_film["Movie_ID"].tolist(),
    "Review_ID": selected_film["Review_ID"].tolist(),
    "Review_Text": selected_film["Review_Text"].tolist()
}

# Iterate through each keyword extraction model
for model_name, model in models.items():
    tqdm.pandas(desc=f"Extracting keywords with {model_name}")

    # Metadata-aware model requires batch embedding
    if model_name == "metadata":
        try:
            # Compute document and candidate embeddings using metadata
            doc_emb, word_emb = model.extract_embeddings_mean(
                docs=list(selected_film["Preprocessed_Review"]),
                metadata=metadata,
                keyphrase_ngram_range=keyphrase_ngram_range,
            )

            keyword_results[f"keywords_{model_name}"] = selected_film["Preprocessed_Review"].progress_apply(
                lambda _: model.extract_keywords(
                    docs = list(selected_film["Preprocessed_Review"]),
                    doc_embeddings=doc_emb,
                    word_embeddings=word_emb,
                    keyphrase_ngram_range=keyphrase_ngram_range,
                    top_n=top_n
                )
            )
            
            # Convert results to a list of tuples (keyword, score)
            keyword_results[f"keywords_{model_name}"] = keyword_results[f"keywords_{model_name}"].apply(
                lambda x: [kw for kw_group in x for kw in kw_group]
            ).tolist()

        
        # Handle any exceptions that may occur during batch processing
        except Exception as e:
            print(f"Batch error in metadata model: {e}")
            keyword_results[f"keywords_{model_name}"] = [[] for _ in range(len(selected_film))]

    # All other models use per-review keyword extraction
    else:
        keyword_results[f"keywords_{model_name}"] = selected_film["Preprocessed_Review"].progress_apply(
            lambda text: [(kw[0], kw[1]) for kw in model.extract_keywords(
                text,
                top_n=top_n,
                keyphrase_ngram_range=keyphrase_ngram_range
            )]
        ).tolist()

# Create final DataFrame with keywords from all models
keywords_df = pd.DataFrame(keyword_results)

Extracting keywords with base: 100%|██████████| 5/5 [00:02<00:00,  1.79it/s]
Extracting keywords with reranker: 100%|██████████| 5/5 [00:05<00:00,  1.04s/it]
Extracting keywords with sentiment: 100%|██████████| 5/5 [00:17<00:00,  3.51s/it]
Extracting keywords with metadata: 100%|██████████| 5/5 [00:00<00:00, 93.07it/s]


## Save Extracted Keywords to File

After extracting the keywords for each review using all models, we save the results for future evaluation or analysis.

### What this cell does:
1. **Extracts the movie name** from the original `.pkl` file path.
2. **Defines an output path** with the prefix `kw_` (e.g., `kw_Parasite.pkl`) inside the `../Dataset/Extracted_Keywords` directory.
3. **Ensures the output directory exists**, creating it if necessary.
4. **Saves the `keywords_df`** (containing Movie ID, original text, and all extracted keyword columns) as a pickle file.

This allows us to reuse extracted keywords without re-running the extraction pipeline.

When complete, a message confirms the save location.


In [7]:
# Extract movie name from the original file path
movie_name = os.path.splitext(os.path.basename(movie_path))[0]

# Define output path with prefix 'kw_'
output_dir = "../Dataset/Extracted_Keywords"
output_path = os.path.join(output_dir, f"kw_{movie_name}.pkl")

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# Save the DataFrame as a pickle file
keywords_df.to_pickle(output_path)

print(f"Saved keywords for '{movie_name}' to: {output_path}")


Saved keywords for 'Parasite' to: ../Dataset/Extracted_Keywords/kw_Parasite.pkl


## Load Extracted Keywords from File

This cell verifies that the extracted keywords for the selected movie were correctly saved and can be successfully reloaded for further analysis or evaluation.

### What this cell does:
1. **Builds the input file path** using the `movie_name` (e.g., `kw_Parasite.pkl`).
2. **Attempts to load the DataFrame** using `pandas.read_pickle()`.
3. **Handles errors gracefully**, printing a clear message if the file is not found or any other issue occurs.
4. **Displays the first few rows** of the loaded DataFrame to confirm its content.

Use this to ensure that the extraction pipeline completed correctly and the output is ready for use.


In [8]:
# Define the input path for the keywords DataFrame
input_path = os.path.join("../Dataset/Extracted_Keywords", f"kw_{movie_name}.pkl")

# Load the DataFrame
try:
    loaded_df = pd.read_pickle(input_path)
    print(f"Successfully loaded file: {input_path}\n")
    print(f"DataFrame shape: {loaded_df.shape}")
    display(loaded_df.head())  # Show the first few rows if in Jupyter
except FileNotFoundError:
    print(f"File not found: {input_path}")
except Exception as e:
    print(f"Error loading file: {e}")


Successfully loaded file: ../Dataset/Extracted_Keywords/kw_Parasite.pkl

DataFrame shape: (5, 7)


Unnamed: 0,Movie_ID,Review_ID,Review_Text,keywords_base,keywords_reranker,keywords_sentiment,keywords_metadata
0,tt6751668,9637661,I'm genuinely baffled this film won not only b...,"[(korean culture, 0.5476), (seeing korean, 0.5...","[(korean culture, 0.4473), (korean, 0.4152), (...","[(tried hard, 0.4114), (films, 0.3814), (break...","[(korean culture, 0.6205), (seeing korean, 0.5..."
1,tt6751668,5510542,Just watch it. It has everything; entertainmen...,"[(suspense drama, 0.4686), (drama tragedy, 0.4...","[(movie messages, 0.3052), (shown metaphorical...","[(watch entertainment, 0.3874), (movie, 0.3463...","[(korean culture, 0.6205), (seeing korean, 0.5..."
2,tt6751668,5182892,First Hit: I really enjoyed this story as it d...,"[(korean family, 0.6138), (family kim, 0.539),...","[(korean family, 0.4633), (family kim, 0.4168)...","[(ki woo, 0.4314), (perfect, 0.4024), (woo, 0....","[(korean culture, 0.6205), (seeing korean, 0.5..."
3,tt6751668,5499682,I was not expecting that much of this movie. N...,"[(expecting movie, 0.5192), (expect movie, 0.4...","[(expect movie, 0.691), (expecting movie, 0.44...","[(original oscar, 0.4264), (oscar deserved, 0....","[(korean culture, 0.6205), (seeing korean, 0.5..."
4,tt6751668,6094155,"Good acting, cinematography, twists and screen...","[(screenplay liked, 0.5552), (good acting, 0.5...","[(good acting, 0.6689), (screenplay liked, 0.5...","[(liked location, 0.611), (really good, 0.6048...","[(korean culture, 0.6205), (seeing korean, 0.5..."
