# BERTopic Analysis for Screenplays

## Current Notebook Workflow

### 1. Import Data
- Load screenplay text.
- Load words to remove (stop words, character names, common screenplay words).

### 2. Clean the Screenplay
- Convert to lowercase.
- Remove numbers and special characters.
- Filter out words to remove (stop words, character names, etc.).

### 3. Format Text for BERT
- Convert screenplay text into **chunks** (currently non-overlapping).
- **Planned improvement:** Implement **rolling window chunking** (e.g., 300-word chunks with 100-word overlap).

### 4. Load Subgenres & Genres
- Load predefined **genres and subgenres**.
- Concatenate **descriptions** to each genre/subgenre label to improve **zero-shot classification accuracy**.

### 5. Run BERTopic
- Train BERTopic model on the screenplay chunks.
- Generate and save visualizations:
  - **Topic Bar Chart**
  - **Dendrogram**
  - **Inter-Topic Distance Map**

### 6. Zero-Shot Classification
- Perform **zero-shot classification** on genres and subgenres **for each chunk**.
- Assign predicted genres/subgenres to chunks based on topic distributions.

---

## Next Steps

### 1. Improve Chunking
- Modify text preprocessing to **include overlapping chunks** (e.g., 300-word chunks with 100-word overlap).
- Update chunking logic to ensure **better context retention** across segments.

### 2. Train BERTopic on a Larger Corpus
- Use **30 selected screenplays** from the `screenplay_text/` folder as a **training set**.
- Train BERTopic **on all 30 scripts** to **learn better topic structures**.

### 3. Evaluate Model Performance
- **Determine a metric for evaluation** (currently undecided).
- Possible approach: Measure **accuracy** of BERTopic’s **top predicted genres/subgenres**.
- **Challenge:** No ground truth labels available yet → Need to **label some scripts manually**.

### 4. Optimize Zero-Shot Classification Strategy
- Consider **removing zero-shot classification from BERTopic** and **running it separately**.
- Alternative approach:
  - Export **BERTopic’s topic word distributions**.
  - Use **zero-shot classification in a separate environment** (e.g., **Hugging Face**).
  - Classify topics **based on their top words** instead of per-chunk classification.

### 5. Optimize Performance & Scaling
- Consider **precomputing BERT embeddings** and storing them in a **vector database** (e.g., FAISS, Weaviate) to **speed up training**.
- Monitor memory usage as dataset size increases.

---

### Notes:
- **Next Immediate Task:** Implement **rolling window chunking** and test performance.
- **Potential Bottleneck:** Increasing dataset size may require **parallelized processing** for efficiency.
- **Long-Term Goal:** Create a stable **BERTopic model trained on a corpus of screenplays**, which can be used for more robust **genre/subgenre classification**.

### PREP TEXT

In [1]:
from bertopic import BERTopic
import os

def get_filenames(directory):
    return [f for f in os.listdir(directory) if f.endswith('.txt')]


def get_screenplay(movie_title):
    screenplay_path = f"../data/screenplays/text_from_ocr/{movie_title}"

    if not os.path.exists(screenplay_path):
        print(f"Error: Screenplay '{movie_title}.txt' not found in the folder.")
        return None

    try:
        with open(screenplay_path, "r", encoding="utf-8") as f:
            screenplay_text = f.read()
        return screenplay_text
    except Exception as e:
        print (f"Error reading screenplay '{movie_title}.txt': {e}")
        return None

# Example usage
directory_path = "../data/screenplays/train"
train_screenplays = get_filenames(directory_path)
print(len(train_screenplays), "screenplays loaded for training.\n")

screenplay_text = get_screenplay(train_screenplays[0])
print(screenplay_text[:300])

100 screenplays loaded for training.

2001 A SPACE ODYSSEY
Screenplay
by
Stanley Kubrick and Arthur C Clark

Hawk Films Ltd
co M-G-M Studios
Boreham Wood
Herts

TITLE PART I
AFRICA
3000000 YEARS AGO

Al
VIEWS OF AFRICAN DRYLANDS - DROUGHT

The remorseless drought had lasted now for ten million years
and would not end for another million


### LOAD WORDS TO REMOVE FROM SCREENPLAY
- Character names
- stopwords
- 200 most common words in screenplays

In [2]:
# load character names
import nltk
from nltk.corpus import stopwords
import json
import re

def load_character_names(movie_title):
    with open(f"../data/movie_data/{movie_title.replace(" ", "_")}.json", "r", encoding="utf-8") as file2:
        character_names = json.load(file2)
        
    characters = [dicty["character"] for dicty in character_names["actors_characters"]]
    characters_cleaned = []
    for char in characters:
         names = char.split(" ")
         for name in names: 
             name = re.sub(r"[^a-z]", "", name.lower())
             if name != "":
                 characters_cleaned.append(name)

    return characters_cleaned

def load_words(top_n_common=200):
    # stop words
    with open("word_frequencies.json", "r", encoding="utf-8") as file:
        word_frequencies = json.load(file)

    common_words = list(word_frequencies.keys())[:top_n_common]
    stop_words = list(set(stopwords.words('english')))

    return common_words + stop_words
    
words_to_remove = load_words(top_n_common=200)
characters = load_character_names(movie_title="Die Hard")
print("Words loaded.\n")
characters[:5]

Words loaded.



['john', 'mcclane', 'hans', 'gruber', 'karl']

### LOAD SUB-GENRES FOR ZERO-SHOT CLASSIFICATION

In [3]:
# load subgenres

import os
import json

def load_genres_and_subgenres(json_path="../data/IMDb/parsed_subgenres.json"):
    # Check if file exists
    if not os.path.exists(json_path):
        return "Error: JSON file not found."

    try:
        # Load JSON data
        with open(json_path, "r", encoding="utf-8") as f:
            data = json.load(f)
        
        # Separate genres and subgenres into dictionaries
        genres_dicts = []
        subgenres_dicts = []

        for entry in data:
            title = entry.get("title")
            description = entry.get("description")
            category = entry.get("type", "").lower()  # Normalize category name

            if title and description:
                temp_dict = {"title": title, "description": description}
                
                if "subgenre" == category:  # If categorized as a subgenre
                    subgenres_dicts.append(temp_dict)
                elif "genre" == category:  # Default to genre
                    genres_dicts.append(temp_dict)

        return genres_dicts, subgenres_dicts, data

    except json.JSONDecodeError:
        print("Error: Failed to parse JSON file (invalid format).")
    except Exception as e:
        print(f"Error reading file: {e}")


genres_dicts, subgenres_dicts, data = load_genres_and_subgenres()

verbose = False
if verbose:
    for d in genres_dicts:
        print(d["title"])
        print(d["description"])
        print()
        
subgenres_list = [dicty["title"] for dicty in subgenres_dicts]
subgenres_descriptions_list = [dicty["title"] + " " + dicty["description"] for dicty in subgenres_dicts]

genres_list = [dicty["title"] for dicty in genres_dicts]
genres_descriptions_list = [dicty["title"] + " " + dicty["description"] for dicty in genres_dicts]

genres_descriptions_list

['Action The action genre features fast-paced, thrilling, and intense sequences of physical feats, combat, and excitement. The characters of these stories are involved in daring and often dangerous situations, requiring them to rely on their physical prowess, skills, and quick thinking to overcome challenges and adversaries.',
 'Adventure The adventure genre features exciting journeys, quests, or expeditions undertaken by characters who often face challenges, obstacles, and risks in pursuit of a goal. Adventures can take place in a wide range of settings, from exotic and fantastical locations to historical or even everyday environments.',
 'Biography The biography, or "biopic", is a genre that portrays the life story of a real person, often a notable individual or historical figure. They aim to provide a depiction of the subject\'s personal history, achievements, challenges, and impact on society.',
 'Comedy The comedy genre refers to a category of entertainment that aims to amuse and 

### CLEAN TEXT
- Lowercase, remove non-letter characters
- filter out common words and stopwords and character names

In [4]:
# clean text
def clean_and_chunk_text(screenplay_text, movie_title, top_n_common=200, verbose=True, chunk_size=35):
    cleaned_words = screenplay_text.lower().replace("\n", " ")                
    cleaned_words = [re.sub(r"[^a-zA-Z]", "", word.strip()) for word in cleaned_words.split(" ") if word.strip() != ""]
    
    # Filter out stop words, character names, and common words
    words_to_remove = load_words(top_n_common=200)
    characters = load_character_names(movie_title=movie_title)
    filtered_words = [word for word in cleaned_words if word not in (words_to_remove or characters)]
    
    # filter out short words
    filtered_words = [word for word in filtered_words if len(word) > 2]

    # chunks
    chunks = [" ".join(filtered_words[i:i+chunk_size]) for i in range(0, len(filtered_words), chunk_size)]

    if verbose:
        print(f"Original words: {len(cleaned_words)}, Filtered words: {len(filtered_words)}")
        print(f"Total chunks: {len(chunks)}")

    return chunks



### RUN BERTOPIC

In [5]:
import hdbscan
from umap import UMAP
import matplotlib.pyplot as plt
import plotly.io as pio

def create_plots(topic_model, movie_title, top_n_topics=10):
    # barcharts
    fig = topic_model.visualize_barchart(top_n_topics=top_n_topics)
    fig.update_layout(title_text=f"Top 10 BERTopic distributions in {movie_title}", title_x=0.5)  # Centered title
    fig.write_html(f"plots/{movie_title}_bertopic_barchart.html")

    # intertopic distances
    fig2 = topic_model.visualize_topics()
    fig2.update_layout(title_text=f"Intertopic Distance Map for {movie_title}", title_x=0.5)  # Centered title
    fig2.write_html(f"plots/{movie_title}_intertopic_distance.html")

    # dendogram
    fig3 = topic_model.visualize_hierarchy()
    fig3.write_html(f"plots/{movie_title}_dendrogram.html")  # Save as PNG

    print("Plots saved.")


def create_model_and_plots(chunks):
    
    hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=3,
                                   min_samples=1,
                                   cluster_selection_epsilon=0.1)
    
    umap_model = UMAP(n_components=10, n_neighbors=15, min_dist=0.05, metric='cosine')
    
    topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
    topics, probs = topic_model.fit_transform(chunks)
    
    # View the most common topics
    #topic_model.get_topic_info()
    
    # save plots
    create_plots(topic_model, movie_title, top_n_topics=10)

### RUN THE JEWELS

In [None]:
import time as time

start_time = time.time()

movie_title = "High Noon"
screenplay_text = get_screenplay(movie_title)
if screenplay_text:
    chunks = clean_and_chunk_text(screenplay_text, top_n_common=200, verbose=True, chunk_size=35)
    create_model_and_plots(chunks)

end_time = time.time()

print(f"Time: {round(end_time - start_time, 2)} seconds")

### RUN THE JEWELS 2

In [6]:
from sentence_transformers import SentenceTransformer
import time as time

start_time = time.time()

# CREATE EMBEDDINGS
embedder = SentenceTransformer("all-MiniLM-L6-v2")
chunks = []

for filename in train_screenplays:
    screenplay_text = get_screenplay(filename)
    movie_title = filename.replace(".txt", "")
    
    if screenplay_text:
        chunks += clean_and_chunk_text(screenplay_text, movie_title, top_n_common=200, verbose=False, chunk_size=150)

#create_model_and_plots(chunks)

embeddings = embedder.encode(chunks, show_progress_bar=True)
end_time = time.time()
    
print(f"Time: {round(end_time - start_time, 2)} seconds\nEmbeddings created!")

FileNotFoundError: [Errno 2] No such file or directory: '../data/movie_data/Ground_Hog_Day.json'

In [None]:
hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=10,
                                   min_samples=2,
                                   cluster_selection_epsilon=0.2)
    
umap_model = UMAP(n_components=5, n_neighbors=25, min_dist=0.1, metric='cosine')

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)

topics, probs = topic_model.fit_transform(chunks, embeddings)


In [None]:
topic_model.get_topic_info()


### ZERO-SHOT CLASSIFICATION

In [None]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from collections import Counter

movie_title = "Transformers 2007"
print("Screenplay:", movie_title, "\n")

# genres
zeroshot_topic_list = genres_descriptions_list

topic_model = BERTopic(
    embedding_model="thenlper/gte-small", 
    min_topic_size=15,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.8,
    representation_model=KeyBERTInspired()
)

screenplay_text = get_screenplay(movie_title)
chunks = clean_and_chunk_text(screenplay_text, top_n_common=200, verbose=True, chunk_size=30)
topics, _ = topic_model.fit_transform(chunks)
predicted_genres = []

verbose = False
for i, topic_id in enumerate(topics[:50]):  # Check first 10 documents
    topic =genres_list[topic_id]
    predicted_genres.append(topic)
    if verbose:
        print(f"Document {i}: Predicted Genre - {topic}")
        print(f"Text Snippet: {chunks[i][:200]}")  # Show first 200 characters
        print("-" * 50)


# Count occurrences of each unique predicted subgenre
genre_counts = Counter(predicted_genres)

# Total number of documents
total_docs = sum(genre_counts.values())

# Sort subgenres by count in descending order
sorted_counts = sorted(genre_counts.items(), key=lambda x: x[1], reverse=True)

# Print results
print("\nPredicted Genres:")
for genre, count in sorted_counts:
    percent = (count / total_docs) * 100  # Calculate percentage
    print(f"{percent:.0f}% | {genre}")

# subgenres
zeroshot_topic_list = subgenres_descriptions_list

topic_model = BERTopic(
    embedding_model="thenlper/gte-small", 
    min_topic_size=15,
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.8,
    representation_model=KeyBERTInspired()
)

topics, _ = topic_model.fit_transform(chunks)
predicted_subgenres = []

verbose = False
for i, topic_id in enumerate(topics[:50]):  # Check first 10 documents
    topic =subgenres_list[topic_id]
    predicted_subgenres.append(topic)
    if verbose:
        print(f"Document {i}: Predicted Subgenre - {topic}")
        print(f"Text Snippet: {chunks[i][:200]}")  # Show first 200 characters
        print("-" * 50)


# Count occurrences of each unique predicted subgenre
subgenre_counts = Counter(predicted_subgenres)

# Total number of documents
total_docs = sum(subgenre_counts.values())

# Sort subgenres by count in descending order
sorted_counts = sorted(subgenre_counts.items(), key=lambda x: x[1], reverse=True)

# Print results
print("\nPredicted Subgenres:")
for subgenre, count in sorted_counts:
    percent = (count / total_docs) * 100  # Calculate percentage
    print(f"{percent:.0f}% | {subgenre}")


In [None]:
# other plots

#topic_model.visualize_term_rank()
#topic_model.visualize_heatmap()