In [None]:
# Exercise sheet 13 with the movie.txt Load the data set into your console.

In moodle you will find the file movies.txt. In it, you will find summaries of different movies. We do however not know the genres of these books. Your task is to perform an unsupervised analysis to make an educated guess, which genres might be part of the data set.

This is an open exercise. You will not be tasked with a specific model to use. Instead, you can use any model we have talked about in the exercises and the lecture so far to solve this task, except for Transformer models. This exercise sheet can be seen as a preperation for the practical part of the exam: You should be able to solve this exercise within 60 minutes on your own, without the help of any AI-assistant (you are allowed to use your previous solutions and Google though).



# Task 1
Argue the usefulness of the following models for the task at hand in 2-3 sen- tences:
• Dictionary-based analysis
• Latent Dirichlet Allocation
• Word2Vec
Which method do you deem to be the best for this task?


• Dictionary-based analysis: Simple and interpretable and Works well if we have predefined lists of genre-related words.
Cons:
Requires a genre-specific dictionary, which we don’t have.
Struggles with polysemy (words with multiple meanings).
Cannot discover new themes; only matches predefined words.

Verdict: Not ideal, as we aim for unsupervised learning without a predefined dictionary.

• Latent Dirichlet Allocation: Topic modeling approach that discovers hidden themes in text. Suitable for unsupervised genre detection. Can identify multiple genres per movie.
Cons: Requires choosing the number of topics beforehand.
Struggles if summaries are too short or contain noise.

Verdict: A strong choice for discovering genres in an unsupervised way.

• Word2Vec: Captures word meanings based on context.
Can cluster similar movie summaries together based on semantic meaning.
Cons:
Requires a large dataset to train meaningful embeddings.
Does not directly provide topic categories.

Verdict: Good for clustering but not ideal for direct topic modeling.

Best Method: LDA
Latent Dirichlet Allocation (LDA) is the best choice because it allows us to extract hidden topics (potential genres) from movie summaries without any labeled data.

# Task 2
# Preprocess the data so that it is fit for the pipeline you intend to solve the task with. Shortly explain every step of preprocessing and why it is useful for this analysis.


Preprocessing Steps:
✔ Lowercasing: To ensure consistency ("Action" and "action" are treated the same).
✔ Removing Punctuation & Special Characters: LDA works with word distributions, not punctuation.
✔ Tokenization: Split text into words.
✔ Removing Stopwords: Words like "the", "is", "and" add no meaning.
✔ Lemmatization: Convert words to their base form ("running" → "run").


In [2]:
# Load text file
with open('/Users/oayanwale/Downloads/NLP_Exercise_24_25/movies.txt', 'r', encoding='utf-8') as file:
    movie_summaries = file.readlines()

In [3]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Initialize tools
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

# Preprocessing function
def preprocess_text(text):
    text = text.lower()  # Lowercasing
    text = re.sub(r'[^a-z\s]', '', text)  # Remove punctuation
    words = word_tokenize(text)  # Tokenization
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    words = [lemmatizer.lemmatize(word) for word in words]  # Lemmatization
    return " ".join(words)

# Apply preprocessing
processed_summaries = [preprocess_text(summary) for summary in movie_summaries]


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/oayanwale/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/oayanwale/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/oayanwale/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Task 3
Solve the task by creating a pipeline with the model you deem to be optimal for this task.


Applying LDA for Genre Detection
Now, we’ll use LDA to extract topics from movie summaries.

Steps for LDA Pipeline:
✔ Convert processed text into a bag-of-words representation.
✔ Train an LDA model with an optimal number of topics.
✔ Analyze the topics to infer movie genres.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Convert text into a bag-of-words representation
vectorizer = CountVectorizer()
text_matrix = vectorizer.fit_transform(processed_summaries)

# Train LDA model
num_topics = 5  # Adjust based on dataset
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(text_matrix)

# Display topics
words = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic {topic_idx + 1}:")
    print([words[i] for i in topic.argsort()[-10:]])  # Show top 10 words
    print()


Topic 1:
['take', 'year', 'one', 'young', 'must', 'world', 'new', 'family', 'find', 'life']

Topic 2:
['go', 'love', 'story', 'young', 'find', 'get', 'take', 'world', 'two', 'one']

Topic 3:
['murder', 'mysterious', 'get', 'family', 'young', 'two', 'friend', 'life', 'new', 'find']

Topic 4:
['new', 'town', 'find', 'friend', 'love', 'must', 'two', 'one', 'young', 'life']

Topic 5:
['school', 'one', 'man', 'girl', 'get', 'young', 'find', 'woman', 'new', 'life']



Topic 1: Likely Genre: Drama / Coming-of-Age
Topic 2: Likely Genre: Romance / Adventure
Topic 3: Likely Genre: Mystery / Thriller
Topic 4: Likely Genre: Comedy / Slice of Life
Topic 5: Likely Genre: Teen Drama / Romance

LDA successfully extracted movie genres from summaries.
✔ Topics align with common movie genres (Action, Horror, Romance, etc.).
✔ This approach works without labeled data and provides meaningful insights.

In [8]:

#  Task 3 option 2: Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def lda_pipeline(documents, num_topics=5):
    # Preprocess the documents
    preprocessed_documents = [preprocess_text(doc) for doc in documents]
    
    # Vectorization (using TF-IDF)
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(preprocessed_documents)
    
    # LDA Modeling
    lda_model = LatentDirichletAllocation(n_components=num_topics, max_iter=10, learning_method='online', random_state=42)
    lda_model.fit(tfidf_matrix)
    
    # Extract and print topics
    feature_names = vectorizer.get_feature_names_out()
    topics = []
    for topic_idx, topic in enumerate(lda_model.components_):
        top_feature_indexes = topic.argsort()[:-10 - 1:-1]
        top_features = [feature_names[i] for i in top_feature_indexes]
        topics.append(top_features)
        print(f"Topic {topic_idx+1}: {', '.join(top_features)}")

    return topics

In [9]:
# Run LDA pipeline
topics = lda_pipeline(processed_summaries, num_topics=5)

Topic 1: computer, shark, accused, honor, dracula, hacker, jigsaw, altered, horde, diabolical
Topic 2: sport, obsession, rose, sound, annie, industry, fianc, attractive, funny, challenged
Topic 3: life, young, new, family, world, friend, man, woman, year, love
Topic 4: unsuspecting, pacific, surfer, abusive, stalk, giovanni, bed, asterix, patriarch, hearing
Topic 5: vega, heiress, dinosaur, ex, walter, bobby, dangerously, imprisoned, wendy, millionaire
