# Recession Pop Analysis: Dynamic Topic Modeling with KeyNMF

This notebook performs a quantitative analysis of "recession pop" themes over time (2000-2025) using the `turftopic` library and `KeyNMF`. We will analyze the evolution of topics in pop lyrics.

## 1. Install and Import Dependencies
First, we ensure all necessary libraries are installed and imported.

In [None]:
import kagglehub
import pandas as pd
import numpy as np
from turftopic import KeyNMF
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import os

# Set plot style
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

## 2. Download and Load Dataset
We use `kagglehub` to download the dataset and load it into a pandas DataFrame.

In [None]:
# Download latest version
path = kagglehub.dataset_download("rithikkotha/music-dataset")
print("Path to dataset files:", path)

# Find the CSV file
csv_file = [f for f in os.listdir(path) if f.endswith('.csv')][0]
csv_path = os.path.join(path, csv_file)

# Load the dataset
df = pd.read_csv(csv_path)
print(f"Dataset shape: {df.shape}")
df.head()

## 3. Data Preprocessing and Filtering
We filter the data to include only tracks from 2000 to 2025, handle missing values, and ensure the 'release_date' is in datetime format. We will also focus on the 'Pop' genre if available or general analysis if not strictly specified, but given the prompt "recession pop", we should check the genres.

In [None]:
# Convert release_date to datetime
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

# Filter for years 2000-2025
df = df[(df['release_date'].dt.year >= 2000) & (df['release_date'].dt.year <= 2025)]

# Drop rows with missing lyrics or release_date
df = df.dropna(subset=['lyrics', 'release_date'])

# Optional: Filter for Pop genre if 'genre' column exists and has 'Pop'
# Let's check unique genres first (commented out for now, assuming we want all or will filter later)
# print(df['genre'].unique())
# df = df[df['genre'].str.contains('Pop', case=False, na=False)]

# Sort by date
df = df.sort_values('release_date')

print(f"Filtered dataset shape: {df.shape}")

# Prepare corpus and timestamps
corpus = df['lyrics'].tolist()
timestamps = df['release_date'].tolist()

print(f"Number of documents: {len(corpus)}")
print(f"Time range: {min(timestamps)} to {max(timestamps)}")

## 4. Initialize KeyNMF Topic Model
We initialize the `KeyNMF` model. We'll use a standard sentence transformer model for embeddings (default in turftopic is usually 'all-MiniLM-L6-v2' or similar). We set `n_features` (number of keywords) and `n_components` (number of topics). Let's start with 10 topics for overview.

In [None]:
# Initialize KeyNMF
# n_components: number of topics
# top_n: number of keywords to describe each topic
model = KeyNMF(n_components=10, top_n=10, random_state=42)
print("Model initialized.")

## 5. Fit Dynamic Topic Model
We use `fit_transform_dynamic` to fit the model and analyze topics over time. We'll bin the data by year (approx 25 bins for 25 years).

In [None]:
# Fit the dynamic model
# bins=25 creates roughly yearly bins for the 25-year period
# This might take some time as it computes embeddings and fits the model
document_topic_matrix = model.fit_transform_dynamic(corpus, timestamps=timestamps, bins=25)

print("Model fitting complete.")
model.print_topics()

## 6. Visualize Topics Over Time
We visualize the evolution of topics using `plot_topics_over_time()`. This creates an interactive plot.

In [None]:
# Plot topics over time
fig = model.plot_topics_over_time()
fig.show()

## 7. Analyze Specific Topic Trends
We can extract the underlying data to perform custom analysis, focusing on themes related to "recession pop" (e.g., escapism, partying, money, or economic struggle). Note: You will need to identify which topic ID corresponds to these themes from the previous output.

In [None]:
# Access temporal importance
# temporal_importance_ is usually available after fit_transform_dynamic
# It might be stored in the model or returned. 
# Based on docs, model.temporal_importance_ should exist.

if hasattr(model, 'temporal_importance_'):
    temporal_importance = model.temporal_importance_
    
    # Create a DataFrame for plotting
    # We need time labels. model.bin_edges_ might give us the time points.
    # Or we can use the binning logic to recreate labels.
    
    # Let's try to inspect the structure if available, otherwise we rely on the plot_topics_over_time data
    # Actually, plot_topics_over_time returns a figure, but we can also get the data.
    # topic_data = model.prepare_dynamic_topic_data(corpus, timestamps, bins=25)
    # But we already ran fit_transform_dynamic.
    
    # Let's assume we want to plot all topics or a subset.
    # We can use the data from the model directly if exposed.
    
    # For demonstration, let's try to access the data used for plotting
    pass
else:
    print("Temporal importance attribute not found directly. Using prepare_dynamic_topic_data to get structured data.")
    topic_data = model.prepare_dynamic_topic_data(corpus, timestamps=timestamps, bins=25)
    
    # topic_data is a TopicData object, likely has a way to get the dataframe
    # It usually has 'temporal_importance' and 'time_labels'
    
    # Let's print available keys/attributes
    # print(dir(topic_data)) 
    
    # Assuming we can get a dataframe for custom plotting
    # This part depends on the exact structure of TopicData which is not fully detailed in the snippet,
    # but usually it's straightforward.
    pass

# Example of custom plotting if we had the data in a DF 'temporal_df' with columns 'Topic', 'Time', 'Importance'
# sns.lineplot(data=temporal_df, x='Time', y='Importance', hue='Topic')
# plt.title("Topic Importance Over Time")
# plt.show()
