# Genre Fix

This dataset has too many genres, they are too granular. The aim of this notebook:
1. Split problematic genres, those with "joint" genres, e.g. "Science and Drama".
2. Normalize genres to common names.
3. Combine genres until there are 10 main ones.

## Import data

In [1]:
import os
import sys
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

notebook_dir = Path().resolve()
base_path = os.path.abspath(notebook_dir.parent.parent)
sys.path.append(base_path)
from src.data_utils import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
embeddings, movie_ids  = load_movie_embeddings(os.path.join(base_path, "data", "data_final"))

In [3]:
embeddings.shape

(161553, 1024)

In [4]:
movie_ids.shape

(161553,)

In [5]:
movie_df = load_movie_data(os.path.join(base_path, "data", "data_final"))

In [6]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213903 entries, 0 to 213902
Data columns (total 22 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   movie_id        213903 non-null  object
 1   title           213903 non-null  object
 2   summary         0 non-null       object
 3   release_date    213903 non-null  object
 4   genre           176075 non-null  object
 5   director        190332 non-null  object
 6   actors          145306 non-null  object
 7   duration        131113 non-null  object
 8   imdb_id         201370 non-null  object
 9   country         204335 non-null  object
 10  sitelinks       213903 non-null  object
 11  wikipedia_link  213903 non-null  object
 12  budget          8609 non-null    object
 13  box_office      7786 non-null    object
 14  awards          9830 non-null    object
 15  set_in_period   3500 non-null    object
 16  year            213903 non-null  int64 
 17  popularity      181485 non-nu

In [7]:
movie_df[movie_df.year==1950].shape

(1523, 22)

## Genre exploration

In [8]:
all_genres = movie_df[movie_df.genre.notna()].genre.unique()

In [9]:
all_genres

array(['musical, fantasy film, melodrama, musical film, romance film, family film, cinematic fairy tale',
       'drama film, comedy film, film based on literature',
       'drama film, samurai cinema, crime film, medieval film, flashback film',
       ..., 'fantasy drama',
       'documentary film, prison film, musical film, LGBT-related film',
       'thriller film, time-travel film'], shape=(13188,), dtype=object)

In [10]:
# Sample randomly and see examples of genres
import random

print(random.choice(all_genres))
print(random.choice(all_genres))
print(random.choice(all_genres))
print(random.choice(all_genres))
print(random.choice(all_genres))

action film, war film, science fiction film, thriller film, speculative fiction film, apocalyptic film
horror film, erotic film, zombie film
fantasy film, action film, horror film, adventure film, science fiction film, superhero film
Nazi exploitation, exploitation film
comedy film, science fiction film, science fiction comedy, dystopian film


In [11]:
random.choice(all_genres[np.array(["and" in i for i in all_genres])])

'drama film, fantasy film, action film, sword-and-sandal film'

It seems like the common delimeters are commas and "and". Splitting by these should give us all possible genres, however there weirder ones like "film from a novel". 

In [12]:
# First split by "," and "and"
import re

new_genres = []
for row in all_genres:
    split_genres = row.split(",")
    split_genres = [i.lower().replace("film", "").strip() for i in split_genres]
    new_genres.extend(split_genres)

In [13]:
from collections import Counter

In [14]:
genre_counter = Counter(new_genres)

In [15]:
len(genre_counter)

975

In [16]:
genre_counter.most_common(20)

[('drama', 4274),
 ('action', 2416),
 ('comedy', 2388),
 ('thriller', 1783),
 ('science fiction', 1655),
 ('horror', 1632),
 ('fantasy', 1623),
 ('adventure', 1619),
 ('crime', 1394),
 ('romance', 1073),
 ('based on a novel', 1045),
 ('mystery', 883),
 ('musical', 831),
 ('lgbt-related', 822),
 ('comedy drama', 742),
 ('based on literature', 728),
 ('biographical', 674),
 ('family', 656),
 ('teen', 612),
 ('romantic comedy', 595)]

Use sentence transformers to embed genres and cluster them.

In [17]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

In [18]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [19]:
genre_embeddings = model.encode(new_genres)

In [20]:
genre_embeddings.shape

(47655, 384)

In [21]:
# Clustering
n_clusters = 10
kmeans = KMeans(n_clusters=n_clusters, random_state=123)
labels = kmeans.fit_predict(genre_embeddings)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  ret = a @ b
  ret = a @ b
  ret = a @ b
  current_pot = closest_dist_sq @ sample_weight
  current_pot = closest_dist_sq @ sample_weight
  current_pot = closest_dist_sq @ sample_weight


In [22]:
genre_clusters = pd.DataFrame({"genre": new_genres, "cluster": labels})

In [23]:
genre_clusters[genre_clusters.cluster==9].genre.unique()

array(['noir', 'thriller', 'suspense', 'crime thriller',
       'political thriller', 'horror', 'psychological thriller', 'ghost',
       'comedy thriller', 'romantic thriller', 'vampire', 'zombie',
       'supernatural horror', 'science fiction horror', 'horror fiction',
       'psychological horror', 'natural horror', 'neo-noir',
       'action thriller', 'erotic thriller', 'midnight movie',
       'animal horror', 'supernatural', 'horror western', 'body horror',
       'horror fantasy', 'gothic horror', 'thriller anime',
       'erotic horror', 'list of holiday horror s',
       'horror anime and manga', 'thriller television series',
       'horror television series', 'techno-thriller', 'legal thriller',
       'body horror television program', 'lovecraftian horror',
       'japanese horror', 'medical thriller', 'horrorcore', 'halloween',
       'horror literature', 'cosmic horror', 'teen horror',
       'financial thriller', 'paranormal fiction', 'folk horror',
       'thriller pla

10 Main clusters:
* cluster 0: Others
* cluster 1: Comedy
* cluster 2: Romance
* cluster 3: Drama
* cluster 4: Fantasy
* cluster 5: Action
* cluster 6: Science Fiction
* cluster 7: Family
* cluster 8: Mystery
* cluster 9: Thriller

In [24]:
cluster_label_mapping = {
    0: "others",
    1: "comedy",
    2: "romance",
    3: "drama",
    4: "fantasy",
    5: "action",
    6: "scifi",
    7: "family",
    8: "mystery",
    9: "thriller"
}

In [25]:
genre_clusters["new_label"] = genre_clusters["cluster"].apply(lambda x: cluster_label_mapping[x])

In [30]:
movie_df.genre

0         musical, fantasy film, melodrama, musical film...
1         drama film, comedy film, film based on literature
2         drama film, samurai cinema, crime film, mediev...
3         drama film, film noir, adventure film, mystery...
4                     drama film, film noir, flashback film
                                ...                        
213898                                          comedy film
213899                                    biographical film
213900                                     documentary film
213901                                                  NaN
213902                                          action film
Name: genre, Length: 213903, dtype: object

In [28]:
for genre in cluster_label_mapping.values():
    print(f"Genre - {genre.upper()}")
    original_genres = genre_clusters[genre_clusters.new_label == genre].genre.unique()
    for i, original_g in enumerate(original_genres):
        print(f"    - {original_g}")

Genre - OTHERS
    - melodrama
    - cinematic fairy tale
    - samurai cinema
    - medieval
    - flashback
    - heist
    - gangster
    - western
    - auteur
    - sword-and-sandal
    - disaster
    - propaganda
    - swashbuckler
    - slapstick
    - art
    - revisionist western
    - boxing
    - pirate
    - silent
    - experimental
    - documentary
    - trial
    - post-apocalyptic
    - nature documentary
    - historical
    - animated
    - anthology
    - heimat
    - american football
    - epic
    - puppet
    - fairy tale
    - association football
    - docudrama
    - slice of life
    - rumberas
    - telenovela
    - operetta
    - educational
    - semidocumentary
    - short
    - exploitation
    - neorealism
    - animated cartoon
    - coming-of-age
    - monster
    - staliniana
    - superhero
    - buddy
    - adaptation
    - mumbai noir
    - political
    - live-action/animated
    - athletics
    - dance
    - spaghetti western
    - women in pri

In [41]:
# Generate json file for genre to cluster mapping
map_df = genre_clusters.drop_duplicates("genre").drop(columns=["cluster"])
mapping_dict = {}
for i, row in map_df.iterrows():
    mapping_dict[row["genre"]] = row["new_label"]

In [44]:
len(mapping_dict)

975

In [47]:
import json

In [49]:
with open("../genre_fix_mapping.json", "w") as f:
    f.write(json.dumps(mapping_dict))

In [52]:
with open("../genre_fix_mapping.json", "r") as f:
    test = json.loads(f.read())

In [54]:
test["musical"]

'comedy'