# ***Building a Movie Recommendation System with Synopses Using K-Means Clustering***

## Preliminaries

---

### Imports

We define first the necessary imports for the project. Make sure you have read first the `README.md` or at least ran `pip install -r requirements.txt` before proceeding!

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np


from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.metrics import silhouette_samples, silhouette_score

In [2]:
# Allows auto reloading of modules
%load_ext autoreload
%autoreload 2

In [3]:
# Add path to src folder to the path variable 
PATH_TO_SRC = "./src"

if PATH_TO_SRC not in sys.path:
  sys.path.append(PATH_TO_SRC)

In [4]:
# Local imports from .py files
from data_preprocessor import DataPreprocessor
from data_clusterer import DataClusterer

#### Parameters
We get the variables (i.e. the parameters for the model) from `src/config.py`

In [5]:
from config import MOVIES_PATH, GENRES_PATH, TEST_SIZE, RANDOM_STATE, MAX_DF, MIN_DF, STOP_WORDS, N_COMPONENTS, NORMALIZER_COPY, N_CLUSTERS, MAX_ITER, N_INIT

#### Instantiated classes
We instantiate the `KMeans` object with the number of clusters `N_CLUSTERS`, the maximum number of iterations `MAX_ITER` and the number of initializations `N_INIT`.

In [6]:
kmeans = KMeans(n_clusters=N_CLUSTERS, max_iter=MAX_ITER, n_init=N_INIT)

## Preprocessing

---

We will now preprocess the data using the `DataPreprocessor` class defined in `src/data_preprocessor.py`, which contains the following methods:

- `countGenres()`: **Count** the number of unique genres in the dataset.

- `handleDataFrame()`: **Removes** duplicate rows from `df_genres`, **merges** it with `df_movies`, **drops** certain columns from `df_movies`, and **filters** out rows with null values in the *description* column.

- `splitData()`: **Splits** the dataset into training and testing sets. It takes in two parameters: `test_size` which is the proportion of the dataset to include in the test split, and `rand_state` which is the seed used by the random number generator for shuffling the data.

- `vectorizeData()`: **Converts** a collection of raw data into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. *(More on this later)*

- `reduceDim()`: **Performs** dimensionality reduction on a given dataset using the TruncatedSVD algorithm. *(More on this later)*

In [56]:
preprocessor = DataPreprocessor(MOVIES_PATH, GENRES_PATH)
preprocessor.df_movies.head()

Unnamed: 0,id,name,date,tagline,description,minute,rating
0,1000001,Barbie,2023.0,She's everything. He's just Ken.,Barbie and Ken are having the time of their li...,114.0,3.91
1,1000002,Parasite,2019.0,Act like you own the place.,"All unemployed, Ki-taek's family takes peculia...",133.0,4.57
2,1000003,Everything Everywhere All at Once,2022.0,The universe is so much bigger than you realize.,An aging Chinese immigrant is swept up in an i...,140.0,4.32
3,1000004,Fight Club,1999.0,Mischief. Mayhem. Soap.,A ticking-time-bomb insomniac and a slippery s...,139.0,4.27
4,1000005,Interstellar,2014.0,Mankind was born on Earth. It was never meant ...,The adventures of a group of explorers who mak...,169.0,4.32


In [68]:
preprocessor.df_movies.head()

Unnamed: 0,id,genre,name,description,rating
1,1000001,Comedy,Barbie,Barbie and Ken are having the time of their li...,3.91
2,1000002,Comedy,Parasite,"All unemployed, Ki-taek's family takes peculia...",4.57
3,1000003,Science Fiction,Everything Everywhere All at Once,An aging Chinese immigrant is swept up in an i...,4.32
4,1000004,Drama,Fight Club,A ticking-time-bomb insomniac and a slippery s...,4.27
5,1000005,Science Fiction,Interstellar,The adventures of a group of explorers who mak...,4.32


In [69]:
preprocessor.df_movies.loc[-1] = [None, None, None, "Hi", None]
preprocessor.df_movies.index += 1
preprocessor.df_movies.sort_index()

Unnamed: 0,id,genre,name,description,rating
0,,,,Hi,
1,Hi,Hi,Hi,Hi,Hi
2,1000001,Comedy,Barbie,Barbie and Ken are having the time of their li...,3.91
3,1000002,Comedy,Parasite,"All unemployed, Ki-taek's family takes peculia...",4.57
4,1000003,Science Fiction,Everything Everywhere All at Once,An aging Chinese immigrant is swept up in an i...,4.32
...,...,...,...,...,...
638637,1896371,Drama,Xigarette,Xanat lives a secret identity under her teen n...,
638638,1896372,Drama,Your Way,"Udi gets lost in the woods. At the worst time,...",
638639,1896373,Comedy,YusufPutus1 Baru Saja Mengunggah Video,"Yusuf, a worker who recently got viral online ...",
638644,1896391,Music,柴咲コウ CONCERT TOUR 2023 ACTOR'S THE BEST,KO SHIBASAKI CONCERT TOUR 2023 - ACTOR'S THE BEST,


In [73]:
preprocessor.df_genres.head(20)

Unnamed: 0,id,genre
0,1000001,Comedy
1,1000001,Fantasy
2,1000001,Adventure
3,1000002,Comedy
4,1000002,Thriller
5,1000002,Drama
6,1000003,Science Fiction
7,1000003,Action
8,1000003,Adventure
9,1000004,Drama


#### Data Cleaning and Splitting
We perform here data cleaning and splitting using the methods `handleDataFrame` and `splitData`, respectively.

In [58]:
preprocessor.handleDataFrame()
X_train, X_test, y_train, y_test = preprocessor.splitData(TEST_SIZE, RANDOM_STATE)
X_train_desc_only = X_train.description
X_train_desc_only_copy = X_train_desc_only

Reduced Length:  540740
Reduced Columns:  Index(['id', 'genre', 'name', 'description', 'rating'], dtype='object')


In [59]:
X_train_desc_only_copy.loc[-1] = "Dr. Kenzou Tenma, an elite neurosurgeon recently engaged to his hospital director's daughter, is well on his way to ascending the hospital hierarchy. That is until one night, a seemingly small event changes Dr. Tenma's life forever. While preparing to perform surgery on someone, he gets a call from the hospital director telling him to switch patients and instead perform life-saving brain surgery on a famous performer. His fellow doctors, fiancée, and the hospital director applaud his accomplishment; but because of the switch, a poor immigrant worker is dead, causing Dr. Tenma to have a crisis of conscience. So when a similar situation arises, Dr. Tenma stands his ground and chooses to perform surgery on the young boy Johan Liebert instead of the town's mayor. Unfortunately, this choice leads to serious ramifications for Dr. Tenma—losing his social standing being one of them. However, with the mysterious death of the director and two other doctors, Dr. Tenma's position is restored. With no evidence to convict him, he is released and goes on to attain the position of hospital director. Nine years later when Dr. Tenma saves the life of a criminal, his past comes back to haunt him—once again, he comes face to face with the monster he operated on. He must now embark on a quest of pursuit to make amends for the havoc spread by the one he saved."
X_train_desc_only_copy.index += 1
X_train_desc_only_copy = X_train_desc_only_copy.sort_index()

X_train_desc_only_copy

0         Dr. Kenzou Tenma, an elite neurosurgeon recent...
1         Barbie and Ken are having the time of their li...
2         All unemployed, Ki-taek's family takes peculia...
4         A ticking-time-bomb insomniac and a slippery s...
5         The adventures of a group of explorers who mak...
                                ...                        
638633    A long term couple sit down for a romantic mea...
638634    Winterland is a celebration of ski and snowboa...
638636    Xanat lives a secret identity under her teen n...
638637    Udi gets lost in the woods. At the worst time,...
638638    Yusuf, a worker who recently got viral online ...
Name: description, Length: 378519, dtype: object

#### Vectorization
We will now explore the vectorization of the dataset through the following steps.
1. The method `vectorizeData` has the following parameters:

    - `X`: A pandas DataFrame containing the raw data.

    - `max_df`:  A float or integer representing the maximum document frequency threshold. Terms with a document frequency higher than this threshold will be ignored.

    - `min_df`: An integer representing the minimum document frequency threshold. Terms with a document frequency lower than this threshold will be ignored.

    - `custom_stop_words`: A dictionary of custom stop words to be added to the existing English stop words.
    
    - `verbose`: A boolean indicating whether to print additional information during the vectorization process.

2. It combines the existing English stop words along with the custom stop words provided.

3. It creates an instance of the `TfidfVectorizer` class, which is a subclass of `CountVectorizer`. `CountVectorizer` is responsible for converting a collection of text documents into a matrix of token counts.

4. It sets the parameters of the TfidfVectorizer instance, including `max_df`, `min_df`, and `stop_words`.

5. It measures the time it takes to vectorize the data by calling the `fit_transform` method of the `TfidfVectorizer` instance on the input data `X`. This method converts the raw data into a matrix of TF-IDF features.

6. It calculates the time taken for vectorization and prints it if the verbose parameter is set to `True`.

7. It prints additional information about the vectorized data, including the number of samples, number of features, and percentage of nonzero entries.

8. Finally, it returns the vectorized data (`X_tfidf`) and the feature names obtained from the `TfidfVectorizer` instance.

In [11]:
X_tfidf, features = preprocessor.vectorizeData(X_train_desc_only, MAX_DF, MIN_DF)

Vectorized in 7.756 s
# of Samples: 378518
# of Features: 81591
% of Nonzero Entries: 0.010


#### Dimensionality Reduction

In performing dimensional reduction, the following steps were done.

1. The method `reduceDim` has the following parameters:

    - `X`: A pandas DataFrame representing the input data.

    - `num_comp`: An integer specifying the desired dimensionality of the output data.

    - `normalizer_copy`: A boolean indicating whether to create a copy of the normalizer.

    - `verbose`: A boolean indicating whether to print verbose output.

2. The method first creates a pipeline using the `make_pipeline` function, which chains together the TruncatedSVD transformer and the Normalizer transformer. The TruncatedSVD transformer performs the dimensionality reduction, while the Normalizer transformer normalizes the data.

3. Next, the method fits the pipeline to the input data (`X`) using the `fit_transform method`. This step reduces the dimensionality of the data and returns the transformed data (`X_lsa`).

4. The method then calculates the explained variance of the SVD step and the time taken for the dimensionality reduction. If the `verbose` parameter is set to `True`, it prints the time taken and the explained variance.

5. Finally, the method returns the transformed data (`X_lsa`) and the pipeline (`lsa`).

In [13]:
genres = preprocessor.countGenres()
X_lsa, lsa = preprocessor.reduceDim(X_tfidf, N_COMPONENTS, NORMALIZER_COPY)

# of Genres:  19
Unique Genres:
                     id
genre                  
Action            19194
Adventure          8827
Animation         39939
Comedy            89726
Crime             15459
Documentary      138760
Drama            153192
Family            12476
Fantasy            9163
History            7233
Horror            31020
Music             29414
Mystery            8362
Romance           23825
Science Fiction    9519
TV Movie          10236
Thriller          21435
War                4647
Western            6205
LSA done in 7.316 s
Explained variance of the SVD step: 5.0%


In [14]:
X_lsa.shape

(378518, 50)

## Clustering and Optimizing K

---

In [None]:
# clusterer = DataClusterer(X_lsa, N_CLUSTERS, MAX_ITER, N_INIT)
# K, inertias = clusterer.computeKMeansInertia()

silhouettes = []

for i in range(2, N_CLUSTERS + 1):
    kmeans = KMeans(n_clusters=i)
    print("Insantiated", i)
    kmeans.fit(X_lsa)
    print("Model fitted")
    silhouettes.append(silhouette_score(X_lsa, kmeans.labels_))
    print("Silhouette computed")

plt.plot(range(2, N_CLUSTERS + 1), silhouettes, 'bx-')
plt.xlabel('K value')
plt.ylabel('Silhouette')
plt.show()

In [33]:
X_lsa.shape

(378518, 50)

In [18]:
clusterer = DataClusterer(X_lsa, N_CLUSTERS, MAX_ITER, N_INIT)

In [19]:
optimal_k = 15
centers, y_kmeans, inertia = clusterer.computeKMeans(optimal_k)
clusterer.getClusters(optimal_k, centers, features, lsa)

Cluster 0: school high student students teacher girls college group class girl friends new year life children friend day young gets way 
Cluster 1: woman young husband man love life mysterious finds meets relationship married falls beautiful gets home men past help story tries 
Cluster 2: series movie short comedy based drama horror directed tv stories animated video films special documentary animation film events set new 
Cluster 3: love young girl father time day wife friends mother people home new couple house night make friend son town years 
Cluster 4: life love new death work way film years time day young documentary people real lives change make journey living takes 
Cluster 5: old year years new life boy girl mother father time friends friend home love lives son make day house parents 
Cluster 6: city new york big lives living takes village way brother music place people night life film help young streets country 
Cluster 7: family life father home son daughter mother young chi

In [63]:
X_train_clustered = X_train.assign(cluster=y_kmeans)
X_train_clustered.loc[X_train_clustered['cluster'] == X_train_clustered.loc[0]['cluster']].sort_values('rating', ascending=False).head(10)

Unnamed: 0,id,genre,name,description,rating,cluster
4803,1004805,Documentary,Planet Earth,David Attenborough celebrates the amazing vari...,4.57,13
9364,1009384,Animation,Attack on Titan The Final Chapters: Special 2,As Eren ruthlessly tramples every inch of the ...,4.4,13
14498,1014524,Animation,Attack on Titan The Final Chapters: Special 1,"As The Rumbling advances, the remaining member...",4.36,13
2427,1002429,Animation,Death Note,Light Yagami is an ace student with great pros...,4.35,13
17294,1017330,Music,SUGA: Road to D-DAY,"SUGA of the world-renowned group, BTS, works o...",4.35,13
7387,1007400,Animation,Primal,A caveman forms a bond with a dinosaur as they...,4.33,13
1623,1001625,Drama,All That Jazz,"Joe Gideon is at the top of the heap, one of t...",4.3,13
1320,1001321,Drama,La Dolce Vita,Journalist and man-about-town Marcello struggl...,4.27,13
27401,1027473,Documentary,Human Planet,A cinematic experience bringing you the most a...,4.27,13
7028,1007040,Drama,Edvard Munch,Edvard Munch's childhood is overshadowed by de...,4.26,13


In [62]:
X_train_clustered.loc[0]['cluster']

13

In [31]:
# np.save('cluster_centers.npy', centers) # Centroids
np.save('vectorized_data.npy', X_lsa) # Data as 'points'
# X_train_clustered.to_csv('clustered_movies.csv') # Data as presented to user

In [None]:
# TODO
# Export clustered data
# Given block of text, convert it into 6s-dim vector
# Compute euclidean distance for each centroid
# Return cluster with minimum distance
# Given the returned cluster, shoot out all other movies in the same cluster

In [53]:
from input_processor import processInput, computeDistances

text = "Dr. Kenzou Tenma, an elite neurosurgeon recently engaged to his hospital director's daughter, is well on his way to ascending the hospital hierarchy. That is until one night, a seemingly small event changes Dr. Tenma's life forever. While preparing to perform surgery on someone, he gets a call from the hospital director telling him to switch patients and instead perform life-saving brain surgery on a famous performer. His fellow doctors, fiancée, and the hospital director applaud his accomplishment; but because of the switch, a poor immigrant worker is dead, causing Dr. Tenma to have a crisis of conscience. So when a similar situation arises, Dr. Tenma stands his ground and chooses to perform surgery on the young boy Johan Liebert instead of the town's mayor. Unfortunately, this choice leads to serious ramifications for Dr. Tenma—losing his social standing being one of them. However, with the mysterious death of the director and two other doctors, Dr. Tenma's position is restored. With no evidence to convict him, he is released and goes on to attain the position of hospital director. Nine years later when Dr. Tenma saves the life of a criminal, his past comes back to haunt him—once again, he comes face to face with the monster he operated on. He must now embark on a quest of pursuit to make amends for the havoc spread by the one he saved."
X_lsa, cluster = processInput(text)
recommendations = computeDistances(X_lsa, cluster) # clustered_movies.loc[clustered_movies['cluster'] == cluster].sort_values('rating', ascending=False).head(10)

X_lsa

(1, 90)


ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (100,)  and requested shape (1,90)

In [51]:
dist = np.linalg.norm(np.array([1]) - np.array([1, 2, 3]))

print(dist)

2.23606797749979


## Results and Discussion

---