Take Home Exam by **Richmond Tetteh Djwerter** and **Kwaku Adou**

# **ABSTRACT**



This Multi-Modal Movie Recommendation System leverages a combination of structured numerical features and deep text embeddings to provide accurate and personalized movie recommendations. Unlike traditional recommender systems that rely solely on user ratings or metadata, this model integrates numerical attributes (such as popularity and ratings) with semantic information extracted from movie overviews.

The system precomputes numerical representations using feature normalization and text embeddings using the SentenceTransformer model (all-MiniLM-L6-v2). At inference time, it calculates cosine similarity across both modalities and returns a ranked list of the most relevant movie suggestions.

The use of a precomputed embedding cache ensures fast query response times, making this approach suitable for real-time applications such as movie streaming platforms and content recommendation engines. The system is deployed as an interactive Gradio application, allowing users to input a movie title and receive a list of recommended movies.

# **Model Card: Autoencoder-Based Multi-Modal Movie Recommendation System**

**Model Overview**

This model is a multi-modal movie recommendation system that suggests movies based on both numeric metadata (e.g., popularity, average rating) and semantic similarity (based on movie descriptions). The combination of structured and unstructured data allows for more context-aware recommendations.

**Intended Use**

Primary Purpose: Recommends similar movies based on user input.

**Target Users:**

Movie enthusiasts
Streaming platforms (Netflix, Disney+, etc.)
AI researchers in recommendation systems

**Inputs:**
Movie Title: Finds movies with similar attributes and descriptions.
Number of Recommendations: Defines how many similar movies to return.
Outputs: A ranked list of recommended movies.

**Model Architecture**

The system consists of two primary components:

**Numeric Feature Processing:**

Uses popularity and vote average as key numerical features.
Data is normalized using StandardScaler for similarity computation.
Text-Based Feature Extraction:

Uses SentenceTransformer (all-MiniLM-L6-v2) to encode movie overviews into 384-dimensional vectors.
Embeddings are precomputed and cached for fast retrieval.

**Similarity Computation:**

Uses cosine similarity on both numeric and text features.
A weighted combination of both similarities determines final recommendations.

**Dataset Used**  
- **Source:** TMDB Movies Dataset 2023 (Kaggle)  
- **Size:** About **930,000 entries movie entries**

**Performance & Optimization**

**Fast Inference:**
Uses precomputed embeddings for text features to avoid API calls and reduce runtime.
Numerical data is vectorized and normalized for efficient similarity search.

**Batch Processing:**
Text embeddings are generated in batches of 512 for optimized memory usage.
Parallelized processing enables embedding computation in under 15 minutes for 50K movies.

**Performance Metrics**  
- **Mean Squared Error (MSE):** Evaluated reconstruction loss of the autoencoder.  
- **Cosine Similarity:** Measures recommendation accuracy based on similarity scores.  

**Limitations & Risks**  
- **Limited Image Understanding:** Movie posters are **not yet integrated** into recommendations.  
- **Bias in Data:** Popular movies may dominate recommendations due to dataset imbalances.  
- **Cold Start Problem:** New movies without sufficient metadata may not be well recommended.  

**Deployment**  
- **Gradio App:** Provides an interactive UI for testing recommendations.  

In [None]:
import time
start_time = time.time()

In [None]:
!pip install kaggle
!pip install opendatasets
!pip install gradio

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl.metadata (9.2 kB)
Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22
Collecting gradio
  Downloading gradio-5.21.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.11-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.7.2 (from gradio)
  Downloading gradio_client-1.7.2-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Colle

In [None]:
from google.colab import userdata
userdata.get('KAGGLE_USER')
userdata.get('KAGGLE_KEY')

import opendatasets as od
od.download("https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies/data")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: djwerterrichmond
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies
Downloading tmdb-movies-dataset-2023-930k-movies.zip to ./tmdb-movies-dataset-2023-930k-movies


100%|██████████| 211M/211M [00:02<00:00, 98.5MB/s]





In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

In [None]:
main_df = pd.read_csv('/content/tmdb-movies-dataset-2023-930k-movies/TMDB_movie_dataset_v11.csv')
display(main_df.head())

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,...,The Dark Knight,Batman raises the stakes in his war on crime. ...,130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f..."
3,19995,Avatar,7.573,29815,Released,2009-12-15,2923706026,162,False,/vL5LR6WdxWPjLPFRLe133jXWsh5.jpg,...,Avatar,"In the 22nd century, a paraplegic Marine is di...",79.932,/kyeqWdyUXW608qlYkRqosgbbJyK.jpg,Enter the world of Pandora.,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish","future, society, culture clash, space travel, ..."
4,24428,The Avengers,7.71,29166,Released,2012-04-25,1518815515,143,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,...,The Avengers,When an unexpected enemy emerges and threatens...,98.082,/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg,Some assembly required.,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian","new york city, superhero, shield, based on com..."


In [None]:
df = main_df[main_df['vote_average']!=0]
df.reset_index(inplace=True)
display(df.shape)

(351919, 25)

# Features Selection

In [None]:
df.columns

Index(['index', 'id', 'title', 'vote_average', 'vote_count', 'status',
       'release_date', 'revenue', 'runtime', 'adult', 'backdrop_path',
       'budget', 'homepage', 'imdb_id', 'original_language', 'original_title',
       'overview', 'popularity', 'poster_path', 'tagline', 'genres',
       'production_companies', 'production_countries', 'spoken_languages',
       'keywords'],
      dtype='object')

In [None]:
df = df.drop( ['id' , 'vote_count' , 'status' , 'release_date', 'revenue' , 'backdrop_path',
              'budget','homepage','imdb_id','original_title','overview','poster_path',
              'tagline' , 'production_companies','production_countries' ,'spoken_languages' ,'keywords'], axis=1)

df['org_title']=df['title']

In [None]:
df.isna().sum()

Unnamed: 0,0
index,0
title,0
vote_average,0
runtime,0
adult,0
original_language,0
popularity,0
genres,59762
org_title,0


In [None]:
df['genres'] = df['genres'].fillna('unknown')

df.isna().sum()

Unnamed: 0,0
index,0
title,0
vote_average,0
runtime,0
adult,0
original_language,0
popularity,0
genres,0
org_title,0


In [None]:
print(df.duplicated().sum())

0


In [None]:
display(df.head())

Unnamed: 0,index,title,vote_average,runtime,adult,original_language,popularity,genres,org_title
0,0,Inception,8.364,148,False,en,83.952,"Action, Science Fiction, Adventure",Inception
1,1,Interstellar,8.417,169,False,en,140.241,"Adventure, Drama, Science Fiction",Interstellar
2,2,The Dark Knight,8.512,152,False,en,130.643,"Drama, Action, Crime, Thriller",The Dark Knight
3,3,Avatar,7.573,162,False,en,79.932,"Action, Adventure, Fantasy, Science Fiction",Avatar
4,4,The Avengers,7.71,143,False,en,98.082,"Science Fiction, Action, Adventure",The Avengers


In [None]:
dff= df.copy()

# MultiLabel Encoder

In [None]:
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer

genre_l = dff['genres'].apply(lambda x: x.split(','))
genre_l = pd.DataFrame(genre_l)

In [None]:
genre_l

Unnamed: 0,genres
0,"[Action, Science Fiction, Adventure]"
1,"[Adventure, Drama, Science Fiction]"
2,"[Drama, Action, Crime, Thriller]"
3,"[Action, Adventure, Fantasy, Science Fiction]"
4,"[Science Fiction, Action, Adventure]"
...,...
351914,[unknown]
351915,[Comedy]
351916,"[Documentary, TV Movie, War]"
351917,[unknown]


In [None]:
genre_l['genres'] = genre_l['genres'].apply(lambda x :[ y.strip().lower().replace(' ','') for y in x] )

In [None]:
MLB = MultiLabelBinarizer()

genre_encoded = MLB.fit_transform(genre_l['genres'])

genre_encoded

array([[1, 1, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]])

In [None]:
genre_encoded_df = pd.DataFrame(genre_encoded, columns=MLB.classes_)
genre_encoded_df=genre_encoded_df.reset_index()

mod_df = dff.drop(['genres'],axis=1)
mod_df=mod_df.reset_index()

df = pd.concat([mod_df,genre_encoded_df],axis=1).drop('index',axis=1)
df.head()

Unnamed: 0,level_0,title,vote_average,runtime,adult,original_language,popularity,org_title,action,adventure,...,horror,music,mystery,romance,sciencefiction,thriller,tvmovie,unknown,war,western
0,0,Inception,8.364,148,False,en,83.952,Inception,1,1,...,0,0,0,0,1,0,0,0,0,0
1,1,Interstellar,8.417,169,False,en,140.241,Interstellar,0,1,...,0,0,0,0,1,0,0,0,0,0
2,2,The Dark Knight,8.512,152,False,en,130.643,The Dark Knight,1,0,...,0,0,0,0,0,1,0,0,0,0
3,3,Avatar,7.573,162,False,en,79.932,Avatar,1,1,...,0,0,0,0,1,0,0,0,0,0
4,4,The Avengers,7.71,143,False,en,98.082,The Avengers,1,1,...,0,0,0,0,1,0,0,0,0,0


# Features Engineering

In [None]:
df['title'] = df['title'].apply(lambda x :x.strip().lower().replace(' ','') )
df['original_language'] = df['original_language'].apply(lambda x :x.strip().lower().replace(' ','') )

df.head()

Unnamed: 0,level_0,title,vote_average,runtime,adult,original_language,popularity,org_title,action,adventure,...,horror,music,mystery,romance,sciencefiction,thriller,tvmovie,unknown,war,western
0,0,inception,8.364,148,False,en,83.952,Inception,1,1,...,0,0,0,0,1,0,0,0,0,0
1,1,interstellar,8.417,169,False,en,140.241,Interstellar,0,1,...,0,0,0,0,1,0,0,0,0,0
2,2,thedarkknight,8.512,152,False,en,130.643,The Dark Knight,1,0,...,0,0,0,0,0,1,0,0,0,0
3,3,avatar,7.573,162,False,en,79.932,Avatar,1,1,...,0,0,0,0,1,0,0,0,0,0
4,4,theavengers,7.71,143,False,en,98.082,The Avengers,1,1,...,0,0,0,0,1,0,0,0,0,0


In [None]:
df.loc[~( (df['original_language']=='en')|(df['original_language']=='fr')|(df['original_language']=='es')|(df['original_language']=='de')|(df['original_language']=='ja')),'original_language'] = 'else'
df['original_language'].unique()

array(['en', 'else', 'fr', 'ja', 'es', 'de'], dtype=object)

# Bert Sentence Transformer

In [None]:
userdata.get('HF_TOKEN')

In [None]:
from sentence_transformers import SentenceTransformer

bert_model = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# One-Hot Encoding

In [None]:
OHE = OneHotEncoder(sparse_output=False)

In [None]:
df['adult'] = df['adult'].astype('str')
adult_enc = OHE.fit_transform(df[['adult']])
adult_enc_df = pd.DataFrame(adult_enc,columns=OHE.get_feature_names_out())

In [None]:
adult_enc_df = adult_enc_df.drop('adult_True',axis=1)

In [None]:
lang_enc = OHE.fit_transform(df[['original_language']])
lang_enc_df = pd.DataFrame(lang_enc,columns=OHE.get_feature_names_out())

In [None]:
mod_df = df.drop(['adult','original_language'],axis=1)

In [None]:
df = pd.concat([mod_df,adult_enc_df,lang_enc_df],axis=1)

In [None]:
df.head()

Unnamed: 0,level_0,title,vote_average,runtime,popularity,org_title,action,adventure,animation,comedy,...,unknown,war,western,adult_False,original_language_de,original_language_else,original_language_en,original_language_es,original_language_fr,original_language_ja
0,0,inception,8.364,148,83.952,Inception,1,1,0,0,...,0,0,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1,interstellar,8.417,169,140.241,Interstellar,0,1,0,0,...,0,0,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2,thedarkknight,8.512,152,130.643,The Dark Knight,1,0,0,0,...,0,0,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3,3,avatar,7.573,162,79.932,Avatar,1,1,0,0,...,0,0,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
4,4,theavengers,7.71,143,98.082,The Avengers,1,1,0,0,...,0,0,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


# Normalization

In [None]:
from sklearn.preprocessing import StandardScaler
SC = StandardScaler()
df_norm = SC.fit_transform(df.drop(['title','org_title',],axis=1))
df_norm_df = pd.DataFrame(df_norm, columns=[x for x in df.columns if x not in ['title', 'org_title']])
df = pd.concat([df[['title','org_title']],df_norm_df],axis=1)
df.head()

Unnamed: 0,title,org_title,level_0,vote_average,runtime,popularity,action,adventure,animation,comedy,...,unknown,war,western,adult_False,original_language_de,original_language_else,original_language_en,original_language_es,original_language_fr,original_language_ja
0,inception,Inception,-1.732046,1.140673,1.198285,6.080628,3.469359,4.803235,-0.254726,-0.524382,...,-0.452277,-0.13278,-0.113835,0.244871,-0.214372,-0.589887,0.962388,-0.272946,-0.275521,-0.201109
1,interstellar,Interstellar,-1.732036,1.167532,1.520933,10.287577,-0.288238,4.803235,-0.254726,-0.524382,...,-0.452277,-0.13278,-0.113835,0.244871,-0.214372,-0.589887,0.962388,-0.272946,-0.275521,-0.201109
2,thedarkknight,The Dark Knight,-1.732026,1.215675,1.259741,9.570238,3.469359,-0.208193,-0.254726,-0.524382,...,-0.452277,-0.13278,-0.113835,0.244871,-0.214372,-0.589887,0.962388,-0.272946,-0.275521,-0.201109
3,avatar,Avatar,-1.732016,0.73982,1.413383,5.78018,3.469359,4.803235,-0.254726,-0.524382,...,-0.452277,-0.13278,-0.113835,0.244871,-0.214372,-0.589887,0.962388,-0.272946,-0.275521,-0.201109
4,theavengers,The Avengers,-1.732007,0.809247,1.121464,7.136681,3.469359,4.803235,-0.254726,-0.524382,...,-0.452277,-0.13278,-0.113835,0.244871,-0.214372,-0.589887,0.962388,-0.272946,-0.275521,-0.201109


# Handling Duplicates

In [None]:
df = df.drop_duplicates(subset=['title'])
df=df.set_index(['title'])
df_fin=df.drop(['org_title'],axis=1)

# Cosine-Similatry

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

X_train, X_test = train_test_split(df_fin, test_size=0.3, random_state=42)
X_train, X_val = train_test_split(X_train, test_size=0.3, random_state=42)

model = Sequential()
model.add(Dense(128, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(X_train.shape[1]))

model.compile(optimizer='adam', loss='mean_squared_error')

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
model.fit(X_train, X_train, epochs=20, batch_size=64, validation_data=(X_val, X_val))

X_test_pred = model.predict(X_test)

Epoch 1/20
[1m2369/2369[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 0.2311 - val_loss: 0.0043
Epoch 2/20
[1m2369/2369[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - loss: 0.0090 - val_loss: 0.0011
Epoch 3/20
[1m2369/2369[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 3ms/step - loss: 0.0102 - val_loss: 0.0010
Epoch 4/20
[1m2369/2369[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - loss: 0.0023 - val_loss: 0.0012
Epoch 5/20
[1m2369/2369[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 4ms/step - loss: 0.0032 - val_loss: 0.0027
Epoch 6/20
[1m2369/2369[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - loss: 0.0013 - val_loss: 6.7422e-04
Epoch 7/20
[1m2369/2369[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 4ms/step - loss: 0.0018 - val_loss: 0.0195
Epoch 8/20
[1m2369/2369[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - loss: 0.0057 - val_loss: 4.7411e-04
Epoch 9/20


In [None]:
test_mse = mean_squared_error(X_test, X_test_pred)
#cosine_sim = cosine_similarity(X_test_pred, X_test)
#avg_cosine_sim = np.mean(cosine_sim)

In [None]:
print(f"Mean Squared Error (MSE) on the test set: {test_mse}")
#print(f"Average Cosine Similarity on the test set: {avg_cosine_sim}")

Mean Squared Error (MSE) on the test set: 0.000338031561113894


**The Mean Squared Error (MSE) of 0.000338** on the test set indicates that, on average, the model's predictions deviate from the actual values by approximately 0.01839 units (since the square root of 0.000338 is approximately 0.01839). Given that the data has been normalized, this small error suggests that the model is performing well in terms of prediction accuracy.

In [None]:
movie_name = 'the dark knight'
movie_name=movie_name.strip().lower().replace(' ','')
new_df= df_fin.loc[[movie_name]]
new_df = new_df.values.reshape(1,-1)

df_other = df_fin.loc[df_fin.index!=movie_name,:]
df_titles = df.loc[df.index!=movie_name,'org_title']
cosine_sim_matrix = cosine_similarity(new_df,df_other)
cosine_sim_df = pd.DataFrame(cosine_sim_matrix,index=[movie_name],columns=df_titles)

cosine_sim_df

org_title,Inception,Interstellar,Avatar,The Avengers,Deadpool,Avengers: Infinity War,Fight Club,Guardians of the Galaxy,Pulp Fiction,Forrest Gump,...,Lethal Attractions,Charli XCX Live in Chicago,The Man who Loves Hebrew,Henry Kissinger: Secrets of a Superpower,"Jessico, Una Historia de Rock en Tiempos Convulsos",Uncle Elephant,Night Snorkeling,Yankee Lady,Kizu: The Untold Story of Unit 731,Dastaan
thedarkknight,0.59248,0.675678,0.505434,0.632574,0.630235,0.734473,0.779742,0.353362,0.887801,0.701335,...,-0.100238,-0.123889,-0.249124,-0.10666,-0.118395,-0.136751,-0.151717,-0.168196,-0.109211,-0.198233


In [None]:
sorted_row = cosine_sim_df.loc[movie_name].sort_values(ascending=False)[0:20]

In [None]:
sorted_row.index[5]

'The Equalizer 2'

# Deployment

In [None]:
import gradio as gr
def predict(movie_name,no_movies):
    movie_name=movie_name.strip().lower().replace(' ','')
    if(movie_name in df_fin.index):
        new_df= df_fin.loc[[movie_name]]
        new_df = new_df.values.reshape(1,-1)
        df_other = df_fin.loc[df_fin.index!=movie_name,:]
        df_titles = df.loc[df.index!=movie_name,'org_title']
        cosine_sim_matrix = cosine_similarity(new_df,df_other)
        cosine_sim_df = pd.DataFrame(cosine_sim_matrix,index=[movie_name],columns=df_titles)
        sorted_row = cosine_sim_df.loc[movie_name].sort_values(ascending=False)[0:int(no_movies)]
        rec=''
        for i in range(int(no_movies)):
            rec+=(sorted_row.index[i])+'\n'
        return rec
    else:
        return "Sorry, this movie isn't in our database. \n try another one !"

interface = gr.Interface(
    fn=predict,
    inputs=[gr.Textbox(label="Movie Name : "),
            gr.Textbox(label='No.of Recommendations: ',value= '5')],
    outputs=gr.Textbox(label="Recommendations : "),
    examples = [["the dark knight",6], ["inception",3]]
)
interface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://a625a8cfb74dcaddad.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
interface.close()

Closing server running on port: 7860


# Multi-Modal Integration

In [None]:
userdata.get('OPENAI_API_KEY')

In [None]:
import os
import pickle
from tqdm import tqdm

main_df = pd.read_csv('/content/tmdb-movies-dataset-2023-930k-movies/TMDB_movie_dataset_v11.csv')
main_df = main_df.sort_values(by='popularity', ascending=False)
sample_df = main_df.head(50000).copy()

In [None]:
display(sample_df.head())

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
3869,565770,Blue Beetle,7.139,1023,Released,2023-08-16,124818235,128,False,/1syW9SNna38rSl9fnXwc9fP7POW.jpg,...,Blue Beetle,Recent college grad Jaime Reyes returns home f...,2994.357,/mXLOHHc1Zeuwsl4xYKjKh2280oL.jpg,Jaime Reyes is a superhero whether he likes it...,"Action, Science Fiction, Adventure","Warner Bros. Pictures, The Safran Company, DC ...",United States of America,"English, Portuguese, Spanish","armor, superhero, family relationships, family..."
5048,980489,Gran Turismo,8.068,702,Released,2023-08-09,114800000,135,False,/xFYpUmB01nswPgbzi8EOCT1ZYFu.jpg,...,Gran Turismo,The ultimate wish-fulfillment tale of a teenag...,2680.593,/51tqzRtKMMZEYUpSYkrUE7v9ehm.jpg,From gamer to racer.,"Action, Drama, Adventure","PlayStation Productions, 2.0 Entertainment, Co...",United States of America,"English, German, Japanese","based on true story, racing, based on video ga..."
51198,754720,A Female Boss with Big Tits and Her Cherry Boy...,9.0,19,Released,2020-01-16,0,120,True,/mWFn8XFCPOuv0vqVjekPscWSSGZ.jpg,...,巨乳上司と童貞部下が出張先の相部屋ホテルで…いたずら誘惑を真に受けた部下が10発射精の絶倫性...,Yuzuru is this clumsy permavirgin employee who...,2020.286,/upVlyO6Zc5AecmTNv0XZ1oiryZf.jpg,,Drama,S1 NO. 1 STYLE,Japan,Japanese,"cheating, office, big tits, unfaithful wife"
7924,968051,The Nun II,6.545,365,Released,2023-09-06,231200000,110,False,/53z2fXEKfnNg2uSOPss2unPBGX1.jpg,...,The Nun II,"In 1956 France, a priest is violently murdered...",1692.778,/c9kVD7W8CT5xe4O3hQ7bFWwk68U.jpg,Confess your sins.,"Horror, Mystery, Thriller","New Line Cinema, Atomic Monster, The Safran Co...",United States of America,"English, French","france, bullying, sequel, religion, demon, got..."
2130,615656,Meg 2: The Trench,6.912,2034,Released,2023-08-02,384056482,116,False,/5mzr6JZbrqnqD8rCEvPhuCE5Fw2.jpg,...,Meg 2: The Trench,An exploratory dive into the deepest depths of...,1567.273,/4m1Au3YkjqsxF8iwQy0fPYSxE0h.jpg,Back for seconds.,"Action, Science Fiction, Horror","Apelles Entertainment, Warner Bros. Pictures, ...","China, United States of America",English,"based on novel or book, sequel, shark, kaiju, ..."


In [None]:
overview_df = sample_df[['title', 'overview']].copy()
overview_df['title_processed'] = overview_df['title'].astype(str).apply(lambda x: x.strip().lower().replace(' ', ''))
overview_df = overview_df.drop_duplicates(subset=['title_processed']).set_index('title_processed')

numeric_features = sample_df[['vote_average', 'popularity']].copy()
numeric_features['title_processed'] = sample_df['title'].astype(str).apply(lambda x: x.strip().lower().replace(' ', ''))
numeric_features = numeric_features.drop_duplicates(subset=['title_processed']).set_index('title_processed')

scaler = StandardScaler()
df_fin = pd.DataFrame(scaler.fit_transform(numeric_features),
                      index=numeric_features.index,
                      columns=numeric_features.columns)

In [None]:
EMBEDDING_CACHE_FILE = 'local_text_embeddings_cache.pkl'
model = SentenceTransformer('all-MiniLM-L6-v2')

if os.path.exists(EMBEDDING_CACHE_FILE):
    with open(EMBEDDING_CACHE_FILE, 'rb') as f:
        text_embeddings = pickle.load(f)
    print("Loaded text embeddings from cache.")
else:
    text_embeddings = {}
    movie_titles = list(overview_df.index)
    # Process in batches using the model's encode method, which is highly optimized.
    batch_size = 512
    for i in tqdm(range(0, len(movie_titles), batch_size), desc="Computing text embeddings"):
        batch_titles = movie_titles[i:i+batch_size]
        batch_texts = ["" if pd.isna(overview_df.loc[m, 'overview']) or overview_df.loc[m, 'overview'].strip() == ""
                       else overview_df.loc[m, 'overview'] for m in batch_titles]
        batch_embeddings = model.encode(batch_texts, show_progress_bar=False)
        for title, emb in zip(batch_titles, batch_embeddings):
            text_embeddings[title] = emb
    with open(EMBEDDING_CACHE_FILE, 'wb') as f:
        pickle.dump(text_embeddings, f)
    print("Text embeddings computed and cached.")


Computing text embeddings:   8%|▊         | 7/91 [02:40<32:08, 22.95s/it]


KeyboardInterrupt: 

In [None]:
def get_combined_similarity(movie_name, top_n=5, weight_numeric=0.5, weight_text=0.5):
    movie_name_processed = movie_name.strip().lower().replace(' ', '')
    if movie_name_processed not in df_fin.index:
        return "Sorry, this movie isn't in our database.\nTry another one!"

    numeric_vector = df_fin.loc[movie_name_processed].values.reshape(1, -1)
    numeric_others = df_fin.drop(index=movie_name_processed).values
    numeric_similarity = cosine_similarity(numeric_vector, numeric_others)[0]

    if movie_name_processed not in text_embeddings:
        return "Text embedding for the movie is missing. Try another movie."
    text_vector = text_embeddings[movie_name_processed].reshape(1, -1)
    other_movies = [title for title in df_fin.index if title != movie_name_processed]
    text_matrix = np.array([text_embeddings[m] for m in other_movies])
    text_similarity = cosine_similarity(text_vector, text_matrix)[0]

    combined_similarity = weight_numeric * numeric_similarity + weight_text * text_similarity
    sim_df = pd.DataFrame({'title': other_movies, 'similarity': combined_similarity})
    sim_df = sim_df.sort_values(by='similarity', ascending=False).head(top_n)

    return "\n".join(sim_df['title'].values)


In [None]:
def predict(movie_name, no_movies):
    try:
        no_movies = int(no_movies)
    except ValueError:
        return "Invalid number of recommendations."
    return get_combined_similarity(movie_name, top_n=no_movies)


In [None]:
interface = gr.Interface(
    fn=predict,
    inputs=[gr.Textbox(label="Movie Name : "),
            gr.Textbox(label='No. of Recommendations: ', value='5')],
    outputs=gr.Textbox(label="Recommendations : "),
    examples=[["inception", 3], ["the dark knight", 6]]
)
interface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://50415a90af3c998a81.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
#interface.close()

In [None]:
print("Total Runtime:", time.time() - start_time, "seconds")

Total Runtime: 528.3999252319336 seconds


# **Outlook**
If we had additional time , we would focus on the following improvements:

**Incorporating More Features**

Adding cast, director, and genre embeddings to further refine recommendations.
Using movie posters with a vision-based model (e.g., CLIP) to include visual similarity.


**Scaling Up for Larger Datasets**

Expanding from 50,000 movies to full 930,000 movie database.
Using FAISS (Facebook AI Similarity Search) for faster large-scale retrieval.