In [13]:
import pandas as pd
import re
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

In [14]:
path = "movies_data.csv"
movies_data = pd.read_csv(path)

In [15]:
movies_data.head(2)

Unnamed: 0,ID,Movie Title,Description,Director,Star Rating,Critic Review 1,Critic Review 2,Critic Review 3,Synopsis,Year
0,0,Arctic Chuckles,Penguins trying stand-up comedy to uplift spir...,Sofia Mendoza,4.0,An endearing and hilarious animation that both...,Mendoza showcases that humor is truly universa...,"Pure joy from start to finish, it's the feel-g...","In the heart of the frosty Arctic, where the s...",1974
1,1,Ballad of the Lonely Lighthouse,A reclusive lighthouse keeper's life is illumi...,Dmitri Ivanov,4.9,Ivanov’s storytelling brilliance shines as bri...,"A touching tale of isolation, connection, and ...","Between the vast sea and towering lighthouse, ...","In a remote coastal town, atop a craggy cliff ...",1963


In [16]:

movies_data.shape

(120, 10)

In [17]:
movies_data.isnull().sum()

ID                 0
Movie Title        0
Description        0
Director           0
Star Rating        0
Critic Review 1    0
Critic Review 2    0
Critic Review 3    0
Synopsis           0
Year               0
dtype: int64

In [22]:
movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Movie Title      120 non-null    object 
 1   Description      120 non-null    object 
 2   Director         120 non-null    object 
 3   Star Rating      120 non-null    float64
 4   Critic Review 1  120 non-null    object 
 5   Critic Review 2  120 non-null    object 
 6   Critic Review 3  120 non-null    object 
 7   Synopsis         120 non-null    object 
 8   Year             120 non-null    int64  
dtypes: float64(1), int64(1), object(7)
memory usage: 8.6+ KB


In [None]:
movies_data= movies_data.drop(['ID'], axis=1)

In [None]:
movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Movie Title      120 non-null    object 
 1   Description      120 non-null    object 
 2   Director         120 non-null    object 
 3   Star Rating      120 non-null    float64
 4   Critic Review 1  120 non-null    object 
 5   Critic Review 2  120 non-null    object 
 6   Critic Review 3  120 non-null    object 
 7   Synopsis         120 non-null    object 
 8   Year             120 non-null    int64  
dtypes: float64(1), int64(1), object(7)
memory usage: 8.6+ KB


In [24]:
def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r'[^A-Za-z\s]', '', text)
        text = text.lower()
    else:
        text = ''
    return text

In [30]:
# First, fill any NaN values to avoid issues when concatenating
movies_data.fillna('', inplace=True)

# Define the columns you want to merge
text_columns = [
    'Movie Title', 'Description', 'Director', 'Star Rating',
    'Critic Review 1', 'Critic Review 2', 'Critic Review 3',
    'Synopsis', 'Year'
]

# Convert all values to string and join them into one column
movies_data['document'] = movies_data[text_columns].astype(str).agg(' '.join, axis=1)



In [26]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Error while downloading from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...


In [31]:
embeddings = model.encode(movies_data['document'].values)

In [32]:
embeddings = np.array(embeddings)

In [33]:
np.save('embedding_data.npy',embeddings)

In [34]:
embeddings = np.load('embedding_data.npy')

In [35]:
dimensions = embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(dimensions)

In [36]:
faiss_index.add(embeddings)

In [37]:
faiss.write_index(faiss_index, 'faiss_index.faiss')

In [38]:
def get_similar_movies(query,model,faiss_index,count=5):
    query_embeddings = model.encode([query])
    distance, indices = faiss_index.search(query_embeddings,count)

    for i in range(count):
        print(f"Movies {i+1}")
        print(f"Distance: {distance[0][1]}")
        print(movies_data['document'].iloc[indices[0][1]])

In [39]:
get_similar_movies("comedy",model,faiss_index,2)

Movies 1
Distance: 1.2223610877990723
Robo's First Laugh An AI designed for serious tasks starts experiencing humor, leading to unexpected and comical situations. Marco Bianchi 4.2 Bianchi crafts a world where machines challenge our understanding of emotion and humor. A delightful watch.  At the intersection of circuits and chuckles, this film shines with genuine comedic brilliance. It's rare to find a film that combines cutting-edge tech with gut-busting humor so seamlessly.  In the bustling city of Technoville, the future has arrived. Skyscrapers touch the heavens, hovercars roam the streets, and robots are an integral part of daily life. Among them is R-421, nicknamed “Robo” by his colleagues, the latest AI designed by the prodigious Dr. Elena Clark. Robo's primary function is to assist in serious tasks: managing city infrastructure, decoding complex algorithms, and ensuring safety protocols.

One day, while undergoing a routine software update at Dr. Clark's lab, a glitch occurs. A