# Intro
My plan is to use SBERT to primarily analyse movie plots, but also actors, genres, country and various metadata from a movie dataset to then find similar movies via Facebook AI Similarity Search. This will be my content based filtering. Then, to show new movies that may not be in the dataset, I will employ collaborative filtering by recommending movies that users with similar preferences interacted with, which may or may not be in the dataset. This way I can come up with a hybrid approach which accurately recommends movies but also can branch out of the dataset and give new recommendations. Additionally I can add these unseen movies to my dataset and further improve future recommendations.

# Importing data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [None]:
df = pd.read_csv('drive/MyDrive/netflix_titles.csv')
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [None]:
result = df[df['title'].str.lower().str.contains('demon slayer')]
result

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
1385,s1386,TV Show,Demon Slayer: Kimetsu no Yaiba,,"Natsuki Hanae, Akari Kito, Hiro Shimono, Yoshi...",Japan,"January 22, 2021",2019,TV-14,1 Season,"Anime Series, International TV Shows",After a demon attack leaves his family slain a...


In [None]:
#!pip install fuzzywuzzy
#from fuzzywuzzy import process

# def movie_finder(title):
#   all_titles = df['title'].tolist()
#   matches = process.extract(title, all_titles)
#   matches.sort(key=lambda x: (x[1], len[x[0]]), reverse=True)
#   return matches

def movie_finder(title):
  result = df[df['title'].str.lower().str.contains(title)]
  return result['title'].iloc[0]

print(movie_finder('squid game'))

Squid Game


In [None]:
def create_feature_text(row):
    description = row['description'] if pd.notna(row['description']) else None
    cast = row['cast'] if pd.notna(row['cast']) else None
    director = row['director'] if pd.notna(row['director']) else None
    country = row['country'] if pd.notna(row['country']) else None
    listed_in = row['listed_in'] if pd.notna(row['listed_in']) else None
    rating = row['rating'] if pd.notna(row['rating']) else None
    release_year = row['release_year'] if pd.notna(row['release_year']) else None

    #text = f"Title: {row['title']}. "
    text = ''

    if description:
        text += f"Description (important): {description}. "
    if listed_in:
        text += f"Genre (important): {listed_in}. "
        text += f"Genre (important): {listed_in}. "
    if cast:
        text += f"Cast: {cast}. "
    if release_year:
        text += f"Year: {release_year}. "
    if country:
        text += f"Country: {country}. "
        text += f"Country: {country}. "
    if rating:
        text += f"Rating: {rating}. "

    return text

df['feature_text'] = df.apply(create_feature_text, axis=1)
df['feature_text'][0]

'Description (important): As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.. Genre (important): Documentaries.Genre (important): Documentaries.Year: 2020. Country: United States. Country: United States. Rating: PG-13. '

In [None]:
missing = df.isna().sum()
missing

Unnamed: 0,0
show_id,0
type,0
title,0
director,2634
cast,825
country,831
date_added,10
release_year,0
rating,4
duration,3


In [None]:
#!pip install faiss-gpu
#!pip install sentence-transformers

import faiss
from sentence_transformers import SentenceTransformer
import requests
import numpy as np

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
movie_descriptions = df['feature_text'].tolist()
print(len(movie_descriptions))
print((len(movie_descriptions)-1) // 50 + 1)



8807
177


In [None]:
def encode_texts_in_batches(texts, batch_size=50):
    encoded_texts = []
    num_batches = (len(texts) - 1) // batch_size + 1
    print(texts)
    for i in range(num_batches):
        print(i)
        start_idx = i * batch_size
        end_idx = min((i + 1) * batch_size, len(texts))
        batch_texts = texts[start_idx:end_idx]
        batch_embeddings = model.encode(batch_texts)
        encoded_texts.extend(batch_embeddings)
        print(f"Processed batch {i+1}/{num_batches}")
    return encoded_texts

embeddings = encode_texts_in_batches(movie_descriptions)
print(len(embeddings))

Output hidden; open in https://colab.research.google.com to view.

In [None]:
embeddings_array = np.array(embeddings)
print(embeddings_array.shape)

dimension = embeddings_array.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings_array)

query_text = '''
Title: Aasai Aruvi: The Divine Rift
Description (important): In the ancient Tamil kingdom of Aruvapura, a young fisherman discovers a hidden celestial artifact that grants him extraordinary powers. As he grapples with his newfound abilities, he must unite with a band of misfit warriors to prevent a powerful sorcerer from using dark magic to tear the fabric of reality. The journey will reveal long-buried secrets about their ancestors and test the strength of their alliance in a clash that could reshape their world forever.
Director: Karthik Rajan
Cast: Vikram Ravi, Anjali Menon, Suriya Kumar
Country: India
Listed in: Action, Fantasy, Adventure
'''
query_embedding = model.encode([query_text])
query_embedding_array = np.array(query_embedding)

movie_title = movie_finder('demon slayer')
query_text_2 = df[df['title'] == movie_title]['feature_text'].iloc[0]
query_embedding_2 = np.array(model.encode([query_text_2]))

k = 10
distances, indices = index.search(query_embedding_2, k)
print(distances)
print(indices)

for i in indices[0]:
    print(df['title'][i])

(8807, 384)
[[7.6944397e-13 4.3122911e-01 4.5207605e-01 4.6600467e-01 4.8764223e-01
  4.9159360e-01 5.1341462e-01 5.1521713e-01 5.1732409e-01 5.2049071e-01]]
[[1385 3088 2180 5096 3173  843 7088   59 3696 5092]]
Demon Slayer: Kimetsu no Yaiba
The Disastrous Life of Saiki K.: Reawakened
Toradora!
Fullmetal Alchemist: Brotherhood
Teasing Master Takagi-san
JoJo's Bizarre Adventure
Inuyasha the Movie - L'isola del fuoco scarlatto
Naruto Shippuden: The Movie
Record of Grancrest War
Devilman Crybaby


In [None]:
faiss.write_index(index, 'drive/MyDrive/netflix_movies_index_v3.index')

In [None]:
index = faiss.read_index('drive/MyDrive/netflix_movies_index_v3.index')
query_text = ''' Title: Aasai Aruvi: The Divine Rift
Description (important): In the ancient kingdom of Aruvapura, a young fisherman discovers a hidden celestial artifact that grants him extraordinary powers. As he grapples with his newfound abilities, he must unite with a band of misfit warriors to prevent a powerful sorcerer from using dark magic to tear the fabric of reality. The journey will reveal long-buried secrets about their ancestors and test the strength of their alliance in a clash that could reshape their world forever.
Director: Karthik Rajan
Cast: Vikram Ravi, Anjali Menon, Suriya Kumar
Country: India
Listed in: Action, Fantasy, Adventure '''

query_embedding = np.array(model.encode([query_text]))
k=10
distance, indices = index.search(query_embedding, k)
print(distance)
print(indices)

for i in indices[0]:
    print(df['title'][i])

[[1.0669923 1.0925444 1.0958458 1.1081268 1.1150116 1.115929  1.1185582
  1.1204969 1.1208205 1.1226608]]
[[7161 3916 6461 4860 2919 4243 7450  417 2717 1009]]
Kaaliyan
Rainbow Jelly
Chhota Bheem Aur Kaala Yodha
Aiyaary
Dragon Quest Your Story
The Seven Deadly Sins the Movie: Prisoners of the Sky
Mi Shivajiraje Bhosale Boltoy
Chhota Bheem in African Safari
Chhota Bheem and the Curse of Damyaan
Rudra: Secret of the Black Moon
