# Movie Recommender

In [None]:
pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.

---
*Make sure to **make a copy** of this notebook and to work from there (File --> Save a Copy in Drive).*


---


Hi! This notebook will help you build recommendations for movies by applying
NLP to the movie descriptions. Run the code blocks one by one by clicking the Run symbol or by pressing **Shift + Enter**.

# Getting started!

We're first going to import some packages, such that we read & process the data.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
# Example DataFrame
df = pd.DataFrame({"description": ["This is a test.", "Another test description."]})

# Add 'embedding' column
df["embedding"] = df["description"].apply(lambda x: model.encode(x))


In [None]:
print(df)

                 description  \
0            This is a test.   
1  Another test description.   

                                           embedding  
0  [0.033590816, 0.0105125075, -0.01749978, 0.029...  
1  [0.039427444, 0.04022672, -0.031039717, 0.0504...  


In [None]:
# Import pandas
import pandas as pd

# Make sure we can see all columns in this notebook
pd.set_option("display.max_colwidth", 999)
pd.set_option("display.max_columns", 999)

In [None]:
# Read the data from this github. Ignore lines that can't be parsed
df_complete = pd.read_csv(r"https://raw.githubusercontent.com/enniasuijkerbuijk/bb-nlp-case/main/data/movies.csv", on_bad_lines="skip")

  df_complete = pd.read_csv(r"https://raw.githubusercontent.com/enniasuijkerbuijk/bb-nlp-case/main/data/movies.csv", on_bad_lines="skip")


In [None]:
# This server is a bit slow, so sample movies
df = df_complete.sample(n=5_000, random_state=5)
df = df.reset_index(drop=True)

In [None]:
# Inspect the data
df.head()

Unnamed: 0,title,description,belongs_to_collection,budget,homepage,id,imdb_id,original_language,original_title,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,video,vote_average,vote_count,adult,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,genre_7,genre_8,genres
0,Mobile Suit Gundam: Char's Counterattack,In UC 0093 the Federation has recovered from its defeat and has created a new anti-colonial special forces unit to deal with rebel forces: Londo Bell. Elsewhere in space Char Aznable re-appears out of self imposed hiding with a declaration that he now commands his own Neo-Zeon movementand intends to force the emigration of Earth's inhabitants to space by bringing about an apocalypse.,,0,http://www.gundam-cca.net/,16157,tt0095262,ja,機動戰士 ガンダム 逆襲のシャア,1.639329,/bbK6N3Dlfapu8mWZVwxM6nJ5cb5.jpg,"[{'name': 'Sunrise', 'id': 3153}, {'name': 'Sotsu Agency', 'id': 4719}, {'name': 'Bandai Visual', 'id': 5844}, {'name': 'Shochiku', 'id': 5906}, {'name': 'Nagoya Broadcasting Network (NBN)', 'id': 81418}]","[{'iso_3166_1': 'JP', 'name': 'Japan'}]",3/12/1988,0.0,124.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,,False,7.1,13.0,False,Animation,Action,Science Fiction,,,,,,"[{'id': 16, 'name': 'Animation'}, {'id': 28, 'name': 'Action'}, {'id': 878, 'name': 'Science Fiction'}]"
1,Red Cliff Part II,"In 208 A.D., in the final days of the Han Dynasty, shrewd Prime Minster Cao convinced the fickle Emperor Han the only way to unite all of China was to declare war on the kingdoms of Xu in the west and East Wu in the south. Thus began a military campaign of unprecedented scale. Left with no other hope for survival, the kingdoms of Xu and East Wu formed an unlikely alliance.","{'id': 96677, 'name': 'Red Cliff Collection', 'poster_path': '/3KFgWRuNk3d9QGCnQUXKJSsfrLC.jpg', 'backdrop_path': '/46G7BAqK6LDAxIFVLq926rzN65o.jpg'}",80000000,http://www.redclifffilm.com,15384,tt1326972,zh,赤壁 2,7.309903,/s6fUmPUR5YY8HqkCnlthHsVLoDC.jpg,"[{'name': 'Metropolitan Filmexport', 'id': 656}, {'name': 'Lion Rock Productions', 'id': 2812}, {'name': 'Showbox', 'id': 3491}]","[{'iso_3166_1': 'CN', 'name': 'China'}]",1/7/2009,121059225.0,136.0,"[{'iso_639_1': 'zh', 'name': '普通话'}, {'iso_639_1': 'th', 'name': 'ภาษาไทย'}]",Released,Destiny Lies In The Wind,False,7.1,110.0,False,War,Action,Drama,History,Thriller,,,,"[{'id': 10752, 'name': 'War'}, {'id': 28, 'name': 'Action'}, {'id': 18, 'name': 'Drama'}, {'id': 36, 'name': 'History'}, {'id': 53, 'name': 'Thriller'}]"
2,Careless Love,"Linh (Nammi Le) is a Vietnamese Australian university student who secretly starts part-time work as an escort. She develops a close rapport with one of her clients, an enigmatic American art dealer, who books her on a regular basis. For a time she manages to keep her two lives in separate compartments. But when she falls for a fellow student her worlds collide and she must deal with the emotional chaos that follows.",,0,http://www.carelesslovefilm.com,105676,tt1835920,en,Careless Love,0.085536,/4OlhiiNHs13yb6rrZ3LwIUe09eQ.jpg,[],"[{'iso_3166_1': 'AU', 'name': 'Australia'}]",5/16/2012,0.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Sometimes you have to be two different people. What happens when they meet?,False,8.0,1.0,False,,,,,,,,,[]
3,5 Days of War,"An American journalist and his cameraman are caught in the combat zone during the first Russian airstrikes against Georgia. Rescuing Tatia, a young Georgian schoolteacher separated from her family during the attack, the two reporters agree to help reunite her with her family in exchange for serving as their interpreter. As the three attempt to escape to safety, they witness--and document--the devastation from the full-scale crossfire and cold-blooded murder of innocent civilians.",,20000000,,50601,tt1486193,en,5 Days of War,3.174512,/7hRpmThsUm68m1F8MNtxEKa4Irc.jpg,"[{'name': 'Midnight Sun Pictures', 'id': 17887}, {'name': 'Dispictures', 'id': 24262}, {'name': 'Georgia International Films', 'id': 24264}, {'name': 'Rex Media', 'id': 24266}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",4/14/2011,17479.0,113.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'ka', 'name': 'ქართული'}]",Released,Their only weapon is the truth.,False,5.8,63.0,False,War,Drama,,,,,,,"[{'id': 10752, 'name': 'War'}, {'id': 18, 'name': 'Drama'}]"
4,My Dear Killer,"Following a mysterious decapitation (via mechanical digger) of an insurance investigator, Police Inspector Peretti is put onto the case. Slowly more people are found dead... a man supposedly commits suicide, a women is strangled, another attacked in her flat... but all the clues lead to an unsolved case of kidnapping and murder. Can Peretti find the murderer, if his major clue is a little girls drawing???",,0,,69579,tt0067434,en,Mio caro assassino,0.774628,/ppzWTmTtMzqqamR7jj1nFFjEFw1.jpg,"[{'name': 'B.R.C. Produzione S.r.l.', 'id': 7855}]","[{'iso_3166_1': 'IT', 'name': 'Italy'}, {'iso_3166_1': 'ES', 'name': 'Spain'}]",2/3/1972,0.0,96.0,"[{'iso_639_1': 'it', 'name': 'Italiano'}]",Released,,False,6.6,6.0,False,Drama,Thriller,Foreign,,,,,,"[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}, {'id': 10769, 'name': 'Foreign'}]"


# Data processing/cleaning

In [None]:
# Data preprocessing
df["description_clean"] = df["description"].astype(str)

# Drop rows where the description isn't filled in correctly
df = df[df["description_clean"] != ""]
df = df[df["description_clean"].notnull()]
df = df[df["description_clean"] != ""]
df = df[df["description_clean"].notnull()]

In [None]:
# TODO: What cleaning steps you would like to do?
...

In [None]:
# Pick a movie for which you want the recommendation
index = 4362 #TODO fill in a number here
df.iloc[index:index+1,]

Unnamed: 0,title,description,belongs_to_collection,budget,homepage,id,imdb_id,original_language,original_title,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,video,vote_average,vote_count,adult,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,genre_7,genre_8,genres,description_clean,embedding,similarity
4362,Jurassic World,"Twenty-two years after the events of Jurassic Park, Isla Nublar now features a fully functioning dinosaur theme park, Jurassic World, as originally envisioned by John Hammond.","{'id': 328, 'name': 'Jurassic Park Collection', 'poster_path': '/qIm2nHXLpBBdMxi8dvfrnDkBUDh.jpg', 'backdrop_path': '/pJjIH9QN0OkHFV9eue6XfRVnPkr.jpg'}",150000000,http://www.jurassicworld.com/,135397,tt0369610,en,Jurassic World,32.790475,/jjBgi2r5cRt36xF6iNUEhzscEcb.jpg,"[{'name': 'Universal Studios', 'id': 13}, {'name': 'Amblin Entertainment', 'id': 56}, {'name': 'Legendary Pictures', 'id': 923}, {'name': 'Fuji Television Network', 'id': 3341}, {'name': 'Dentsu', 'id': 6452}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",6/9/2015,1513529000.0,124.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The park is open.,False,6.5,8842.0,False,Action,Adventure,Science Fiction,Thriller,,,,,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 53, 'name': 'Thriller'}]","Twenty-two years after the events of Jurassic Park, Isla Nublar now features a fully functioning dinosaur theme park, Jurassic World, as originally envisioned by John Hammond.","[0.0075168377, -0.0073733325, 0.0408921, -0.058211718, 0.04788657, -0.015973186, -0.09272771, -0.0128216, 0.011668127, 0.030926455, -0.04825614, 0.011415787, -0.04692573, 0.097221285, -0.020575453, -0.008768093, -0.029688735, 0.008495336, 0.108765535, -0.061208595, -0.028183024, 0.068037175, 0.028113872, -0.0064741997, -0.06035439, 0.042427443, 0.019126924, 0.09799467, -0.016061565, -0.025110483, 0.024138443, -0.012121526, 0.046112243, -0.06612382, 0.027757788, 0.036631204, -0.05229918, -0.013524659, -0.0093390085, -0.006619366, -0.04647201, -0.062587455, 0.025268946, 0.01554889, -0.05916187, 0.061688367, 0.0031289782, -0.026575888, 0.042741403, 0.016759183, 0.04660276, -0.041598793, 0.050644692, -0.15045401, 0.0074561513, 0.069428094, 0.018925816, -0.01750128, 0.010070058, -0.054564744, 0.03049806, -0.06449751, 0.00027729638, 0.009522896, 0.02898214, -0.0374767, 0.06837103, 0.050112985, 0.016460324, -0.05660898, 0.06861404, -0.058936883, 0.017215692, -0.087829925, -0.07441692, 0....",0.422406


In [None]:
print(df.shape)

(5000, 34)


# NLP Modelling

In [None]:
from tqdm import tqdm

for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    df.at[index, "embedding"] = model.encode(row["description_clean"])

df_with_embeddings = df.copy() #Save a copy so you won't lose the embedding if you accidentally do something wrong

100%|██████████| 5000/5000 [04:08<00:00, 20.11it/s]


In [None]:
from numpy import dot
from numpy.linalg import norm


def cosine_similarity(a,b):
    """
    Get cosine similarity between two arrays a,b
    :param a: array one
    :param b: array two
    """
    cos_sim = dot(a, b)/(norm(a)*norm(b))
    return cos_sim

def get_closest(df, ix):
    """
    Get most similar movie based on the index given
    :param df: movies dataframe, including the embeddings
    :param ix: Index of the movie you're interested in
    """
    base_embed = df.at[ix, "embedding"]
    print(f"Find closest movie for \nTitle:{df.at[ix, 'title']}: \nOverwiew: {df.at[ix, 'description']}")

    for i in df.index:
        this_embed = df.at[i, "embedding"]
        try:
          df.at[i, "similarity"] = cosine_similarity(base_embed, this_embed)
        except:
          print(df[df.index == i])


    return df[df["similarity"].notnull()].sort_values("similarity", ascending=False)[["title", "description", "similarity"]]

In [None]:
# Use get_closest to get movie that is most similar to my little pony
get_closest(df, index)

Find closest movie for 
Title:Jurassic World: 
Overwiew: Twenty-two years after the events of Jurassic Park, Isla Nublar now features a fully functioning dinosaur theme park, Jurassic World, as originally envisioned by John Hammond.


Unnamed: 0,title,description,similarity
4362,Jurassic World,"Twenty-two years after the events of Jurassic Park, Isla Nublar now features a fully functioning dinosaur theme park, Jurassic World, as originally envisioned by John Hammond.",1.000000
4515,Walking With Dinosaurs,"Walking with Dinosaurs 3D is a film depicting life-like 3D dinosaur characters set in photo-real landscapes that transports audiences to the prehistoric world as it existed 70 million years ago. The film is based on the 1999 documentary television miniseries Walking with Dinosaurs, produced by the BBC. Walking with Dinosaurs 3D is being produced by Evergreen Studios, the company that produced Happy Feet, and it is was released on October 11, 2013.",0.485142
2367,Dinosaur 13,Documentary about the discovery of the largest T-Rex fossil found.,0.427046
2861,Coo of The Far Seas,"After a storm hits their island, Yusuke (driving his jet-ski) finds a baby dinosaur. He show it to his dad, they keep it and call it Coo. But there are other parties interested in a 65 milion year old creature ...",0.423010
2342,Cowboys vs. Dinosaurs,"After an accidental explosion at a local mine, dinosaurs emerge from the rubble to terrorize a small western town. Now, a group of gunslingers must defend their home if anyone is going to survive in a battle of cowboys versus dinosaurs.",0.422406
...,...,...,...
3705,Now & Later,"Sex, politics and American culture are mixed into a combustible combination in Now &amp; Later. Angela is an illegal Latina immigrant living in Los Angeles who stumbles across Bill, a disgraced banker on the run. She takes him in. Through passionate sex, soul-searching conversations ranging from politics to philosophy, and other worldly pleasures, Angela introduces Bill to another worldview. As their affair heats up, the course of Bill's life begins to take an abrupt and unexpected turn.",-0.175261
1975,Yuva,"Michael (Ajay Devgan) , Arjun (Vivek Oberoi) and Lallan (Abhishek Bachchan) are three young men in Kolkata , with different ideals and objectives . Michael is an idealistic youth leader who dreams of a better India being created by the youth power . Arjun is a self-centered , opportunistic , easygoing fellow whose objective is to immigrate to a developed country and make big money . Lallan is a goon who works for Prosenjit Chatterjee (Om Puri) , an immoral politician . The lives of these three different people become intertwined following a murder attempt and an accident in broad daylight on the Hooghly bridge",-0.177731
2499,The Vagabond King,"Louis XI of France drafts Paris's popular ""king"" of criminals as Provost Marshal in his fight against usurper Charles of Burgundy and the traitorous nobles who rally around him.",-0.180614
4078,Girls Against Boys,"After a series of bad experiences with men, Shae teams up with her co-worker, Lu, who has a simple, deadly way of dealing with the opposite sex.",-0.193963


# Improvements
You've now found recommendations for your chosen movie, congratuations!

Are the recommendations good? Do you have ideas on tweaks or methods you could implement to improve the recommendations?

In [None]:
# TODO: Improve the recommendations
...