<a href="https://colab.research.google.com/github/enniasuijkerbuijk/bb-nlp-case/blob/main/Case%20Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommender

---
*Make sure to **make a copy** of this notebook and to work from there (File --> Save a Copy in Drive).*


---


Hi! This notebook will help you build recommendations for movies by applying 
NLP to the movie descriptions. Run the code blocks one by one by clicking the Run symbol or by pressing **Shift + Enter**. 

# Getting started!

We're first going to import some packages, such that we read & process the data.

In [1]:
# Download the W2V model
import spacy.cli
spacy.cli.download("en_core_web_sm")

# Import pandas
import pandas as pd

# Make sure we can see all columns in this notebook
pd.set_option("display.max_colwidth", 999)
pd.set_option("display.max_columns", 999)



[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [52]:
# Read the data from this github. Ignore lines that can't be parsed
df_complete = pd.read_csv(r"https://raw.githubusercontent.com/enniasuijkerbuijk/bb-nlp-case/main/data/movies.csv", on_bad_lines="skip")

  df_complete = pd.read_csv(r"https://raw.githubusercontent.com/enniasuijkerbuijk/bb-nlp-case/main/data/movies.csv", on_bad_lines="skip")


In [53]:
# This server is a bit slow, so sample movies
df = df_complete.sample(n=5_000, random_state=5)
df = df.reset_index(drop=True)

In [54]:
# Inspect the data
df.head()

Unnamed: 0,title,description,belongs_to_collection,budget,homepage,id,imdb_id,original_language,original_title,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,video,vote_average,vote_count,adult,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,genre_7,genre_8,genres
0,Mobile Suit Gundam: Char's Counterattack,In UC 0093 the Federation has recovered from its defeat and has created a new anti-colonial special forces unit to deal with rebel forces: Londo Bell. Elsewhere in space Char Aznable re-appears out of self imposed hiding with a declaration that he now commands his own Neo-Zeon movementand intends to force the emigration of Earth's inhabitants to space by bringing about an apocalypse.,,0,http://www.gundam-cca.net/,16157,tt0095262,ja,機動戰士 ガンダム 逆襲のシャア,1.639329,/bbK6N3Dlfapu8mWZVwxM6nJ5cb5.jpg,"[{'name': 'Sunrise', 'id': 3153}, {'name': 'Sotsu Agency', 'id': 4719}, {'name': 'Bandai Visual', 'id': 5844}, {'name': 'Shochiku', 'id': 5906}, {'name': 'Nagoya Broadcasting Network (NBN)', 'id': 81418}]","[{'iso_3166_1': 'JP', 'name': 'Japan'}]",3/12/1988,0.0,124.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,,False,7.1,13.0,False,Animation,Action,Science Fiction,,,,,,"[{'id': 16, 'name': 'Animation'}, {'id': 28, 'name': 'Action'}, {'id': 878, 'name': 'Science Fiction'}]"
1,Red Cliff Part II,"In 208 A.D., in the final days of the Han Dynasty, shrewd Prime Minster Cao convinced the fickle Emperor Han the only way to unite all of China was to declare war on the kingdoms of Xu in the west and East Wu in the south. Thus began a military campaign of unprecedented scale. Left with no other hope for survival, the kingdoms of Xu and East Wu formed an unlikely alliance.","{'id': 96677, 'name': 'Red Cliff Collection', 'poster_path': '/3KFgWRuNk3d9QGCnQUXKJSsfrLC.jpg', 'backdrop_path': '/46G7BAqK6LDAxIFVLq926rzN65o.jpg'}",80000000,http://www.redclifffilm.com,15384,tt1326972,zh,赤壁 2,7.309903,/s6fUmPUR5YY8HqkCnlthHsVLoDC.jpg,"[{'name': 'Metropolitan Filmexport', 'id': 656}, {'name': 'Lion Rock Productions', 'id': 2812}, {'name': 'Showbox', 'id': 3491}]","[{'iso_3166_1': 'CN', 'name': 'China'}]",1/7/2009,121059225.0,136.0,"[{'iso_639_1': 'zh', 'name': '普通话'}, {'iso_639_1': 'th', 'name': 'ภาษาไทย'}]",Released,Destiny Lies In The Wind,False,7.1,110.0,False,War,Action,Drama,History,Thriller,,,,"[{'id': 10752, 'name': 'War'}, {'id': 28, 'name': 'Action'}, {'id': 18, 'name': 'Drama'}, {'id': 36, 'name': 'History'}, {'id': 53, 'name': 'Thriller'}]"
2,Careless Love,"Linh (Nammi Le) is a Vietnamese Australian university student who secretly starts part-time work as an escort. She develops a close rapport with one of her clients, an enigmatic American art dealer, who books her on a regular basis. For a time she manages to keep her two lives in separate compartments. But when she falls for a fellow student her worlds collide and she must deal with the emotional chaos that follows.",,0,http://www.carelesslovefilm.com,105676,tt1835920,en,Careless Love,0.085536,/4OlhiiNHs13yb6rrZ3LwIUe09eQ.jpg,[],"[{'iso_3166_1': 'AU', 'name': 'Australia'}]",5/16/2012,0.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Sometimes you have to be two different people. What happens when they meet?,False,8.0,1.0,False,,,,,,,,,[]
3,5 Days of War,"An American journalist and his cameraman are caught in the combat zone during the first Russian airstrikes against Georgia. Rescuing Tatia, a young Georgian schoolteacher separated from her family during the attack, the two reporters agree to help reunite her with her family in exchange for serving as their interpreter. As the three attempt to escape to safety, they witness--and document--the devastation from the full-scale crossfire and cold-blooded murder of innocent civilians.",,20000000,,50601,tt1486193,en,5 Days of War,3.174512,/7hRpmThsUm68m1F8MNtxEKa4Irc.jpg,"[{'name': 'Midnight Sun Pictures', 'id': 17887}, {'name': 'Dispictures', 'id': 24262}, {'name': 'Georgia International Films', 'id': 24264}, {'name': 'Rex Media', 'id': 24266}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",4/14/2011,17479.0,113.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'ka', 'name': 'ქართული'}]",Released,Their only weapon is the truth.,False,5.8,63.0,False,War,Drama,,,,,,,"[{'id': 10752, 'name': 'War'}, {'id': 18, 'name': 'Drama'}]"
4,My Dear Killer,"Following a mysterious decapitation (via mechanical digger) of an insurance investigator, Police Inspector Peretti is put onto the case. Slowly more people are found dead... a man supposedly commits suicide, a women is strangled, another attacked in her flat... but all the clues lead to an unsolved case of kidnapping and murder. Can Peretti find the murderer, if his major clue is a little girls drawing???",,0,,69579,tt0067434,en,Mio caro assassino,0.774628,/ppzWTmTtMzqqamR7jj1nFFjEFw1.jpg,"[{'name': 'B.R.C. Produzione S.r.l.', 'id': 7855}]","[{'iso_3166_1': 'IT', 'name': 'Italy'}, {'iso_3166_1': 'ES', 'name': 'Spain'}]",2/3/1972,0.0,96.0,"[{'iso_639_1': 'it', 'name': 'Italiano'}]",Released,,False,6.6,6.0,False,Drama,Thriller,Foreign,,,,,,"[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}, {'id': 10769, 'name': 'Foreign'}]"


# Data processing/cleaning

In [55]:
# Data preprocessing
df["description_clean"] = df["description"].astype(str)

# Drop rows where the description isn't filled in correctly
df = df[df["description_clean"] != ""]
df = df[df["description_clean"].notnull()]
df = df[df["description_clean"] != ""]
df = df[df["description_clean"].notnull()]

In [None]:
# TODO: What cleaning steps you would like to do? 
...

In [70]:
# Pick a movie for which you want the recommendation 
index = 2342 #TODO fill in a number here
df.iloc[index:index+1,]

Unnamed: 0,title,description,belongs_to_collection,budget,homepage,id,imdb_id,original_language,original_title,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,video,vote_average,vote_count,adult,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,genre_7,genre_8,genres,description_clean,embedding,similarity
2342,Cowboys vs. Dinosaurs,"After an accidental explosion at a local mine, dinosaurs emerge from the rubble to terrorize a small western town. Now, a group of gunslingers must defend their home if anyone is going to survive in a battle of cowboys versus dinosaurs.",,0,,337208,tt3252786,en,Cowboys vs. Dinosaurs,1.121188,/s0bFAVkEgRxM3hxUPqnc916W1FF.jpg,"[{'name': 'Oracle Film Group', 'id': 52305}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",5/19/2015,0.0,89.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Let the Best Species Win,False,4.3,14.0,False,Science Fiction,Action,,,,,,,"[{'id': 878, 'name': 'Science Fiction'}, {'id': 28, 'name': 'Action'}]","After an accidental explosion at a local mine, dinosaurs emerge from the rubble to terrorize a small western town. Now, a group of gunslingers must defend their home if anyone is going to survive in a battle of cowboys versus dinosaurs.","[0.17869939, -0.19270253, 0.016139043, -0.09272385, 0.059241135, -0.022864835, 0.30100754, -0.025572274, 0.10966, 0.232948, -0.074068226, -0.19797756, -0.25803798, 0.03660716, -0.14684346, -0.20335218, 0.20014171, 0.19364086, 0.16049604, 0.060791593, -0.23717698, 0.02520218, -0.3075304, -0.35554937, 0.25069267, -0.09250803, 0.48756066, 0.09285031, 0.13303941, 0.41174105, -0.059028924, 0.12712108, 0.47067052, -0.3799655, 0.03149418, -0.21008416, 0.22052965, 0.2413929, -0.06826611, 0.078029275, -0.2413724, 0.17844485, -0.2946879, 0.06524317, 0.12241088, 0.16333705, 0.0132577475, 0.20054571, 0.3162138, -0.12779132, -0.52333194, 0.12857863, -0.08041629, -0.16588, -0.10913316, 0.2177906, -0.17047232, 0.13843554, -0.032032926, -0.33110243, 0.061178286, -0.4893927, -0.06979269, -0.15948716, 0.073672995, -0.22971264, 0.31181028, -0.016890602, 0.32075348, 0.29394984, 0.07243651, 0.30164975, -0.004544348, 0.00988805, -0.24355167, -0.36713678, -0.2055822, -0.0575707, 0.1806719, -0.33091328, ...",0.769871


# NLP Modelling

In [71]:
# Get W2V embedding - Takes about 2 minutes on the sampled dataset for 5K rows
model = spacy.load('en_core_web_sm')

# Construct the embeddings using the model
df["embedding"] = df['description_clean'].apply(lambda x:model(x).vector)

# Drop any rows without embeddings
df = df[~df["embedding"].isna()]
df = df.reset_index(drop=True)

In [61]:
from numpy import dot
from numpy.linalg import norm


def cosine_similarity(a,b):
    """
    Get cosine similarity between two arrays a,b
    :param a: array one
    :param b: array two
    """
    cos_sim = dot(a, b)/(norm(a)*norm(b))
    return cos_sim

def get_closest(df, ix):
    """
    Get most similar movie based on the index given 
    :param df: movies dataframe, including the embeddings
    :param ix: Index of the movie you're interested in 
    """
    base_embed = df.at[ix, "embedding"]
    print(f"Find closest movie for \nTitle:{df.at[ix, 'title']}: \nOverwiew: {df.at[ix, 'description']}")
    
    for i in df.index:
        this_embed = df.at[i, "embedding"]
        try:
          df.at[i, "similarity"] = cosine_similarity(base_embed, this_embed)
        except:
          print(df[df.index == i])
        
    
    return df[df["similarity"].notnull()].sort_values("similarity", ascending=False)[["title", "description", "similarity"]]

In [72]:
# Use get_closest to get movie that is most similar to my little pony
get_closest(df, index)

Find closest movie for 
Title:Cowboys vs. Dinosaurs: 
Overwiew: After an accidental explosion at a local mine, dinosaurs emerge from the rubble to terrorize a small western town. Now, a group of gunslingers must defend their home if anyone is going to survive in a battle of cowboys versus dinosaurs.


Unnamed: 0,title,description,similarity
2342,Cowboys vs. Dinosaurs,"After an accidental explosion at a local mine, dinosaurs emerge from the rubble to terrorize a small western town. Now, a group of gunslingers must defend their home if anyone is going to survive in a battle of cowboys versus dinosaurs.",1.000000
1530,Blackwood,"Having recovered from a shattering emotional breakdown, college professor Ben Marshall relocates to the countryside with his wife and young son, hoping for a fresh start. He has a teaching job lined up and a new home to move into; things finally look to be going Ben's way. Until, that is, he starts to feel that something isn't quite right in the house. Finding himself plagued by spectral visions, Ben becomes obsessed with uncovering the truth behind a local mystery that appears to be putting the lives of his family in danger",0.886405
2540,Moll Flanders,"The daughter of a thief, young Moll is placed in the care of a nunnery after the execution of her mother. However, the actions of an abusive priest lead Moll to rebel as a teenager, escaping to the dangerous streets of London. Further misfortunes drive her to accept a job as a prostitute from the conniving Mrs. Allworthy. It is there that Moll first meets Hibble, who is working as Allworthy's servant but takes a special interest in the young woman's well-being. With his help, she retains hope for the future, ultimately falling in love with an unconventional artist who promises the possibility of romantic happiness.",0.872958
4849,The Saga of the Viking Women and Their Voyage to the Waters of the Great Sea Serpent,"A group of lonely Viking women build a ship and set off across the sea to locate their missing menfolk, only to fall into the clutches of the barbarians that also hold their men captive. There is a cameo appearance by the sea serpent.",0.869338
3838,Fading of the Cries,"Jacob, a young man armed with a deadly sword, saves Sarah, a teenage girl, from Mathias, a malevolent evil that has begun plaguing a small farmland town while in search of an ancient necklace that had belonged to Sarahs Uncle. Jacob sets out to get Sarah home safely, running through streets, fields, churches and underground tunnels, while being pursued by hordes of demonic creatures. Along the way, both come to terms with the demons within themselves - Sarah begins to understand her hatred towards her mother and sister may be unjustified and Jacob discovers the secrets of his past, realizing the only way to truly defeat the demons is to return to the very place his family was murdered.",0.868834
...,...,...,...
968,The Missing Star,No overview found.,0.116961
3413,The Nautical Chart,No overview found.,0.116961
2523,Gasht-e Ershad,Not Available,0.061525
2856,,Released,0.020068


# Improvements
You've now found recommendations for your chosen movie, congratuations! 

Are the recommendations good? Do you have ideas on tweaks or methods you could implement to improve the recommendations?

In [None]:
# TODO: Improve the recommendations
...