# Movie Recommender

---
*Make sure to **make a copy** of this notebook and to work from there (File --> Save a Copy in Drive).*


---


Hi! This notebook will help you build recommendations for movies by applying 
NLP to the movie descriptions. Run the code blocks one by one by clicking the Run symbol or by pressing **Shift + Enter**. 

# Getting started!

We're first going to import some packages, such that we read & process the data.

In [None]:
# Download the W2V model
import spacy.cli
spacy.cli.download("en_core_web_sm")

# Import pandas
import pandas as pd

# Make sure we can see all columns in this notebook
pd.set_option("display.max_colwidth", 999)
pd.set_option("display.max_columns", 999)

In [None]:
# Read the data from this github. Ignore lines that can't be parsed
df_complete = pd.read_csv(r"https://raw.githubusercontent.com/enniasuijkerbuijk/bb-nlp-case/main/data/movies.csv", on_bad_lines="skip")

In [None]:
# This server is a bit slow, so sample movies
df = df_complete.sample(n=5_000, random_state=5)
df = df.reset_index(drop=True)

In [None]:
# Inspect the data
df.head()

# Data processing/cleaning

In [None]:
# Data preprocessing
df["description_clean"] = df["description"].astype(str)

# Drop rows where the description isn't filled in correctly
df = df[df["description_clean"] != ""]
df = df[df["description_clean"].notnull()]
df = df[df["description_clean"] != ""]
df = df[df["description_clean"].notnull()]

In [None]:
# TODO: What cleaning steps you would like to do? 
...

In [None]:
# Pick a movie for which you want the recommendation 
index = 2342 #TODO fill in a number here
df.iloc[index:index+1,]

# NLP Modelling

In [None]:
# Get W2V embedding - Takes about 2 minutes on the sampled dataset for 5K rows
model = spacy.load('en_core_web_sm')

# Construct the embeddings using the model
df["embedding"] = df['description_clean'].apply(lambda x:model(x).vector)

# Drop any rows without embeddings
df = df[~df["embedding"].isna()]
df = df.reset_index(drop=True)

In [None]:
from numpy import dot
from numpy.linalg import norm


def cosine_similarity(a,b):
    """
    Get cosine similarity between two arrays a,b
    :param a: array one
    :param b: array two
    """
    cos_sim = dot(a, b)/(norm(a)*norm(b))
    return cos_sim

def get_closest(df, ix):
    """
    Get most similar movie based on the index given 
    :param df: movies dataframe, including the embeddings
    :param ix: Index of the movie you're interested in 
    """
    base_embed = df.at[ix, "embedding"]
    print(f"Find closest movie for \nTitle:{df.at[ix, 'title']}: \nOverwiew: {df.at[ix, 'description']}")
    
    for i in df.index:
        this_embed = df.at[i, "embedding"]
        try:
          df.at[i, "similarity"] = cosine_similarity(base_embed, this_embed)
        except:
          print(df[df.index == i])
        
    
    return df[df["similarity"].notnull()].sort_values("similarity", ascending=False)[["title", "description", "similarity"]]

In [None]:
# Use get_closest to get movie that is most similar to my little pony
get_closest(df, index)

# Improvements
You've now found recommendations for your chosen movie, congratuations! 

Are the recommendations good? Do you have ideas on tweaks or methods you could implement to improve the recommendations?

In [None]:
# TODO: Improve the recommendations
...