<a href="https://colab.research.google.com/github/enniasuijkerbuijk/bb-nlp-case/blob/main/Case%20Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Make sure to **make a copy** of this notebook and to work from there (*File --> Save a Copy in Drive*).


This notebook will take you through the case. Run the code blocks one by one by clicking the Run symbol or by pressing **Shift + Enter**. 

# Getting started!

In [None]:
# Download the W2V model
import spacy.cli
spacy.cli.download("en_core_web_sm")

In [None]:
# Read the data
import pandas as pd

# Make sure we can see all columns in this notebook
pd.set_option("display.max_colwidth", 999)
pd.set_option("display.max_columns", 999)

# Read the data from this github. Ignore lines that can't be parsed
df = pd.read_csv(r"https://raw.githubusercontent.com/maartenvanhooft/avans-nlp/main/movies.csv", on_bad_lines="skip")

# This server is a bit slow, so sample movies
df = df.sample(n=5_000, random_state=5)
df = df.reset_index(drop=True)

# Data preprocessing
df["overview_clean"] = df["overview"].astype(str)

# TODO: What cleaning steps you would like to do? 
...

# Remove empty overviews and titles
df = df[df["overview_clean"] != ""]
df = df[df["overview_clean"].notnull()]
df = df[df["title"] != ""]
df = df[df["title"].notnull()]

# We want a recommendation for this movie:
df[df["title"].str.contains("my little pony")]

In [None]:
# Get W2V embedding - Takes about 2 minutes on the sampled dataset for 5K rows
model = spacy.load('en_core_web_sm')

# TODO: Construct the embeddings using the model
df["embedding"] = ... 

# If we don't have embeddings, drop the row
df = df[~df["embedding"].isna()]
df = df.reset_index(drop=True)

In [None]:
from numpy import dot
from numpy.linalg import norm


def cosine_similarity(a,b):
    """
    Get cosine similarity between two arrays a,b
    :param a: array one
    :param b: array two
    """
    cos_sim = dot(a, b)/(norm(a)*norm(b))
    return cos_sim

def get_closest(df, ix):
    """
    Get most similar movie based on the index given 
    :param df: movies dataframe, including the embeddings
    :param ix: Index of the movie you're interested in 
    """
    base_embed = df.at[ix, "embedding"]
    print(f"Find closest movie for \nTitle:{df.at[ix, 'title']}: \nOverwiew: {df.at[ix, 'overview']}")
    
    for i in df.index:
        this_embed = df.at[i, "embedding"]
        try:
          df.at[i, "similarity"] = cosine_similarity(base_embed, this_embed)
        except:
          print(df[df.index == i])
        
    
    return df[df["similarity"].notnull()].sort_values("similarity", ascending=False)[["title", "overview", "similarity"]]

In [None]:
# TODO: Use get_closest to get movie that is most similar to my little pony


If you've got time left, try to see if you can improve the recommendations by using other data in the movies dataframe (df)