<a href="https://colab.research.google.com/github/thomasnamink/nlp-movie-recommender-case/blob/main/NLP%20Case%202024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommender
Welcome to the Movie Recommender case!

---
*Make sure to **make a copy** of this notebook and to work from there (File --> Save a Copy in Drive).*


---


# Getting started!
Let's first install a required package

In [None]:
#Install a required package. Note that this may take a minute.
!pip install sentence-transformers

Import pandas for data processing and set some configurations

In [5]:
#Import pandas for data processing
import pandas as pd

# Make sure we can see all columns in this notebook
pd.set_option("display.max_colwidth", 999)
pd.set_option("display.max_columns", 999)

In [None]:
# Read the data from this github. Ignore lines that can't be parsed
df_complete = pd.read_csv(r"https://raw.githubusercontent.com/thomasnamink/nlp-movie-recommender-case/main/movies.csv", on_bad_lines="skip")

# Data processing/cleaning

Since generating embeddings is quite slow, we first take a subset of the top 2500 most popular movies to do create our recommendations on.

In [7]:
df = df_complete.nlargest(2500, 'revenue').reset_index(drop=True)

In [None]:
#TODO: analyze the data, and create some descriptive statistics!
df

Let's continue with doing some basic data cleaning of the movie descriptions

In [9]:
# Create a new column, 'description_clean', and make sure this column is in string format
df["description_clean"] = df["description"].astype(str)

# Drop rows where the description isn't filled in correctly
df = df[df["description_clean"] != ""]
df = df[df["description_clean"].notnull()]

Data cleaning is an important step. Can you think of more data cleaning steps of the 'description_clean' column?

In [10]:
# TODO: What cleaning steps would you like to do?
...

Awesome, you know have prepared the descriptions for using our NLP technique! Let's now pick a base movie you like (or dislike), to start generating recommendations for this movie, based on similar movie descriptions using NLP!

In [None]:
# Pick a movie for which you want to create recommendations
df.head(50)['title'].tolist()

In [19]:
#TODO
chosen_movie = "Transformers"

# NLP Modelling

Let's start our NLP modelling! Before we start, load in the model that we will use for creating the embeddings. As explained, this model can be used to generate 384 dimensional embeddings of the movie descriptions. More details about the specific model are explained below.

In [None]:
#Import SentenceTransformer, to most easily calculate embeddings
from sentence_transformers import SentenceTransformer

#Initialize the embedding model we have chosen for this case
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

Let's check if our embedder correctly generates the embedding vector

In [None]:
test_sentence = "Fill in something here"
model.encode(test_sentence)

Let's start generating the embeddings for all the movies! We do this similarly as in the example above, by adding a embedding column to our movie DataFrame. You can start by running the cell below.

There are many many possible models you can use for calculating embeddings. The model we have chosen for this case is: sentence-transformers/all-MiniLM-L6-v2, a very popular model that can be used for calculating embeddings for texts in the English language. If you are curious about the embedding model, take a look at the HuggingFace page: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2.

In [None]:
#Import the tqdm package, that can easily be used for creating progress bars!
from tqdm import tqdm

#Create an empty embedding column
df["embedding"] = None

#Loop over all 2500 rows in our movie DataFrame, while using the tqdm for the progress bar. Encode the movie description at every row.
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    df.at[index, "embedding"] = model.encode(row["description_clean"])

df_with_embeddings = df.copy() #Save a copy so you won't lose the embedding if you accidentally do something wrong

Let's create some functions that will help us create the recommendations

In [20]:
from numpy import dot
from numpy.linalg import norm


def cosine_similarity(a,b):
    """
    Get cosine similarity between two arrays a,b
    :param a: array one
    :param b: array two
    """
    cos_sim = dot(a, b)/(norm(a)*norm(b))
    return cos_sim

def get_closest(df, chosen_movie):
    """
    Get most similar movie based on the index given
    :param df: movies dataframe, including the embeddings
    :param ix: Index of the movie you're interested in
    """

    print(f"Finding closest movie for title: {chosen_movie}")

    #Try finding the base embedding
    try:
      base_embed = df[df['title'] == chosen_movie]['embedding'].iloc[0]
    except:
      raise Exception("Error finding base embedding! Did you specify a movie title that is in the movie dataset?")

    #Create an empty embedding column
    df["similarity"] = None

    for i in df.index:
        this_embed = df.at[i, 'embedding']
        df.at[i, 'similarity'] = cosine_similarity(base_embed, this_embed)

    return df[df["similarity"].notnull()].sort_values("similarity", ascending=False)

In [None]:
#You can change your chosen movie here also
#chosen_movie = "Some other movie you want to try"

# Use get_closest to get movie that is most similar to your chosen movie
df_results = get_closest(df, chosen_movie)

#print the top 20 movies with the most similar embeddings
df_results[["title", "description", "similarity"]].head(10)

# Improvements
You've now found recommendations for your chosen movie, congratulations! What do you think about the quality of the recommendations?

Right now, the movie recommendations are only based on the movie similarity, but more information about the movies is available! Can you think of a way to improve recommendations? Make sure your method for improving recommendations is general in the sense that it can be applied for any base movie that is chosen!

In [None]:
#Right now, the recommendation score is only the similarity score
df['score'] = df['similarity']

In [None]:
# TODO: Improve the recommendations
df['score'] = df['similarity'] # + df['???']