# Exercise: Recommender systems

In this exercise, we will build a content-based recommendation system using a dataset of Netflix titles. We will preprocess the text data, convert it into numerical features with TF-IDF, and compute item similarities to generate recommendations. This hands-on activity will help us understand and implement key techniques in content-based filtering.

## Learning objectives

By the end of this exercise, we should be able to:
* Understand content-based recommendation systems.
* Clean and preprocess text data.
* Convert text data into numerical features using TF-IDF.
* Compute item similarities using cosine similarity.
* Build and evaluate a content-based recommendation model.

## Introduction

In this notebook, we will build a `content-based recommendation system` using the `Netflix` dataset. The primary goal of this task is to recommend similar titles to users based on the attributes of the media they have already interacted with. This will enhance the user experience by providing personalised content recommendations, thereby increasing user engagement and satisfaction. By predicting which titles a user might enjoy based on their previous interactions, content-based recommendation systems help platforms like `Netflix` keep users engaged and encourage them to explore a broader range of content.

The dataset is derived from Netflix's collection of movies and TV shows. This dataset includes various attributes for each title, such as:

* show_id: Unique identifier for each title.
* type: The type of media (e.g., Movie, TV Show).
* title: The name of the media.
* director: Directors involved in the media.
* cast: Main actors involved in the media.
* country: Countries where the media was produced.
* date_added: The date when the media was added to Netflix.
* release_year: The year the media was released.
* rating: The rating given to the media.
* duration: Duration of the media (e.g., 90 min, 1 Season).
* listed_in: Categories or genres the media belongs to.
* description: Brief summary or synopsis of the media.

The data was collected to provide a comprehensive overview of the available media on `Netflix`. It allows for detailed analysis and exploration of the media's attributes, which is essential for building a recommendation system.

Let's dive in!

Import the necessary libraries and read the data.

In [None]:
# Import necessary libraries
import numpy as np 
import pandas as pd 

# For text handling and regular expressions
import re
from sklearn.feature_extraction.text import TfidfVectorizer # For converting text to numerical datan 

# For computing cosine similarity
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity 


In [58]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/unsupervised_sprint/netflix_titles.csv', index_col=0)
df.head()

Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6234 entries, 81145628 to 70153404
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          6234 non-null   object
 1   title         6234 non-null   object
 2   director      4265 non-null   object
 3   cast          5664 non-null   object
 4   country       5758 non-null   object
 5   date_added    6223 non-null   object
 6   release_year  6234 non-null   int64 
 7   rating        6224 non-null   object
 8   duration      6234 non-null   object
 9   listed_in     6234 non-null   object
 10  description   6234 non-null   object
dtypes: int64(1), object(10)
memory usage: 584.4+ KB


## Exercises

In this exercise, we focus on the relevant columns `cast`, `title`, `description`, and `listed_in` because these textual features provide detailed descriptions and attributes essential for capturing the content similarities between media items. These columns contain detailed information about what the media is about, who stars in it, and its genres, which are crucial for generating meaningful recommendations in a content-based filtering approach.

### Exercise 1: Data cleaning and preprocessing

Before proceeding with our recommender system, we need to clean and process our data first to get the most accurate results. 

We need to do the following:

* Remove rows with missing or NaN values.
<br>
<br>
* Remove punctuation and extra spaces in the text data. This helps to standardise and clean the text, ensuring consistency in the dataset and facilitating accurate analysis and modelling by eliminating unnecessary noise and variations in the text.

**Hint**: 
> * For all the text columns, remove all characters that are not alphanumeric or whitespace.
> * For the 'cast' column, first remove all spaces and then replace commas with spaces. This ensures that the cast members' names are treated as single entities separated by spaces.
<br>
<br>
* Combine the columns `listed_in`, `cast`, `title`, and `description` into a single feature for the recommendation system. This creates a richer and more complete representation of each item, enhancing the effectiveness of the recommendation system by allowing it to consider all aspects of the content simultaneously.<br> 
**Hint**: Remember to drop the individual columns as they are now combined into one.
<br>
<br>   
* Drop the rest of the columns to streamline and focus on the most relevant data for our recommendation model so that we are only left with the `type`, `title`, and `combined` columns with `type` and `title` providing context and identification, and `combined` serving as the main feature for calculating similarities.


In [None]:
# 1. Remove rows with NaN values
df.dropna(subset=['cast', 'title', 'description', 'listed_in'], inplace=True, axis=0)

# Identify text columns (object dtype)
text_columns = df.select_dtypes(include=['object']).columns

# 2. Clean text data
for col in text_columns:
    if col == 'cast':
        # For the 'cast' column: first remove all spaces then replace commas with spaces.
        df[col] = df[col].apply(lambda x: x.replace(" ", "").replace(",", " "))
    else:
        # Remove characters that are not alphanumeric or whitespace using regex
        df[col] = df[col].str.replace(r'[^\w\s]', '', regex=True)
        # Remove extra spaces (including leading/trailing)
        df[col] = df[col].str.replace(r'\s+', ' ', regex=True).str.strip()

# 3. Combine the relevant columns into a single feature.
# Combine listed_in, cast, title, and description into a new column "combined"
df['combined'] = df['listed_in'] + ' ' + df['cast'] + ' ' + df['title'] + ' ' + df['description']

# 4. Drop the individual columns that have been combined.
df.drop(columns=['listed_in', 'cast', 'description'], inplace=True)

# 5. Drop any other columns so that we only have 'type', 'title', and 'combined'
df = df[['type', 'title', 'combined']]

# Display the cleaned DataFrame
df.head()

Unnamed: 0_level_0,type,title,combined
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
81145628,Movie,Norm of the North King Sized Adventure,Children Family Movies Comedies AlanMarriott A...
80117401,Movie,Jandino Whatever it Takes,StandUp Comedy JandinoAsporaat Jandino Whateve...
70234439,TV Show,Transformers Prime,Kids TV PeterCullen SumaleeMontano FrankWelker...
80058654,TV Show,Transformers Robots in Disguise,Kids TV WillFriedle DarrenCriss ConstanceZimme...
80125979,Movie,realityhigh,Comedies NestaCooper KateWalsh JohnMichaelHigg...


**Note!!**

It's important to acknowledge that dropping NaN values can potentially lead to information loss, and in situations where missing data follows a specific trend or pattern, dropping NaNs might not be ideal as it could bias the analysis. In such cases, other strategies like imputation or handling missing values through specialised techniques might be more suitable. Ultimately, the choice between dropping NaN values and handling them through imputation or other methods depends on the data, the specific context of the analysis and the goals of the study.

### Exercise 2: Feature extraction
Next, we want to convert the combined text feature into numerical features using TF-IDF.
This enables the application of mathematical and statistical techniques for measuring similarities between different media items. In its raw form, text data cannot be directly used for similarity calculations or machine learning algorithms. By transforming the text into numerical representations, we can leverage these techniques to analyse and compare the content effectively.

* Utilise TF-IDF to convert the `combined` column into numerical vectors, which represent the importance of words in the document. Initialise your TF-IDF Vectoriser without specifying any parameters, which means it will default to single-word tokens.
* Compute the cosine similarity between these vectors to measure how similar the titles are.


In [60]:
# Initialize vectorizer 
tfidf = TfidfVectorizer()

# Convert the 'combined' column to a TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(df['combined'])

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Display the cosine similarity matrix for the first few entries
print(cosine_sim[:5, :5])

[[1.         0.01002575 0.01349178 0.00179204 0.01134698]
 [0.01002575 1.         0.03540881 0.02772973 0.01855767]
 [0.01349178 0.03540881 1.         0.23672432 0.01587988]
 [0.00179204 0.02772973 0.23672432 1.         0.01009383]
 [0.01134698 0.01855767 0.01587988 0.01009383 1.        ]]


**Note!!**

Adjusting parameters like the n-grams in the vectoriser could capture more complex relationships, such as those between cast members. This could potentially lead to recommendations that reflect relationships like sequels or movies with similar casts. Exploring different n-gram ranges can be valuable for enhancing the recommendation system’s performance and capturing nuanced similarities within the text data.

### Exercise 3: Building the recommendation function

Now, we can generate recommendations based on cosine similarity.

Define a function that, given a title, finds similar titles by looking up their cosine similarity scores and returns the top 10 recommendations based on these scores.

In [84]:
def get_recommendations(title, df, cosine_sim):
    """
    Given a title, return the top 10 recommended titles based on cosine similarity.
    """
    # Reset index to ensure alignment with cosine_sim matrix
    df = df.reset_index(drop=True)
    
    # Find the index of the given title (case-insensitive match)
    title_lower = title.lower()
    df['title_lower'] = df['title'].str.lower()
    idx_list = df.index[df['title_lower'] == title_lower].tolist()
    
    if not idx_list:
        return f"Title '{title}' not found in the dataset."
    
    idx = idx_list[0]
    
    # Get the pairwise similarity scores for the given title
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the titles based on similarity score in descending order
    # Skip the first element as it is the same title (similarity score = 1)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]
    
    # Get the indices of the top 10 similar titles
    recommended_indices = [i[0] for i in sim_scores]
    
    # Return the recommended titles
    return df.iloc[recommended_indices]['title']

### Exercise 4: Test the recommender

Say you are trying to get recommendations for what movie to watch, and you particularly enjoyed the film `The Crown`. Run our recommender for this title and see what recommendations we get. 

Would you want to watch any of these titles?


In [85]:
recommendations = get_recommendations("The Crown", df=df, cosine_sim=cosine_sim)
print(f"Recommendations: \n {recommendations}")

Recommendations: 
 369                         Witches A Century of Murder
1829                                         London Spy
5068                                              Reign
2612                                     My Hotter Half
692     The Blue Planet A Natural History of the Oceans
3915                        The Real Football Factories
1753                                         Collateral
5474                                           Lovesick
4830               Planet Earth The Complete Collection
2724                                       Age Gap Love
Name: title, dtype: object


---