# Recommender systems

## Introduction

Recommender systems are among the most popular applications of data science today. They are used to predict the "rating" or "preference" that a user would give to an item. Almost every major tech company has applied them in some form. Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow.

What's more, for some companies like Netflix, Amazon Prime, Hulu, and Hotstar, the business model and its success revolves around the potency of their recommendations. Netflix even offered a million dollars in 2009 to anyone who could improve its system by 10%.

There are also popular recommender systems for domains like restaurants, courses, and online dating. Recommender systems have also been developed to explore research articles and experts, collaborators, and financial services. YouTube uses the recommendation system at a large scale to suggest you videos based on your history. For example, if you watch a lot of educational videos, it would suggest those types of videos.



Some really good read covering what we do in this class and even beyond ...
- [Recommender | Recommender Systems | Overview of systems](https://towardsdatascience.com/the-4-recommendation-engines-that-can-predict-your-course-tastes-109dc4e10c52) (19 min.) nicely explained with some code snippets.

Please watch the following videos (~60 min.): 
- [Recommender | Recommender Systems | Introduction](https://youtu.be/giIXNoiqO_U) (8 min.)Problem formulation
- [Recommender | Intro to Recommender Systems](https://youtu.be/gxXn9LDAdcU) (4 min.)
- [Recommender | Types of Recommender Systems](https://youtu.be/QRzfpJa3iJk) (3 min.)
- [Recommender | Content Based](https://youtu.be/IlqnNWuqToo) (21 min.) A bit long but solid foundations.
- [Recommender | Collaborative Filtering](https://youtu.be/3Sl_nFQbLQA) (21 min.).
- Alternatively, Go to Coursera and enroll on 'intro to recommender system'- it is **free when you select the 'audit course' option** and go to week3 'Content-Based Filtering Using TFIDF' [Recommender | Content-Based Filtering Using TFIDF](https://www.coursera.org/learn/recommender-systems-introduction/home/week/3) (3 x videos app.60 min.)

## But what are these recommender systems?

Broadly, recommender systems can be classified into 3 types:

- **Simple recommenders**: offer generalized recommendations to every user, based on course popularity and/or genre. The basic idea behind this system is that courses that are more popular and critically acclaimed will have a higher probability of being liked by the average audience. An example could be IMDB Top 250.
- **Content-based recommenders**: suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for courses, to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. A good example could be YouTube, where based on your history, it suggests you new videos that you could potentially watch.
- **Collaborative filtering engines**: these systems are widely used, and they try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.±

#  1- Content Based Recommender 


In this tutorial, you will learn how to build a basic model of content-based recommender systems. While this model will be nowhere close to the industry standard in terms of complexity, quality, or accuracy, it will help you to get started with building more complex models that produce even better results.

Lets get started : Go the following link to download the dataset : 
https://www.kaggle.com/rounakbanik/the-courses-dataset

## The dataset : IMDB 250 Recommender systems
courses_metadata.csv: this file contains information on ~45,000 courses featured in the Full courseLens dataset. Features include posters, backdrops, budget, genre, revenue, release dates, languages, production countries, and companies.
Go the following link to download the dataset : 
https://www.kaggle.com/rounakbanik/the-courses-dataset

## Preparing the Data

In [8]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

df = pd.read_csv('course_061120.csv')
df.head()

Unnamed: 0,ID,Title,Description,Objectives,Solutions,Duration
0,d.4,Emotional Intelligence,<p>Emotional intelligence is the skill at perc...,<p>Here are the topics you will learn about by...,,<p>2 hour 35 minutes</p>
1,d.5,Stress Management & Balance,<p>This is a self-learning program for learnin...,<p>At the end of this track you will be able t...,,<p>48 minutes.</p>
2,d.6,Time Management & Productivity,<p>You'll learn strategies to better mangage y...,"<p>At the end of this, you will be able to:</p...",,"<p>1 hour, 36 minutes</p>"
3,d.7,Public Speaking,"<p>In today’s business world, most of us need ...","<p>After studying the Public Speaking, you wil...",,"<p>4 hours, 30 min</p>"
4,d.8,Virtual Team Communication,"<p><span style=""color: rgb(78, 78, 78);"">This ...",<p>Here are the topics you will learn about by...,,<p>Maximum 5 hours.</p>


In [9]:
df.shape

(6204, 6)

We can see that we have data on 45,466 courses.

We also see that we have 24 columns. Each column represents a feature or a piece of metadata about the course. When we ran df.head(), we saw that most of the columns were truncated to fit in the display. To view all the columns (henceforth, called features) we have, we can run the following:

In [10]:
#Output the columns of df
df.columns

Index(['ID', 'Title', 'Description', 'Objectives', 'Solutions', 'Duration'], dtype='object')

From our output, it is quite clear which features we do and do not require. Now, let's reduce our DataFrame to only contain features that we need for our model:

In [11]:
#Only keep those features that we require 
df = df[['Title','Description', 'Objectives', 'Duration']]

df.head()

Unnamed: 0,Title,Description,Objectives,Duration
0,Emotional Intelligence,<p>Emotional intelligence is the skill at perc...,<p>Here are the topics you will learn about by...,<p>2 hour 35 minutes</p>
1,Stress Management & Balance,<p>This is a self-learning program for learnin...,<p>At the end of this track you will be able t...,<p>48 minutes.</p>
2,Time Management & Productivity,<p>You'll learn strategies to better mangage y...,"<p>At the end of this, you will be able to:</p...","<p>1 hour, 36 minutes</p>"
3,Public Speaking,"<p>In today’s business world, most of us need ...","<p>After studying the Public Speaking, you wil...","<p>4 hours, 30 min</p>"
4,Virtual Team Communication,"<p><span style=""color: rgb(78, 78, 78);"">This ...",<p>Here are the topics you will learn about by...,<p>Maximum 5 hours.</p>


In [12]:
def remove_html_tags(text):
    if pd.isna(text):
        return text
    return BeautifulSoup(text, "html.parser").get_text()

df['Title'] = df['Title'].apply(remove_html_tags)
df['Description'] = df['Description'].apply(remove_html_tags)
df['Objectives'] = df['Objectives'].apply(remove_html_tags)





Next, let us extract the year of release from our release_date feature:

Our year feature is still an object and is riddled with NaT values, which are a type of null value used by Pandas. Let's convert these values to an integer, 0, and convert the datatype of the year feature into int.

To do this, we will define a helper function, convert_int, and apply it to the year feature:

We do not require the release_date feature anymore. So let's go ahead and remove it:

The runtime feature is already in a form that is usable. It doesn't require any additional wrangling. Let us now turn our attention to genres.

We can observe that the genres are in a format that looks like a JSON object (or a Python dictionary). Let us take a look at the genres object of one of our courses:

In [13]:
#Print the head of the cleaned DataFrame
df.head(10)

Unnamed: 0,Title,Description,Objectives,Duration
0,Emotional Intelligence,Emotional intelligence is the skill at perceiv...,Here are the topics you will learn about by ta...,<p>2 hour 35 minutes</p>
1,Stress Management & Balance,This is a self-learning program for learning s...,At the end of this track you will be able to:I...,<p>48 minutes.</p>
2,Time Management & Productivity,You'll learn strategies to better mangage your...,"At the end of this, you will be able to:use bo...","<p>1 hour, 36 minutes</p>"
3,Public Speaking,"In today’s business world, most of us need to ...","After studying the Public Speaking, you will:K...","<p>4 hours, 30 min</p>"
4,Virtual Team Communication,This curriculum addresses the challenges of wo...,Here are the topics you will learn about by ta...,<p>Maximum 5 hours.</p>
5,Interpersonal Communication,Good interpersonal communication skills help y...,Here are the topics you will learn about by ta...,<p>Maximum 4 hours.</p>
6,Cross Cultural Awareness and Communication,It is a field of study that looks at how peopl...,Here are the topics you will learn about by ta...,<p>Maximum 4 hours</p>
7,Software Engineering,The Software Engineering Community is a self-l...,Taking this learning program will allow you to...,<p>33:54 hours</p>
8,Problem Solving and Decision Making Fundamentals,In this community you will start by reviewing ...,These are the topics you will learn about by t...,<p>3 hour 52 minutes</p>
9,Web Programming,The Web Programming Community is a self-learni...,Discover and learn how to develop web applicat...,"Up to 17 hours, depending on your level of exp..."


##  Implementing the Content based recommender

In this section, you will learn how to build a system that recommends courses that are similar to a particular course.

Essentially, the models we are building compute the pairwise similarity between bodies of text. In our case we will use the course description and objectives to calculate the similarity between two courses and recommend courses based on that similarity score.

The challenge is that overview is text. Hence you need to extract some kind of features from the text data before you can compute the similarity and/or dissimilarity between them. This is done by representing each course as mathematical word vectors.

But what are the values of these vectors? The answer to that question depends on the vectorizer we are using to convert our documents into vectors. The two most popular vectorizers are CountVectorizer and TF-IDFVectorizer. We will be using TF-IDFVectorizer because some wordsoccur much more frequently in overview than others. It is therefore a good idea to assign weights to each word in a document according to the TF-IDF formula. **tfidf=term frequency/document frequency**

TF-IDF is used to searching and prioritizing important words in a document. Any word gets scored by tf-idf. Higher scores indicates core terms. In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.

Scikit-learn gives you a built-in TfIdfVectorizer class that produces the TF-IDF matrix in a couple of lines.
- Import the Tfidf module using scikit-learn;
- Remove stop words like 'the', 'an', etc. since they do not give any useful information about the topic;
- Replace not-a-number values with a blank string;
- Finally, construct the TF-IDF matrix on the data.

Additionnaly you can watch this video on tfidf: https://youtu.be/6HuKFh0BatQ


In [14]:
#Import TfIdfVectorizer from the scikit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stopwords
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
df['Description'] = df['Description'].fillna('')

#Replace NaN with an empty string
df['Objectives'] = df['Objectives'].fillna('')

df['text'] = df['Description'] + ' ' + df['Objectives']

#Construct the required TF-IDF matrix by applying the fit_transform method on the Description feature
tfidf_matrix = tfidf.fit_transform(df['text'])


#Output the shape of tfidf_matrix
tfidf_matrix.shape

(6204, 20641)

We see that the vectorizer has created a 75,827-dimensional vector for the Descriptoin of every course. 
tdidf_matrix is a matrix, where each row represents a course and each column represents a token (word) 
It’s a sparse numpy array because it’s essentially a matrix of zeros, with a handful of 
nonzero elements per row. The sparse matrix format is more efficient storage wise



Every course is now represented through a TD-IDF keyword vector. 

The next step is to calculate the pairwise **cosine similarity score** of every course. In other words, we are going to create a 45,466 × 45,466 matrix, where the cell in the ith row and jth column represents the similarity score between courses i and j. We can easily see that this matrix is symmetric in nature and every element in the diagonal is 1, since it is the similarity score of the course with itself.

You will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two courses. You use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate (especially when used in conjunction with TF-IDF scores). The lower the cosine distance, the closer the courses are.

Notice : the following command could take some time to complete ... 





In [15]:
# Import linear_kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [16]:
cosine_sim.shape

(6204, 6204)

In [17]:
cosine_sim[1]

array([0.04074644, 1.        , 0.12493763, ..., 0.01987448, 0.        ,
       0.        ])

With the similarity scores of every course with every other course, we are now in a very good position to write our final recommender function.
Let's create a reverse mapping of course titles and their respective indices. In other words, let's create a pandas series with the index as the course title and the value as the corresponding index in the main DataFrame.
In other words, you need a mechanism to identify the index of a course in your metadata DataFrame, given its title.

In [18]:
#Construct a reverse mapping of indices and course titles, and drop duplicate titles, if any
indices = pd.Series(df.index, index=df['Title']).reset_index().drop_duplicates(subset='Title').set_index('Title')[0]


In [19]:
indices[:10]

print(indices['Emotional Intelligence'])

0


We will perform the following steps in building the recommender function:

- Declare the title of the course as an argument.
- Obtain the index of the course from the indices reverse mapping.
- Get the list of cosine similarity scores for that particular course with all courses using cosine_sim. Convert this into a list of tuples where the first element is the position and the second is the similarity score.
- Sort this list of tuples on the basis of the cosine similarity scores.
- Get the top 10 elements of this list. Ignore the first element as it refers to the similarity score with itself (the course most similar to a particular course is obviously the course itself).
- Return the titles corresponding to the indices of the top 10 elements, excluding the first:

In [20]:
# Function that takes in course title as input and gives recommendations 
def content_recommender(title, cosine_sim=cosine_sim, df=df, indices=indices):
    # Obtain the index of the course that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all courses with that course
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the courses based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar courses. Ignore the first course.
    sim_scores = sim_scores[1:11]

    # Get the course indices
    course_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar courses
    return df['Title'].iloc[course_indices]

**Congratulations!** You've built your very first content-based recommender. Now it is time to see our recommender in action! Let's ask it for recommendations of courses similar to Emotional Intelligence:



In [None]:
#Get recommendations for The Lion King
content_recommender('Emotional Intelligence')

4656                         Navigating Your Own Emotions
1516    Emotional Intelligence: Being Aware of the Emo...
4657                   Navigating Other People's Emotions
1517    Emotional Intelligence: Building Self-Manageme...
4661          Emotional Intelligence: Applying EI at Work
4893     Learn - Emotional Intelligence / Motivate others
4213                    Leveraging Emotional Intelligence
4659                  Leading with Emotional Intelligence
4655               Developing Your Emotional Intelligence
3360    Emotional Intelligence talk with Niklas Nordli...
Name: Title, dtype: object