## Introduction
Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous, and can be commonly seen in online stores, movies databases and job finders. 

Within this notebook, we exemplify an implementation of **Content-based** systems 

The notebook has 2 `Parts`: 
1. data cleaning: This part has 2 **Step**s, cleaning the `movies` dataframe and the `ratings` dataframe and a little analysis of the technique, showing it's pros and cons
2. implementing content based recommneder systems: This part is devided to 2 **Step**s too! First the user input in prosseced and next, learning the input's preferences is done

## Importing Neccessary Libraries

In [18]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt

## Reading in the Data Files 

In [19]:
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')

# Part 1: Cleaning the Dataframes 

### Step 1: Cleaning the `movies_df` dataframe

The first thing that comes to mind is to seperate the **year attribute** in the `tile` column and add it as a new column to `movies_df`

One can do so with the use of pandas and regular expresions as below:

In [20]:
# extracting the year from the 'title' column and adding it to a new 'year' column
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)

# removing the parentheses from the 'year' column
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)

# removing the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')

# applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

  movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')


After that, the dataframe would be like:

In [21]:
movies_df

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995
...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic,Action|Animation|Comedy|Fantasy,2017
9738,193583,No Game No Life: Zero,Animation|Comedy|Fantasy,2017
9739,193585,Flint,Drama,2017
9740,193587,Bungo Stray Dogs: Dead Apple,Action|Animation,2018


Also, it would be beneficial to split the values in the `genres` column and make them a **list of genres**. Since every genre is separated by a `|`, so we simply have to call the splilt function to get the job done.

Since keeping genres in a list format isn't optimal for the content-based recommendation system technique, we will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. 

In this case, we store every different genre in columns that contain values equal to either 1 or 0 (as a boolean representation), so `1` shows that a movie has that genre and `0` shows that it doesn't. 

Let's also store this dataframe in another variable since genres won't be important for our first recommendation system.


In [22]:
movies_df['genres'] = movies_df.genres.str.split('|')

# copying the movie dataframe into a new one since we won't need to use the genre information in our first case.
movies_with_genres_df = movies_df.copy()

# for every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
for index, row in movies_df.iterrows():
    for genre in row['genres']:
        movies_with_genres_df.at[index, genre] = 1

# filling in the NaN values with 0 to show that a movie doesn't have that column's genre
movies_with_genres_df = movies_with_genres_df.fillna(0)
movies_with_genres_df.head()



Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Step 2: Cleaning the `ratings_df` dataframe
Now it's time to clean the `ratings_df` dataframe.


In [23]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won't be needing the timestamp column, so let's drop it to save memory.


In [24]:
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

  ratings_df = ratings_df.drop('timestamp', 1)


Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


Now the data frames are cleaned and ready for use

# Part 2: Content-Based recommendation system

This technique attempts to figure out what a users favourite aspects of an item is, and then recommends items that present those aspects. 

In our case, we're going to try to figure out the input's favorite genres from the movies and ratings given.

Let's begin by an arbiturary input user as an example to recommend movies to:


## Step1: User Input Processing

In [25]:
# remember that you can write any arbiturary input instead of the one presented below
user_input = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
            ] 
input_movies = pd.DataFrame(user_input)
input_movies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


#### Adding `movieId` row

To better analyse the user inputs, we can extract the input movie's ID's from the movies dataframe (`movies_df`) and add them into the above dataframe (`input_movies`).

We can achieve this by first filtering out the rows that contain the input movie's title and then merging this subset with the input dataframe. It's good practice to drop unnecessary columns for the input to save memory space.


In [26]:
# filtering out the movies by title
user_input_titles = input_movies['title'].tolist()  # put all the titles in a python list
input_id = movies_df[movies_df['title'].isin(user_input_titles)]  # store all rows with titles in the list above  

# merge the dataframes `input_id`` and `input_movies` based on their common columns, 
# implicitly using the ‘title’ column as the key
input_movies = pd.merge(input_id, input_movies)

# dropping columns we won't need
input_movies = input_movies.drop('genres', 1).drop('year', 1)

# the final input dataframe

# if a movie you added in above isn't here, then it might not be in the original 
# dataframe or it might spelled differently, please check capitalisation.
input_movies

  input_movies = input_movies.drop('genres', 1).drop('year', 1)
  input_movies = input_movies.drop('genres', 1).drop('year', 1)


Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


So now we have the `movieId` row added to `input_movies` dataframe we had previously

Now, we can start learning the input's preferences, so let's get the subset of movies that the input has watched from the dataframe containing genres defined with binary values.


In [27]:
# filtering out the movies from the input
user_input_id = input_movies['movieId'].tolist()
user_movies = movies_with_genres_df[movies_with_genres_df['movieId'].isin(user_input_id)]
user_movies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
257,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
973,1274,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1445,1968,"Breakfast Club, The","[Comedy, Drama]",1985,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Since we'll only need the actual genre table, we can clean up the dataframe a bit by resetting the index and dropping the `movieId`, `title`, `genres` and `year` columns.


In [28]:
# resetting the index 
user_movies = user_movies.reset_index(drop=True)

# dropping unnecessary columns 
user_genre_table = user_movies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)

user_genre_table

  user_genre_table = user_movies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  user_genre_table = user_movies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  user_genre_table = user_movies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  user_genre_table = user_movies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)


Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Step2: Learning the input's preferences
To do so, we're going to turn each genre into weights. We can do this by using the input's reviews and multiplying them into the input's genre table and then summing up the resulting table by column. This operation is actually a **dot product** between a matrix and a vector, so we can do so by calling the Pandas **dot** function.

In [29]:
input_movies['rating']

0    3.5
1    2.0
2    5.0
3    4.5
4    5.0
Name: rating, dtype: float64

In [30]:
# dot produt to get weights
user_profile = user_genre_table.transpose().dot(input_movies['rating'])

user_profile

Adventure             10.0
Animation              8.0
Children               5.5
Comedy                13.5
Fantasy                5.5
Romance                0.0
Drama                 10.0
Action                 4.5
Crime                  5.0
Thriller               5.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 4.5
War                    0.0
Musical                0.0
Documentary            0.0
IMAX                   0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

Now, we have the weights for every of the user's preferences. This is known as the **User Profile**. 

Using this, we can recommend movies that satisfy the user's preferences.


In [31]:
# get the genres of every movie in our original dataframe
genre_table = movies_with_genres_df.set_index(movies_with_genres_df['movieId'])

# drop the unnecessary information
genre_table = genre_table.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)

genre_table.head()

  genre_table = genre_table.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  genre_table = genre_table.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  genre_table = genre_table.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
  genre_table = genre_table.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)


Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


With the input's profile and the complete list of movies and their genres in hand, we're going to take the weighted average of every movie based on the input profile and recommend the top twenty movies that most satisfy it.


In [32]:
# multiply the genres by the weights and then take the weighted average
recommendation_table_df = ((genre_table * user_profile).sum(axis=1)) / (user_profile.sum())

recommendation_table_df.head()

movieId
1    0.594406
2    0.293706
3    0.188811
4    0.328671
5    0.188811
dtype: float64

In [33]:
# sort recommendations in descending order
recommendation_table_df = recommendation_table_df.sort_values(ascending=False)

recommendation_table_df.head()

movieId
134853    0.734266
148775    0.685315
117646    0.678322
6902      0.678322
81132     0.671329
dtype: float64

## The Final Result of Content Based Technique


In [34]:
movies_df.loc[movies_df['movieId'].isin(recommendation_table_df.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
559,673,Space Jam,"[Adventure, Animation, Children, Comedy, Fanta...",1996
1390,1907,Mulan,"[Adventure, Animation, Children, Comedy, Drama...",1998
2250,2987,Who Framed Roger Rabbit?,"[Adventure, Animation, Children, Comedy, Crime...",1988
3460,4719,Osmosis Jones,"[Action, Animation, Comedy, Crime, Drama, Roma...",2001
4631,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002
5490,26340,"Twelve Tasks of Asterix, The (Les douze travau...","[Action, Adventure, Animation, Children, Comed...",1976
5819,32031,Robots,"[Adventure, Animation, Children, Comedy, Fanta...",2005
6047,40339,Chicken Little,"[Action, Adventure, Animation, Children, Comed...",2005
6448,51939,TMNT (Teenage Mutant Ninja Turtles),"[Action, Adventure, Animation, Children, Comed...",2007
6455,52287,Meet the Robinsons,"[Action, Adventure, Animation, Children, Comed...",2007


## Advantages and Disadvantages of Content-Based Filtering

#### Advantages
Content-based filtering offers several advantages. Firstly, it can effectively learn and adapt to individual user preferences over time, leading to highly personalized recommendations. This personalized approach can result in increased user satisfaction and engagement with the recommendations provided.

#### Disadvantages
However, content-based filtering has its limitations. Since it primarily relies on analyzing the characteristics of items and user preferences, it may overlook the influence of social validation or collective opinions on the quality of recommendations, potentially leading to the promotion of low-quality items. Additionally, the process of extracting relevant data for content-based filtering systems can be complex, and identifying the specific characteristics that drive user preferences or dislikes may not always be straightforward.