# Content based recommendation

A second way to recommend movies for a user is using movies a user has liked in the past. The premise is that by finding movies similar to movies a user has rated high, we can assume that the user would like those movies as well.
<br><br>
For example, let's assume user 1 likes Toy Story and Cars, we can assume that the user likes animated movies or that are kid friendly or that the user likes movies made by directors John Lasseter etc and recommend accordingly.
<br>
<pre>
For our set of movies we are going to focus on the following features:
i The genres of the movie
ii The tags associated with a movie
iii The director/actors of a movie
</pre>

The steps we'll follow to find similar movies are as follows:
1. Combine the movie name, genre, tags, director and actor name to a single string
2. Perform count vectorization and convert the strings to vectors
3. Find similarities between the vectors using Cosine or Jaccard similarities
4. For a particular movie, extract the respective row.
5. Sort in descending order, the movie with highest similarity should be the movie to be recommended.

In [1]:
user_id = 10

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [3]:
ratings = pd.read_csv('Data/ratings.csv')
movies = pd.read_csv('Data/movies.csv')
crew= pd.read_csv('Data/crew.csv')
tags = pd.read_csv('Data/tags.csv')

In [4]:
user_movies = ratings[((ratings.userId == user_id)& (ratings.rating >= 4))]

### Preparing the string for comparison

The first step would be to create the string using which we would compare the movies. We need one single string to perform the comparison. Therefore, we will create a string that combines the movie name, genres, tags, director and cast. 
<br> We are going to perform a few enhancements to our string though.<br>
1. The Director of a movie generally plays a major role when it comes to choosing a movie based on another movie. As we wouldn't be able to add weights to a part of the string, we would repeat the director name twice to add more weightage.<br>
2. We are going to remove the space between the first name and last name as we don't want a common first name or last name to skew the result 

In [5]:
## Preparing the director details
crew['Director'] = crew['Director'].str.replace(" ","")

In [6]:
## Preparing the actors details
crew['Cast'] = crew['Cast'].str.replace(" ","")
crew['Cast'] = crew['Cast'].str.replace("|"," ")

In [7]:
## Preparing the genre details 
movies['genres'] = movies['genres'].str.replace("|"," ")

In [8]:
## Preparing the tags detailes
tags = tags.groupby(['movieId'])['tag'].apply(' '.join).reset_index()

In [9]:
## Joining the dataframes together
intermediate_df = pd.merge(movies,crew, left_on='movieId',right_on = 'MovieId',how='left')
final_df = pd.merge(intermediate_df, tags, on = 'movieId', how='outer')
final_df['tag'] = final_df['tag'].replace(np.nan,'')


In [10]:
metadata = final_df['title']+' '+ final_df['genres']+' '+ final_df['genres']+ ' '+ final_df['tag']+' '+final_df['Cast']+' '+final_df['Director']+' '+final_df['Director']+' '+final_df['Director']

### Vectorizing the string

There are 2 techniques we could use for vectorizing the string, either a Count vectorizer or tfidf, Term frequency Inverse document frequency. <br><br>
In the <b>Count Vectorizer</b>, we give more weightage to words appearing several times. In our metadata, we have added director twice and hence that would be given more weightage.<br><br>
<b>tfidf </b>does the opposite where it penalizes more frequently occuring words. This would be more applicable if we were taking into consideration the description of the movies. In such cases, the words " a the in on" could occur in almost all descriptions. Providing more weightage to these words could skew the results. 
<br><br>
Since we are not using description and just succinct tags, we would be using count vectorizer.

In [11]:
metadata = metadata.astype(str)

In [12]:
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(metadata)

### Finding similarities using Cosine similarities method

Now that we have created matrix to represent each of the movies' metadata features, we will now find similarities between the movies. <br>
For this we would be using Cosine values to measure similarities.<br>
The reason we use cosine is that <b>larger the distance between the 2 vectors, smaller the distance</b> value will be. This linearity is helpful in finding out the most closely related movies.

In [13]:
cosine_matrix = cosine_similarity(matrix)
a = cosine_similarity(matrix)

### Extracting the cosine matrix row for a particular movie

We now will need to extract the corresponding row in the cosine_matrix for a particular movie. 
<br>
The user must enter the movie name as in the movies.csv file, ie along with the year of release
eg Toy Story (1995)<br>
From that we calculate the <b>movie id</b> which in term we use to find the **index** in the final_df dataframe.<br>
We then use this index to extract a **particular row** from the cosine matrix

In [14]:
## Enter movie name here
movie_name = 'Harry Potter and the Chamber of Secrets (2002)'

In [15]:
selected_movie = final_df[final_df['title'].str.find(movie_name)==0]
selected_index = selected_movie.iloc[0].name
selected_index

4076

In [16]:
movie_row = cosine_matrix[selected_index]
movie_row

array([0.29201253, 0.37605072, 0.09231862, ..., 0.11396058, 0.        ,
       0.        ])

### Sorting the resulting row in descending order and finding top 5 recommended movies

From the previous step we have a row that corresponds to a particular movie. This list now contains the cosine similarity values. We want to find the top 5 values and the movies corresponding to that

In [17]:
movie_df = pd.DataFrame(movie_row)
movie_df_sorted = movie_df.sort_values(0,ascending=False)

In [18]:
for i in range(6):
    if(i==0):
        continue
    print(final_df.iloc[movie_df_sorted.iloc[i].name].title)

Pokémon: The First Movie (1998)
Harry Potter and the Prisoner of Azkaban (2004)
Lord of the Rings: The Fellowship of the Ring, The (2001)
Harry Potter and the Goblet of Fire (2005)
Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)


### Evaluation

In [19]:
user_prediction = []

In [20]:
user_list = ratings.userId.unique()

In [21]:
for user_id in user_list:
    user_rated_movies = ratings[(ratings.userId==user_id) & (ratings.rating >3)]
    msk = np.random.rand(len(user_rated_movies)) < 0.85
    training_data = user_rated_movies[msk]
    test_data = user_rated_movies[~msk]
   
    pred_movie_id = []
    flag=0
    for movieId in training_data.movieId:
        selected_movie_u = final_df[final_df.movieId == movieId]
        selected_index_u = selected_movie_u.iloc[0].name
        movie_row = cosine_matrix[selected_index_u]
        for i in range(1,20):
            pred_movie_id.append(final_df.iloc[movie_df_sorted.iloc[i].name].movieId)
    for movieId in test_data.movieId:
        if(movieId in pred_movie_id):
            flag+= 1
    if(len(test_data.index!=0)):
        user_prediction.append(flag/len(test_data.index))

In [22]:
np.mean(user_prediction)*100

1.6233515581976543