<img src="google_search_movie.png">

There are three ways to build a recommendation engine:

1. Popularity based recommendation engine
2. Content based recommendation engine
3. Collaborative filtering based recommendation engine

Lets discuss the difference between these three:

### Popularity based recommendation engine:

This is the simplest kind of recommendation engine that we will come across. The trending list we see in YouTube or Netflix is based on this algorithm. It keeps a track of view counts for each movie/video and then lists movies based on views in descending order. Very simple, yet effective.


### Content based recommendation engine:

This type of recommendation systems, takes in a movie that a user currently likes as input. Then it analyzes the contents (storyline, genre, cast, director etc.) of the movie to find out other movies which have similar content. Then it ranks similar movies according to their similarity scores and recommends the most relevant movies to the user.

### Collaborative filtering based recommendation engine:

This algorithm at first tries to find similar users based on their activities and preferences (for example, both the users watch same type of movies or movies directed by the same director). Now, between these users(A and B) if user A has seen a movie that user B has not seen yet, then that movie gets recommended to user B and vice-versa. In other words, the recommendations get filtered based on the collaboration between similar user’s preferences (thus, the name “Collaborative Filtering”).

#### Another type of recommendation system can be created by mixing properties of two or more types of recommendation systems. This type of recommendation systems are known as hybrid recommendation system.

Here we are going to implement a Content based recommendation system using the scikit-learn library.

### Finding the similarity

For finding the similarity we'll be using Cosine similarity. We will represent texts as vectors using `CountVectorizer()` class from `sklearn.feature_extraction.text` library and then check the similarity using `cosine_similarity()` function from `sklearn.metrics.pairwise` library.

### Building the recommendation engine:

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df = pd.read_csv("tmdb_5000_movies.csv")

If we visualize the dataset, we will see that it has many extra info about a movie. We don’t need all of them. So, we choose keywords, cast, genres and director column to use as our feature set (the so called “content” of the movie).

In [2]:
features = ['keywords','cast','genres','director']

Next we are creating a function to combine the values of these columns into a single string.

In [3]:
def combine_features(row):
    return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']

**Preprocessing the data for use. We will fill all the NaN values with blank string in the dataframe.**

In [4]:
for feature in features:
    #filling all NaNs with blank string
    df[feature] = df[feature].fillna('')

"""applying combined_features() method over each rows of dataframe and storing
the combined string in "combined_features" column"""

df["combined_features"] = df.apply(combine_features,axis=1)

In [5]:
df.iloc[0].combined_features

'culture clash future space war space colony society Sam Worthington Zoe Saldana Sigourney Weaver Stephen Lang Michelle Rodriguez Action Adventure Fantasy Science Fiction James Cameron'

Now we have obtained the combined strings. We will now feed these strings to a CountVectorizer() object for getting the count matrix.

In [6]:
#creating new CountVectorizer() object
cv = CountVectorizer()

#feeding combined strings(movie contents) to CountVectorizer() object
count_matrix = cv.fit_transform(df["combined_features"])

Now, we need to obtain the cosine similarity matrix from the count matrix.

In [7]:
cosine_sim = cosine_similarity(count_matrix)

In [8]:
def get_title_from_index(index):
    return df[df.index == index]["title"].values[0]
def get_index_from_title(title):
    return df[df.title == title]["index"].values[0]

Our next step is to get the title of the movie that the user currently likes. Then we will find the index of that movie. After that, we will access the row corresponding to this movie in the similarity matrix. Thus, we will get the similarity scores of all other movies from the current movie. Then we will enumerate through all the similarity scores of that movie to make a tuple of movie index and similarity score

In [9]:
movie_user_likes = "Dracula Untold"
movie_index = get_index_from_title(movie_user_likes)

"""accessing the row corresponding to given movie to find all the similarity
    scores for that movie and then enumerating over it"""
similar_movies = list(enumerate(cosine_sim[movie_index]))

Now we will sort the list `similar_movies` according to similarity scores in descending order. Since the most similar movie to a given movie will be itself, we will discard the first element after sorting the movies.

In [10]:
sorted_similar_movies=sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]

We will run a loop to print first 7 entries from `sorted_similar_movies` list.

In [11]:
print("Top 7 similar movies to " + movie_user_likes + " are:\n")
count = 0

for element in sorted_similar_movies:
    print(get_title_from_index(element[0]))
    count += 1
    if count > 6:
        break

Top 7 similar movies to Dracula Untold are:

Abraham Lincoln: Vampire Hunter
Amidst the Devil's Wings
Immortals
The Devil's Double
Underworld: Awakening
Blade: Trinity
Alien³


Seeing this output I can say the recommender engine works well as I have seen Dracula Untold and also some the movies that have been recommended here. In conclusion I think the recommendations are good as a basic level implementation but, it can be further improved.

### Ideas for future improvment:
We can implement hybrid filtering and see how the results compare to the ones we got from content based filtering.