# Content Based Recommender System

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

import warnings 
warnings.simplefilter('ignore')

Help : https://www.kaggle.com/rounakbanik/movie-recommender-systems: rounakbanik

In [2]:
data = pd.read_csv('the-movies-dataset/movies_metadata.csv')
data.head(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0


In [3]:
data.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

## Sample-Dataset

- Initially as CPU memory is lower we can take sample dataset, 

- For that we can take specific id’s i.e provided in linked_small.csv and for those id we can take extract all variables from the main (meta_data) dataset.


In [4]:
links_small = pd.read_csv('the-movies-dataset/links_small.csv')
links_small.head(2)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0


In [5]:
links_small[links_small['tmdbId'].notnull()].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9112 entries, 0 to 9124
Data columns (total 3 columns):
movieId    9112 non-null int64
imdbId     9112 non-null int64
tmdbId     9112 non-null float64
dtypes: float64(1), int64(2)
memory usage: 284.8 KB


In [6]:
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

Outliers removal 

In [7]:
data = data.drop([19730, 29503, 35587])

Convert datatype of 'id' to int

In [8]:
data['id'] = data['id'].astype('int')

- Read the data with id taken into consideration.
- Take id's form linked_small dataset and get metadata of those id from main dataset

In [9]:
sample_data = data[data['id'].isin(links_small)]
sample_data.shape

(9099, 24)

- As we have ‘overview’ variable, that we are providing the majority of information about the title. For a content based method we will concentrate on that text dataset.

In [10]:
sample_data['tagline'] = sample_data['tagline'].fillna('')
sample_data['description'] = sample_data['overview'] + sample_data['tagline']
sample_data['description'] = sample_data['description'].fillna('')

- We can use several approaches for text to numerical conversion such as bag_of_words, tf-idf, word2vec conversions. Out of these we have prefered tf-idf method, it will provide sparse matrix. Now the text dataset is converted to numerical values.


In [11]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(sample_data['description'])

In [12]:
tfidf_matrix.shape

(9099, 268124)

- Dataset can be camped using similarity or differentiation of distance technique. Distance between two dataset can be calculated and these distance can be compared. Now for these techniques we have techniques such as euclidean distance, cosine similarity or cosine distance. These are interchangable. 
    - (1 - Cos-similarity) = Cos-distance 
    - (Euclidean Distance)² =  2(Cos-distance)

- As cosine similarity has less complex calculation, we will go with that technique. 
    - Cos-distance 
    - $ cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $


In [13]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

- Titles in sample dataset will be one of important parameters.
- In dataset 'title' is key value to find the recomandations.

In [14]:
sample_data = sample_data.reset_index()
titles = sample_data['title']

- Convert dataset to serie and reset the index according to sample data

In [15]:
indices = pd.Series(sample_data.index, index=sample_data['title'])

In [16]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

* Evaluation

In [17]:
get_recommendations('The Godfather').head(10)

973      The Godfather: Part II
8387                 The Family
3509                       Made
4196         Johnny Dangerously
29               Shanghai Triad
5667                       Fury
2412             American Movie
1582    The Godfather: Part III
4221                    8 Women
2159              Summer of Sam
Name: title, dtype: object