<a href="https://www.kaggle.com/code/sedatparlak/content-based-recommendation?scriptVersionId=111067303" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Content Based Recommendation

How to work Content Based Recommendation systems: https://www.analyticsvidhya.com/blog/2015/08/beginners-guide-learn-content-based-recommender-systems/


Steps: 

1- Create the TF-IDF Matrix<br>
2- Create the Cosine-Similarity Matrix<br>
3- Make the recommendation based on similarities

**Import Libraries**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

warnings.filterwarnings('ignore')

**Read csv file**

In [2]:
df = pd.read_csv('/kaggle/input/movies-metadata/movies_metadata.csv', low_memory=False)

In [3]:
df.shape

(45466, 24)

In [4]:
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


**We will work on overview column**

In [5]:
df['overview']

0        Led by Woody, Andy's toys live happily in his ...
1        When siblings Judy and Peter discover an encha...
2        A family wedding reignites the ancient feud be...
3        Cheated on, mistreated and stepped on, the wom...
4        Just when George Banks has recovered from his ...
                               ...                        
45461          Rising and falling between a man and woman.
45462    An artist struggles to finish his work while a...
45463    When one of her hits goes wrong, a professiona...
45464    In a small town live two brothers, one a minis...
45465    50 years after decriminalisation of homosexual...
Name: overview, Length: 45466, dtype: object

In [6]:
df['overview'].isnull().sum()

954

**Fill null values with space**

In [7]:
df['overview'].fillna('', inplace=True)

In [8]:
df['overview'].isnull().sum()

0

## Step 1: Create TF-IDF Matrix

In [9]:
tf_idf = TfidfVectorizer(stop_words='english')

In [10]:
tf_idf_matrix = tf_idf.fit_transform(df['overview'])

In [11]:
tf_idf_matrix.shape

(45466, 75827)

In [12]:
tf_idf.get_feature_names()[0:25]

['00',
 '000',
 '000km',
 '000th',
 '001',
 '006',
 '007',
 '008',
 '009',
 '0093',
 '01',
 '0123',
 '02',
 '03',
 '04',
 '042',
 '05',
 '05pm',
 '06',
 '07',
 '077',
 '07am',
 '08',
 '088',
 '09']

In [13]:
len(tf_idf.get_feature_names())

75827

In [14]:
tf_idf_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Step 2: Create Cosine Similarity Matrix

In [15]:
cosine_sim = cosine_similarity(tf_idf_matrix, tf_idf_matrix)

In [16]:
cosine_sim[1]

array([0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
       0.00929411])

## Step 3: Make recommendations based on similarities

In [17]:
indices = pd.Series(df.index, index=df['title'])

In [18]:
indices.head()

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64

In [19]:
indices.index.value_counts()

Cinderella              11
Hamlet                   9
Alice in Wonderland      9
Beauty and the Beast     8
Les Misérables           8
                        ..
Cluny Brown              1
Babies                   1
The Green Room           1
Captain Conan            1
Queerama                 1
Name: title, Length: 42277, dtype: int64

In [20]:
indices = indices[~indices.index.duplicated(keep='last')]

In [21]:
indices.index.value_counts()

Toy Story                   1
Russell Madness             1
Attack of the Sabretooth    1
The Millennials             1
X/Y                         1
                           ..
Wife! Be Like a Rose!       1
Adelheid                    1
PEEPLI [Live]               1
The Moth                    1
Queerama                    1
Name: title, Length: 42277, dtype: int64

**Sherlock Holmes Recommendations**

In [22]:
movie_index = indices['Sherlock Holmes']
movie_index

35116

In [23]:
similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=['score'])

In [24]:
similarity_scores

Unnamed: 0,score
0,0.000000
1,0.003928
2,0.004768
3,0.000000
4,0.000000
...,...
45461,0.000000
45462,0.000000
45463,0.000000
45464,0.006792


In [25]:
movie_indices = similarity_scores.sort_values('score', ascending=False)[1:11].index
movie_indices

Int64Index([34737, 14821, 34750, 9743, 4434, 29706, 18258, 24665, 6432, 29154], dtype='int64')

In [26]:
df['title'].iloc[movie_indices]

34737    Приключения Шерлока Холмса и доктора Ватсона: ...
14821                                    The Royal Scandal
34750    The Adventures of Sherlock Holmes and Doctor W...
9743                           The Seven-Per-Cent Solution
4434                                        Without a Clue
29706                       How Sherlock Changed the World
18258                   Sherlock Holmes: A Game of Shadows
24665     The Sign of Four: Sherlock Holmes' Greatest Case
6432                   The Private Life of Sherlock Holmes
29154                          Sherlock Holmes in New York
Name: title, dtype: object

### Functionalize all steps

In [27]:
def content_based_recommender(title, cosine_sim, dataframe):
    
    # Create indexes
    indices = pd.Series(df.index, index=df['title'])
    indices = indices[~indices.index.duplicated(keep='last')]
    
    # Catch the title of index
    movie_index = indices[title]
    
    # Calculate the similarity score based on title
    similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=['score'])
    
    # Recommend top 10 movies
    movie_indices = similarity_scores.sort_values('score', ascending=False)[1:11].index
    
    return df['title'].iloc[movie_indices]

In [28]:
content_based_recommender('Toy Story', cosine_sim, df)

15348                                     Toy Story 3
2997                                      Toy Story 2
10301                          The 40 Year Old Virgin
24523                                       Small Fry
23843                     Andy Hardy's Blonde Trouble
29202                                      Hot Splash
43427                Andy Kaufman Plays Carnegie Hall
38476    Superstar: The Life and Times of Andy Warhol
42721    Andy Peters: Exclamation Mark Question Point
8327                                        The Champ
Name: title, dtype: object

In [29]:
content_based_recommender('Jumanji', cosine_sim, df)

21633         Table No. 21
45253                 Quiz
41573         Snowed Under
35509             The Mend
44376    Liar Game: Reborn
17223       The Dark Angel
8801               Quintet
6166             Brainscan
30981         Turkey Shoot
9503             Word Wars
Name: title, dtype: object