<a href="https://colab.research.google.com/github/Al95200/recommander_system/blob/master/recommender_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Import and Loading dataset

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

path = "/content/drive/My Drive/recommender_system/"

metadata = pd.read_csv(path + 'movies_metadata.csv')

overview = metadata['overview']

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
overview = overview[:30000]
print(overview)

0        Led by Woody, Andy's toys live happily in his ...
1        When siblings Judy and Peter discover an encha...
2        A family wedding reignites the ancient feud be...
3        Cheated on, mistreated and stepped on, the wom...
4        Just when George Banks has recovered from his ...
                               ...                        
29995    Cat and Gene have been divorced for several ye...
29996    A beautiful lonely girl named Melissa tries to...
29997    A journalist starts an investigation into the ...
29998    When an out of work actress unexpectedly lands...
29999    Bumbling navy officer Lieutenant Humphrey Fair...
Name: overview, Length: 30000, dtype: object


# Preprocessing data

First, we need to create a matrix terms-documents. Then we need to calculate the TF-IDF score which give a greater value if a word appear don't appear in many document because it means that this word can be used to describe precisely a document. To do that we can use the function in the library sklearn TfidfVectorizer which will return a sparse matrix which is a matrix mostly filled with 0 because most of the appear on all the document and in that case the TF-IDF score will be 0.

In [None]:
tfidf = TfidfVectorizer(stop_words='english')
overview = overview.fillna('')
tfidf_matrix = tfidf.fit_transform(overview)
print(tfidf_matrix)
tfidf_matrix.shape

  (0, 14227)	0.13755167851661015
  (0, 3435)	0.1487068980299446
  (0, 29937)	0.10228892137895437
  (0, 17561)	0.10438673324789216
  (0, 15776)	0.13453206511481544
  (0, 37975)	0.10467125649268563
  (0, 46519)	0.1324868043781035
  (0, 9858)	0.12877600096023176
  (0, 39968)	0.13474869509303505
  (0, 23314)	0.1048150644684269
  (0, 39810)	0.09436528246336888
  (0, 31042)	0.12145001058350167
  (0, 1495)	0.14622598358693117
  (0, 45793)	0.11680551182427533
  (0, 30462)	0.20238436807229349
  (0, 7781)	0.4943421218569454
  (0, 7129)	0.10850606765347512
  (0, 5841)	0.12524286331842582
  (0, 44543)	0.11424701823835179
  (0, 22927)	0.13496771882111694
  (0, 30712)	0.08988554091503144
  (0, 53026)	0.1548546171031509
  (0, 2485)	0.41900392243537965
  (0, 57350)	0.4722473412715093
  (0, 29976)	0.10903467086520731
  :	:
  (29999, 53145)	0.15782568843624337
  (29999, 24683)	0.17487235702595763
  (29999, 13908)	0.1850199091745502
  (29999, 46401)	0.19057141981012868
  (29999, 9473)	0.13012978625827365

(30000, 58562)

Because we get tf idf score, we can use the cosine similarity to calculate the similarities between two movie overview and with the cosine_similarity function of sklearn we can calculate the cosine similarity between all overview of our dataset.

In [None]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim.shape

(30000, 30000)

Finally, to get the 'x' most similar document, we need to create a function to get them from cosine similarity matrix without taking the first one because it is the document itself.

In [None]:
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [None]:
indices

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45466, dtype: int64

In [None]:
def best_match(x, title, cosine_sim=cosine_sim):
    idx = indices[title]

    sim_scores = list(enumerate(cosine_sim[idx]))

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    sim_scores = sim_scores[1: x + 1]

    movie_indices = [i[0] for i in sim_scores]

    return metadata['title'].iloc[movie_indices]

In [None]:
best_match(12,'Jumanji')

21633                   Table No. 21
6166                       Brainscan
8801                         Quintet
17223                 The Dark Angel
9503                       Word Wars
16843                         DeVour
8079                         Masques
13601    The Mindscape of Alan Moore
6055                Poolhall Junkies
19726                 Wreck-It Ralph
13711                     Rhinoceros
2486                        eXistenZ
Name: title, dtype: object

In [None]:
best_match(12,'Toy Story')

15348                       Toy Story 3
2997                        Toy Story 2
10301            The 40 Year Old Virgin
24523                         Small Fry
23843       Andy Hardy's Blonde Trouble
29202                        Hot Splash
8327                          The Champ
27206        Life Begins for Andy Hardy
1071              Rebel Without a Cause
26304            You're Only Young Once
29369    Andy Hardy's Private Secretary
1932                          Condorman
Name: title, dtype: object