https://www.kaggle.com/code/mehmetisik/content-based-recommendation

In [1]:
import os
# Comprueba si el código se está ejecutando en Google Colab
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

path_absolute = ''
if IN_COLAB:
    print("El código se está ejecutando en Google Colab.")
    from google.colab import drive

    drive.mount('/content/drive')
    path_absolute = '/content/drive/Othercomputers/Mi_portátil/TFM/WorkSpace/'

    # Cambia al directorio de tu carpeta en Google Drive
    os.chdir(path_absolute)

    # Lista los archivos y carpetas en el directorio actual
    contenido_carpeta = os.listdir(path_absolute)
    print("Contenido de la carpeta en Google Drive:")
    print(contenido_carpeta)
else:
    print("El código se está ejecutando en un entorno local.")
    path_absolute = os.getcwd().replace("\\", "/")

datasets_path = "/datasets"
path_absolute = path_absolute+datasets_path

El código se está ejecutando en un entorno local.


![CBR](https://miro.medium.com/v2/resize:fit:1400/1*H_MMnrpLQrqTSJHdDOCMoA.png)

# What is Content Based Recommendation

Content-based recommendation, also known as content-based filtering, is a type of system or algorithm that provides recommendations to a user based on their interests and preferences. Those with such recommendation systems analyze the user's past preferences and likes, and suggest new items based on similar content.

Content-based recommendation analyzes the content of items and determines the ones that are suitable for the user based on similarity criteria. For example, when making a movie recommendation, the system can take into account the genres, actors, directors, and other features of the movies the user has liked or watched. Based on this information, the system suggests other movies with similar characteristics.

This recommendation system can utilize text analysis, tagging, categorization, or other content features along with the user profile or history to better understand the user's preferences. For instance, when making a music recommendation, the system can analyze features such as genres, instruments, tempo, and rhythm.

Content-based recommendation systems can be effective in providing personalized recommendations based on user preferences. The recommended items based on the user's past data can capture their interest and provide a better user experience.

# Business Problem
To recommend movies similar to the movies that a person who comes to our site to watch movies.

# Road Map

- 1. Creating the **TF-IDF Matrix**
- 2. Creation of **Cosine Similarity Matrix**
- 3. Making Recommendations Based on Similarities
- 4. Preparation of the Study Script


In [2]:
# import Required Libraries

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
# Adjusting Row Column Settings

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)

In [4]:
# Loading the Data Set

df = pd.read_csv(path_absolute+"/movies_metadata.csv")

  df = pd.read_csv(path_absolute+"/movies_metadata.csv")


In [5]:
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [6]:
# Preliminary examination of the data set

def check_df(dataframe, head=5):
    print('##################### Shape #####################')
    print(dataframe.shape)
    print('##################### Types #####################')
    print(dataframe.dtypes)
    print('##################### Head #####################')
    print(dataframe.head(head))
    print('##################### Tail #####################')
    print(dataframe.tail(head))
    print('##################### NA #####################')
    print(dataframe.isnull().sum())
    print('##################### Quantiles #####################')
    print(dataframe.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

check_df(df)

##################### Shape #####################
(45466, 24)
##################### Types #####################
adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object
##################### Head #####################
   adult                         

In [7]:
# Reemplaza 'df' con el nombre de tu DataFrame y 'nombre_columna' con el nombre de tu columna
columna_deseada = 'title'
valor_a_verificar = 'The Promise'

# Encuentra las filas que tienen el mismo valor que 'valor_a_verificar' en la columna seleccionada
filas_duplicadas = df[df[columna_deseada] == valor_a_verificar]


# # Encuentra las filas duplicadas en la columna seleccionada
# filas_duplicadas = df[df.duplicated(subset=[columna_deseada], keep=False)]
# # Encuentra las filas duplicadas en la columna seleccionada, excluyendo la primera aparición
# filas_duplicadas = df[df.duplicated(subset=[columna_deseada], keep='first')]

# Muestra las filas que tienen el mismo valor en la columna seleccionada
filas_duplicadas.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
676,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,105045,tt0111613,de,Das Versprechen,"East-Berlin, 1961, shortly after the erection ...",0.122178,/5WFIrBhOOgc0jGmoLxMZwWqCctO.jpg,"[{'name': 'Studio Babelsberg', 'id': 264}, {'n...","[{'iso_3166_1': 'DE', 'name': 'Germany'}]",1995-02-16,0.0,115.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,"A love, a hope, a wall.",The Promise,False,5.0,1.0
1465,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,105045,tt0111613,de,Das Versprechen,"East-Berlin, 1961, shortly after the erection ...",0.122178,/5WFIrBhOOgc0jGmoLxMZwWqCctO.jpg,"[{'name': 'Studio Babelsberg', 'id': 264}, {'n...","[{'iso_3166_1': 'DE', 'name': 'Germany'}]",1995-02-16,0.0,115.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,"A love, a hope, a wall.",The Promise,False,5.0,1.0
10601,False,,0,"[{'id': 14, 'name': 'Fantasy'}, {'id': 18, 'na...",http://thepromisemovie.net,2008,tt0417976,zh,Wu Ji,"An orphaned girl, driven by poverty at such a ...",2.447818,/zAzI1wMRZzaBlc56DXJkjiv79Ut.jpg,[],"[{'iso_3166_1': 'CN', 'name': 'China'}, {'iso_...",2005-12-15,0.0,98.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,the promise,The Promise,False,5.0,29.0
17378,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,109251,tt0079756,en,The Promise,A rich student's fiancee has her face destroye...,0.217476,/jCyhSgXX6yEwFIYHhtaKIbIPv3w.jpg,"[{'name': 'Universal Pictures', 'id': 33}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1979-03-08,0.0,97.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,The Promise,False,6.0,1.0
40506,False,,0,"[{'id': 36, 'name': 'History'}, {'id': 10749, ...",http://www.survivalpictures.org/the-promise/,354859,tt4776998,en,The Promise,Set during the last days of the Ottoman Empire...,9.555114,/iDWer2VFikupdqDc7d5sxCWZ3gW.jpg,"[{'name': 'Babieka', 'id': 20656}, {'name': 'W...","[{'iso_3166_1': 'ES', 'name': 'Spain'}, {'iso_...",2017-04-21,0.0,130.0,"[{'iso_639_1': 'hy', 'name': ''}, {'iso_639_1'...",Released,Empires fall. Love survives.,The Promise,False,7.3,69.0


# 1. Creating the TF-IDF Matrix

In [8]:
df["overview"].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [9]:
df["overview"].isnull().sum()

954

In [10]:
# Let's remove the constructs like a, an, the, and, but that don't make sense for us from our DataFrames.

tfidf = TfidfVectorizer(stop_words="english")

In [11]:
# fill the null value in the cin overviev variable with nothing to avoid errors in the following steps

df['overview'] = df['overview'].fillna('')

In [12]:
df["overview"].isnull().sum()

0

In [13]:
# fit and transform according to the tfidf object
# Those in rows are texts 'overview'. Those in columns are unique words.

tfidf_matrix = tfidf.fit_transform(df['overview'])

In [14]:
tfidf_matrix.shape

(45466, 75827)

In [15]:
#If we want to see all the unique words in the columns

tfidf.get_feature_names_out()


array(['00', '000', '000km', ..., '첫사랑', 'ﬁrst', 'ﬁve'], dtype=object)

In [16]:
# tfidf scores

tfidf_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# 2. Creation of Cosine Similarity Matrix

In [17]:
# Calculates cos sim for all possible document pairs one by one. In the cosine_sim matrix, each movie has similarities with each other

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [18]:
cosine_sim.shape

(45466, 45466)

In [19]:
# To see how the movie in index 1 is similar to all the other movies

cosine_sim[1]

array([0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
       0.00929411])

In [39]:
cosine_sim[1].shape

(45466,)

# 3. Making Recommendations Based on Similarities

In [20]:
# Let's create a pd series of indexes and movie names

indices = pd.Series(df.index, index=df['title'])

In [21]:
# let's count the index information of the movies and simplify the most repetitive movies to the most recent ones

indices.index.value_counts().head()

Cinderella              11
Hamlet                   9
Alice in Wonderland      9
Beauty and the Beast     8
Les Misérables           8
Name: title, dtype: int64

In [22]:
indices = indices[~indices.index.duplicated(keep='last')]

In [23]:
indices["Cinderella"]

45406

In [24]:
indices["Sherlock Holmes"]

35116

In [25]:
# I assign the index of the movie 'Sherlock Holmes' to the variable

movie_index = indices['Sherlock Holmes']

In [26]:
cosine_sim[movie_index]

array([0.        , 0.00392837, 0.00476764, ..., 0.        , 0.0067919 ,
       0.        ])

In [38]:
cosine_sim[movie_index].shape

(45466,)

In [27]:
# Let's see the Smilarity Scores that express the similarities between the movie 'Sherlock Holmes' and other movies


similarity_scores = pd.DataFrame(cosine_sim[movie_index],
                                 columns=["score"])

In [28]:
# The similarities between the movie 'Sherlock Holmes' and all other movies

similarity_scores.head()

Unnamed: 0,score
0,0.0
1,0.003928
2,0.004768
3,0.0
4,0.0


In [46]:
aux = df['title'].iloc[similarity_scores.sort_values("score", ascending=False).index]

In [48]:
aux.shape

(45466,)

In [49]:
# aux

In [29]:
# Let's list the similarity scores of the movie 'Sherlock Holmes' in descending order. It starts at 1 because it's the first movie.

movie_indices = similarity_scores.sort_values("score", ascending=False)[1:11].index

In [50]:
movie_indices

Int64Index([34737, 14821, 34750, 9743, 4434, 29706, 18258, 24665, 6432, 29154], dtype='int64')

In [30]:
# Go to the indexes we selected in our first data set

df['title'].iloc[movie_indices]

34737    Приключения Шерлока Холмса и доктора Ватсона: ...
14821                                    The Royal Scandal
34750    The Adventures of Sherlock Holmes and Doctor W...
9743                           The Seven-Per-Cent Solution
4434                                        Without a Clue
29706                       How Sherlock Changed the World
18258                   Sherlock Holmes: A Game of Shadows
24665     The Sign of Four: Sherlock Holmes' Greatest Case
6432                   The Private Life of Sherlock Holmes
29154                          Sherlock Holmes in New York
Name: title, dtype: object

In [41]:
# Reemplaza 'df' con el nombre de tu DataFrame y 'nombre_columna' con el nombre de tu columna
columna_deseada = 'title'
valor_a_verificar = 'Sherlock Holmes'

# Encuentra las filas que tienen el mismo valor que 'valor_a_verificar' en la columna seleccionada
filas_duplicadas = df[df[columna_deseada] == valor_a_verificar]


# # Encuentra las filas duplicadas en la columna seleccionada
# filas_duplicadas = df[df.duplicated(subset=[columna_deseada], keep=False)]
# # Encuentra las filas duplicadas en la columna seleccionada, excluyendo la primera aparición
# filas_duplicadas = df[df.duplicated(subset=[columna_deseada], keep='first')]

# Muestra las filas que tienen el mismo valor en la columna seleccionada
filas_duplicadas.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
14557,False,"{'id': 102322, 'name': 'Sherlock Holmes Collec...",90000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://sherlock-holmes-movie.warnerbros.com/,10528,tt0988045,en,Sherlock Holmes,"Eccentric consulting detective, Sherlock Holme...",15.68604,/22ngurXbLqab7Sko6aTSdwOCe5W.jpg,"[{'name': 'Village Roadshow Pictures', 'id': 7...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",2009-12-23,524028679.0,128.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Nothing escapes him.,Sherlock Holmes,False,7.0,5883.0
17830,False,,1000000,"[{'id': 14, 'name': 'Fantasy'}, {'id': 18, 'na...",,33555,tt1522835,en,Sherlock Holmes,Sherlock Holmes and Watson are on the trail of...,2.719197,/yVeRjNrFPNHkjV200Ncmldd6p6p.jpg,"[{'name': 'The Asylum', 'id': 1311}]","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2010-01-26,0.0,89.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The world's greatest detective has finally met...,Sherlock Holmes,False,5.5,41.0
32040,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 9648, 'na...",,53800,tt0013597,en,Sherlock Holmes,Starring John Barrymore as Holmes and Roland Y...,1.206513,/7ojb5QcSs8SliK0TqOdVP4cRTJ2.jpg,"[{'name': 'Goldwyn Pictures Corporation', 'id'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1922-03-07,0.0,85.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Sherlock Holmes,False,5.0,1.0
35116,False,,0,"[{'id': 9648, 'name': 'Mystery'}, {'id': 18, '...",,342831,tt0007338,en,Sherlock Holmes,"Long considered lost since its first release, ...",0.096912,/yykgiz38HclvUBiZFUbqrMwVtec.jpg,[{'name': 'The Essanay Film Manufacturing Comp...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1916-06-30,0.0,116.0,[],Released,,Sherlock Holmes,False,0.0,0.0


# 4. Preparation of the Study Script

In [31]:
def content_based_recommender(title, cosine_sim, dataframe):
    # create indexes
    indices = pd.Series(dataframe.index, index=dataframe['title'])
    indices = indices[~indices.index.duplicated(keep='last')]
    # capturing index of title
    movie_index = indices[title]
    # calculate similarity scores based on title
    similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=["score"])
    # Bringing the top 10 movies except for itself
    movie_indices = similarity_scores.sort_values("score", ascending=False)[1:11].index
    return dataframe['title'].iloc[movie_indices]

In [32]:
content_based_recommender("Sherlock Holmes", cosine_sim, df)

34737    Приключения Шерлока Холмса и доктора Ватсона: ...
14821                                    The Royal Scandal
34750    The Adventures of Sherlock Holmes and Doctor W...
9743                           The Seven-Per-Cent Solution
4434                                        Without a Clue
29706                       How Sherlock Changed the World
18258                   Sherlock Holmes: A Game of Shadows
24665     The Sign of Four: Sherlock Holmes' Greatest Case
6432                   The Private Life of Sherlock Holmes
29154                          Sherlock Holmes in New York
Name: title, dtype: object

In [33]:
content_based_recommender("The Matrix", cosine_sim, df)

44161                        A Detective Story
44167                              Kid's Story
44163                             World Record
33854                                Algorithm
167                                    Hackers
20707    Underground: The Julian Assange Story
6515                                  Commando
24202                                 Who Am I
22085                           Berlin Express
9159                                  Takedown
Name: title, dtype: object

In [34]:
content_based_recommender("The Godfather", cosine_sim, df)

1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030            A Mother Should Be Loved
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen
Name: title, dtype: object

In [35]:
content_based_recommender('The Dark Knight Rises', cosine_sim, df)

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object

In [36]:
def calculate_cosine_sim(dataframe):
    tfidf = TfidfVectorizer(stop_words='english')
    dataframe['overview'] = dataframe['overview'].fillna('')
    tfidf_matrix = tfidf.fit_transform(dataframe['overview'])
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
    return cosine_sim

In [37]:
# cosine_sim = calculate_cosine_sim(df)
# content_based_recommender('The Dark Knight Rises', cosine_sim, df)