<a href="https://colab.research.google.com/github/RCNXV/Project-Machine-Learning-with-Python-Recommender-System-with-Similarity-Function/blob/main/Project_Machine_Learning_with_Python_Building_Recommender_System_with_Similarity_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this project, I have developed a content-based recommender system that utilizes the features of films to calculate their similarity and generate recommendations. By analyzing the content and characteristics of each film, the system can identify similarities and patterns among them. When a user selects a particular film, the recommender system leverages this information to provide a curated list of other films that share similar content and features. This content-based approach allows for personalized recommendations based on specific film preferences and helps users discover new movies that align with their interests and tastes.

#File Unloading and Checking Dataset


In [1]:
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
pd.set_option('display.max_columns', None)

Mounted at /content/drive


In [2]:
movie_rating_df = pd.read_csv('/content/drive/MyDrive/Kerja/DA/Portofolio/Recommender System with Similarity Function/Dataset/movie_rating_df.csv')
director_writers = pd.read_csv('/content/drive/MyDrive/Kerja/DA/Portofolio/Recommender System with Similarity Function/Dataset/directors_writers.csv')
name_df = pd.read_csv('/content/drive/MyDrive/Kerja/DA/Portofolio/Recommender System with Similarity Function/Dataset/actor_name.csv')

In [4]:
movie_rating_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0,1894.0,,1.0,"Documentary,Short",5.6,1608
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892.0,,5.0,"Animation,Short",6.0,197
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892.0,,4.0,"Animation,Comedy,Romance",6.5,1285
3,tt0000004,short,Un bon bock,Un bon bock,0,1892.0,,12.0,"Animation,Short",6.1,121
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893.0,,1.0,"Comedy,Short",6.1,2050


In [5]:
movie_rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 751614 entries, 0 to 751613
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          751614 non-null  object 
 1   titleType       751614 non-null  object 
 2   primaryTitle    751614 non-null  object 
 3   originalTitle   751614 non-null  object 
 4   isAdult         751614 non-null  int64  
 5   startYear       751614 non-null  float64
 6   endYear         16072 non-null   float64
 7   runtimeMinutes  751614 non-null  float64
 8   genres          486766 non-null  object 
 9   averageRating   751614 non-null  float64
 10  numVotes        751614 non-null  int64  
dtypes: float64(4), int64(2), object(5)
memory usage: 63.1+ MB


In [6]:
director_writers.head()

Unnamed: 0,tconst,director_name,writer_name
0,tt0011414,David Kirkland,"John Emerson,Anita Loos"
1,tt0011890,Roy William Neill,"Arthur F. Goodrich,Burns Mantle,Mary Murillo"
2,tt0014341,"Buster Keaton,John G. Blystone","Jean C. Havez,Clyde Bruckman,Joseph A. Mitchell"
3,tt0018054,Cecil B. DeMille,Jeanie Macpherson
4,tt0024151,James Cruze,"Max Miller,Wells Root,Jack Jevne"


In [7]:
director_writers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 986 entries, 0 to 985
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tconst         986 non-null    object
 1   director_name  986 non-null    object
 2   writer_name    986 non-null    object
dtypes: object(3)
memory usage: 23.2+ KB


It can be seen that for some films, it has several directors and writers. So it needs to be separated by creating a list in the director and writer columns

In [8]:
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))
director_writers.head()

Unnamed: 0,tconst,director_name,writer_name
0,tt0011414,[David Kirkland],"[John Emerson, Anita Loos]"
1,tt0011890,[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,tt0014341,"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,tt0018054,[Cecil B. DeMille],[Jeanie Macpherson]
4,tt0024151,[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


#Cleaning and Processing Cast Table

In [9]:
name_df.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm1774132,Nathan McLaughlin,1973,\N,"special_effects,make_up_department","tt0417686,tt1713976,tt1891860,tt0454839"
1,nm10683464,Bridge Andrew,\N,\N,actor,tt7718088
2,nm1021485,Brandon Fransvaag,\N,\N,miscellaneous,tt0168790
3,nm6940929,Erwin van der Lely,\N,\N,miscellaneous,tt4232168
4,nm5764974,Svetlana Shypitsyna,\N,\N,actress,tt3014168


In [10]:
name_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   nconst             1000 non-null   object
 1   primaryName        1000 non-null   object
 2   birthYear          1000 non-null   object
 3   deathYear          1000 non-null   object
 4   primaryProfession  891 non-null    object
 5   knownForTitles     1000 non-null   object
dtypes: object(6)
memory usage: 47.0+ KB


only select the relevant columns, namely 'nconst', 'primaryName' and 'knownForTitles'

In [11]:
name_df = name_df[['nconst','primaryName','knownForTitles']]
name_df.head()

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,"tt0417686,tt1713976,tt1891860,tt0454839"
1,nm10683464,Bridge Andrew,tt7718088
2,nm1021485,Brandon Fransvaag,tt0168790
3,nm6940929,Erwin van der Lely,tt4232168
4,nm5764974,Svetlana Shypitsyna,tt3014168


as with director_name and writer_name, create a list in the knownForTitles column

In [12]:
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda row: row.split(','))
name_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda row: row.split(','))


Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,"[tt0417686, tt1713976, tt1891860, tt0454839]"
1,nm10683464,Bridge Andrew,[tt7718088]
2,nm1021485,Brandon Fransvaag,[tt0168790]
3,nm6940929,Erwin van der Lely,[tt4232168]
4,nm5764974,Svetlana Shypitsyna,[tt3014168]


Because some actors appear in several films, it needs to be separated for each film.

In [13]:
df_uni = []

for x in ['knownForTitles']:
    #repeats the index of each row to each element of knownForTitles
    idx = name_df.index.repeat(name_df['knownForTitles'].str.len())

    #breaks the values of the list in each row and combines them with other rows into a dataframe
    df1 = pd.DataFrame({x: np.concatenate(name_df[x].values)})

    #replacing the index of the dataframe with the idx that we defined at the beginning
    df1.index = idx

    #for each formable dataframe, we append to the bucket dataframe
    df_uni.append(df1)

#combine all dataframes into one
df_concat = pd.concat(df_uni, axis=1)

#left join with value from the initial dataframe
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], axis=1), how='left')

#select columns according to the initial dataframe
unnested_df = unnested_df[name_df.columns.tolist()]

unnested_df

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,tt0417686
0,nm1774132,Nathan McLaughlin,tt1713976
0,nm1774132,Nathan McLaughlin,tt1891860
0,nm1774132,Nathan McLaughlin,tt0454839
1,nm10683464,Bridge Andrew,tt7718088
...,...,...,...
998,nm5245804,Eliza Jenkins,tt1464058
999,nm0948460,Greg Yolen,tt0436869
999,nm0948460,Greg Yolen,tt0476663
999,nm0948460,Greg Yolen,tt0109723


# Nesting primaryName group by knownForTitles

In [14]:
unnested_drop = unnested_df.drop(['nconst'], axis=1)

#Set up a bucket for a dataframe
df_uni = []

for col in ['primaryName']:
    #aggregation of the PrimaryName column according to the group_col defined above
    dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
    #append
    df_uni.append(dfi)

df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
df_grouped

Unnamed: 0,knownForTitles,cast_name
0,tt0008125,[Charles Harley]
1,tt0009706,[Charles Harley]
2,tt0010304,[Natalie Talmadge]
3,tt0011414,[Natalie Talmadge]
4,tt0011890,[Natalie Talmadge]
...,...,...
1893,tt9610496,[Stefano Baffetti]
1894,tt9714030,[Kevin Kain]
1895,tt9741820,[Caroline Plyler]
1896,tt9759814,[Ethan Francis]


#Joining with Movie Table

In [15]:
#Join between Movie Table and Cast Table
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')

#Join between base_df and director_writer Table
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')

base_df.head()

Unnamed: 0,knownForTitles,cast_name,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,director_name,writer_name
0,tt0011414,[Natalie Talmadge],tt0011414,movie,The Love Expert,The Love Expert,0,1920.0,,60.0,"Comedy,Romance",4.9,136,[David Kirkland],"[John Emerson, Anita Loos]"
1,tt0011890,[Natalie Talmadge],tt0011890,movie,Yes or No,Yes or No,0,1920.0,,72.0,,6.3,7,[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,tt0014341,[Natalie Talmadge],tt0014341,movie,Our Hospitality,Our Hospitality,0,1923.0,,65.0,"Comedy,Romance,Thriller",7.8,9621,"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,tt0018054,[Reeka Roberts],tt0018054,movie,The King of Kings,The King of Kings,0,1927.0,,155.0,"Biography,Drama,History",7.3,1826,[Cecil B. DeMille],[Jeanie Macpherson]
4,tt0024151,[James Hackett],tt0024151,movie,I Cover the Waterfront,I Cover the Waterfront,0,1933.0,,80.0,"Drama,Romance",6.3,455,[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


In [16]:
#Drop the knownForTitles column
base_drop = base_df.drop(['knownForTitles'], axis=1)
base_drop.info()

#Replacing NULL values in the genres column with 'Unknown'
base_drop['genres'] = base_drop['genres'].fillna('Unknown')

#Calculate the number of NULL values in each column
print('\nTotal NULL\n',base_drop.isnull().sum())

#Replacing NULL values in director_name and writer_name columns with 'Unknown'
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('unknown')

#because the values of the genres column have multiple values, so we will wrap it into a list of lists
base_drop['genres'] = base_drop['genres'].apply(lambda x: x.split(','))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1060 entries, 0 to 1059
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   cast_name       1060 non-null   object 
 1   tconst          1060 non-null   object 
 2   titleType       1060 non-null   object 
 3   primaryTitle    1060 non-null   object 
 4   originalTitle   1060 non-null   object 
 5   isAdult         1060 non-null   int64  
 6   startYear       1060 non-null   float64
 7   endYear         110 non-null    float64
 8   runtimeMinutes  1060 non-null   float64
 9   genres          745 non-null    object 
 10  averageRating   1060 non-null   float64
 11  numVotes        1060 non-null   int64  
 12  director_name   986 non-null    object 
 13  writer_name     986 non-null    object 
dtypes: float64(4), int64(2), object(8)
memory usage: 124.2+ KB

Total NULL
 cast_name           0
tconst              0
titleType           0
primaryTitle        0
original

In [17]:
#Drop column tconst, isAdult, endYear, originalTitle
base_drop2 = base_drop.drop(['tconst','isAdult', 'endYear', 'originalTitle'], axis=1)

base_drop2 = base_drop2[['primaryTitle','titleType','startYear','runtimeMinutes','genres','averageRating','numVotes','cast_name','director_name','writer_name']]

#Rename the column
base_drop2.columns = ['title','type','start','duration','genres','rating','votes','cast_name','director_name','writer_name']
base_drop2.head()

Unnamed: 0,title,type,start,duration,genres,rating,votes,cast_name,director_name,writer_name
0,The Love Expert,movie,1920.0,60.0,"[Comedy, Romance]",4.9,136,[Natalie Talmadge],[David Kirkland],"[John Emerson, Anita Loos]"
1,Yes or No,movie,1920.0,72.0,[Unknown],6.3,7,[Natalie Talmadge],[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,Our Hospitality,movie,1923.0,65.0,"[Comedy, Romance, Thriller]",7.8,9621,[Natalie Talmadge],"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,The King of Kings,movie,1927.0,155.0,"[Biography, Drama, History]",7.3,1826,[Reeka Roberts],[Cecil B. DeMille],[Jeanie Macpherson]
4,I Cover the Waterfront,movie,1933.0,80.0,"[Drama, Romance]",6.3,455,[James Hackett],[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


The Table is Ready to use

# Creating Content-based Recommender System

In [18]:
#Metadata Classification
#Classification by title, cast_name, genres, director_name, and writer_name
feature_df = base_drop2[['title','cast_name','genres','director_name','writer_name']]

feature_df.head()

Unnamed: 0,title,cast_name,genres,director_name,writer_name
0,The Love Expert,[Natalie Talmadge],"[Comedy, Romance]",[David Kirkland],"[John Emerson, Anita Loos]"
1,Yes or No,[Natalie Talmadge],[Unknown],[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,Our Hospitality,[Natalie Talmadge],"[Comedy, Romance, Thriller]","[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,The King of Kings,[Reeka Roberts],"[Biography, Drama, History]",[Cecil B. DeMille],[Jeanie Macpherson]
4,I Cover the Waterfront,[James Hackett],"[Drama, Romance]",[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


In [19]:
#Create functions for strip spaces of each row and each of its elements
def sanitize(x):
    try:
        #if the cell contains a list
        if isinstance(x, list):
            return [i.replace(' ','').lower() for i in x]
        #if the cell contains a string
        else:
            return [x.replace(' ','').lower()]
    except:
        print(x)

#Columns: cast_name, genres, writer_name, director_name
feature_cols = ['cast_name','genres','writer_name','director_name']

#Apply sanitize function
for col in feature_cols:
    feature_df[col] = feature_df[col].apply(sanitize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feature_df[col] = feature_df[col].apply(sanitize)


In [20]:
#create a function to create a metadata soup (combining all features into 1 sentence section) for each title
#columns used: cast_name, genres, director_name, writer_name
def soup_feature(x):
    return ' '.join(x['cast_name']) + ' ' + ' '.join(x['genres']) + ' ' + ' '.join(x['writer_name']) + ' ' + ' '.join(x['director_name'])

#making soup into 1 column
feature_df['soup'] = feature_df.apply(soup_feature, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feature_df['soup'] = feature_df.apply(soup_feature, axis=1)


In [21]:
#set up CountVectorizer (stop_words=English) and fit it with the soup
#import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#define the CountVectorizer and convert it into a vector shape
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(feature_df['soup'])

print(count)
print(count_matrix.shape)

CountVectorizer(stop_words='english')
(1060, 10026)


In [22]:
#Model of similarity between count matrix
#Import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

#Use cosine_similarity between count_matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)

print(cosine_sim)

[[1.         0.15430335 0.35355339 ... 0.         0.         0.13608276]
 [0.15430335 1.         0.10910895 ... 0.         0.         0.        ]
 [0.35355339 0.10910895 1.         ... 0.         0.08703883 0.09622504]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.08703883 ... 0.         1.         0.10050378]
 [0.13608276 0.         0.09622504 ... 0.         0.10050378 1.        ]]


In [25]:
#content based recommender system

indices = pd.Series(feature_df.index, index=feature_df['title']).drop_duplicates()

def content_recommender(title):
    #get an index of the mentioned movie title
    idx = indices[title]

    #make a list of cosine sim similarity arrays
    #hint: cosine_sim[idx]
    sim_scores = list(enumerate(cosine_sim[idx]))

    #Sorts movies from highest similarity to lowest
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)

    #to get a list of titles from the second item to the 11th
    sim_scores = sim_scores[1:11]

    #get an index of the titles that appear in the sim_scores
    movie_indices = [i[0] for i in sim_scores]

    #by using iloc, we can call back based on the index of the movie_indices
    return base_drop2.iloc[movie_indices]

In [26]:
#Example
content_recommender('The King of Kings')

Unnamed: 0,title,type,start,duration,genres,rating,votes,cast_name,director_name,writer_name
472,Lincoln,movie,2012.0,150.0,"[Biography, Drama, History]",7.3,236809,[Susan Migliore],[Steven Spielberg],"[Tony Kushner, Doris Kearns Goodwin]"
413,Mangal Pandey: The Rising,movie,2005.0,150.0,"[Biography, Drama, History]",6.6,9033,[Subhash Raul],[Ketan Mehta],"[H. Banerjee, Farrukh Dhondy, Ranjit Kapoor]"
778,Renoir,movie,2012.0,111.0,"[Biography, Drama, History]",6.5,5046,[Régis Quintal],[Gilles Bourdos],"[Jacques Renoir, Gilles Bourdos, Jérôme Tonner..."
557,The Iron Lady,movie,2011.0,105.0,"[Biography, Drama]",6.4,99611,[Robert Noble],[Phyllida Lloyd],[Abi Morgan]
81,I Know Why the Caged Bird Sings,tvMovie,1979.0,96.0,"[Biography, Drama]",7.2,241,[Torain],[Fielder Cook],"[Maya Angelou, Leonora Thuna]"
409,Troy,movie,2004.0,163.0,"[Drama, History]",7.2,471976,[Jitka Holickova],[Wolfgang Petersen],"[Homer, David Benioff]"
437,The Greatest Game Ever Played,movie,2005.0,120.0,"[Biography, Drama, Sport]",7.4,26889,[Terry Reid],[Bill Paxton],[Mark Frost]
696,My Week with Marilyn,movie,2011.0,99.0,"[Biography, Drama]",7.0,81331,[Robert Noble],[Simon Curtis],"[Adrian Hodges, Colin Clark]"
725,Czarny czwartek. Janek Wisniewski padl,movie,2011.0,105.0,"[Drama, History]",6.6,980,[Robert Kampa],[Antoni Krauze],"[Miroslaw Piepka, Michal Pruski]"
759,Rush,movie,2013.0,123.0,"[Biography, Drama, Sport]",8.1,413812,[Robert Noble],[Ron Howard],[Peter Morgan]
