# Movie Recommender based on Variable Similarities on each Movie

In this first ever project, I will build a movie recommender system that uses variable similarities on every movie. Suggested movies will be shown and sorted based on specific movie that has highest similarity values.

## 1) Import Data
The datasets that will be used in this project consist of:
* Data that contains movie titles, genres, ratings, etc. (located in **movie_rating_df.csv** and saved as **film_rating**).
* Data that contains actors or actress played in the movie (located in **actor_name.csv** and saved as **actor**)
* Data that contains directors and writers involved in the movie (located in **directors_writers.csv** and saved as **dir_wri**)

This codes below also show the preview of data and type of data on each column.

In [1]:
# Import libraries and datasets
import pandas as pd
import numpy as np

film_rating = pd.read_csv('movie_rating_df.csv')
dir_wri = pd.read_csv('directors_writers.csv')
actor = pd.read_csv('actor_name.csv')

# film_rating dataset preview and information about the dataset
print('film_rating dataset preview\n')
print(film_rating.head(),'\n')
print('film_rating dataset info\n')
print(film_rating.info(),'\n')

# director and writer name dataset preview and information
print('dir_wri dataset preview\n')
print(dir_wri.head(),'\n')
print('dir_wri dataset info\n')
print(dir_wri.info(),'\n')

# actor and actress names dataset preview and information
print('actor dataset preview\n')
print(actor.head(),'\n')
print('actor dataset info\n')
print(actor.info(),'\n')

film_rating dataset preview

      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

   isAdult  startYear  endYear  runtimeMinutes                    genres  \
0        0     1894.0      NaN             1.0         Documentary,Short   
1        0     1892.0      NaN             5.0           Animation,Short   
2        0     1892.0      NaN             4.0  Animation,Comedy,Romance   
3        0     1892.0      NaN            12.0           Animation,Short   
4        0     1893.0      NaN             1.0              Comedy,Short   

   averageRating  numVotes  
0            5.6      1608  
1          

## 2) Handling dir_wri dataframe

The first dataframe that will be handled is **dir_wri**. This dataframe is considered as the easiest dataframe to be handled because there are no defects, such as missing data. As can be seen in **dir_wri** dataframe preview, there are columns that consist of name of movie directors (*director_name*) and name of movie screenwriters (*writer_name*). In this step, all data points at those columns are going to be converted into list of strings.

In [2]:
# Splitting director and writer names
for a in ['director_name', 'writer_name']:
    dir_wri[a] = dir_wri[a].apply(lambda x: x.split(','))
print(dir_wri.head())

      tconst                      director_name  \
0  tt0011414                   [David Kirkland]   
1  tt0011890                [Roy William Neill]   
2  tt0014341  [Buster Keaton, John G. Blystone]   
3  tt0018054                 [Cecil B. DeMille]   
4  tt0024151                      [James Cruze]   

                                         writer_name  
0                         [John Emerson, Anita Loos]  
1   [Arthur F. Goodrich, Burns Mantle, Mary Murillo]  
2  [Jean C. Havez, Clyde Bruckman, Joseph A. Mitc...  
3                                [Jeanie Macpherson]  
4               [Max Miller, Wells Root, Jack Jevne]  


## 3) Handling actor Dataframe

As can be seen on actor dataframe, an actor can played more than one movie. Like previous step, data points in *knownForTitles* column will be converted to list of strings. Next, modification of the dataframe and save it with another name (**unnested**). Details of **unnested** dataframe is one data point consists of an actor/actress per 1 movie. The purpose of this step is to find out how many movies that the actor/actress played in with bijective function or one-to-one correspondence.

In [3]:
# Rearrange actor dataset, each data point consists of 1 movie title (knownForTitle column)
actor['knownForTitles'] = actor['knownForTitles'].apply(lambda x: x.split(','))
for b in ['knownForTitles']:
    idx = actor.index.repeat(actor['knownForTitles'].str.len())
    repeated_index = pd.DataFrame({b:np.concatenate(actor[b].values)}, index = idx)
unnested = repeated_index.join(actor.drop(columns = 'knownForTitles', axis = 1), how = 'left')
unnested = unnested[actor.columns.to_list()]
print(unnested.head(),'\n')

       nconst        primaryName birthYear deathYear  \
0   nm1774132  Nathan McLaughlin      1973        \N   
0   nm1774132  Nathan McLaughlin      1973        \N   
0   nm1774132  Nathan McLaughlin      1973        \N   
0   nm1774132  Nathan McLaughlin      1973        \N   
1  nm10683464      Bridge Andrew        \N        \N   

                    primaryProfession knownForTitles  
0  special_effects,make_up_department      tt0417686  
0  special_effects,make_up_department      tt1713976  
0  special_effects,make_up_department      tt1891860  
0  special_effects,make_up_department      tt0454839  
1                               actor      tt7718088   



It seems Nathan McLaughin played more than one movie from this dataframe preview, 4 movies to be exact.

## 4) Merge All Imported and Treated Data
Before merge, a dataframe that contains title and name of actor/actress is needed. Actors/actresses will be groupped by title and converted to list, this dataframe named **df_ud**. The Dataframe is merged by **film_rating** dataframe, similar to inner join method with name **base_df**. Inner join is used to match same value in *knownForTitle* column from **base_df** dataframe and in column *tconst* from **film_rating**.

**base_df** Dataframe is merged again with **dir_wri** dataframe with method similar to left join. This method of joining dataframe is used because both dataframe have same row order in *tconst* column. In this condition, inner join method can be used as well and the result is identic to left join method.

In [4]:
# Merge all datasets into single dataframe as base of movie recommender system
for c in ['primaryName']:
    df_ud = unnested.groupby('knownForTitles')[c].apply(list).reset_index()
df_ud = df_ud.rename(columns = {'primaryName':'cast_name'})
base_df = pd.merge(left = df_ud, right = film_rating, left_on = 'knownForTitles', right_on = 'tconst', how = 'inner')
base_df = pd.merge(left = base_df, right = dir_wri, left_on = 'tconst', right_on = 'tconst', how = 'left')
print(base_df.head())

  knownForTitles           cast_name     tconst titleType  \
0      tt0011414  [Natalie Talmadge]  tt0011414     movie   
1      tt0011890  [Natalie Talmadge]  tt0011890     movie   
2      tt0014341  [Natalie Talmadge]  tt0014341     movie   
3      tt0018054     [Reeka Roberts]  tt0018054     movie   
4      tt0024151     [James Hackett]  tt0024151     movie   

             primaryTitle           originalTitle  isAdult  startYear  \
0         The Love Expert         The Love Expert        0     1920.0   
1               Yes or No               Yes or No        0     1920.0   
2         Our Hospitality         Our Hospitality        0     1923.0   
3       The King of Kings       The King of Kings        0     1927.0   
4  I Cover the Waterfront  I Cover the Waterfront        0     1933.0   

   endYear  runtimeMinutes                   genres  averageRating  numVotes  \
0      NaN            60.0           Comedy,Romance            4.9       136   
1      NaN            72.0        

## 5) Cleaning data and handling missing data

A new dataframe called **base_df_drop** defines **base_df** dataframe after dropping *tconst* column, a column that has same value as *knownForTitle* column. After check if there are some missing data, handling the missing data values procedure is a must since there are some missing data values in *genres*, *director_name*, and *writer_name* columns. Just fill those missing data values with string 'unknown' to indicate there are no data input in the data point.

**base_df_drop_2** is a new dataframe that contains exactly the same values as **base_df_drop** data frame, the difference is just the name of some columns and drop unnecessary columns, such as *knownForTitle*, *endYear*, *isAdult*, and *originalTitle*, to make movie recommender system.

In [5]:
base_df_drop = base_df.drop(columns = 'tconst', axis = 1)

# Missing value checking
print(base_df_drop.isna().sum(),'\n')

# Handling missing value
base_df_drop['genres'] = base_df_drop['genres'].fillna('unknown')
base_df_drop[['director_name','writer_name']] = base_df_drop[['director_name','writer_name']].fillna('unknown')
print(base_df_drop.isna().sum(),'\n')

# Preview base_df_drop dataset
print(base_df_drop.head(),'\n')

# Splitting genres into list
base_df_drop['genres'] = base_df_drop['genres'].apply(lambda x: x.split(','))

# Drop knownForTitles, endYear, isAdult, originalTitle. Rename some of remaining columns
base_df_drop_2 = base_df_drop.drop(columns = ['knownForTitles','endYear','isAdult','originalTitle'], axis = 1)
base_df_drop_2 = base_df_drop_2.rename(columns = {'titleType':'type', 'primaryTitle':'title',
                                                 'startYear':'year', 'runtimeMinutes':'duration',
                                                 'averageRating':'rating', 'numVotes':'votes',})

print('base_df_drop dataframe preview after columns drop and renaming columns\n')
print(base_df_drop_2.head())

knownForTitles      0
cast_name           0
titleType           0
primaryTitle        0
originalTitle       0
isAdult             0
startYear           0
endYear           950
runtimeMinutes      0
genres            315
averageRating       0
numVotes            0
director_name      74
writer_name        74
dtype: int64 

knownForTitles      0
cast_name           0
titleType           0
primaryTitle        0
originalTitle       0
isAdult             0
startYear           0
endYear           950
runtimeMinutes      0
genres              0
averageRating       0
numVotes            0
director_name       0
writer_name         0
dtype: int64 

  knownForTitles           cast_name titleType            primaryTitle  \
0      tt0011414  [Natalie Talmadge]     movie         The Love Expert   
1      tt0011890  [Natalie Talmadge]     movie               Yes or No   
2      tt0014341  [Natalie Talmadge]     movie         Our Hospitality   
3      tt0018054     [Reeka Roberts]     movie       The K

**metadata_df** Dataframe will be made and it just contains data of movie genres, actors/actress played in the movie, names of movie director and movie screenwriter. To make metadata for movie recommender system, the first step is creating a function to remove all spaces and lower the characters in list of strings data. Apply the function on columns that has value in list of strings form.

In [6]:
# Making metadata classification based on movie genre, name of casts, directors name, and writers name
metadata = ['genres','cast_name','director_name','writer_name']
metadata_df = base_df_drop_2[metadata]
metadata_df = pd.concat([base_df_drop_2['title'], metadata_df], axis = 1) 
def meta_soup_material(cols):
    try:
        if isinstance(cols, list):
            return [col.replace(' ','').lower() for col in cols]
        else:
            return [cols.replace(' ','').lower()]
    except:
        print(cols)
for d in metadata:
    metadata_df[d] = metadata_df[d].apply(meta_soup_material)
print(metadata_df.head())

                    title                       genres          cast_name  \
0         The Love Expert            [comedy, romance]  [natalietalmadge]   
1               Yes or No                    [unknown]  [natalietalmadge]   
2         Our Hospitality  [comedy, romance, thriller]  [natalietalmadge]   
3       The King of Kings  [biography, drama, history]     [reekaroberts]   
4  I Cover the Waterfront             [drama, romance]     [jameshackett]   

                    director_name  \
0                 [davidkirkland]   
1               [roywilliamneill]   
2  [busterkeaton, johng.blystone]   
3                [cecilb.demille]   
4                    [jamescruze]   

                                      writer_name  
0                        [johnemerson, anitaloos]  
1    [arthurf.goodrich, burnsmantle, marymurillo]  
2  [jeanc.havez, clydebruckman, josepha.mitchell]  
3                              [jeaniemacpherson]  
4               [maxmiller, wellsroot, jackjevne]  


The first step is done. Last step is make another function to apply new column that contains metadata soup on **metadata_df** dataframe. This metadata soup contains series of string separated by spaces between *cast_name*, *genres*, *director_name*, and *writer_name* column. This metadata soup will be turned into matrix and calculate the word similarity.

In [10]:
# Making metadata soup for recommender system
def meta_soup(col_element):
    return (' '.join(col_element['cast_name']) + ' ' +
            ' '.join(col_element['genres']) + ' ' +
            ' '.join(col_element['director_name']) + ' '+
            ' '.join(col_element['writer_name']) + ' ')
metadata_df['soup'] = metadata_df.apply(meta_soup, axis = 1)
print('metadata_df dataframe preview after making metadata soup in it\n')
print(metadata_df.head(),'\n')

metadata_df dataframe preview after making metadata soup in it

                    title                       genres          cast_name  \
0         The Love Expert            [comedy, romance]  [natalietalmadge]   
1               Yes or No                    [unknown]  [natalietalmadge]   
2         Our Hospitality  [comedy, romance, thriller]  [natalietalmadge]   
3       The King of Kings  [biography, drama, history]     [reekaroberts]   
4  I Cover the Waterfront             [drama, romance]     [jameshackett]   

                    director_name  \
0                 [davidkirkland]   
1               [roywilliamneill]   
2  [busterkeaton, johng.blystone]   
3                [cecilb.demille]   
4                    [jamescruze]   

                                      writer_name  \
0                        [johnemerson, anitaloos]   
1    [arthurf.goodrich, burnsmantle, marymurillo]   
2  [jeanc.havez, clydebruckman, josepha.mitchell]   
3                              [jeanie

## 6) Make the model

In this model, only 2 tools from scikit-learn are needed: **CountVectorizer** and **cosine_similarity**.

### i) CountVectorizer
This tool is used to convert text into vector based on word count in the text. However, words such as "the", "is", "a", "are", "had", and etc. are not taken into account because those words are too common to be in text.

### ii) cosine_similarity
The main idea of this tool is based on this equation:

$$ \theta =\cos^{-1} (\frac{\mathbf{a} \bullet \mathbf{b}}{||\mathbf{a}|| \, ||\mathbf{b}||}) $$

$$ \cos({\theta}) =  \frac{\mathbf{a} \bullet \mathbf{b}}{||\mathbf{a}|| \, ||\mathbf{b}||} $$

which is value of cosine, normally has range from -1 to 1. 

The vectors ($ \mathbf{a} $ and $ \mathbf{b} $) represent rows or text in metadata data column (*soup*). Components of vector define counted word of all text in metadata soup column. Those vector components are in column-like direction. However, there is an exception for cosine value since this is information retreival case. The value ranges no longer from -1 to 1, but 0 to 1. Range of cosine value from 0 to 1 makes more sense to interpret. Cosine value of 0 indicates specified vectors have no component match, otherwise if the value is 1 indicates specified vectors have completely same vector component with each other.

In [15]:
# import another scikit-learn library to use count vectorizer function and cosine similarity function on metadata_df dataframe
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vocab_count = CountVectorizer(stop_words = 'english')
vocab_vector = vocab_count.fit_transform(metadata_df['soup'])
print('shape of vocabulary vector consists of {} rows and {} columns\n'.format(vocab_vector.shape[0], vocab_vector.shape[1]))
print('matrix of vocabulary vector\n')
print(vocab_vector,'\n')

# search the similarities in every data point
similarity_matrix = cosine_similarity(vocab_vector, vocab_vector)
print('matrix of movie similarities\n')
print(similarity_matrix,'\n')

shape of vocabulary vector consists of 1060 rows and 10026 columns

matrix of vocabulary vector

  (0, 6898)	1
  (0, 1843)	1
  (0, 8261)	1
  (0, 2172)	1
  (0, 4620)	1
  (0, 517)	1
  (1, 6898)	1
  (1, 9570)	1
  (1, 8342)	1
  (1, 665)	1
  (1, 3364)	1
  (1, 1249)	1
  (1, 6274)	1
  (2, 6898)	1
  (2, 1843)	1
  (2, 8261)	1
  (2, 9304)	1
  (2, 1253)	1
  (2, 4633)	1
  (2, 1023)	1
  (2, 4215)	1
  (2, 3623)	1
  (2, 1814)	1
  (2, 4844)	1
  (2, 6789)	1
  :	:
  (1056, 1895)	1
  (1056, 6858)	1
  (1056, 9584)	1
  (1056, 758)	1
  (1056, 5620)	1
  (1056, 282)	1
  (1056, 782)	1
  (1057, 2538)	1
  (1057, 1895)	1
  (1057, 4786)	1
  (1057, 7830)	2
  (1058, 9304)	1
  (1058, 3771)	1
  (1058, 8467)	1
  (1058, 9060)	2
  (1058, 7794)	1
  (1058, 7733)	1
  (1058, 3465)	1
  (1058, 9068)	1
  (1059, 1843)	1
  (1059, 3771)	1
  (1059, 2884)	1
  (1059, 5463)	1
  (1059, 3462)	1
  (1059, 4575)	2 

matrix of movie similarities

[[1.         0.15430335 0.35355339 ... 0.         0.         0.13608276]
 [0.15430335 1.       

After running the codes, first thing to notice is shape of the count vectorizer matrix. From 1060 text data from *soup* column, there are 10026 different uncommon words detected. Below the description of matrix shape, there is summary of how frequent a certain word occurs in specified text. On the bottom of output or final output, there is a matrix as result of cosine similarity calculation and just as expected the value ranges from 0 to 1 and this satisfy value of cosine. 

$$ \cos{(90{^\circ})} = 0 $$

$$ \cos{(0{^\circ})} = 1 $$

From this explaination, if a text or in this case vector is perpendicular to another text, then there is no similarities between them since the vectors have completely different direction with each other. The opposite condition mentioned before will results in cosine value of 1, since both vector has same direction.

This cosine similarity matrix, named **similarity_matrix** in the code, will be used as index number identifier to find movies that is similar to specified movie title.

At this point, a function that return specified movie title as index number to determine movies with highest neighboring similarity value is needed. Before making the function, a data series that contains movie index numbers will be made. When a specific movie title is called in the function, the function will return the index number of that movie.

The function called **recommender_system** in 5 steps:

1) Return movie index number that correlates with specified movie title

2) Make list of enumerate objects from **similarity_matrix**. The list will return all cosine similarity values from the **similarity_matrix** indexes based on specified movie index number.

3) The list of enumerate objects will be sorted from second highest the cosine value because this system **just** make recommendation of movies that is similar to specified movie.

4) Picking top 10 of highest cosine values and use the index values

5) The index values will be used to locate the movies that is similar to the specified movie in **base_df** dataframe. As a reminder, **base_df** is a merge from a dataframe that contains list of actor/actress (**df_ud**) and 2 datasets(**film_rating** and **dir_wri**).

In [19]:
# Building the recommender system
indices = pd.Series(metadata_df.index, index = metadata_df['title']).drop_duplicates()
def recommender_system(film_title):
    # Number of index based on specified movie title
    idx = indices[film_title]
    print('Movie with title "{}" is in {}th index\n'.format(film_title, idx))
    
    # cosine similarity array
    score_similar = list(enumerate(similarity_matrix[idx]))
    
    # sort similarity from second highest score to lowest score
    score_similar = sorted(score_similar, key = lambda x: x[1], reverse = True)
    
    # return the index number
    movie_sim_score = score_similar[1:11]
    movie_sim_index = [i[0] for i in movie_sim_score]
    print('movie similarity score\n')
    print(movie_sim_score,'\n')
    
    # locate the movie according to the specified index number
    back_to_base = base_df.iloc[movie_sim_index]
    print('top 10 movies that is similar to "{}"\n'.format(film_title))
    print(back_to_base)

# Call the recommender system function
recommender_system('The Lion King')

Movie with title "The Lion King" is in 974th index

movie similarity score

[(848, 0.3779644730092272), (383, 0.30151134457776363), (1002, 0.2773500981126146), (73, 0.2721655269759087), (232, 0.2721655269759087), (556, 0.2721655269759087), (9, 0.2519763153394848), (191, 0.2519763153394848), (803, 0.2519763153394848), (983, 0.24253562503633294)] 

top 10 movies that is similar to "The Lion King"

     knownForTitles                   cast_name      tconst titleType  \
848       tt3040964  [Cristina Carrión Márquez]   tt3040964     movie   
383       tt0286336          [Francisco Bretas]   tt0286336  tvSeries   
1002      tt7222086          [Hiroki Matsukawa]   tt7222086  tvSeries   
73        tt0075147             [Joaquín Parra]   tt0075147     movie   
232       tt0119051            [Chris Kosloski]   tt0119051     movie   
556      tt10068158          [Hiroki Matsukawa]  tt10068158     movie   
9         tt0028657            [Bernard Loftus]   tt0028657     movie   
191       tt01078

"The Lion King" movie is the reference movie to test the recommender system. As can be seen on the recommendation, there are 10 movies that is similar to "The Lion King":

1) "The Jungle Book" with cosine value of **0.3779644730092272**

2) "The Animals of Farthing Wood" with cosine value of **0.30151134457776363**

3) "Made in Abyss" with cosine value of **0.2773500981126146**

4) "Robin and Marian" with cosine value of **0.2721655269759087**

5) "The Edge" with cosine value of **0.2721655269759087**

6) "Made in Abyss: Journey's Dawn" with cosine value of **0.2721655269759087**

7) "Boss of Lonely Valley" with cosine value of **0.2519763153394848**

8) "The Princess and the Goblin" with cosine value of **0.2519763153394848**

9) "Ostwind" with cosine value of **0.2519763153394848**

10) "The Skinner Boys: Guardians of the Lost Secrets" with cosine value of **0.24253562503633294**

From the movie recommendations, suggested movies are not very similar to "The Lion King". Values of cosine are far from 1, The angle between vectors have range from $ 75.96586051 ^\circ $ to $ 67.79633565 ^\circ $, almost perpendicular from each other. To analyse what is the problem of the system, different movie title is going to be new function argument.

In [32]:
# Try another movie, a biography movie from 2013 titled "Rush"
recommender_system('Rush')

Movie with title "Rush" is in 759th index

movie similarity score

[(557, 0.5477225575051662), (437, 0.5000000000000001), (696, 0.5000000000000001), (81, 0.3333333333333334), (170, 0.3333333333333334), (3, 0.3086066999241838), (31, 0.3086066999241838), (102, 0.3086066999241838), (134, 0.3086066999241838), (355, 0.3086066999241838)] 

top 10 movies that is similar to "Rush"

    knownForTitles         cast_name     tconst titleType  \
557      tt1007029    [Robert Noble]  tt1007029     movie   
437      tt0388980      [Terry Reid]  tt0388980     movie   
696      tt1655420    [Robert Noble]  tt1655420     movie   
81       tt0079321          [Torain]  tt0079321   tvMovie   
170      tt0101393    [F. Pat Burns]  tt0101393     movie   
3        tt0018054   [Reeka Roberts]  tt0018054     movie   
31       tt0057188  [Ángela R. Hill]  tt0057188     movie   
102      tt0085380    [Richard Pine]  tt0085380     movie   
134      tt0093054    [Marie Marini]  tt0093054     movie   
355      tt02

In [36]:
# To demonstrate the error, another movie title will be typed. "Avengers" movie for example
recommender_system('Avengers')

KeyError: 'Avengers'

With movie titled "Rush", the recommender system works slightly better than before. Movie titled "The Iron Lady" for example, cosine similarity value betweeen "Rush" and "The Iron Lady" is **0.5477225575051662** or the angle between vectors is $ 56.78908924 ^\circ $, slightly better than cosine similarity value between "The Lion King" and "The Jungle Book" (**0.3779644730092272** or $ 67.79633565 ^\circ $). 

Another movie title is used as the function argument. One of Marvel Studios movie, "Avengers" is not detected in the **base_df** dataframe. This indicates that there is no movie such as "Avengers" in the data, and as the result is an error in the function.

From this analysis, it can be concluded that the recommender system works mainly on input of movie title name. In other word if the user of this system type a specific movie that is "stood out" from other movie, the system cannot give good movie recommendation(low cosine value). There is also limitation on this recommender system, **film_rating** dataset does not have complete movie data which will result in error because the function cannot find specified movie in the dataset.