# Movie Recommender System
#### TASK: Given a movie, the movie recommender system is tasked with performing content-based filtering to recommend 10 movies with common features to the given movie.

## Import Modules
#### pandas: Python sublibrary used to facilitate data analysis and machine learning tasks
#### numpy: Python sublibrary that provides support for work with large, multi-dimensional arrays and matrices
#### sklearn: Python sublibrary that provides tools for statistical modeling and machine learning tasks

In [83]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel

## Read in Data
#### Data set used: MovieLens 20M Dataset from Kaggle

In [84]:
movies = pd.read_csv('/Users/alessia/Desktop/Jupyter/MovieRec/movie.csv')

print(movies.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


In [85]:
tags = pd.read_csv('/Users/alessia/Desktop/Jupyter/MovieRec/tag.csv')

print(tags.head())

   userId  movieId            tag            timestamp
0      18     4141    Mark Waters  2009-04-24 18:19:40
1      65      208      dark hero  2013-05-10 01:41:18
2      65      353      dark hero  2013-05-10 01:41:19
3      65      521  noir thriller  2013-05-10 01:39:43
4      65      592      dark hero  2013-05-10 01:41:18


## Analyze Data

In [86]:
movies.info()
print('\n\n')
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
dtypes: int64(1), object(2)
memory usage: 639.5+ KB



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465564 entries, 0 to 465563
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   userId     465564 non-null  int64 
 1   movieId    465564 non-null  int64 
 2   tag        465548 non-null  object
 3   timestamp  465564 non-null  object
dtypes: int64(2), object(2)
memory usage: 14.2+ MB


## Clean Data

In [87]:
# Drop unnecessary labels
tags = tags.drop(['userId', 'timestamp'], axis=1)

print(tags.head())

   movieId            tag
0     4141    Mark Waters
1      208      dark hero
2      353      dark hero
3      521  noir thriller
4      592      dark hero


In [88]:
# Separate genre terms for metadata merge
movies['genres'] = movies['genres'].str.replace('|',' ', regex=False)

# Convert titles to lowercase for case-insensitive searching
movies['title'] = movies['title'].str.lower()

print(movies.head())

   movieId                               title  \
0        1                    toy story (1995)   
1        2                      jumanji (1995)   
2        3             grumpier old men (1995)   
3        4            waiting to exhale (1995)   
4        5  father of the bride part ii (1995)   

                                        genres  
0  Adventure Animation Children Comedy Fantasy  
1                   Adventure Children Fantasy  
2                               Comedy Romance  
3                         Comedy Drama Romance  
4                                       Comedy  


## Merge movies and tags DataFrames

In [89]:
movies_tags = pd.merge(left=movies, right=tags, on='movieId', how='left')

print(movies_tags.head())

   movieId             title                                       genres  \
0        1  toy story (1995)  Adventure Animation Children Comedy Fantasy   
1        1  toy story (1995)  Adventure Animation Children Comedy Fantasy   
2        1  toy story (1995)  Adventure Animation Children Comedy Fantasy   
3        1  toy story (1995)  Adventure Animation Children Comedy Fantasy   
4        1  toy story (1995)  Adventure Animation Children Comedy Fantasy   

                                      tag  
0                                 Watched  
1                      computer animation  
2                 Disney animated feature  
3                         Pixar animation  
4  TÃ©a Leoni does not star in this movie  


## Create Metadata

In [90]:
# Control for empty cells
movies_tags.fillna('', inplace=True)

# Unite together tags relating to the same movie(Id)
movies_tags = pd.DataFrame(movies_tags.groupby('movieId')['tag'].apply(lambda x: "%s" % ' '.join(x.str.lower())))

print(movies_tags.head())

                                                       tag
movieId                                                   
1        watched computer animation disney animated fea...
2        time travel adapted from:book board game child...
3        old people that is actually funny sequel fever...
4        chick flick revenge characters chick flick cha...
5        diane keaton family sequel steve martin weddin...


In [91]:
# Remerge data
content = pd.merge(movies, movies_tags, on='movieId', how='left')

# Create metadata by merging 'genres' and 'tag' columns
content['metadata'] = content[['genres', 'tag']].apply(lambda x: ' '.join(x), axis=1)

print(content.head())

   movieId                               title  \
0        1                    toy story (1995)   
1        2                      jumanji (1995)   
2        3             grumpier old men (1995)   
3        4            waiting to exhale (1995)   
4        5  father of the bride part ii (1995)   

                                        genres  \
0  Adventure Animation Children Comedy Fantasy   
1                   Adventure Children Fantasy   
2                               Comedy Romance   
3                         Comedy Drama Romance   
4                                       Comedy   

                                                 tag  \
0  watched computer animation disney animated fea...   
1  time travel adapted from:book board game child...   
2  old people that is actually funny sequel fever...   
3  chick flick revenge characters chick flick cha...   
4  diane keaton family sequel steve martin weddin...   

                                            metadata  
0  Adv

In [92]:
# Example metadata for Jumanji (1995)
content.loc[1, 'metadata']

'Adventure Children Fantasy time travel adapted from:book board game childhood recaptured game herds of cgi animals scary see also:zathura time time travel board game fantasy robin williams scary time travel robin williams joe johnston robin williams kid flick jungle robin williams board game robin williams animals lebbat robin williams time travel adventure robin williams children fantasy robin williams dynamic cgi action kirsten dunst robin williams robin williams fantasy kid flick animals animals fantasy for children fantasy adapted from:book childish children kid flick time travel joe johnston fantasy robin williams time travel animals board game children fantasy kirsten dunst robin williams time travel board game time travel children kid flick filmed in bc fantasy robin williams animals bad cgi based on a book board game chris van allsburg robin williams robin williams game magic board game monkey adapted from:book animals bad cgi based on a book board game childhood recaptured ch

## Perform Textual Analysis (TF-IDF)

#### TfidfVectorizer().fit_transform() converts a collection of raw documents to a matrix of TF-IDF features. It learns the terms in the passed document set, calculates their idf scores and returns a TF-IDF-weighted document-term matrix.

In [93]:
TFIDF_matrix = TfidfVectorizer(stop_words='english').fit_transform(content['metadata'])

print("Matrix Dimensions:", TFIDF_matrix.shape, '\n')
print(TFIDF_matrix)

Matrix Dimensions: (27278, 23865) 

  (0, 22599)	0.006491097793447923
  (0, 21969)	0.012070103775451308
  (0, 17608)	0.007104700136518033
  (0, 14772)	0.0070038886011277
  (0, 5909)	0.010557745307297191
  (0, 10419)	0.010435301471900858
  (0, 6259)	0.005216887398726472
  (0, 6789)	0.005917107839440688
  (0, 6257)	0.004698003318667564
  (0, 4134)	0.005298861381222637
  (0, 22622)	0.010474302070069847
  (0, 1945)	0.004251248805804899
  (0, 23023)	0.009235005693364823
  (0, 6715)	0.008683880226854514
  (0, 11493)	0.008277485486808423
  (0, 14453)	0.00611745706504942
  (0, 210)	0.009167963287355628
  (0, 20323)	0.008123753945951235
  (0, 17644)	0.010927326986799627
  (0, 134)	0.009083691251822493
  (0, 6668)	0.010927326986799627
  (0, 403)	0.005671661487227229
  (0, 22775)	0.009635545205966082
  (0, 20425)	0.0054236646073774
  (0, 4951)	0.007800893339116515
  :	:
  (27268, 18147)	0.7955752044012239
  (27268, 4270)	0.6058548457691421
  (27269, 4270)	0.4539164888497601
  (27269, 1003)	0.8910

In [94]:
#for term, idf in zip(TFIDF_matrix.get_feature_names(), TFIDF.idf_):
#    print(term, ':', idf)

## Calculate Cosine Similarity

#### Sklearn's cosine_similarity() method calculates the L2-normalized dot product of two vectors. On the other hand, sklearn's linear_kernel() method calculates the dot product of two vectors directly (without normalization). Because the TF-IDF functionality in sklearn.feature_extraction.text (from which TfidfVectorizer() is imported) already produces L2-normalized vectors, both cosine_similarity() and linear_kernel() can be used to the same effect, linear_kernel being a bit faster since it does not recompute L2-normalization.

In [95]:
# Because TF-IDF scores are already normalized, linear_kernel can be applied directly instead of cosine similarity
cosine_sim = linear_kernel(TFIDF_matrix, TFIDF_matrix)

print('Matrix Shape:', cosine_sim.shape, '\n')
print('Using linear_kernel()\n', cosine_sim, '\n\n')

# For comparison, demonstrating resulting matrices are identical
# cosine_sim_alt = cosine_similarity(TFIDF_matrix, TFIDF_matrix)
# print('Using cosine_similarity()\n', cosine_sim_alt)

Matrix Shape: (27278, 27278) 

Using linear_kernel()
 [[1.         0.06569937 0.0108549  ... 0.00823828 0.         0.05529475]
 [0.06569937 1.         0.00102618 ... 0.00414804 0.         0.13284636]
 [0.0108549  0.00102618 1.         ... 0.         0.         0.        ]
 ...
 [0.00823828 0.00414804 0.         ... 1.         0.         0.09477059]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.05529475 0.13284636 0.         ... 0.09477059 0.         1.        ]] 




In [96]:
cosine_sim_df = pd.DataFrame(cosine_sim)

print(cosine_sim_df.iloc[0:5, 0:5])

          0         1         2         3         4
0  1.000000  0.065699  0.010855  0.004902  0.035365
1  0.065699  1.000000  0.001026  0.063021  0.024147
2  0.010855  0.001026  1.000000  0.027040  0.115535
3  0.004902  0.063021  0.027040  1.000000  0.024689
4  0.035365  0.024147  0.115535  0.024689  1.000000


## Get Recommendations

In [97]:
# Create Series for movie index lookup
movie_indices = pd.Series(content.index, index=content['title'])
print(movie_indices)

title
toy story (1995)                          0
jumanji (1995)                            1
grumpier old men (1995)                   2
waiting to exhale (1995)                  3
father of the bride part ii (1995)        4
                                      ...  
kein bund für's leben (2007)          27273
feuer, eis & dosenbier (2002)         27274
the pirates (2014)                    27275
rentun ruusu (2001)                   27276
innocence (2014)                      27277
Length: 27278, dtype: int64


In [98]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Look up by film index
    title = title.lower()
    film_id = movie_indices[title]
    movies['title'] = movies['title'].str.title()
    
    # Get top 10 similarity scores
    sim_scores = list(enumerate(cosine_sim[film_id]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    
    # Access recommended films
    rec_indices = [i[0] for i in sim_scores]
    return movies['title'].iloc[rec_indices]

## Recommendation Search

In [99]:
get_recommendations('Toy story (1995)')

3027              Toy Story 2 (1999)
2270            Bug'S Life, A (1998)
4790           Monsters, Inc. (2001)
5121                  Ice Age (2002)
6271             Finding Nemo (2003)
15401             Toy Story 3 (2010)
8278         Incredibles, The (2004)
11614             Ratatouille (2007)
19186                 Tin Toy (1988)
25462    The Legend Of Mor'Du (2012)
Name: title, dtype: object

In [100]:
get_recommendations('Jumanji (1995)')

1643                  Flubber (1997)
496            Mrs. Doubtfire (1993)
8256           Final Cut, The (2004)
752                      Jack (1996)
1479             Fathers' Day (1997)
3359    Good Morning, Vietnam (1987)
2168                     Toys (1992)
5431           One Hour Photo (2002)
5168         Death To Smoochy (2002)
2346              Patch Adams (1998)
Name: title, dtype: object

In [101]:
get_recommendations('Mrs. Doubtfire (1993)')

1643                  Flubber (1997)
8256           Final Cut, The (2004)
752                      Jack (1996)
1479             Fathers' Day (1997)
3359    Good Morning, Vietnam (1987)
1                     Jumanji (1995)
2168                     Toys (1992)
5431           One Hour Photo (2002)
5168         Death To Smoochy (2002)
139             Birdcage, The (1996)
Name: title, dtype: object

In [102]:
get_recommendations('Charlie And The Chocolate Factory (2005)')

2994                                  Sleepy Hollow (1999)
2206                            Edward Scissorhands (1990)
12321    Sweeney Todd: The Demon Barber Of Fleet Street...
232                                         Ed Wood (1994)
14937                           Alice In Wonderland (2010)
10430                                  Corpse Bride (2005)
18995                                  Dark Shadows (2012)
653                       James And The Giant Peach (1996)
547                 Nightmare Before Christmas, The (1993)
6429     Pirates Of The Caribbean: The Curse Of The Bla...
Name: title, dtype: object