# Recommender System (Part 2)

## Recommendation Based on Content

In this section, I am going to create 2 recommendation engines that provide recomendation based on content. The first one would be using the plot of the movie as a feature. The second engine would have slightly more complex features; namely the keywords in the plot, the name of the Director, the genre, the name of the screenwriter and the cast. 

For these systems, I used the TfidfVecotirizer and CountVectorizer to extract information from the textual features. Both these vectorizers gives us the importance of a word in the corpus.

To determine similarity, I used Cosine similarity. We could also use any other metric for determining similarity.

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
data = pd.read_csv("movies_metadata.csv")
data.shape
data['genres'] = data['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
data.head(5)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


## Based on Plot Description


In [3]:
df = data[['adult','genres','title','overview','vote_average','vote_count']]
df = df.iloc[:20000,:]
df['overview'] = df['overview'].fillna('')

In [4]:
tfidf = TfidfVectorizer(stop_words='english')

In [5]:
tfidf_matrix = tfidf.fit_transform(df['overview'])
tfidf_matrix.shape
cosine_sim = linear_kernel(tfidf_matrix,tfidf_matrix)

In [6]:
#df = df.reset_index()
indices = pd.Series(df.index, index = df['title'])
indices['Jumanji']

1

In [7]:
def get_reco(title, cosine_sim=cosine_sim):
    idx = indices[title]
    sim_score = list(enumerate(cosine_sim[idx]))
    sim_score = sorted(sim_score,key = lambda x : x[1], reverse = True)
    sim_score = sim_score[1:11]
    movie_indices = [i[0] for i in sim_score]
    return df['title'].iloc[movie_indices]

In [8]:
get_reco("The Dark Knight Rises")

12481                            The Dark Knight
150                               Batman Forever
1328                              Batman Returns
15511                 Batman: Under the Red Hood
585                                       Batman
9230          Batman Beyond: Return of the Joker
18035                           Batman: Year One
19792    Batman: The Dark Knight Returns, Part 1
3095                Batman: Mask of the Phantasm
10122                              Batman Begins
Name: title, dtype: object

In [9]:
get_reco("Toy Story")

15348               Toy Story 3
2997                Toy Story 2
10301    The 40 Year Old Virgin
8327                  The Champ
1071      Rebel Without a Cause
11399    For Your Consideration
1932                  Condorman
3057            Man on the Moon
485                      Malice
11606              Factory Girl
Name: title, dtype: object

In [10]:
get_reco("Rocky")

2298                          Rocky IV
11446                     Rocky Balboa
2296                          Rocky II
2299                           Rocky V
2297                         Rocky III
17302    The Prizefighter and the Lady
7873           Angels with Dirty Faces
15023                   Cain and Mabel
10148       Somebody Up There Likes Me
16144                      The Fighter
Name: title, dtype: object

In [11]:
get_reco("The Fighter")

17302    The Prizefighter and the Lady
2299                           Rocky V
2612                     Killer's Kiss
14437                   The Hard Corps
1845                             Rocky
16215                     Killer McCoy
935            They Made Me a Criminal
8921                   Split Decisions
10359                       Golden Boy
14365                  99 River Street
Name: title, dtype: object

## Credits, Genre and Keywords

In [12]:
#data = pd.read_csv("movies_metadata.csv")
credits = pd.read_csv("credits.csv")
keywords = pd.read_csv("keywords.csv")

In [13]:
data = data.drop([19730,29503,35587])

In [14]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
data['id'] = data['id'].astype('int')

In [15]:
data = data.merge(keywords,on='id')
data = data.merge(credits,on='id')

In [16]:
data.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,keywords,cast,crew
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."


In [17]:
features = ['cast', 'crew', 'keywords']
for feature in features:
    data[feature] = data[feature].apply(literal_eval)

In [18]:
def get_list(x):
    if isinstance(x,list):
        words = []
        for i in x:
            words.append(i['name'])
        
        if len(words)>3:
            words = words[:3]
            
        return words
    return[]


features = ['cast', 'keywords']
for feature in features:
    data[feature] = data[feature].apply(get_list)

In [19]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

data['director'] = data['crew'].apply(get_director)

In [20]:
def get_screenplay(x):
    names = []
    for i in x:
        if i['job'] == 'Screenplay':
            return i['name']
    
    return np.nan
        
data['Screenplay'] = data['crew'].apply(get_screenplay)    

In [21]:
data = data[['genres','title','keywords','cast','director','Screenplay']]


In [22]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''
    

In [23]:
features = ['genres','keywords','cast','director','Screenplay']

for feature in features:
    data[feature] = data[feature].apply(clean_data)

In [24]:
data.head(5)

Unnamed: 0,genres,title,keywords,cast,director,Screenplay
0,"[animation, comedy, family]",Toy Story,"[jealousy, toy, boy]","[tomhanks, timallen, donrickles]",johnlasseter,josswhedon
1,"[adventure, fantasy, family]",Jumanji,"[boardgame, disappearance, basedonchildren'sbook]","[robinwilliams, jonathanhyde, kirstendunst]",joejohnston,jonathanhensleigh
2,"[romance, comedy]",Grumpier Old Men,"[fishing, bestfriend, duringcreditsstinger]","[waltermatthau, jacklemmon, ann-margret]",howarddeutch,
3,"[comedy, drama, romance]",Waiting to Exhale,"[basedonnovel, interracialrelationship, single...","[whitneyhouston, angelabassett, lorettadevine]",forestwhitaker,ronaldbass
4,[comedy],Father of the Bride Part II,"[baby, midlifecrisis, confidence]","[stevemartin, dianekeaton, martinshort]",charlesshyer,nancymeyers


In [25]:
def soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['Screenplay'])+ ' ' + ' '.join(x['genres'])


In [26]:
data['metadata'] = data.apply(soup,axis = 1)

In [27]:
data.head(1)

Unnamed: 0,genres,title,keywords,cast,director,Screenplay,metadata
0,"[animation, comedy, family]",Toy Story,"[jealousy, toy, boy]","[tomhanks, timallen, donrickles]",johnlasseter,josswhedon,jealousy toy boy tomhanks timallen donrickles ...


In [28]:
data = data.iloc[:20000,:]


count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(data['metadata'])
cosine_sim2 = linear_kernel(count_matrix,count_matrix)

In [29]:
indices = pd.Series(data.index , index = data['title'])
indices.head(2)

title
Toy Story    0
Jumanji      1
dtype: int64

In [30]:
def get_reco2(title,cosine_sim = cosine_sim2):
    ind = indices[title]
    sim_scores = list(enumerate(cosine_sim[ind]))
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in  sim_scores]
    return data['title'].iloc[movie_indices]

In [31]:
get_reco2('The Dark Knight Rises')

12589            The Dark Knight
10210              Batman Begins
516            Romeo Is Bleeding
2668       Wanted: Dead or Alive
5106        The Long Good Friday
7772                    Mitchell
9038              State of Grace
11463               The Prestige
11524                Harsh Times
12490    Rise of the Footsoldier
Name: title, dtype: object

In [32]:
get_reco2('Toy Story')

3024                              Toy Story 2
15519                             Toy Story 3
17551                                  Cars 2
1996           One Hundred and One Dalmatians
2262                             A Bug's Life
3336                        Creature Comforts
3659     The Adventures of Rocky & Bullwinkle
11074                                    Cars
11209                           Monster House
11690               The Ugly Duckling and Me!
Name: title, dtype: object

In [40]:
get_reco2('The Fighter')

419                   Blue Chips
568          Spanking the Monkey
2801                 Three Kings
4064        Carman: The Champion
7696              Gladiator 1992
19920    Silver Linings Playbook
123       Flirting with Disaster
144       The Basketball Diaries
277                Nobody's Fool
349                         Cobb
Name: title, dtype: object