# Basic Recommendation System

By Justin Wong

## Poll Results

![Poll](screenshots/poll.png "poll")


# What We'll Do

1. Blockchain Application to execute transaction and send funds from one account to another, using local blockcahin(Ganache)

2. Go over simple recommendatino systems.

3. Combine recommendation system with executing transaction.

In [78]:
import numpy as np
import pandas as pd

#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Parse the stringified features into their corresponding python objects
from ast import literal_eval

In [4]:
#  downloaded  from https://www.kaggle.com/rounakbanik/the-movies-dataset
!ls data

credits.csv         links.csv           movies_metadata.csv ratings_small.csv
keywords.csv        links_small.csv     ratings.csv


In [5]:
fn = 'data/movies_metadata.csv'
movies_metadata = pd.read_csv(fn, low_memory=False)
movies_metadata.head()


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


## Simple Recommender Based on Rankings/Reports

### Scoring our Rankings

$$ WeightedRating = \left(\frac{v}{v+m}*R\right) + \left(\frac{m}{v+m}*C\right)$$

v = number of votes for movie

m = minimum votes required to be listed in chart

R is average rating of movie

C is mean vote across whole report

In [8]:
# Calculate mean of vote average column
C = movies_metadata['vote_average'].mean()

# Calculate the minimum number of votes required to be in the chart, m
m = movies_metadata['vote_count'].quantile(0.90)

print("Vote average: {C}".format(C=C))
print("Vote count 90percential: {m}".format(m=m))


Vote average: 5.618207215133889
Vote count 90percential: 160.0


In [10]:
# Filter out all qualified movies into a new DataFrame
q_movies = movies_metadata.copy().loc[movies_metadata['vote_count'] >= m]

print("filtered for top  90% shape: {s}".format(s=q_movies.shape))
print("Full dataset shape: {s}".format(s=movies_metadata.shape))

filtered for top  90% shape: (4555, 24)
Full dataset shape: (45466, 24)


In [11]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)


In [14]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)


In [17]:
#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


## Comparing with IMDB

Top  movies: https://www.imdb.com/chart/top/

![Top 10 from IMDB](screenshots/imdb_top_10.png "Top 10 from IMDB")

## Content-Based Recommender

Recommendations are based on movies that are similar to a particular movie.


In [23]:
#Print plot overviews of the first 5 movies.
movies_metadata['overview'].sample(5)

40103    The holiday season is turned upside down for a...
15308    The story of the uncompromising artist and fig...
39667    Hamilton is assigned to infiltrate a terrorist...
14287    The tragic story of the beautiful daughter of ...
20171    After he loses his high-paying job, Dory takes...
Name: overview, dtype: object

## NLP

We have an NLP problem  here, where  we need to extract some features from the overviews to accurately categorize the different movies into appropriate groups based on similarity/dissimilarity.

We'll need to compute word vectors for each overview/document. To do so, we'll compute



In [27]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
movies_metadata['overview'] = movies_metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies_metadata['overview'])


In [28]:
print("Shape of TFIDF matrix: {s}".format(s=tfidf_matrix.shape))

Shape of TFIDF matrix: (45466, 75827)


From the above output, you observe that in the 45,000 movies, there is a 75,827 size vocabularies or words in  our dataset.

In [29]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names()[5000:5010]

['avails',
 'avaks',
 'avalanche',
 'avalanches',
 'avallone',
 'avalon',
 'avant',
 'avanthika',
 'avanti',
 'avaracious']

We can compute similarity scores using different mathematical funtions, such as manhattan,  euclidean, cosine, etc.

![Cosine Similarity](screenshots/cos_similarity.png "Cosine similarity")

In [32]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [33]:
print("Shape of cosine_sim: {s}".format(s=cosine_sim.shape))

print(cosine_sim[1]) ## to see what it actually looks like

Shape of cosine_sim: (45466, 45466)
[0.01504121 1.         0.04681953 ... 0.         0.02198641 0.00929411]


## Define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies.

In [34]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(movies_metadata.index, index=movies_metadata['title']).drop_duplicates()

In [35]:
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

In [38]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies_metadata['title'].iloc[movie_indices]

In [41]:
display(get_recommendations('The Dark Knight Rises'))
display(get_recommendations('The Godfather'))

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object

1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030            A Mother Should Be Loved
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen
Name: title, dtype: object

## How we can do better?

Adding additional "metadata", such as

- credits
- genres
- keyords

In [44]:
!ls data

credits.csv         links.csv           movies_metadata.csv ratings_small.csv
keywords.csv        links_small.csv     ratings.csv


In [54]:
movies_metadata.iloc[[19730, 29503, 35587]]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,1,,,,,,,,,
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,12,,,,,,,,,
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,22,,,,,,,,,


In [56]:
# Load keywords and credits
credits_fn='data/credits.csv'
keywords_fc='data/keywords.csv'
credits = pd.read_csv(credits_fn)
keywords = pd.read_csv(keywords_fc)

# Remove rows with bad IDs.
movies_metadata = movies_metadata.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
movies_metadata['id'] = movies_metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
movies_metadata = movies_metadata.merge(credits, on='id')
movies_metadata = movies_metadata.merge(keywords, on='id')

In [59]:
# Print the first two movies of your newly merged metadata
movies_metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."


### Data Cleaning

Data is present in the form of "stringified" lists. You need to convert them into a way that is usable for you.

1. Extract out certain fields like director,  first 3 cast members, keywords, and genres.
2. Making everything lower case.
3. Removing unnecessary spaces.

In [61]:
features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    movies_metadata[feature] = movies_metadata[feature].apply(literal_eval)

In [63]:
# Get the director's name from the crew feature
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

# Return the top 3 elements or the entire list
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [64]:
# Define new director, cast, genres and keywords features that are in a suitable form.
movies_metadata['director'] = movies_metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    movies_metadata[feature] = movies_metadata[feature].apply(get_list)

In [66]:
# Print the new features of the first 3 films
movies_metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


In [67]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [68]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    movies_metadata[feature] = movies_metadata[feature].apply(clean_data)

### Combining everything together

In [69]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [72]:
# Create a new soup feature
movies_metadata['soup'] = movies_metadata.apply(create_soup, axis=1)

In [74]:
movies_metadata[['soup']].head(2)

Unnamed: 0,soup
0,jealousy toy boy tomhanks timallen donrickles ...
1,boardgame disappearance basedonchildren'sbook ...


In [76]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(movies_metadata['soup'])

In [77]:
print("Shape of Count Vectorizer matrix: {s}".format(s=count_matrix.shape))

Shape of Count Vectorizer matrix: (46628, 73881)


In [79]:
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)


In [80]:
# Reset index of your main DataFrame and construct reverse mapping as before
movies_metadata = movies_metadata.reset_index()
indices = pd.Series(movies_metadata.index, index=movies_metadata['title'])

In [83]:
display(get_recommendations('The Dark Knight Rises', cosine_sim2))
display(get_recommendations('The Dark Knight Rises'))

display(get_recommendations('The Godfather', cosine_sim2))
display(get_recommendations('The Godfather'))

12589      The Dark Knight
10210        Batman Begins
9311                Shiner
9874       Amongst Friends
7772              Mitchell
516      Romeo Is Bleeding
11463         The Prestige
24090            Quicksand
25038             Deadfall
41063                 Sara
Name: title, dtype: object

3777                 A Couch in New York
40653                              Wacko
38251                             Agyaat
1304                    April Fool's Day
16844                 A Hole in the Soul
43127                     Hunting Season
16510                   Morsian yllättää
19970          H.P. Lovecraft's The Tomb
10230    Me and You and Everyone We Know
16042                    The Last Letter
Name: title, dtype: object

1934            The Godfather: Part III
1199             The Godfather: Part II
15609                   The Rain People
18940                         Last Exit
34488                              Rege
35802            Manuscripts Don't Burn
35803            Manuscripts Don't Burn
8001     The Night of the Following Day
18261                 The Son of No One
28683            In the Name of the Law
Name: title, dtype: object

25374                              The Fearmakers
3844                   Sorority House Massacre II
3244                                    Jail Bait
20667                         The Goddess of 1967
33253                                 Son of Saul
37854                                          RR
39265    Texas - Doc Snyder hält die Welt in Atem
37097                                  Reptilicus
23349          Celestial Wives of the Meadow Mari
183                                   Nine Months
Name: title, dtype: object

# References:

- https://www.datacamp.com/community/tutorials/recommender-systems-python

