### Content Based Recommendations

In the previous notebook, you were introduced to a way of making recommendations using collaborative filtering.  However, using this technique there are a large number of users who were left without any recommendations at all.  Other users were left with fewer than the ten recommendations that were set up by our function to retrieve....

In order to help these users out, let's try another technique: **content based** recommendations. Let's start off where we were in the previous notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from IPython.display import HTML
from progressbar import ProgressBar
import tests as t
import pickle

%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

all_recs = pickle.load(open("all_recs.p", "rb"))

In [2]:
movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Fantasy,Romance,Game-Show,Action,Documentary,Animation,Comedy,Short,Western,Thriller
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


In [3]:
reviews[['user_id', 'movie_id', 'rating']].head()

Unnamed: 0,user_id,movie_id,rating
0,1,68646,10
1,1,113277,10
2,2,422720,8
3,2,454876,8
4,2,790636,7


In [4]:
list(all_recs.items())[:1]

[(2,
  ['Philadelphia (1993)',
   'Training Day (2001)',
   'About Schmidt (2002)',
   'Insomnia (2002)',
   'The United States of Leland (2003)',
   'Shattered Glass (2003)',
   'Man on Fire (2004)',
   'Flipped (2010)',
   'Silver Linings Playbook (2012)',
   'Lawless (2012)',
   '50/50 (2011)',
   'Crazy, Stupid, Love. (2011)',
   'The Perks of Being a Wallflower (2012)',
   'Before I Go to Sleep (2014)',
   'Zero Dark Thirty (2012)',
   'American Hustle (2013)',
   'Django Unchained (2012)',
   'Side Effects (2013)',
   'Gone Girl (2014)',
   'Enough Said (2013)',
   'Nightcrawler (2014)'])]

### Datasets

From the above, you now have access to three important items that you will be using throughout the rest of this notebook.  

`a.` **movies** - a dataframe of all of the movies in the dataset along with other content related information about the movies (genre and date)


`b.` **reviews** - this was the main dataframe used before for collaborative filtering, as it contains all of the interactions between users and movies.


`c.` **all_recs** - a dictionary where each key is a user, and the value is a list of movie recommendations based on collaborative filtering

For the individuals in **all_recs** who did receive 10 recommendations using collaborative filtering, we don't really need to worry about them.  However, there were a number of individuals in our dataset who did not receive any recommendations.

-----

`1.` Let's start with finding all of the users in our dataset who didn't get all 10 ratings we would have liked them to have using collaborative filtering.  

In [5]:
# Store user ids who have all their recommendations in this (10 or more)
# and those who don't: we'll be giving them content-based recommendations
users_with_all_recs = []
users_who_need_recs = []

for user, movie_recs in all_recs.items():
    if len(movie_recs) > 9:
        users_with_all_recs.append(user)
    else:
        users_who_need_recs.append(user)

print('Of {} users, {} have at least 10 recommendations and {} have less.'.format(
    len(all_recs), len(users_with_all_recs), len(users_who_need_recs)))

Of 23512 users, 22187 have at least 10 recommendations and 1325 have less.


In [6]:
# A quick test
assert len(users_with_all_recs) == 22187
print("That's right there were still another 1325 users who needed recommendations when we only used collaborative filtering!")

That's right there were still another 1325 users who needed recommendations when we only used collaborative filtering!


### Content Based Recommendations

You will be doing a bit of a mix of content and collaborative filtering to make recommendations for the users this time.  This will allow you to obtain recommendations in many cases where we didn't make recommendations earlier.     

`2.` Before finding recommendations, rank the user's ratings from highest to lowest. You will move through the movies in this order looking for other similar movies.

In [7]:
# create a dataframe similar to reviews, but ranked by rating for each user
reviews_descending = reviews[['user_id', 'movie_id', 'rating']].sort_values(by=['user_id', 'rating'], ascending=[1, 0])
reviews_descending.head()

Unnamed: 0,user_id,movie_id,rating
0,1,68646,10
1,1,113277,10
16,2,1798709,10
18,2,2024544,10
22,2,2726560,9


### Similarities

In the collaborative filtering sections, you became quite familiar with different methods of determining the similarity (or distance) of two users.  We can perform similarities based on content in much the same way.  

In many cases, it turns out that one of the fastest ways we can find out how similar items are to one another (when our matrix isn't totally sparse like it was in the earlier section) is by simply using matrix multiplication.  If you are not familiar with this, an explanation is available [here by 3blue1brown](https://www.youtube.com/watch?v=LyGKycYT2v0) and another quick explanation is provided [in the post here](https://math.stackexchange.com/questions/689022/how-does-the-dot-product-determine-similarity).

For us to pull out a matrix that describes the movies in our dataframe in terms of content, we might just use the indicator variables related to **year** and **genre** for our movies.  

Then we can obtain a matrix of how similar movies are to one another by taking the dot product of this matrix with itself.  Notice below that the dot product where our 1 values overlap gives a value of 2 indicating higher similarity.  In the second dot product, the 1 values don't match up.  This leads to a dot product of 0 indicating lower similarity.

<img src="images/dotprod1.png" alt="Dot Product" height="500" width="500">

We can perform the dot product on a matrix of movies with content characteristics to provide a movie by movie matrix where each cell is an indication of how similar two movies are to one another.  In the below image, you can see that movies 1 and 8 are most similar, movies 2 and 8 are most similar, and movies 3 and 9 are most similar for this subset of the data.  The diagonal elements of the matrix will contain the similarity of a movie with itself, which will be the largest possible similarity (and will also be the number of 1's in the movie row within the orginal movie content matrix).

<img src="images/moviemat.png" alt="Dot Product" height="500" width="500">


`3.` Create a numpy array that is a matrix of indicator variables related to year (by century) and movie genres by movie.  Perform the dot product of this matrix with itself (transposed) to obtain a similarity matrix of each movie with every other movie.  The final matrix should be 31245 x 31245.

In [8]:
# Subset so movie_content is only using the dummy variables for each genre and the 3 century based year dummy columns
movie_content = movies.set_index('movie_id')
movie_content.drop(columns=['movie', 'genre', 'date'], inplace=True)

# instructor's solution: movie_content = np.array(movies.iloc[:,4:])

movie_content.head()

Unnamed: 0_level_0,1800's,1900's,2000's,History,News,Horror,Musical,Film-Noir,Mystery,Adventure,...,Fantasy,Romance,Game-Show,Action,Documentary,Animation,Comedy,Short,Western,Thriller
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
10,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
12,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
25,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
91,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [9]:
# Take the dot product to obtain a movie x movie matrix of similarities
dot_prod_movies = movie_content.dot(np.transpose(movie_content))
dot_prod_movies.head()

movie_id,8,10,12,25,91,417,439,443,628,833,...,8144778,8144868,8206708,8289196,8324578,8335880,8342748,8342946,8402090,8439854
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,3,3,3,1,2,1,1,0,1,1,...,1,0,0,1,0,0,0,0,0,0
10,3,3,3,1,2,1,1,0,1,1,...,1,0,0,1,0,0,0,0,0,0
12,3,3,3,1,2,1,1,0,1,1,...,1,0,0,1,0,0,0,0,0,0
25,1,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
91,2,2,2,1,3,1,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [10]:
dot_prod_movies.shape

(31245, 31245)

In [11]:
dot_prod_movies.iloc[0, 0]

3

In [12]:
# create checks for the dot product matrix
assert dot_prod_movies.shape[0] == 31245
assert dot_prod_movies.shape[1] == 31245
assert dot_prod_movies.iloc[0, 0] == np.max(dot_prod_movies.iloc[0])
print("Looks like you passed all of the tests.  Though they weren't very robust - if you want to write some of your own, I won't complain!")

Looks like you passed all of the tests.  Though they weren't very robust - if you want to write some of your own, I won't complain!


### For Each User...


Now you have a matrix where each user has their ratings ordered.  You also have a second matrix where movies are each axis, and the matrix entries are larger where the two movies are more similar and smaller where the two movies are dissimilar.  This matrix is a measure of content similarity. Therefore, it is time to get to the fun part.

For each user, we will perform the following:

    i. For each movie, find the movies that are most similar that the user hasn't seen.

    ii. Continue through the available, rated movies until 10 recommendations or until there are no additional movies.

As a final note, you may need to adjust the criteria for 'most similar' to obtain 10 recommendations.  As a first pass, I used only movies with the highest possible similarity to one another as similar enough to add as a recommendation.

`3.` In the cell below, complete each of the functions needed for making content based recommendations.

In [13]:
# Playing around...

dot_prod_movies[dot_prod_movies.iloc[25] == np.max(dot_prod_movies.iloc[25])].index

Int64Index([   2354,    3863,    4100,    4101,    4210,    4395,    4518,
               4546,    4936,    5074,
            ...
             231639,  274518,  291476,  312908,  351080,  411819,  860507,
            2927244, 2942796, 3083958],
           dtype='int64', name='movie_id', length=163)

In [14]:
dot_prod_movies.iloc[25, 25]

3

In [15]:
def find_similar_movies(movie_id):
    '''
    INPUT
    movie_id - a movie_id 
    OUTPUT
    similar_movies - an array of the most similar movies by title
    '''
    # find the most similar movie indices - to start I said they need to be the same for all content
    similar_movie_ids = []
    max_value = np.max(dot_prod_movies.loc[movie_id])
    
    for id in dot_prod_movies.index:
        val = dot_prod_movies.loc[movie_id, id]
        
        if (val == max_value) & (id != movie_id):    # don't add the movie to its list of similar movies
            similar_movie_ids.append(id)
    
    # pull the movie titles based on the indices
    similar_movies = get_movie_names(similar_movie_ids)
    
    return similar_movies
    
# You made this function in an earlier notebook - using again here    
def get_movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    '''
    movie_lst = list(movies[movies['movie_id'].isin(movie_ids)]['movie'])
   
    return movie_lst


def make_recs(max_recs=10):
    '''
    INPUT
    None
    OUTPUT
    recs - a dictionary with keys of the user and values of the recommendations
    '''
    recs = dict()
    pbar = ProgressBar()
    
    for u, user in enumerate(pbar(users_who_need_recs)):
        
        user_recs = np.array([])
        
        # Pull only the reviews the user has seen
        user_has_seen = np.array(reviews_descending[reviews_descending['user_id'] == user]['movie_id'])

        # Look at each of the movies (highest ranked first), pull the movies the user hasn't seen that are most similar
        # These will be the recommendations - continue until 10 recs or you have depleted the movie list for the user
        for movie_id in user_has_seen:
            similar_movies = find_similar_movies(movie_id)
            similar_movies_not_seen = np.setdiff1d(similar_movies, user_has_seen, assume_unique=True)
            user_recs = np.unique(np.concatenate([similar_movies_not_seen, user_recs], axis=0))
            
            # If the max number of recommendations has been reached, break out of the movie loop and go to the next user
            if (len(user_recs) >= max_recs):
                user_recs = user_recs[:max_recs]
                break
        
        if (u % 100 == 0):
            print('\nUser {} has the following recommendations:\n{}\n'.format(user, user_recs))
            
        recs[user] = user_recs
    
    return recs

In [16]:
recs = make_recs()

  mask &= (ar1 != a)
  0% (1 of 1325) |                       | Elapsed Time: 0:00:10 ETA:   3:57:05


User 26 has the following recommendations:
['$5 a Day (2008)' '10 Items or Less (2006)' '10 Years (2011)'
 '10,000 Saints (2015)' '10.000 Km (2014)' '100 metros (2016)'
 '11:14 (2003)' '13 game sayawng (2006)' '17 Again (2009)'
 '2 Days in New York (2012)']



  7% (101 of 1325) |#                    | Elapsed Time: 0:00:57 ETA:   0:10:40


User 3889 has the following recommendations:
['$5 a Day (2008)' '10 Items or Less (2006)' '10 Years (2011)'
 '10,000 Saints (2015)' '10.000 Km (2014)' '100 metros (2016)'
 '11:14 (2003)' '13 game sayawng (2006)' '17 Again (2009)'
 '2 Days in New York (2012)']



 15% (201 of 1325) |###                  | Elapsed Time: 0:01:42 ETA:   0:07:52


User 8278 has the following recommendations:
['102 Dalmatians (2000)' '16-Love (2012)' '3 Holiday Tails (2011)'
 '3 Idiotas (2017)' 'A Christmas Snow (2010)' 'A Cinderella Story (2004)'
 'A Merry Little Christmas (2006)' 'A Street Cat Named Bob (2016)'
 'Adventures of a Teenage Dragonslayer (2010)'
 'Alexander and the Terrible, Horrible, No Good, Very Bad Day (2014)']



 22% (301 of 1325) |####                 | Elapsed Time: 0:02:29 ETA:   0:07:27


User 12341 has the following recommendations:
['A Wrinkle in Time (2018)' 'Alice Through the Looking Glass (2016)'
 'Alice in Wonderland (2010)' 'Back to the Secret Garden (2001)'
 'City of Ember (2008)' 'Fantastic Beasts and Where to Find Them (2016)'
 'Five Children and It (2004)'
 'Harry Potter and the Chamber of Secrets (2002)'
 'Harry Potter and the Deathly Hallows: Part 1 (2010)'
 'Harry Potter and the Goblet of Fire (2005)']



 30% (401 of 1325) |######               | Elapsed Time: 0:03:15 ETA:   0:06:30


User 16832 has the following recommendations:
['.com for Murder (2002)' '28 Days Later... (2002)'
 '3G - A Killer Connection (2013)' 'Absence (2013)' 'AfterDeath (2015)'
 'Alien Abduction (2014)' 'Alligator X (2010)' 'Almost Human (2013)'
 'Altered (2006)' 'Anatomie 2 (2003)']



 37% (501 of 1325) |#######              | Elapsed Time: 0:03:59 ETA:   0:05:51


User 20935 has the following recommendations:
['$ (1971)' '$30 (1999)' "'A' gai wak (1983)"
 "'Crocodile' Dundee II (1988)" "'Til There Was You (1997)"
 '*batteries not included (1987)' '...Più forte ragazzi! (1972)'
 '10 (1979)' '10 Things I Hate About You (1999)' '101 Dalmatians (1996)']



 45% (601 of 1325) |#########            | Elapsed Time: 0:04:45 ETA:   0:05:53


User 24507 has the following recommendations:
['3:10 to Yuma (1957)' 'A Man Called Horse (1970)' 'Arrowhead (1953)'
 'Bad Company (1972)' 'Bells of San Fernando (1947)'
 'Cahill U.S. Marshal (1973)' 'Cimarron (1931)' 'Clearcut (1991)'
 'Comanche Station (1960)' 'Comes a Horseman (1978)']



 52% (701 of 1325) |###########          | Elapsed Time: 0:05:32 ETA:   0:04:18


User 28412 has the following recommendations:
['10 Things I Hate About You (1999)' '2 secondes (1998)'
 '24 7: Twenty Four Seven (1997)' 'About Last Night... (1986)'
 'Absolute Giganten (1999)' "Adam's Rib (1949)" 'Adventure (1945)'
 'Afterglow (1997)' 'Al-kompars (1993)' 'All Night Long (1981)']



 60% (801 of 1325) |############         | Elapsed Time: 0:06:15 ETA:   0:03:41


User 33196 has the following recommendations:
['$ (1971)' '2 Days in the Valley (1996)' '48 Hrs. (1982)'
 '8 Heads in a Duffel Bag (1997)' 'A Fish Called Wanda (1988)'
 'A Life Less Ordinary (1997)' 'A Low Down Dirty Shame (1994)'
 'A Man Betrayed (1941)' 'A Matter of WHO (1961)'
 'A Night for Crime (1943)']



 68% (901 of 1325) |##############       | Elapsed Time: 0:07:01 ETA:   0:02:54


User 36577 has the following recommendations:
['$5 a Day (2008)' '10 Items or Less (2006)' '10 Years (2011)'
 '10,000 Saints (2015)' '10.000 Km (2014)' '100 metros (2016)'
 '11:14 (2003)' '13 game sayawng (2006)' '17 Again (2009)'
 '2 Days in New York (2012)']



 75% (1001 of 1325) |###############     | Elapsed Time: 0:07:46 ETA:   0:02:14


User 41639 has the following recommendations:
['12 Years a Slave (2013)'
 '2 Filhos de Francisco: A História de Zezé di Camargo &amp; Luciano (2005)'
 'A Mighty Heart (2007)' 'A Tale of Love and Darkness (2015)'
 'Admiral (2008)' 'Amazing Grace (2006)' 'Argo (2012)'
 'Cadillac Records (2008)' 'Castles in the Sky (2014)'
 'Catch a Fire (2006)']



 83% (1101 of 1325) |################    | Elapsed Time: 0:08:30 ETA:   0:02:00


User 44897 has the following recommendations:
['A Sound of Thunder (2005)' 'Aftershock (2012)' 'All Girls Weekend (2016)'
 'Animal (2014)' 'Battle of the Damned (2013)' 'Beyond Skyline (2017)'
 'Cloverfield (2008)' 'Darkest Day (2015)' 'Exists (2014)'
 'Hansel & Gretel: Warriors of Witchcraft (2013)']



 90% (1201 of 1325) |##################  | Elapsed Time: 0:09:15 ETA:   0:01:16


User 48674 has the following recommendations:
['Aap Kaa Surroor: The Moviee - The Real Luv Story (2007)' 'Allure (2017)'
 'Amar (2017)' 'Any Day (2015)' 'Asylum (2005)' 'Before the Rains (2007)'
 'Charlotte Gray (2001)' 'Death Defying Acts (2007)' 'Dot the I (2003)'
 'Down in the Valley (2005)']



 98% (1301 of 1325) |################### | Elapsed Time: 0:09:59 ETA:   0:00:10


User 52494 has the following recommendations:
['$ (1971)' '$30 (1999)' "'A' gai wak (1983)" "'Breaker' Morant (1980)"
 "'Crocodile' Dundee II (1988)" "'Doc' (1971)" "'Gator Bait (1974)"
 "'Hukkunud Alpinisti' hotell (1979)" "'I Know Where I'm Going!' (1945)"
 "'Je vous salue, Marie' (1985)"]



100% (1325 of 1325) |####################| Elapsed Time: 0:10:11 Time:  0:10:11


### How Did We Do?

Now that you have made the recommendations, how did we do in providing everyone with a set of recommendations?

`4.` Use the cells below to see how many individuals you were able to make recommendations for, as well as explore characteristics about individuals for whom you were not able to make recommendations.  

In [17]:
# Some characteristics of my content based recommendations

users_without_all_recs = []
users_with_all_recs = []
no_recs = []

for user, movie_recs in recs.items():
    if len(movie_recs) < 10:
        users_without_all_recs.append(user)
    if len(movie_recs) > 9:
        users_with_all_recs.append(user)
    if len(movie_recs) == 0:
        no_recs.append(user)

print("There were {} users without all 10 recommendations we would have liked to have.".format(len(users_without_all_recs)))
print("There were {} users with all 10 recommendations we would like them to have.".format(len(users_with_all_recs)))
print("There were {} users with no recommendations at all!".format(len(no_recs)))

There were 0 users without all 10 recommendations we would have liked to have.
There were 1325 users with all 10 recommendations we would like them to have.
There were 0 users with no recommendations at all!


### Now What?  

Well, if you were really strict with your criteria for how similar two movies are (like I was initially), then you still have some users that don't have all 10 recommendations (and a small group of users who have no recommendations at all). 

As stated earlier, recommendation engines are a bit of an **art** and a **science**.  There are a number of things we still could look into - how do our collaborative filtering and content based recommendations compare to one another? How could we incorporate user input along with collaborative filtering and/or content based recommendations to improve any of our recommendations?  How can we truly gain recommendations for every user?

`5.` In this last step feel free to explore any last ideas you have with the recommendation techniques we have looked at so far.  You might choose to make the final needed recommendations using the first technique with just top ranked movies.  You might also loosen up the strictness in the similarity needed between movies.  Be creative and share your insights with your classmates!