               
# **Movie Recommender Systems** 

#Recommender Systems are a part of AI systems that predict and recommend new items. (e.g. YouTube videos, Netflix shows, Amazon products).

In this project, we'll use recommender systems to try to find a good movie for our next movie night!

Here's what we need to do:

* **Step 1:** Get a dataset of movie ratings, and make sure we understand how the dataset is structured.
* **Step 2:** Try to get just a non-personalized set of recommendations for Sambit, Parin and Hasmeet, to see if we can find a movie to watch that way.
* **Step 3**: Get personalized ratings for Sambit, Parin and Hasmeet, and import them into the system in the correct format.
* **Step 4:** Train a User-User collaborative filtering model to provide personalized recommendations based on Sambit's, Parin's and Hasmeet's prior ratings.
* **Step 5:** Combine ratings to generate a single ranked recommendation list for our movie night together!



We'll use an existing dataset published by MovieLens, which contains about 100,000 user ratings for about 10,000 different movies. You can read more about this dataset here: http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html

We'll also use the LensKit API to implement our recommender systems algorithms.

***STEP 1***

**Step 1.1**

In [None]:
!pip install lenskit


import lenskit.datasets as ds
import pandas as pd
!git clone https://github.com/crash-course-ai/lab4-recommender-systems.git


data = ds.MovieLens('lab4-recommender-systems/')

print("Successfully installed dataset.")

Collecting lenskit
[?25l  Downloading https://files.pythonhosted.org/packages/c1/51/9cd302b097f08835884588bc400724f3861bd8eefd73d0229be51b6541f9/lenskit-0.10.1.tar.gz (74kB)
[K     |████▍                           | 10kB 18.1MB/s eta 0:00:01[K     |████████▉                       | 20kB 1.7MB/s eta 0:00:01[K     |█████████████▎                  | 30kB 2.3MB/s eta 0:00:01[K     |█████████████████▊              | 40kB 2.5MB/s eta 0:00:01[K     |██████████████████████▏         | 51kB 2.0MB/s eta 0:00:01[K     |██████████████████████████▋     | 61kB 2.3MB/s eta 0:00:01[K     |███████████████████████████████ | 71kB 2.5MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 2.3MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pyarrow>=0.15
[?25l  Downloading https://files.pythonho

It's important to understand how a dataset is structured and to make sure that the dataset imported correctly.  Let's print out a few rows of the rating data. 

As you see, MovieLens stores a user's ID number (the first row few rows look like they're all ratings from user 1), the item's ID (in this case each ID is a different movie), the rating the user gave this item, and a time stamp for when the rating was left.

**Step 1.2**

In [None]:
rows_to_show = 10  # <-- Try changing this number to see more rows of data
data.ratings.head(rows_to_show)  # <-- Try changing "ratings" to "movies", "tags", or "links" to see the kinds of data that's stored in the other MovieLens files

Unnamed: 0,user,item,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


A big aspect of recommender system datasets is how they handle missing data. Recommender systems usually have a LOT of missing data, because most users only rate a few movies and most movies only receive ratings from a few users. 

For example, we can see that user #1 provided rating of 4.0 to the item #1 and that they provided a rating of 4.0 to item #3. But there isn't a rating for item #2 at all, which means that user #1 never rated this item. It's helpful to know that this dataset doesn't store unranked items at all, instead of, for example, storing unranked items as 0 ratings. 

But here we have another small issue: names like item #1 and item #2 aren't very descriptive, so we can't tell what those movies are. Thankfully, MovieLens also has a data table called "movies" that includes information about titles and genres. We can get a more meaningful look at these data by joining the two data files. 

**Step 1.3**

In [None]:
joined_data = data.ratings.join(data.movies['genres'], on='item')
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data.head(rows_to_show)

Unnamed: 0,user,item,rating,timestamp,genres,title
0,1,1,4.0,964982703,Adventure|Animation|Children|Comedy|Fantasy,Toy Story (1995)
1,1,3,4.0,964981247,Comedy|Romance,Grumpier Old Men (1995)
2,1,6,4.0,964982224,Action|Crime|Thriller,Heat (1995)
3,1,47,5.0,964983815,Mystery|Thriller,Seven (a.k.a. Se7en) (1995)
4,1,50,5.0,964982931,Crime|Mystery|Thriller,"Usual Suspects, The (1995)"
5,1,70,3.0,964982400,Action|Comedy|Horror|Thriller,From Dusk Till Dawn (1996)
6,1,101,5.0,964980868,Adventure|Comedy|Crime|Romance,Bottle Rocket (1996)
7,1,110,4.0,964982176,Action|Drama|War,Braveheart (1995)
8,1,151,5.0,964984041,Action|Drama|Romance|War,Rob Roy (1995)
9,1,157,5.0,964984100,Comedy|War,Canadian Bacon (1995)


***STEP 2***

Now that we have ratings, let's create a generic set of recommended movies by looking at the highest rated films. We can average all the ratings by item, sort the list in descending order, and print that top set of recommendations.

**Step 2.1**

In [None]:
average_ratings = (data.ratings).groupby(['item']).mean()
sorted_avg_ratings = average_ratings.sort_values(by="rating", ascending=False)
joined_data = sorted_avg_ratings.join(data.movies['genres'], on='item')
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[1:]]

print("RECOMMENDED FOR SAMBIT:")
joined_data.head(rows_to_show)

RECOMMENDED FOR SAMBIT:


Unnamed: 0_level_0,rating,timestamp,genres,title
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88448,5.0,1315438000.0,Comedy|Drama,Paper Birds (Pájaros de papel) (2010)
100556,5.0,1456151000.0,Documentary,"Act of Killing, The (2012)"
143031,5.0,1520409000.0,Comedy|Drama|Romance,Jump In! (2007)
143511,5.0,1526207000.0,Documentary,Human (2015)
143559,5.0,1520410000.0,Comedy|Crime|Fantasy,L.A. Slasher (2015)
6201,5.0,1100120000.0,Drama|Romance,Lady Jane (1986)
102217,5.0,1443200000.0,Comedy,Bill Hicks: Revelations (1993)
102084,5.0,1493422000.0,Action|Animation|Fantasy,Justice League: Doom (2012)
6192,5.0,1063275000.0,Romance,Open Hearts (Elsker dig for evigt) (2002)
145994,5.0,1526207000.0,Comedy,Formula of Love (1984)


That seemed like a good idea, but the results are strange... _Paper Birds_? _Bill Hicks: Revelations_? Those are pretty obscure movies. Let's see what's actually happening here.

In [None]:
average_ratings = (data.ratings).groupby('item') \
       .agg(count=('user', 'size'), rating=('rating', 'mean')) \
       .reset_index()

sorted_avg_ratings = average_ratings.sort_values(by="rating", ascending=False)
joined_data = sorted_avg_ratings.join(data.movies['genres'], on='item')
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[1:]]


print("RECOMMENDED FOR HASMEET:")
joined_data.head(rows_to_show)

RECOMMENDED FOR HASMEET:


Unnamed: 0,count,rating,genres,title
7638,1,5.0,Comedy|Drama,Paper Birds (Pájaros de papel) (2010)
8089,1,5.0,Documentary,"Act of Killing, The (2012)"
9065,1,5.0,Comedy|Drama|Romance,Jump In! (2007)
9076,1,5.0,Documentary,Human (2015)
9078,1,5.0,Comedy|Crime|Fantasy,L.A. Slasher (2015)
4245,1,5.0,Drama|Romance,Lady Jane (1986)
8136,1,5.0,Comedy,Bill Hicks: Revelations (1993)
8130,1,5.0,Action|Animation|Fantasy,Justice League: Doom (2012)
4240,1,5.0,Romance,Open Hearts (Elsker dig for evigt) (2002)
9104,1,5.0,Comedy,Formula of Love (1984)


Adding the "count" column, we can see that each of these movies was given a perfect 5.0 rating but by just ONE person. They might be good movies, but we can't be very confident in these recommendations.

To improve this list, we should try only including movies in this recommendation list if they have more than a certain number of ratings, so we can be more confident that each movie is generally good. Let's start with movies that 20 or more people rated.

**Step 2.2**

In [None]:
minimum_to_include = 20 #<-- You can try changing this minimum to include movies rated by fewer or more people

average_ratings = (data.ratings).groupby(['item']).mean()
rating_counts = (data.ratings).groupby(['item']).count()
average_ratings = average_ratings.loc[rating_counts['rating'] > minimum_to_include]
sorted_avg_ratings = average_ratings.sort_values(by="rating", ascending=False)
joined_data = sorted_avg_ratings.join(data.movies['genres'], on='item')
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[3:]]

print("RECOMMENDED FOR PARIN:")
joined_data.head(rows_to_show)

RECOMMENDED FOR PARIN:


Unnamed: 0_level_0,genres,title
item,Unnamed: 1_level_1,Unnamed: 2_level_1
318,Crime|Drama,"Shawshank Redemption, The (1994)"
922,Drama|Film-Noir|Romance,Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
898,Comedy|Drama|Romance,"Philadelphia Story, The (1940)"
475,Drama,In the Name of the Father (1993)
1204,Adventure|Drama|War,Lawrence of Arabia (1962)
246,Documentary,Hoop Dreams (1994)
858,Crime|Drama,"Godfather, The (1972)"
1235,Comedy|Drama|Romance,Harold and Maude (1971)
168252,Action|Sci-Fi,Logan (2017)
2959,Action|Crime|Drama|Thriller,Fight Club (1999)


These movies are more commonly known and we can trust that they're more popularly recommended. But these movies span a bunch of genres.

Let's try to get a list of recommendations from Hasmeet's and my favorite genres. I like Action movies and he prefers Romance movies. So in addition to filtering by the number of ratings, let's also filter by a particular genre. We'll run the recommendations for an action movie fan, then for a romance movie fan.

**Step 2.3**

In [None]:
average_ratings = (data.ratings).groupby(['item']).mean()
rating_counts = (data.ratings).groupby(['item']).count()
average_ratings = average_ratings.loc[rating_counts['rating'] > minimum_to_include]
average_ratings = average_ratings.join(data.movies['genres'], on='item')
average_ratings = average_ratings.loc[average_ratings['genres'].str.contains('Sci-Fi')]

sorted_avg_ratings = average_ratings.sort_values(by="rating", ascending=False)
joined_data = sorted_avg_ratings.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[3:]]
print("RECOMMENDED FOR AN ACTION MOVIE FAN:")
joined_data.head(rows_to_show)

RECOMMENDED FOR AN ACTION MOVIE FAN:


Unnamed: 0_level_0,genres,title
item,Unnamed: 1_level_1,Unnamed: 2_level_1
168252,Action|Sci-Fi,Logan (2017)
260,Action|Adventure|Sci-Fi,Star Wars: Episode IV - A New Hope (1977)
1196,Action|Adventure|Sci-Fi,Star Wars: Episode V - The Empire Strikes Back...
2571,Action|Sci-Fi|Thriller,"Matrix, The (1999)"
1199,Fantasy|Sci-Fi,Brazil (1985)
7361,Drama|Romance|Sci-Fi,Eternal Sunshine of the Spotless Mind (2004)
741,Animation|Sci-Fi,Ghost in the Shell (Kôkaku kidôtai) (1995)
1210,Action|Adventure|Sci-Fi,Star Wars: Episode VI - Return of the Jedi (1983)
541,Action|Sci-Fi|Thriller,Blade Runner (1982)
1223,Adventure|Animation|Children|Comedy|Sci-Fi,"Grand Day Out with Wallace and Gromit, A (1989)"


In [None]:
average_ratings = (data.ratings).groupby(['item']).mean()
rating_counts = (data.ratings).groupby(['item']).count()
average_ratings = average_ratings.loc[rating_counts['rating'] > minimum_to_include]
average_ratings = average_ratings.join(data.movies['genres'], on='item')
average_ratings = average_ratings.loc[average_ratings['genres'].str.contains('Animation')]

sorted_avg_ratings = average_ratings.sort_values(by="rating", ascending=False)
joined_data = sorted_avg_ratings.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[3:]]
print("RECOMMENDED FOR A ANIMATED MOVIE FAN:")
joined_data.head(rows_to_show)

RECOMMENDED FOR A ANIMATED MOVIE FAN:


Unnamed: 0_level_0,genres,title
item,Unnamed: 1_level_1,Unnamed: 2_level_1
5618,Adventure|Animation|Fantasy,Spirited Away (Sen to Chihiro no kamikakushi) ...
741,Animation|Sci-Fi,Ghost in the Shell (Kôkaku kidôtai) (1995)
78499,Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 (2010)
720,Adventure|Animation|Comedy,Wallace & Gromit: The Best of Aardman Animatio...
1223,Adventure|Animation|Children|Comedy|Sci-Fi,"Grand Day Out with Wallace and Gromit, A (1989)"
31658,Adventure|Animation|Fantasy|Romance,Howl's Moving Castle (Hauru no ugoku shiro) (2...
6350,Action|Adventure|Animation|Children|Fantasy|Sc...,Laputa: Castle in the Sky (Tenkû no shiro Rapy...
60069,Adventure|Animation|Children|Romance|Sci-Fi,WALL·E (2008)
2761,Adventure|Animation|Children|Drama|Sci-Fi,"Iron Giant, The (1999)"
1148,Animation|Children|Comedy|Crime,Wallace & Gromit: The Wrong Trousers (1993)


There's actually one movie that's on both of these lists: _The Princess Bride_. But Parin doesn't want to rewatch.

So, while Step 2 produced some generic recommendations, our AI hasn't given us a new movie we want to watch together.

***STEP 3***

Step 3 is personalizing our recommender system AI. Parin and Hasmeet each need to provide our own movie ratings as data, so we filled out simple spreadsheets. We've uploaded these spreadsheets to GitHub.

But, we need to provide these personalized ratings in the correct format. By looking at the documentation for LensKit (https://lkpy.lenskit.org/en/stable/interfaces.html#lenskit.algorithms.Recommender.recommend), we know that we need to provide a dictionary of item-rating pairs for each person. This means that we need to import the two spreadsheets from GitHub and format the data in a way that will make sense to our AI: two dictionaries.

To test that it worked, let's also print both our ratings for _The Princess Bride_, since we know that's a movie we both watched.

**Step 3.1**

In [None]:
import csv

jabril_rating_dict = {}
jgb_rating_dict = {}

with open("/content/lab4-recommender-systems/jabril-movie-ratings.csv", newline='') as csvfile:
  ratings_reader = csv.DictReader(csvfile)
  for row in ratings_reader:
    if ((row['ratings'] != "") and (float(row['ratings']) > 0) and (float(row['ratings']) < 6)):
      jabril_rating_dict.update({int(row['item']): float(row['ratings'])})
      
with open("/content/lab4-recommender-systems/jgb-movie-ratings.csv", newline='') as csvfile:
  ratings_reader = csv.DictReader(csvfile)
  for row in ratings_reader:
    if ((row['ratings'] != "") and (float(row['ratings']) > 0) and (float(row['ratings']) < 6)):
      jgb_rating_dict.update({int(row['item']): float(row['ratings'])})
     
print("Rating dictionaries assembled!")
print("Sanity check:")
print("\tParin's rating for 1197 (The Princess Bride) is " + str(jabril_rating_dict[1197]))
print("\tHasmeet's rating for 1197 (The Princess Bride) is " + str(jgb_rating_dict[1197]))


Rating dictionaries assembled!
Sanity check:
	Parin's rating for 1197 (The Princess Bride) is 4.5
	Hasmeet's rating for 1197 (The Princess Bride) is 3.5


***STEP 4***

In Step 4, we want to actually train a new collaborative filtering model to provide recommendations. We'll use the UserUser library from LensKit to do this. This algorithm clusters similar users based on their movie ratings, and uses those clusters to predict movie ratings for one user (in this case, we'll want that user to be Parin or Hasmeet).






**Step 4.1**

In [None]:
from lenskit.algorithms import Recommender
from lenskit.algorithms.user_knn import UserUser

num_recs = 10  #<---- This is the number of recommendations to generate. You can change this if you want to see more recommendations

user_user = UserUser(15, min_nbrs=3) #These two numbers set the minimum (3) and maximum (15) number of neighbors to consider. These are considered "reasonable defaults," but you can experiment with others too
algo = Recommender.adapt(user_user)
algo.fit(data.ratings)

print("Set up a User-User algorithm!")

Set up a User-User algorithm!


Now that the system has defined clusters, we can give it our personal ratings to get the top 10 recommended movies for Hasmeet and for Parin.

For each of us, the User-User algorithm will find a neighborhood of users similar to us based on their movie ratings.

**Step 4.2**

In [None]:
jabril_recs = algo.recommend(-1, num_recs, ratings=pd.Series(jabril_rating_dict))  #Here, -1 tells it that it's not an existing user in the set, that we're giving new ratings, while 10 is how many recommendations it should generate

joined_data = jabril_recs.join(data.movies['genres'], on='item')      
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[2:]]
print("\n\nRECOMMENDED FOR PARIN:")
joined_data



RECOMMENDED FOR PARIN:


Unnamed: 0,genres,title
0,Comedy|Drama,"Last Detail, The (1973)"
1,Comedy,Love and Death (1975)
2,Drama,Before Night Falls (2000)
3,Drama,"Magdalene Sisters, The (2002)"
4,Drama|Horror|Mystery|Sci-Fi|Thriller,Black Mirror: White Christmas (2014)
5,Action|Animation|Drama|Fantasy|Sci-Fi,Neon Genesis Evangelion: The End of Evangelion...
6,Action|Adventure|Thriller,Raiders of the Lost Ark: The Adaptation (1989)
7,Comedy|Drama|Romance,Submarine (2010)
8,Adventure|Drama,Nebraska (2013)
9,Documentary,"Endless Summer, The (1966)"


In [None]:
jgb_recs = algo.recommend(-1, num_recs, ratings=pd.Series(jgb_rating_dict))  #Here, -1 tells it that it's not an existing user in the set, that we're giving new ratings, while 10 is how many recommendations it should generate

joined_data = jgb_recs.join(data.movies['genres'], on='item')      
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[2:]]
print("RECOMMENDED FOR HASMEET:")
joined_data

RECOMMENDED FOR HASMEET:


Unnamed: 0,genres,title
0,Comedy,The Night Before (2015)
1,Adventure|Drama|Sci-Fi,"Day of the Doctor, The (2013)"
2,Drama|Fantasy|Romance,Wristcutters: A Love Story (2006)
3,Comedy|Musical,Holiday Inn (1942)
4,Comedy,Outside Providence (1999)
5,Comedy|Romance,Adam's Rib (1949)
6,Drama,Reign Over Me (2007)
7,Drama,Guess Who's Coming to Dinner (1967)
8,Drama,Half Nelson (2006)
9,Comedy,Fired Up (2009)


Now, we have "top 10" lists of movies for both! Each of these only has movies that each of us hasn't watched before (or at least that we didn't rate in our personal ratings). These lists include both popular movies and more obscure ones.
But our lists don't overlap at all.

***STEP 5***

That brings us to Step 5, making a combined movie recommendation list. Because rating preferences are stored as numbers, we can create a hybrid!

We'll also do a quick sanity check by looking at _The Princess Bride_ again. I rated it as a 4.5 (because it's awesome!!) and John-Green-bot rated it as a 3.5, so we'd expect our combined list would have it as a 4.

**Step 5.1**

In [None]:
combined_rating_dict = {}
for k in jabril_rating_dict:
  if k in jgb_rating_dict:
    combined_rating_dict.update({k: float((jabril_rating_dict[k]+jgb_rating_dict[k])/2)})
  else:
    combined_rating_dict.update({k:jabril_rating_dict[k]})
for k in jgb_rating_dict:
   if k not in combined_rating_dict:
      combined_rating_dict.update({k:jgb_rating_dict[k]})
      
print("Combined ratings dictionary assembled!")
print("Sanity check:")
print("\tCombined rating for 1197 (The Princess Bride) is " + str(combined_rating_dict[1197]))

Combined ratings dictionary assembled!
Sanity check:
	Combined rating for 1197 (The Princess Bride) is 4.0


Looks like everything checks out. So now, we have a combined dictionary that we can plug right into our User-User model to output a ranked list of new movies that we should both enjoy!

**Step 5.2**

In [None]:
combined_recs = algo.recommend(-1, num_recs, ratings=pd.Series(combined_rating_dict))  #Here, -1 tells it that it's not an existing user in the set, that we're giving new ratings, while 10 is how many recommendations it should generate

joined_data = combined_recs.join(data.movies['genres'], on='item')      
joined_data = joined_data.join(data.movies['title'], on='item')
joined_data = joined_data[joined_data.columns[2:]]
print("\n\nRECOMMENDED FOR PARIN / HASMEET HYBRID:")
joined_data



RECOMMENDED FOR PARIN / HASMEET HYBRID:


Unnamed: 0,genres,title
0,Comedy|Drama|Romance,Submarine (2010)
1,Drama|Romance,Call Me by Your Name (2017)
2,Drama|Sci-Fi,"Man Who Fell to Earth, The (1976)"
3,Comedy|Romance,Adam's Rib (1949)
4,Drama|War,Gallipoli (1981)
5,Drama,Before Night Falls (2000)
6,Adventure|Drama|Sci-Fi,"Day of the Doctor, The (2013)"
7,Action|Adventure|Thriller,Raiders of the Lost Ark: The Adaptation (1989)
8,Adventure|Drama|Western,True Grit (1969)
9,Comedy,Love and Death (1975)


The number one recommendation is _[Submarine](https://www.imdb.com/title/tt1440292/)_ which is a quirky movie from 2010. If this is too obscure, we could pick a different recommendation from this list like _[True Grit](https://www.imdb.com/title/tt1403865/)_.

We could also go back to Step 4.1 and set different parameters.The trade-off between unconventional and popular results is what really characterizes recommender systems!