# Lab 7

## Collaborative filtering and recommendations

### At the end of this lab, I should be able to
* Understand how item-item and user-user collaborative filtering perform recommendations
* Explain a experiment where we tested item-item versus user-user

**Note:** Exercises can be autograded and count towards your lab and assignment score. Problems are graded for participation.

**Video Introduction:**
https://calpoly.zoom.us/rec/share/TNZApiQaUUNRyMrKIl8MAVEh1FCIFNUTCthC81lA1Cn-Vw2CVn3hWBd6Wtde2WXj.V_7lCBK8Yk1MXx_r?startTime=1646613203000

In [2]:
from pathlib import Path
home = str(Path.home()) # all other paths are relative to this path. change to something else if this is not the case on your system

In [3]:
%load_ext autoreload
%autoreload 2

# make sure your run the cell above before running this
import Lab7_helper

## Real dataset: Movielens

https://grouplens.org/datasets/movielens/

> MovieLens is a collaborative filtering system for movies. A
user of MovieLens rates movies using 1 to 5 stars, where 1 is "Awful" and 5 is "Must
See". MovieLens then uses the ratings of the community to recommend other movies
that user might be interested in, predict what that user might rate a movie,
or perform other tasks. - "Collaborative Filtering Recommender Systems"

In [4]:
import pandas as pd
import numpy as np

ratings = pd.read_csv(f'{home}/csc-466-student/data/movielens-small/ratings.csv') # you might need to change this path
ratings = ratings.dropna()
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
movies = pd.read_csv(f'{home}/csc-466-student/data/movielens-small/movies.csv')
movies = movies.dropna()
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Joining the data together
We need to join those two source dataframes into a single one called data. I do this by setting the index to movieId and then specifying an ``inner`` join which means that the movie has to exist on both sides of the join. Then I reset the index so that I can later set the multi-index of userId and movieId. The results of this are displayed below. Pandas is awesome, but it takes some getting used to how everything works.

In [6]:
data = movies.set_index('movieId').join(ratings.set_index('movieId'),how='inner').reset_index()
data = data.drop('timestamp',axis=1) # We won't need timestamp here
data.head()

Unnamed: 0,movieId,title,genres,userId,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5


In [7]:
ratings = data.set_index(['userId','movieId'])['rating']
ratings # as Series

userId  movieId
1       1          4.0
5       1          4.0
7       1          4.5
15      1          2.5
17      1          4.5
                  ... 
184     193581     4.0
        193583     3.5
        193585     3.5
        193587     3.5
331     193609     4.0
Name: rating, Length: 100836, dtype: float64

In [8]:
ratings.unstack()
ratings.unstack().loc[1].dropna()

movieId
1       4.0
3       4.0
6       4.0
47      5.0
50      5.0
       ... 
3744    4.0
3793    5.0
3809    4.0
4006    4.0
5060    5.0
Name: 1, Length: 232, dtype: float64

#### Exercise 1
I provide a structure for predicting recommentations using user-user collaborative filtering.  For this exercise, please complete the missing components.

``data_raw`` - your entire dataframe

``x_raw`` - the data from a single user

``N`` - neighborhood size

``frac`` - fraction for your test dataset

In [9]:
mae = Lab7_helper.predict_user_user(ratings.unstack(),ratings.unstack().loc[1])
mae

0.8241596814667028

#### Exercise 2
I provide a structure for predicting recommentations using item-item collaborative filtering. For this exercise, please complete the missing components.

In [10]:
mae = Lab7_helper.predict_item_item(ratings.unstack(),ratings.unstack().loc[1])
mae

  sims = db.drop(movie).loc[ix_raw].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
  sims = db.drop(movie).loc[ix_raw].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
  sims = db.drop(movie).loc[ix_raw].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
  sims = db.drop(movie).loc[ix_raw].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
  sims = db.drop(movie).loc[ix_raw].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)


0.74

#### Problem 1
This is an open ended question that requires you to code. I have provided my own ratings for some of the movies in the dataset. What would you recommend to me based on my recommendations if you applied user-user filtering? Feel free to also change to your rankings. I ranked the top 5 movies according to the count of users who have ranked movies.

**Upload your solution here:** https://canvas.calpoly.edu/courses/89325/assignments/594374

In [11]:
data[['movieId','title']].value_counts()

movieId  title                                 
356      Forrest Gump (1994)                       329
318      Shawshank Redemption, The (1994)          317
296      Pulp Fiction (1994)                       307
593      Silence of the Lambs, The (1991)          279
2571     Matrix, The (1999)                        278
                                                  ... 
4093     Cop (1988)                                  1
4089     Born in East L.A. (1987)                    1
58351    City of Men (Cidade dos Homens) (2007)      1
4083     Best Seller (1987)                          1
193609   Andrew Dice Clay: Dice Rules (1991)         1
Length: 9724, dtype: int64

In [12]:
counts = data[['movieId','title']].value_counts().reset_index()

In [13]:
user_ratings = pd.DataFrame(index=['Dr. Anderson'],columns=counts['title'])
user_ratings.loc["Dr. Anderson","Forrest Gump (1994)"] = 4
user_ratings.loc["Dr. Anderson","Shawshank Redemption, The (1994)"] = 5
user_ratings.loc["Dr. Anderson","Pulp Fiction (1994)"] = 3
user_ratings.loc["Dr. Anderson","Silence of the Lambs, The (1991)"] = 2
user_ratings.loc["Dr. Anderson","Matrix, The (1999)"] = 5
user_ratings

title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,Doomsday (2008),Gardens of Stone (1987),"Fourth Protocol, The (1987)",Mongol (2007),War Dance (2007),Cop (1988),Born in East L.A. (1987),City of Men (Cidade dos Homens) (2007),Best Seller (1987),Andrew Dice Clay: Dice Rules (1991)
Dr. Anderson,4,5,3,2,5,,,,,,...,,,,,,,,,,


In [14]:
ratings_reordered = ratings.unstack().T.loc[counts['movieId']].T # reorder the ratings to be the same as above
ratings_reordered.columns = user_ratings.columns
ratings_reordered

title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,Doomsday (2008),Gardens of Stone (1987),"Fourth Protocol, The (1987)",Mongol (2007),War Dance (2007),Cop (1988),Born in East L.A. (1987),City of Men (Cidade dos Homens) (2007),Best Seller (1987),Andrew Dice Clay: Dice Rules (1991)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,3.0,4.0,5.0,5.0,4.0,4.0,,5.0,...,,,,,,,,,,
2,,3.0,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,0.5,...,,,,,,,,,,
4,,,1.0,5.0,1.0,5.0,,,,,...,,,,,,,,,,
5,,3.0,5.0,,,,,4.0,3.0,5.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,4.0,3.5,5.0,4.5,5.0,4.5,2.5,3.5,3.5,5.0,...,,,,,,,,,,
607,,5.0,3.0,5.0,5.0,3.0,4.0,5.0,4.0,5.0,...,,,,,,,,,,
608,3.0,4.5,5.0,4.0,5.0,3.5,3.0,4.0,3.0,4.0,...,,,,,,,,,,
609,4.0,4.0,4.0,,,,3.0,3.0,3.0,,...,,,,,,,,,,


In [218]:
import numpy as np
### Start coding your solution here
#set up
N = 10
db = ratings_reordered
x = user_ratings

#removing duplicate columns
db = db.loc[:,~db.columns.duplicated()].copy()
x = x.loc[:,~x.columns.duplicated()].copy()

#normalizing data
db = (db.T - db.T.mean()).fillna(0).T
x = (x.T - x.T.mean()).fillna(0).T

#grabbing most-similiar users
sims = db.apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum(axis=1))),axis=1)
sims = sims.dropna()
sims = pd.DataFrame(sims.values, columns=["similarity"])
sorted_sims = sims.sort_values(by="similarity")[::-1]
top_sims = sorted_sims.iloc[:N]
ids = top_sims.index
similar_users = db.loc[ids]

#removing movies you've seen before
similar_users.drop(columns=x.loc[:, x.iloc[0] != 0].columns, inplace=True)

#grabbing the top rated movies on average by users similiar to you
average_movie_ratings = similar_users.mean(axis=0)
average_movie_ratings.sort_values(ascending=False).head(10)


title
Fargo (1996)                                                 0.425971
Apocalypse Now (1979)                                        0.328971
How to Train Your Dragon (2010)                              0.317286
Lord of the Rings: The Fellowship of the Ring, The (2001)    0.308066
Lord of the Rings: The Return of the King, The (2003)        0.308066
Schindler's List (1993)                                      0.301963
Eternal Sunshine of the Spotless Mind (2004)                 0.298943
Donnie Darko (2001)                                          0.276377
Toy Story (1995)                                             0.262505
Lord of the Rings: The Two Towers, The (2002)                0.258066
dtype: float64

#### Problem 2
Repeat problem 1 but recommend movies using item-item. Any difference? Which one do you think is more reasonable?

**Upload your solution here:** https://canvas.calpoly.edu/courses/89325/assignments/594375

In [209]:
N = 10
db = ratings_reordered
x = user_ratings

#removing duplicate columns
db = db.loc[:,~db.columns.duplicated()].copy()
x = x.loc[:,~x.columns.duplicated()].copy()

#normalizing data
db = (db.T - db.T.mean()).fillna(0).T
x = (x.T - x.T.mean()).fillna(0).T

#setting up X
x = x.iloc[0]
x = x[x != 0]
x = x.reset_index()
x = x.rename(columns = {"Dr. Anderson": "score"})
#tranposing db
db = db.T

x

Unnamed: 0,title,score
0,Forrest Gump (1994),0.2
1,"Shawshank Redemption, The (1994)",1.2
2,Pulp Fiction (1994),-0.8
3,"Silence of the Lambs, The (1991)",-1.8
4,"Matrix, The (1999)",1.2


In [210]:
db

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Forrest Gump (1994),-0.366379,0.000000,0.0,0.000000,0.000000,1.506369,1.769737,-0.574468,0.0,0.221429,...,0.000000,-0.392593,-0.507953,0.00,-0.210407,0.34246,0.000000,-0.134176,0.72973,-0.688556
"Shawshank Redemption, The (1994)",0.000000,-0.948276,0.0,0.000000,-0.636364,1.506369,0.000000,1.425532,0.0,0.000000,...,0.574257,1.607407,0.000000,0.00,0.000000,-0.15754,1.213904,1.365824,0.72973,-0.688556
Pulp Fiction (1994),-1.366379,0.000000,0.0,-2.555556,1.363636,-1.493631,0.000000,0.425532,0.0,-2.278571,...,0.000000,1.607407,1.492047,1.52,-1.210407,1.34246,-0.786096,1.865824,0.72973,1.311444
"Silence of the Lambs, The (1991)",-0.366379,0.000000,0.0,1.444444,0.000000,0.506369,1.769737,0.425532,0.0,0.000000,...,0.000000,1.607407,1.492047,1.52,0.000000,0.84246,1.213904,0.865824,0.00000,0.811444
"Matrix, The (1999)",0.633621,0.000000,0.0,-2.555556,0.000000,0.000000,0.000000,0.000000,0.0,-2.778571,...,0.574257,0.000000,1.492047,0.00,0.000000,1.34246,1.213904,1.865824,0.00000,1.311444
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cop (1988),0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.00,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000
Born in East L.A. (1987),0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.00,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000
City of Men (Cidade dos Homens) (2007),0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.00,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000
Best Seller (1987),0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.00,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000


In [216]:
results = {}
for row in x.iterrows():
    #grab the N most simliar movies to the movie
    source_title = row[1]["title"]
    movie_ratings = db.loc[source_title]
    score = row[1]["score"]
    sims = db.drop(index=[source_title]).apply(lambda y: (y.values*movie_ratings.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((movie_ratings**2).sum())),axis=1)
    sims = sims.dropna()
    sims = pd.DataFrame(sims.values, columns=["similarity"])
    sorted_sims = sims.sort_values(by="similarity")[::-1]
    
    top_sims = sorted_sims.iloc[:N]
    ids = top_sims.index
    #for each movie similiar to the original movie, mutliply how much you like you like the orignal movie
    #by how simliar this new movie is to the orignal movie. 
    #we keep track of acccumulated scores so feeding different movies affects the model
    for index in ids:
        title = db.iloc[index].name
        similarity = sims.loc[index, "similarity"]
        result = similarity * score
        results[title] = results.get(title, 0) + result

results = pd.Series(results) 

  sims = db.drop(index=[source_title]).apply(lambda y: (y.values*movie_ratings.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((movie_ratings**2).sum())),axis=1)
  sims = db.drop(index=[source_title]).apply(lambda y: (y.values*movie_ratings.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((movie_ratings**2).sum())),axis=1)
  sims = db.drop(index=[source_title]).apply(lambda y: (y.values*movie_ratings.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((movie_ratings**2).sum())),axis=1)
  sims = db.drop(index=[source_title]).apply(lambda y: (y.values*movie_ratings.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((movie_ratings**2).sum())),axis=1)
  sims = db.drop(index=[source_title]).apply(lambda y: (y.values*movie_ratings.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((movie_ratings**2).sum())),axis=1)


Schindler's List (1993)                                      0.522112
Lord of the Rings: The Two Towers, The (2002)                0.498916
Toy Story (1995)                                             0.463041
Jurassic Park (1993)                                         0.447092
Bourne Identity, The (2002)                                  0.429872
Beauty and the Beast (1991)                                  0.411120
Mars Attacks! (1996)                                         0.407891
Lord of the Rings: The Fellowship of the Ring, The (2001)    0.399222
Matrix, The (1999)                                           0.394116
Apollo 13 (1995)                                             0.375928
dtype: float64

In [269]:
#removing movies you've already rated
rated_movies = user_ratings.T.loc[~user_ratings.T.isna().values].index
for movie in rated_movies:
    if movie in results.index:
        results.drop(index=[movie], inplace=True)
results.sort_values(ascending=False).head(10)

Schindler's List (1993)                                      0.522112
Lord of the Rings: The Two Towers, The (2002)                0.498916
Toy Story (1995)                                             0.463041
Jurassic Park (1993)                                         0.447092
Bourne Identity, The (2002)                                  0.429872
Beauty and the Beast (1991)                                  0.411120
Mars Attacks! (1996)                                         0.407891
Lord of the Rings: The Fellowship of the Ring, The (2001)    0.399222
Apollo 13 (1995)                                             0.375928
Godfather: Part II, The (1974)                               0.184641
dtype: float64

The fact that some of the same movies show up in both of the top 10 of the two methods makes me feel more confident how well they preformed. I am worried that kids movies ranked so highly on these results, when you didn't have any kids movies in the movies you rated. Action movies like Lord of the Rings, Jason Bourne, and Jurassic park make sense given how you liked the Matrix, and Schlinders list and the GodFather are mature, critically acclaimed dramas like Shawshank redemption, which makes sense that they got there. Ultimately though, I'm inclined to go with the user-user ratings since Beauty and the Beast just seems out of left field

#### Having trouble with the test cases and the autograder?

You can always load up the answers for the autograder. The autograder runs your code and compares your answer to the expected answer. I manually review your code, so there is no need to hide this from you.

```python
import joblib
answers = joblib.load(f"{home}/csc-466-student/tests/answers_Lab7.joblib")
answers.keys()
```