# Name(s)
Quinn Coleman & Andrew Keshishian

## Should we grade this notebook? (Answer yes or no)

yes

**Instructions:** Pair programming assignment. Submit only a single notebook unless you deviate significantly after lab on Thursday. If you submit individually, make sure you indicate who you worked with originally. Make sure to include your first and last names. For those students who push to individual repos but still work in groups, please indicate which notebook should be graded.

# Recommendation Systems

## Lab Assignment

This is a pair programming assignment. I strongly
discourage individual work for this (and other team/pair programming) lab(s), even if you think you can do it
all by yourself. Also, this is a pair programming assignment, not a ”work in teams of two” assignment. Pair
programming requires joint work on all aspects of the project without delegating portions of the work to individual
1
team members. For this lab, I want all your work — discussion, software development, analysis of the results,
report writing — to be products of joint work.
Students enrolled in the class can pair with other students enrolled in the class. Students on the waitlist can
pair with other students on the waitlists. In the cases of ”odd person out” situations, a team of three people can
be formed, but that team must (a) ask and answer one additional question, and (b) work as a pair would, without
delegation of any work off-line.

## At the end of this lab, I should be able to
* Understand how item-item and user-user collaborative filtering perform recommendations
* Explain a experiment where we tested item-item versus user-user

In [1]:
# We need a better version
!pip install -U scikit-learn

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/41/b6/126263db075fbcc79107749f906ec1c7639f69d2d017807c6574792e517e/scikit_learn-0.22.2.post1-cp37-cp37m-manylinux1_x86_64.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 4.2MB/s eta 0:00:01
[31mERROR: auto-sklearn 0.5.2 has requirement scikit-learn<0.20,>=0.19, but you'll have scikit-learn 0.22.2.post1 which is incompatible.[0m
Installing collected packages: scikit-learn
  Found existing installation: scikit-learn 0.19.2
    Uninstalling scikit-learn-0.19.2:
      Successfully uninstalled scikit-learn-0.19.2
Successfully installed scikit-learn-0.22.2.post1
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


## Our data
We will be using a well known movielens dataset (small version).

### Here are all the imports that I've used

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

In [3]:
ratings = pd.read_csv('~/csc-466-student/data/movielens-small/ratings.csv') # you might need to change this path

In [4]:
ratings = ratings.dropna()
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [5]:
len(ratings.userId.unique())

610

In [6]:
movies = pd.read_csv('~/csc-466-student/data/movielens-small/movies.csv')

In [7]:
movies = movies.dropna()
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


### Joining the data together
We need to join those two source dataframes into a single one called data. I do this by setting the index to movieId and then specifying an ``inner`` join which means that the movie has to exist on both sides of the join. Then I reset the index so that I can later set the multi-index of userId and movieId. The results of this are displayed below. Pandas is awesome, but it takes some getting used to how everything works.

In [8]:
data = movies.set_index('movieId').join(ratings.set_index('movieId'),how='inner').reset_index()
#data["movieId"] = data["title"]+" "+data["movieId"].astype(str)
data = data.set_index(['userId','movieId'])[["rating"]]
data

Unnamed: 0_level_0,Unnamed: 1_level_0,rating
userId,movieId,Unnamed: 2_level_1
1,1,4.0
5,1,4.0
7,1,4.5
15,1,2.5
17,1,4.5
...,...,...
184,193581,4.0
184,193583,3.5
184,193585,3.5
184,193587,3.5


### Turning data into a matrix instead of a series
The functions ``stack()`` and ``unstack()`` are called multiple times in this lab. They allow me to easily change from a dataframe to a series and back again. Below I'm changing from the Series object to a DataFrame. The important thing to note is that each row is now a user! NaN values are inserted where a user did not rate movie.

In [9]:
data=data.unstack()
data

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


## Let's take a look at some useful code together before the exercises.

First let's look at code that centers the data (important for cosine distance) and then fills in missing values as 0.

In [10]:
data_centered = data-data.mean()
data_centered = data_centered.fillna(0)
data_centered

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,0.07907,0.000000,0.740385,0.0,0.0,0.053922,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.00000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.00000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.00000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.07907,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,-1.42093,0.000000,0.000000,0.0,0.0,0.000000,-0.685185,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,0.07907,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,-1.42093,-1.431818,-1.259615,0.0,0.0,0.000000,0.000000,0.0,0.0,0.503788,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,-0.92093,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.503788,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Now what if we want to grab a specific user? Let's say we want the one with user ID of 1.

In [11]:
x = data_centered.loc[1]
x

        movieId
rating  1          0.079070
        2          0.000000
        3          0.740385
        4          0.000000
        5          0.000000
                     ...   
        193581     0.000000
        193583     0.000000
        193585     0.000000
        193587     0.000000
        193609     0.000000
Name: 1, Length: 9724, dtype: float64

### Finding neighborhood.
If we are hoping to predict movies for this user, then user-user collaborative filtering says find the ``N`` users that are similar. We should definitely drop out user 1 because it makes no sense to recommend to yourself. We then compute the cosine similarity between this user ``x`` and all other users in the db. We then reverse sort them, and then display the results.

In [12]:
db = data_centered.drop(1)
sims = db.apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
sorted_sims = sims.sort_values()[::-1]
sorted_sims

userId
452    0.144083
597    0.141746
527    0.109020
171    0.107520
17     0.105684
         ...   
293   -0.131293
368   -0.134358
428   -0.136580
370   -0.151998
217   -0.162144
Length: 609, dtype: float64

### Grabing similar users
Let's set the network size to 10, and then grab those users :)

In [13]:
N=10
userIds = sorted_sims.iloc[:N].index
data_centered.loc[userIds]

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
452,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.503788,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
597,0.07907,0.0,0.0,0.0,0.0,-0.946078,-2.185185,0.0,0.0,-0.496212,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
527,0.0,0.568182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
171,1.07907,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,0.57907,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
555,0.07907,0.0,1.740385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
484,0.57907,-0.931818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
414,0.07907,-0.431818,0.740385,0.0,-1.071429,-0.946078,-0.185185,0.125,0.0,-0.496212,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
72,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
380,1.07907,1.568182,0.0,0.0,0.0,1.053922,0.0,0.0,0.0,1.503788,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### How about a prediction?
We could compute the mean from the neighborhood for each prediction

In [14]:
db.loc[userIds]
db.loc[userIds].mean()
# data.loc[1].mean()
# db.loc[userIds].mean()+data.loc[1].mean()

        movieId
rating  1          0.355349
        2          0.077273
        3          0.248077
        4          0.000000
        5         -0.107143
                     ...   
        193581     0.000000
        193583     0.000000
        193585     0.000000
        193587     0.000000
        193609     0.000000
Length: 9724, dtype: float64

### What if we want to weight by the distance?

In [15]:
display(db.loc[userIds].multiply(2))
display(db.loc[userIds].multiply(2, axis=0))
db.loc[userIds].multiply(2, axis=1)

db.loc[userIds].multiply(sorted_sims.iloc[:N],axis=0).sum()/sorted_sims.iloc[:N].sum()+data.loc[1].mean()

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
452,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.007576,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
597,0.15814,0.0,0.0,0.0,0.0,-1.892157,-4.37037,0.0,0.0,-0.992424,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
527,0.0,1.136364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
171,2.15814,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,1.15814,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
555,0.15814,0.0,3.480769,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
484,1.15814,-1.863636,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
414,0.15814,-0.863636,1.480769,0.0,-2.142857,-1.892157,-0.37037,0.25,0.0,-0.992424,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
72,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
380,2.15814,3.136364,0.0,0.0,0.0,2.107843,0.0,0.0,0.0,3.007576,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
452,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.007576,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
597,0.15814,0.0,0.0,0.0,0.0,-1.892157,-4.37037,0.0,0.0,-0.992424,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
527,0.0,1.136364,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
171,2.15814,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,1.15814,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
555,0.15814,0.0,3.480769,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
484,1.15814,-1.863636,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
414,0.15814,-0.863636,1.480769,0.0,-2.142857,-1.892157,-0.37037,0.25,0.0,-0.992424,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
72,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
380,2.15814,3.136364,0.0,0.0,0.0,2.107843,0.0,0.0,0.0,3.007576,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


        movieId
rating  1          4.699488
        2          4.436795
        3          4.596111
        4          4.366379
        5          4.267809
                     ...   
        193581     4.366379
        193583     4.366379
        193585     4.366379
        193587     4.366379
        193609     4.366379
Length: 9724, dtype: float64

## Finally to the exercises!
I want you to implement user-user, item-item, and a combination of item-item and user-user.

In [16]:
data

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


## Exercise 1 (Worth 5 points)
Complete the following function that predicts using user-user collaborative filtering. 

In [17]:
def predict_user_user(data_raw,x_raw,N=10,frac=0.02):
    # data_raw is our uncentered data matrix. We want to make sure we drop the name of the user we
    # are predicting:
    db = data_raw.drop(x_raw.name)
    # We of course want to center and fill in missing values
    db = (db-db.mean()).fillna(0)
    # Now this is a little tricky to think about, but we want to create a train test split of the movies
    # that user x_raw.name has rated. We need some of them but want some of them removed for testing.
    # This is where the frac parameter is used. I want you to think about how to select movies for training
    # ix_raw, ix_raw_test = train_test_split(???,test_size=frac,random_state=42) # Got to ignore some movies
    
    # Filter out movies that user x hasn't rated
    # Find all movies user x has rated
    
    ix_raw, ix_raw_test = train_test_split(x_raw.dropna().index,test_size=frac,random_state=42)
    # Here is where we use what you figured out above
    x_raw_test = x_raw.loc[ix_raw_test]
    x_raw = x_raw.copy()
    x_raw.loc[ix_raw_test] = np.NaN # ignore the movies in test
    x = (x_raw - x_raw.mean()).fillna(0)

    preds = []
    for movie in ix_raw_test:
        sims = db.loc[data_raw.drop(x_raw.name)[movie].isnull()==False].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
        #sims = db.apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
        sims = sims.dropna()
        try:
            sorted_sims = sims.sort_values()[::-1]
        except:
            preds.append(0) # means there is no one that also rated this movie amongst all other users
            continue
        top_sims = sorted_sims.iloc[:N]
        ids = top_sims.index
        preds.append(db.loc[ids][movie].mean())
        #preds.append(x_raw[ids])
    pred = pd.Series(preds,index=x_raw_test.index)
    actual = x_raw_test-x_raw.mean()
    mae = (actual-pred).abs().mean()
    return mae

In [18]:
data.loc[1]


        movieId
rating  1          4.0
        2          NaN
        3          4.0
        4          NaN
        5          NaN
                  ... 
        193581     NaN
        193583     NaN
        193585     NaN
        193587     NaN
        193609     NaN
Name: 1, Length: 9724, dtype: float64

In [19]:
mae = predict_user_user(data,data.loc[1])
mae

0.7277949129743602

In [20]:
maes = data.head(20).apply(lambda x: predict_user_user(data,x),axis=1)

In [21]:
np.mean(maes)

0.8819436531525158

## Exercise 2 (Worth 5 points)
Complete the following function that predicts using item-item collaborative filtering. 

In [22]:
def predict_item_item(data_raw,x_raw,N=10,frac=0.02,debug={}):
    # x_raw is a user (row)
    ix_raw, ix_raw_test = train_test_split(x_raw.dropna().index,test_size=frac,random_state=42) # Got to ignore some movies
    
    print('ix_raw and ix_raw_test:')
    print(ix_raw, ix_raw_test)
    
    # Indices of movies our user has seen
    x_raw_test = x_raw.loc[ix_raw_test]
    
    print('x_raw_test:')
    print(x_raw_test)
    
    db = data_raw.drop(x_raw.name)
    db = (db-db.mean()).fillna(0)
    # ??? db = FIX DB SO WE CAN KEEP CODE SIMILAR BUT DO ITEM-ITEM ???
    db = db.T # Columns are all users except ours, rows are all movies
    
    preds = []
    for movie in ix_raw_test:
        x = db.loc[movie]
        # x is a row of db, all users except ours who have also rated this movie we're testing
        
        print('Movie:', movie)
        
#         breakpoint()
#         sims = db.drop(movie).loc[??? ONLY SELECT MOVIES IN TRAINING SET WHICH USER HAS RATED ???].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
#         sims = db.drop(movie).loc[x_raw.drop(x_raw.name)[movie].isnull()==False].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
        sims = db.drop(movie).loc[data_raw.T[x_raw.name].isnull()==False].apply(
            lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
        
        print('Sims:', sims)
        
        # db has movie rows except movie row we're testing, where our user rated this movie
        
        sims = sims.dropna()
        sorted_sims = sims.sort_values()[::-1]
        top_sims = sorted_sims.iloc[:N]
        ids = top_sims.index
        #preds.append(??? HOW TO PREDICTION ???)
#         breakpoint()
        preds.append((data_raw-data_raw.mean()).fillna(0).T.loc[ids][x_raw.name].mean())
    
        # Pred is the avg of movie ratings of this user, 
        # where the movies are the ones in the neighborhood
    
#         preds.append(db.loc[ids].mean().mean())
        # Prediction is the avg of the ratings of most similar movies
        
    pred = pd.Series(preds,index=x_raw_test.index)
    actual = x_raw_test
    mae = (actual-pred).abs().mean()
    return mae

# def predict_user_user(data_raw,x_raw,N=10,frac=0.02):
#     # x_raw is a user
#     db = data_raw.drop(x_raw.name)
#     db = (db-db.mean()).fillna(0)
#     # db has no user x_raw (removed row)
    
#     ix_raw, ix_raw_test = train_test_split(x_raw.dropna().index,test_size=frac,random_state=42)
#     # Indices of movies x_raw has seen
#     x_raw_test = x_raw.loc[ix_raw_test]
#     # (Test) movies x_raw seen
#     x_raw = x_raw.copy()
#     x_raw.loc[ix_raw_test] = np.NaN # ignore the movies in test
#     # x_raw is moves x_raw ignoring test movies
#     x = (x_raw - x_raw.mean()).fillna(0)

#     preds = []
#     for movie in ix_raw_test:
#         # movie is an index of seen movie by user
#         sims = db.loc[data_raw.drop(x_raw.name)[movie].isnull()==False].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
#         # Find all the people who have seen the movie seen by user, compute similarities
#         sims = sims.dropna()
#         try:
#             sorted_sims = sims.sort_values()[::-1]
#         except:
#             preds.append(0) # means there is no one that also rated this movie amongst all other users
#             continue
#         top_sims = sorted_sims.iloc[:N]
#         ids = top_sims.index
#         preds.append(db.loc[ids][movie].mean())
#         # Prediction is the avg of the ratings of most similar users
#     pred = pd.Series(preds,index=x_raw_test.index)
#     actual = x_raw_test-x_raw.mean()
#     mae = (actual-pred).abs().mean()
#     return mae

In [23]:
data_dict = {('rating', "A"): [3, np.nan, 1, 5],
             ("rating", "B"): [np.nan, 4, 3, np.nan],
             ("rating", "C"): [2, np.nan, 2, 3], 
             ("rating", "D"): [np.nan, 3, np.nan, np.nan]}
data = pd.DataFrame(data_dict)
display(data)

mae = predict_item_item(data,data.loc[0])
mae

Unnamed: 0_level_0,rating,rating,rating,rating
Unnamed: 0_level_1,A,B,C,D
0,3.0,,2.0,
1,,4.0,,3.0
2,1.0,3.0,2.0,
3,5.0,,3.0,


ix_raw and ix_raw_test:
MultiIndex([('rating', 'A')],
           ) MultiIndex([('rating', 'C')],
           )
x_raw_test:
rating  C    2.0
Name: 0, dtype: float64
Movie: ('rating', 'C')
Sims: rating  A    1.0
dtype: float64


2.0

In [24]:
maes = data.head(20).apply(lambda x: predict_item_item(data,x),axis=1)

ix_raw and ix_raw_test:
MultiIndex([('rating', 'A')],
           ) MultiIndex([('rating', 'C')],
           )
x_raw_test:
rating  C    2.0
Name: 0, dtype: float64
Movie: ('rating', 'C')
Sims: rating  A    1.0
dtype: float64
ix_raw and ix_raw_test:
MultiIndex([('rating', 'B')],
           ) MultiIndex([('rating', 'D')],
           )
x_raw_test:
rating  D    3.0
Name: 1, dtype: float64
Movie: ('rating', 'D')
Sims: rating  B   NaN
dtype: float64
ix_raw and ix_raw_test:
MultiIndex([('rating', 'B'),
            ('rating', 'C')],
           ) MultiIndex([('rating', 'A')],
           )
x_raw_test:
rating  A    1.0
Name: 2, dtype: float64
Movie: ('rating', 'A')
Sims: rating  B    NaN
        C    1.0
dtype: float64
ix_raw and ix_raw_test:
MultiIndex([('rating', 'A')],
           ) MultiIndex([('rating', 'C')],
           )
x_raw_test:
rating  C    3.0
Name: 3, dtype: float64
Movie: ('rating', 'C')
Sims: rating  A   NaN
dtype: float64




In [25]:
np.mean(maes)

1.6666666666666667

**For this very simple experiment, what method seems better?**

YOUR ANSWER HERE

## Exercise 3 (Worth 5 points)
Create new versions of predict_user_user and predict_item_item, but now perform a weighted prediction as was demonstrated above. Did our accuracy get any better?

In [84]:
# Weighted Avg
# db.loc[userIds].multiply(sorted_sims.iloc[:N],axis=0).sum()/sorted_sims.iloc[:N].sum()+data.loc[1].mean()

def predict_item_item(data_raw,x_raw,N=10,frac=0.02,debug={}):
    ix_raw, ix_raw_test = train_test_split(x_raw.dropna().index,test_size=frac,random_state=42) # Got to ignore some movies
    x_raw_test = x_raw.loc[ix_raw_test]
    
    db = data_raw.drop(x_raw.name)
    db = (db-db.mean()).fillna(0)
    # ??? db = FIX DB SO WE CAN KEEP CODE SIMILAR BUT DO ITEM-ITEM ???
    db = db.T
    
    preds = []
    for movie in ix_raw_test:
        x = db.loc[movie]
        # sims = db.drop(movie).loc[??? ONLY SELECT MOVIES IN TRAINING SET WHICH USER HAS RATED ???].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
        sims = db.drop(movie).loc[data_raw.T[x_raw.name].isnull()==False].apply(
            lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
        sims = sims.dropna()
        sorted_sims = sims.sort_values()[::-1]
        top_sims = sorted_sims.iloc[:N]
        ids = top_sims.index
        #preds.append(??? HOW TO PREDICTION ???)
        preds.append((data_raw-data_raw.mean()).fillna(0).T.loc[ids][x_raw.name]
                     .multiply(top_sims,axis=0).sum()/top_sims.sum())

    pred = pd.Series(preds,index=x_raw_test.index)
    actual = x_raw_test
    mae = (actual-pred).abs().mean()
    return mae

def predict_user_user(data_raw,x_raw,N=10,frac=0.02):
    # data_raw is our uncentered data matrix. We want to make sure we drop the name of the user we
    # are predicting:
    db = data_raw.drop(x_raw.name)
    # We of course want to center and fill in missing values
    db = (db-db.mean()).fillna(0)
    # Now this is a little tricky to think about, but we want to create a train test split of the movies
    # that user x_raw.name has rated. We need some of them but want some of them removed for testing.
    # This is where the frac parameter is used. I want you to think about how to select movies for training
    #ix_raw, ix_raw_test = train_test_split(???,test_size=frac,random_state=42) # Got to ignore some movies
    ix_raw, ix_raw_test = train_test_split(x_raw.dropna().index,test_size=frac,random_state=42)

    # Here is where we use what you figured out above
    x_raw_test = x_raw.loc[ix_raw_test] 
    x_raw = x_raw.copy()
    x_raw.loc[ix_raw_test] = np.NaN # ignore the movies in test
    x = (x_raw - x_raw.mean()).fillna(0)

    preds = []
    for movie in ix_raw_test:
        #sims = db.loc[??? Only look at users who have rated this movie ???].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
        sims = db.loc[data_raw.drop(x_raw.name)[movie].isnull()==False].apply(lambda y: (y.values*x.values).sum()/(np.sqrt((y**2).sum())*np.sqrt((x**2).sum())),axis=1)
        sims = sims.dropna()
        try:
            sorted_sims = sims.sort_values()[::-1]
        except:
            preds.append(0) # means there is no one that also rated this movie amongst all other users
            continue
        top_sims = sorted_sims.iloc[:N]
        ids = top_sims.index
        #preds.append(??? using ids how do you predict ???)
        preds.append(db.loc[ids][movie].multiply(top_sims,axis=0).sum()/top_sims.sum())
    pred = pd.Series(preds,index=x_raw_test.index)
    actual = x_raw_test-x_raw.mean()
    mae = (actual-pred).abs().mean()
    return mae

In [85]:
mae = predict_item_item(data,data.loc[1])
mae



3.3859081423023576

In [86]:
mae = predict_user_user(data,data.loc[1])
mae

0.6436909405848068

## Exercise 4 (Worth 5-10 extra credit points for one or both implementions)
Combine in sequence item-item and user-user AND/OR user-user and item-item.