In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

## Issues with User-User Collaborative Filtering

### Sparsity
* With the large item sets, small no. of ratings
* often there are no recommendation for a user

### Computational Performace
* With million of users(or more), computing all-pair correlation is expensive
* Even Incremental approches were expensive
* And user profile could change quickly- needed to compute in real time to keep user happy-case where user rate three more items, we need to recompute recommendation for the user


####  Reference
Item-Based Collaborative Filtering Recommendation Algorithms 
http://files.grouplens.org/papers/www10_sarwar.pdf

## Item-Item Collaborative Filtering
* Fairly Stable
    * Avg. item has many more rating than an avg. user
    * Intuitively, item don't genrally change rapidly
* Item similarity is a route to computing a prediction  of a user's item  preference

## 2-Step Process
* Compute Similarity b/w pair of items
    * Correlation b/w rating vectors (where item is rated on scale like 1-5/1-10)
        * user for only multi-level rating
    
    * Cosine of item rating vectors
        * can be used with unary or multi-level ratings
        * adjusted ratings(normalize- before computing cosine)
* Predict user-item rating
    * Weighted sum of rated "item-neighbors" i.e (We'll look at the things this person has already weighted or consumed that are closest to the item we're trying to evaluate. And we'll compute a weighted sum of what the user thought of all of those)
    * Linear Regression to estimate rating

In [2]:
ratings = pd.read_excel('Data/Input/CF2.xlsx', sheet_name='Ratings')
ratings = ratings.fillna(0)

In [3]:
ratings.values[:-1,1:-1].shape

(20, 20)

## Step-1
Similarity b/w users w. Here we will use cosine similarity

In [4]:
cosine_similarity(ratings.iloc[:-1,1:-1]).shape

(20, 20)

In [5]:
matrix = pd.read_excel('Data/Input/CF2.xlsx', sheet_name='Matrix')
matrix.head()
#matrix.set_index('Unnamed: 0')

Unnamed: 0.1,Unnamed: 0,1: Toy Story (1995),1210: Star Wars: Episode VI - Return of the Jedi (1983),356: Forrest Gump (1994),"318: Shawshank Redemption, The (1994)","593: Silence of the Lambs, The (1991)",3578: Gladiator (2000),260: Star Wars: Episode IV - A New Hope (1977),2028: Saving Private Ryan (1998),296: Pulp Fiction (1994),...,2396: Shakespeare in Love (1998),2916: Total Recall (1990),780: Independence Day (ID4) (1996),541: Blade Runner (1982),1265: Groundhog Day (1993),"2571: Matrix, The (1999)",527: Schindler's List (1993),"2762: Sixth Sense, The (1999)",1198: Raiders of the Lost Ark (1981),34: Babe (1995)
0,1: Toy Story (1995),,,,,,,,,,...,,,,,,,,,,
1,1210: Star Wars: Episode VI - Return of the Je...,,,,,,,,,,...,,,,,,,,,,
2,356: Forrest Gump (1994),,,,,,,,,,...,,,,,,,,,,
3,"318: Shawshank Redemption, The (1994)",,,,,,,,,,...,,,,,,,,,,
4,"593: Silence of the Lambs, The (1991)",,,,,,,,,,...,,,,,,,,,,


In [7]:
matrix = pd.DataFrame(cosine_similarity(ratings.values[:-1,1:-1].T), index=matrix.index, columns=matrix.columns[1:])
matrix.head()

Unnamed: 0,1: Toy Story (1995),1210: Star Wars: Episode VI - Return of the Jedi (1983),356: Forrest Gump (1994),"318: Shawshank Redemption, The (1994)","593: Silence of the Lambs, The (1991)",3578: Gladiator (2000),260: Star Wars: Episode IV - A New Hope (1977),2028: Saving Private Ryan (1998),296: Pulp Fiction (1994),1259: Stand by Me (1986),2396: Shakespeare in Love (1998),2916: Total Recall (1990),780: Independence Day (ID4) (1996),541: Blade Runner (1982),1265: Groundhog Day (1993),"2571: Matrix, The (1999)",527: Schindler's List (1993),"2762: Sixth Sense, The (1999)",1198: Raiders of the Lost Ark (1981),34: Babe (1995)
0,1.0,0.644995,0.58054,0.667424,0.570229,0.587852,0.747409,0.534579,0.667846,0.492659,0.376659,0.623056,0.690665,0.383067,0.661016,0.50501,0.463817,0.421637,0.466817,0.61807
1,0.644995,1.0,0.563029,0.456052,0.516566,0.483187,0.589805,0.408752,0.685662,0.534324,0.533429,0.391934,0.605856,0.515397,0.526952,0.535673,0.573529,0.565297,0.252604,0.511576
2,0.58054,0.563029,1.0,0.293041,0.381346,0.569209,0.59555,0.463003,0.399114,0.527926,0.647153,0.491498,0.498741,0.487713,0.29829,0.631039,0.320494,0.602943,0.288275,0.456849
3,0.667424,0.456052,0.293041,1.0,0.589,0.212846,0.565577,0.598344,0.538219,0.340151,0.329203,0.332674,0.617366,0.531981,0.437319,0.255345,0.497511,0.459446,0.467347,0.542782
4,0.570229,0.516566,0.381346,0.589,1.0,0.551612,0.682137,0.64059,0.400471,0.661958,0.484751,0.414499,0.738445,0.585662,0.673091,0.530856,0.75763,0.715565,0.702452,0.309159


## Step-2
FInd nearest neighbors

In [12]:
# Top 5 similar items to Toy Story
matrix['1: Toy Story (1995)'].nlargest(6)

0     1.000000
6     0.747409
12    0.690665
8     0.667846
3     0.667424
14    0.661016
Name: 1: Toy Story (1995), dtype: float64