# Social Network Data for Collaborative Filtering
![](banner_cf.jpg)

In [1]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)                       
update_geom_defaults("point", list(size=6, colour="grey50"))

## Introduction

* **Recommender System**
* **Bipartite Graph or Bigraph**
* **Collaborative Filtering**
  * **Neighborhood Approach** 
    * **User-Based:** predict a user's preference for an item based on ratings for that item by similar users   
    * **Item-Based:** predict a user's preference for an item based on ratings of similar items by that user
  * **Latent Factor Approach:** characterize both users and items in terms of variables inferred from user ratings

## Data

<img src="cf.jpg" align="left" width="500">

In [2]:
data.a = data.frame(user_a=c(0,0,0,0,0,0,0,0,0,0), 
                    user_b=c(0,0,0,0,0,0,0,0,0,0), 
                    user_c=c(0,0,0,0,0,0,0,0,0,0), 
                    user_d=c(0,0,0,0,0,0,0,0,0,0), 
                    item_1=c(1,1,0,1,0,0,0,0,0,0), 
                    item_2=c(1,0,1,0,0,0,0,0,0,0), 
                    item_3=c(1,1,1,0,0,0,0,0,0,0), 
                    item_4=c(0,1,0,1,0,0,0,0,0,0), 
                    item_5=c(1,1,1,1,0,0,0,0,0,0), 
                    item_6=c(1,1,1,1,0,0,0,0,0,0))
rownames(data.a) = colnames(data.a)

fmt(data.a, "data (adjacency matrix)", TRUE)

Unnamed: 0,user_a,user_b,user_c,user_d,item_1,item_2,item_3,item_4,item_5,item_6
user_a,0,0,0,0,1,1,1,0,1,1
user_b,0,0,0,0,1,0,1,1,1,1
user_c,0,0,0,0,0,1,1,0,1,1
user_d,0,0,0,0,1,0,0,1,1,1
item_1,0,0,0,0,0,0,0,0,0,0
item_2,0,0,0,0,0,0,0,0,0,0
item_3,0,0,0,0,0,0,0,0,0,0
item_4,0,0,0,0,0,0,0,0,0,0
item_5,0,0,0,0,0,0,0,0,0,0
item_6,0,0,0,0,0,0,0,0,0,0


In [3]:
data = data.frame(item_1=c(4,4,NA,3), item_2=c(5,NA,4,NA), item_3=c(5,5,1,NA), item_4=c(NA,2,NA,5), item_5=c(4,4,2,4), item_6=c(1,2,1,3))
rownames(data) = c("user_a","user_b","user_c","user_d")

fmt(data, row.names=TRUE)

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6
user_a,4.0,5.0,5.0,,4,1
user_b,4.0,,5.0,2.0,4,2
user_c,,4.0,1.0,,2,1
user_d,3.0,,,5.0,4,3


## About Collaborative Filtering - Simple Version

Neighborhood approach, user-based.  Predict ratings of would-be links.  Assume that similar users rate a particular item similarly.

* Use only the rating of the most similar user
* Measure similarity by correlation
* Disregard missing data, treat negative correlations as no similarity

Specifically, predictions for ratings are calculated like this:

$
\begin{align}
\hat{r}_{u,i} = r_{x,i}
\end{align}
$

where ...
* $\hat{r}_{u,i}$ is the prediction for user $u$'s rating of item $i$
* $x$ is the index of the user most similar to user $u$ 
* $r_{x,i}$ is user $x$'s rating of item $i$

## Collaborative FIltering - Simple Version

**Calculate dissimilarity matrix:**

In [5]:
data.p = data

similarity = cor(t(data.p), method="pearson", use="pairwise.complete.obs")
similarity[similarity < 0] = 0

layout(fmt(t(data.p), "data (transposed)", TRUE),
       fmt(similarity, "similarity matrix (correlation after negatives removed)", TRUE))

Unnamed: 0_level_0,user_a,user_b,user_c,user_d
Unnamed: 0_level_1,user_a,user_b,user_c,user_d
item_1,4,4.0,,3.0
item_2,5,,4.0,
item_3,5,5.0,1.0,
item_4,,2.0,,5.0
item_5,4,4.0,2.0,4.0
item_6,1,2.0,1.0,3.0
user_a,1.0000000,0.9941348,0.4980582,0.5
user_b,0.9941348,1.0,0.1889822,0.0
user_c,0.4980582,0.1889822,1.0,1.0
user_d,0.5000000,0.0,1.0,1.0

Unnamed: 0,user_a,user_b,user_c,user_d
item_1,4.0,4.0,,3.0
item_2,5.0,,4.0,
item_3,5.0,5.0,1.0,
item_4,,2.0,,5.0
item_5,4.0,4.0,2.0,4.0
item_6,1.0,2.0,1.0,3.0

Unnamed: 0,user_a,user_b,user_c,user_d
user_a,1.0,0.9941348,0.4980582,0.5
user_b,0.9941348,1.0,0.1889822,0.0
user_c,0.4980582,0.1889822,1.0,1.0
user_d,0.5,0.0,1.0,1.0


**Predict User A's rating of Item 4:**

In [6]:
nn = names(which.max(similarity[-1,"user_a"]))
fmt(data.frame(user="user_a", similar_user=nn, item_4=data.p[nn, "item_4"]), "prediction")

user,similar_user,item_4
user_a,user_b,2


In [7]:
data.p["user_a","item_4"] = data.p["user_b","item_4"]
fmt(data.p, "data (after filtered just for user_a)", TRUE)

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6
user_a,4.0,5.0,5.0,2.0,4,1
user_b,4.0,,5.0,2.0,4,2
user_c,,4.0,1.0,,2,1
user_d,3.0,,,5.0,4,3


**Predict all missing ratings:**

In [8]:
data.p = data

nn = c(names(which.max(similarity[-1,"user_a"])),
       names(which.max(similarity[-2,"user_b"])),
       names(which.max(similarity[-3,"user_c"])),
       names(which.max(similarity[-4,"user_d"])))

sur = data.frame(user=c("user_a","user_b","user_c","user_d"),
                 similar_user=nn,
                 item_1=data.p[nn, "item_1"],
                 item_2=data.p[nn, "item_2"],
                 item_3=data.p[nn, "item_3"],
                 item_4=data.p[nn, "item_4"],
                 item_5=data.p[nn, "item_5"],
                 item_6=data.p[nn, "item_6"])

fmt(sur, "similar user rating")

user,similar_user,item_1,item_2,item_3,item_4,item_5,item_6
user_a,user_b,4.0,,5.0,2.0,4,2
user_b,user_a,4.0,5.0,5.0,,4,1
user_c,user_d,3.0,,,5.0,4,3
user_d,user_c,,4.0,1.0,,2,1


In [9]:
data.p["user_a","item_4"] = sur[sur$user=="user_a","item_4"]
data.p["user_b","item_2"] = sur[sur$user=="user_b","item_2"]
data.p["user_c","item_1"] = sur[sur$user=="user_c","item_1"]
data.p["user_c","item_4"] = sur[sur$user=="user_c","item_4"]
data.p["user_d","item_2"] = sur[sur$user=="user_d","item_2"]
data.p["user_d","item_3"] = sur[sur$user=="user_d","item_3"]

fmt(data.p, "data (after filtered)", TRUE)

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6
user_a,4,5,5,2,4,1
user_b,4,5,5,2,4,2
user_c,3,4,1,5,2,1
user_d,3,4,1,5,4,3


## About Collaborative Filtering - Weighted Version

* Use ratings of all other users, adjusted by similarity
* Measure similarity by correlation
* Disregard missing data, treat negative correlations as no similarity

Specifically, predictions for ratings are calculated like this:

$
\begin{align}
\hat{r}_{u,i} = \frac{\sum_{x \in N_u}{w_{x,u} r_{x,i}}}{\sum_{x \in N_u}{w_{x,u}}}
\end{align}
$

where ...
* $\hat{r}_{u,i}$ is the prediction for user $u$'s rating of item $i$
* $N_u$ are the indices of the users other than user $u$
* $w_{x,u}$ is the measure of how similar user $x$ is to user $u$
* $r_{x,i}$ is user $x$'s rating of item $i$

## Collaborative Filtering - Weighted Version

**Calculate similarity matrix:**

In [10]:
data.p = data

similarity = cor(t(data.p), method="pearson", use="pairwise.complete.obs")
similarity[similarity < 0] = 0

layout(fmt(t(data.p), "data (transposed)", TRUE),
       fmt(similarity, "similarity matrix (correlation after negatives removed)", TRUE))

Unnamed: 0_level_0,user_a,user_b,user_c,user_d
Unnamed: 0_level_1,user_a,user_b,user_c,user_d
item_1,4,4.0,,3.0
item_2,5,,4.0,
item_3,5,5.0,1.0,
item_4,,2.0,,5.0
item_5,4,4.0,2.0,4.0
item_6,1,2.0,1.0,3.0
user_a,1.0000000,0.9941348,0.4980582,0.5
user_b,0.9941348,1.0,0.1889822,0.0
user_c,0.4980582,0.1889822,1.0,1.0
user_d,0.5000000,0.0,1.0,1.0

Unnamed: 0,user_a,user_b,user_c,user_d
item_1,4.0,4.0,,3.0
item_2,5.0,,4.0,
item_3,5.0,5.0,1.0,
item_4,,2.0,,5.0
item_5,4.0,4.0,2.0,4.0
item_6,1.0,2.0,1.0,3.0

Unnamed: 0,user_a,user_b,user_c,user_d
user_a,1.0,0.9941348,0.4980582,0.5
user_b,0.9941348,1.0,0.1889822,0.0
user_c,0.4980582,0.1889822,1.0,1.0
user_d,0.5,0.0,1.0,1.0


**Predict User A's rating of Item 4:**<br>
Prediction is weighted mean of ratings, where weights are similarity measures and missing data are ignored.

In [11]:
rating = data.p[-1,"item_4"]
weight = similarity["user_a",-1]
item_4.predicted = weighted.mean(rating, weight, na.rm=TRUE)

calc = data.frame(user=rownames(data.p)[-1],
                  weight,
                  rating,
                  contribution_to_weighted_mean=rating*weight)

prediction = data.frame(user="user_a", item_4=item_4.predicted)

layout(fmt(calc, "calculation for user_a rating of item_4"), fmt(prediction))

user,weight,rating,contribution_to_weighted_mean
user,item_4,Unnamed: 2_level_1,Unnamed: 3_level_1
user_b,0.9941348,2.0,1.98827
user_c,0.4980582,,
user_d,0.5000000,5.0,2.5
user_a,3.003925,,
calculation for user_a rating of item_4  user weight rating contribution_to_weighted_mean user_b 0.9941348 2 1.98827 user_c 0.4980582 NA NA user_d 0.5000000 5 2.50000,prediction  user item_4 user_a 3.003925,,

user,weight,rating,contribution_to_weighted_mean
user_b,0.9941348,2.0,1.98827
user_c,0.4980582,,
user_d,0.5,5.0,2.5

user,item_4
user_a,3.003925


In [12]:
data.p["user_a","item_4"] = item_4.predicted
fmt(data.p, "data (after filtered just for user_a)", TRUE)

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6
user_a,4.0,5.0,5.0,3.003925,4,1
user_b,4.0,,5.0,2.0,4,2
user_c,,4.0,1.0,,2,1
user_d,3.0,,,5.0,4,3


**Predict all missing ratings:**

In [13]:
data.p = data

prediction = data.frame(user=c("user_a","user_b","user_c","user_d"),
                        item_1=aaply(1:4, 1, function(i) weighted.mean(data.p[-i,"item_1"], similarity[i,-i], na.rm=TRUE)),
                        item_2=aaply(1:4, 1, function(i) weighted.mean(data.p[-i,"item_2"], similarity[i,-i], na.rm=TRUE)),
                        item_3=aaply(1:4, 1, function(i) weighted.mean(data.p[-i,"item_3"], similarity[i,-i], na.rm=TRUE)),
                        item_4=aaply(1:4, 1, function(i) weighted.mean(data.p[-i,"item_4"], similarity[i,-i], na.rm=TRUE)),
                        item_5=aaply(1:4, 1, function(i) weighted.mean(data.p[-i,"item_5"], similarity[i,-i], na.rm=TRUE)),
                        item_6=aaply(1:4, 1, function(i) weighted.mean(data.p[-i,"item_6"], similarity[i,-i], na.rm=TRUE)))
                                
fmt(prediction)

user,item_1,item_2,item_3,item_4,item_5,item_6
user_a,3.665358,4.0,3.664896,3.003925,3.49999,2.000975
user_b,4.0,4.840268,4.36107,,3.680535,1.0
user_c,3.407246,5.0,5.0,4.523166,4.0,2.297528
user_d,4.0,4.333333,2.333333,,2.666667,1.0


In [14]:
data.p["user_a","item_4"] = prediction[prediction$user=="user_a","item_4"]
data.p["user_b","item_2"] = prediction[prediction$user=="user_b","item_2"]
data.p["user_c","item_1"] = prediction[prediction$user=="user_c","item_1"]
data.p["user_c","item_4"] = prediction[prediction$user=="user_c","item_4"]
data.p["user_d","item_2"] = prediction[prediction$user=="user_d","item_2"]
data.p["user_d","item_3"] = prediction[prediction$user=="user_d","item_3"]

fmt(data.p, "data (after filtered)", TRUE)

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6
user_a,4.0,5.0,5.0,3.003925,4,1
user_b,4.0,4.840268,5.0,2.0,4,2
user_c,3.407246,4.0,1.0,4.523166,2,1
user_d,3.0,4.333333,2.333333,5.0,4,3


## About Collaborative Filtering - Adjusted Weighted Version

* Calibrate to account for differences in users' average ratings 
* Use ratings of only some other users, adjusted by similarity
* Measure similarity by correlation
* Disregard missing data, treat negative correlations as no similarity

Specifically, predictions for ratings are calculated like this:

$
\begin{align}
\hat{r}_{u,i} = \bar{r}_u + \alpha \frac{\sum_{x \in N^t_u}{w_{x,u} (r_{x,i} - \bar{r}_x)}}{\sum_{x \in N^t_u}{w_{x,u}}}
\end{align}
$

where ...
* $\hat{r}_{u,i}$ is the prediction for user $u$'s rating of item $i$
* $\bar{r}_u$ is the average of user $u$'s ratings for all items
* $\bar{r}_x$ is the average of user $x$'s ratings for all items
* $\alpha$ is a parameter indicating how much a user's rating is affected by other users' ratings
* $N^t_u$ are the indices of the users most similar to user $u$ (a threshold parameter $t$ sets the number of similar users to use)
* $w_{x,u}$ is the measure of how similar user $x$ is to user $u$
* $r_{x,i}$ is user $x$'s rating of item $i$

## Collaborative Filtering - Adjusted Weighted Version

**Set parameters:**

In [15]:
threshold = 2 # number of nearest neighbors considered
alpha = 1 # magnitude of effect of neighbors 

data.frame(threshold, alpha)

threshold,alpha
2,1


**Calculate similarity matrix:**

In [16]:
data.p = data

similarity = cor(t(data), method="pearson", use="pairwise.complete.obs")
similarity[similarity < 0] = 0

layout(fmt(t(data), "data (transposed)", TRUE),
       fmt(similarity, "similarity matrix (correlation after negatives removed)", TRUE))

Unnamed: 0_level_0,user_a,user_b,user_c,user_d
Unnamed: 0_level_1,user_a,user_b,user_c,user_d
item_1,4,4.0,,3.0
item_2,5,,4.0,
item_3,5,5.0,1.0,
item_4,,2.0,,5.0
item_5,4,4.0,2.0,4.0
item_6,1,2.0,1.0,3.0
user_a,1.0000000,0.9941348,0.4980582,0.5
user_b,0.9941348,1.0,0.1889822,0.0
user_c,0.4980582,0.1889822,1.0,1.0
user_d,0.5000000,0.0,1.0,1.0

Unnamed: 0,user_a,user_b,user_c,user_d
item_1,4.0,4.0,,3.0
item_2,5.0,,4.0,
item_3,5.0,5.0,1.0,
item_4,,2.0,,5.0
item_5,4.0,4.0,2.0,4.0
item_6,1.0,2.0,1.0,3.0

Unnamed: 0,user_a,user_b,user_c,user_d
user_a,1.0,0.9941348,0.4980582,0.5
user_b,0.9941348,1.0,0.1889822,0.0
user_c,0.4980582,0.1889822,1.0,1.0
user_d,0.5,0.0,1.0,1.0


**Predict User A's rating of Item 4:**

In [17]:
i = which(colnames(similarity) == "user_a")
nn = which(rownames(data) %in% names(sort(similarity[-i,i], decreasing=TRUE))[1:threshold])
fmt(data[nn,], "nearest neighbors", TRUE)

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6
user_b,4,,5.0,2,4,2
user_d,3,,,5,4,3


In [18]:
weight = similarity["user_a",nn]
rating = data[nn,"item_4"]
mean_rating.nn = rowMeans(data[nn,], na.rm=TRUE)
diff = rating - mean_rating.nn
weighted_mean_part = weighted.mean(diff, weight, na.rm=TRUE)
        
calc = data.frame(weight,
                  rating,
                  mean_rating.nn,
                  diff,
                  contribution_to_weighted_mean=diff*weight)

fmt(calc, "calculation for user_a rating of item_4", TRUE)

Unnamed: 0,weight,rating,mean_rating.nn,diff,contribution_to_weighted_mean
user_b,0.9941348,2,3.4,-1.4,-1.391789
user_d,0.5,5,3.75,1.25,0.625


In [19]:
mean_rating.i = rowMeans(data["user_a",], na.rm=TRUE)
item_4.predicted = mean_rating.i + (alpha*weighted_mean_part)

prediction = data.frame(mean_rating=mean_rating.i,
                        alpha,
                        weighted_mean_part,
                        item_4=item_4.predicted)

fmt(prediction, row.names=TRUE)

Unnamed: 0,mean_rating,alpha,weighted_mean_part,item_4
user_a,3.8,1,-0.5131992,3.286801


In [20]:
data.p["user_a","item_4"] = item_4.predicted
fmt(data.p, "data (after filtered just for user_a)", TRUE)

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6
user_a,4.0,5.0,5.0,3.286801,4,1
user_b,4.0,,5.0,2.0,4,2
user_c,,4.0,1.0,,2,1
user_d,3.0,,,5.0,4,3


**Predict all missing ratings:**

In [21]:
prediction = predict.cf(data, similarity, threshold=2, alpha=1)
fmt(prediction, row.names=TRUE)

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6
user_a,3.948234,,5.4,3.286801,4.282875,2.6175172
user_b,3.6,4.727786,4.248589,,3.568053,0.8875185
user_c,1.565846,3.2,3.2,3.25,2.233376,0.5684381
user_d,3.95,5.483333,3.483333,,3.816667,2.15


In [22]:
data.p["user_a","item_4"] = prediction["user_a","item_4"]
data.p["user_b","item_2"] = prediction["user_b","item_2"]
data.p["user_c","item_1"] = prediction["user_c","item_1"]
data.p["user_c","item_4"] = prediction["user_c","item_4"]
data.p["user_d","item_2"] = prediction["user_d","item_2"]
data.p["user_d","item_3"] = prediction["user_d","item_3"]

fmt(data.p, "data (after filtered)", TRUE)

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6
user_a,4.0,5.0,5.0,3.286801,4,1
user_b,4.0,4.727786,5.0,2.0,4,2
user_c,1.565846,4.0,1.0,3.25,2,1
user_d,3.0,5.483333,3.483333,5.0,4,3


## Code

### Useful Functions

### Templates

## Expectations

## Further Reading



<p style="text-align:left; font-size:10px;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float:right;">
Document revised July 17, 2020
</span>
</p>