## 显示评价与隐式评价
用户的评价类型可以分为显式评价和隐式评价。显式评价指的是用户明确地给出对物品的评价。
所谓隐式评价，就是我们不让用户明确给出对物品的评价，而是通过观察他们的行为来获得偏好信息。

显式评价的问题：
- 人们很懒，不愿评价物品
- 人们会撒谎，或存有偏见
- 人们不会更新他们的评论

我们可以收集到哪些隐式评价呢？ 网页方面：页面点击、停留时间、重复访问次数、引用率、Hulu 上观看视频的次数； 音乐播放器：播放的曲目、跳过的曲目、播放次数；

## 基于用户的协同过滤
目前为止我们描述的都是基于用户的协同过滤算法。我们将一个用户和其他所有用户进行对比，找到相似的人。这种算法有两个弊端：
1. 扩展性 上文已经提到，随着用户数量的增加，其计算量也会增加。这种算法在只有几千个用户的情况下能够工作得很好，但达到一百万个用户时就会出现瓶颈。
2. 稀疏性 大多数推荐系统中，物品的数量要远大于用户的数量，因此用户仅仅对一小部分物品进行了评价，这就造成了数据的稀疏性。比如亚马逊有上百万本书，但用户只评论了很少一部分，于是就很难找到两个相似的用户了。

## 基于物品的协同过滤
需要注意这两种算法的区别：基于用户的协同过滤是通过计算用户之间的距离找出最相似的用户，并将他评价过的物品推荐给目标用户；而基于物品的协同过滤则是找出最相似的物品，再结合用户的评价来给出推荐结果。

## 修正的余弦相似度
为了避免“分数膨胀”现象，因此我们会从用户的评价中减去他所有评价的均值，这就是修正的余弦相似度。
$$s(i,j)=\frac{\sum_{u \in U} (R_{u,i}-\bar{R_u})(R_{u,j}-\bar{R_u})}{\sqrt{\sum_{u \in U} (R_{u,i}-\bar{R_u})^2}\quad\sqrt {\sum_{u \in U}(R_{u,j}-\bar{R_u})^2}\quad} \quad$$
U 表示同时评价过物品 i 和 j 的用户集合

s(i,j)表示物品 i 和 j 的相似度，分子表示将同时评价过物品 i 和 j 的用户的修正评分相乘并求和，分母则是对所有的物品的修正评分做一些汇总处理。

下表是五个学生对五位歌手的评价
![title](rating4.png)

In [1]:
# -*- coding: utf-8 -*-
from math import sqrt
users3 = {"David": {"Imagine Dragons": 3, "Daft Punk": 5, "Lorde": 4, "Fall Out Boy": 1}, 
          "Matt": {"Imagine Dragons": 3, "Daft Punk": 4,"Lorde": 4, "Fall Out Boy": 1}, 
          "Ben": {"Kacey Musgraves": 4, "Imagine Dragons": 3,"Lorde": 3, "Fall Out Boy": 1},
          "Chris": {"Kacey Musgraves": 4, "Imagine Dragons": 4,"Daft Punk": 4, "Lorde": 3, "Fall Out Boy": 1},
          "Tori": {"Kacey Musgraves": 5, "Imagine Dragons": 4,"Daft Punk": 5, "Fall Out Boy": 3}}

In [2]:
def computeSimilarity(band1, band2, userRatings):
    averages = {}
    for (key, ratings) in userRatings.items():
        averages[key] = (float(sum(ratings.values())) / len(ratings.values()))
        
    num = 0 # 分子
    dem1 = 0 # 分母的第一部分
    dem2 = 0
    
    for (user, ratings) in userRatings.items():
        if band1 in ratings and band2 in ratings:
            avg = averages[user]
            num += (ratings[band1] - avg) * (ratings[band2] - avg)
            dem1 += (ratings[band1] - avg) ** 2
            dem2 += (ratings[band2] - avg) ** 2
            
    return num / (sqrt(dem1) * sqrt(dem2))

In [3]:
print(computeSimilarity('Kacey Musgraves', 'Lorde', users3))

0.320959291340884


In [4]:
print(computeSimilarity('Imagine Dragons', 'Lorde', users3)) 
print(computeSimilarity('Daft Punk', 'Lorde', users3)) 

-0.2525265372291518
0.7841149584671063


通过计算，我们可以得到如下的矩阵：
![title](rating5.png)

那么如何利用它来进行推荐呢？

$$p(u,i)=\frac{\sum_{N \in similarTo(i)} (S_{i,N} \times R_{u,N})}{\sum_{N \in similarTo(i)} (|S_{i,N}|)} \quad$$

- p(u,i)表示我们会来预测用户 u 对物品 i 的评分。
- N 是一个物品的集合，有如下特性：用户 u 对集合中的物品打过分；物品 i 和集合中的物品有相似度数据（即上文中的矩阵）。
- $$S_{i,N}表示物品 i 和 N 的相似度，R_{u,N}表示用户 u 对物品 N 的评分。$$

为了让公式的计算效果更佳，对物品的评价分值最好介于-1 和 1 之间。由于我们的评分系统是 1 至 5 星，所以需要使用一些运算将其转换到-1 至 1 之间。即1分对应-1,2分对应-0.5,3分对应0,4分对应0.5,5分对应1。我们利用修正后的分数进行预测，在将其重新转换成1-5的评分标准。

$$NR_{u,N}=\frac{2(R_{u,N}-Min_R)-(Max_R-Min_R)}{Max_R-Min_R} \quad$$

In [5]:
#计算修正后的分数
def computeNRu(users):
    for (user,ratings) in users.items():
        for (band,score) in ratings.items():
            if ratings[band] == 1:
                ratings[band] = -1
            elif ratings[band] == 2:
                rating[band] = -0.5
            elif ratings[band] == 3:
                ratings[band] = 0
            elif ratings[band] == 4:
                ratings[band] = 0.5
            else:
                ratings[band] = 1

computeNRu(users3)
users3

{'David': {'Imagine Dragons': 0,
  'Daft Punk': 1,
  'Lorde': 0.5,
  'Fall Out Boy': -1},
 'Matt': {'Imagine Dragons': 0,
  'Daft Punk': 0.5,
  'Lorde': 0.5,
  'Fall Out Boy': -1},
 'Ben': {'Kacey Musgraves': 0.5,
  'Imagine Dragons': 0,
  'Lorde': 0,
  'Fall Out Boy': -1},
 'Chris': {'Kacey Musgraves': 0.5,
  'Imagine Dragons': 0.5,
  'Daft Punk': 0.5,
  'Lorde': 0,
  'Fall Out Boy': -1},
 'Tori': {'Kacey Musgraves': 1,
  'Imagine Dragons': 0.5,
  'Daft Punk': 1,
  'Fall Out Boy': 0}}

In [18]:
#们预测出 David 对 Kacey Musgraves 的评分
def getscore(username,bandname,users):
    for (user,ratings) in users.items():
        if user == username:
            for (band,score) in ratings.items():
                if band == bandname:
                    return score
            
def pred_D():
    ID = computeSimilarity('Kacey Musgraves','Imagine Dragons',users3)
    DP = computeSimilarity('Kacey Musgraves','Daft Punk',users3)
    L = computeSimilarity('Kacey Musgraves','Lorde',users3)
    FOB = computeSimilarity('Kacey Musgraves','Fall Out Boy',users3)
    ID_rate = getscore('David','Imagine Dragons',users3)
    DP_rate = getscore('David','Daft Punk',users3)
    L_rate = getscore('David','Lorde',users3)
    FOB_rate = getscore('David','Fall Out Boy',users3)
    numerator = ID*ID_rate + DP*DP_rate + L*L_rate + FOB*FOB_rate
    denominator = abs(ID) + abs(DP) + abs(L) + abs(FOB)
    rate = numerator/denominator
    pred_rate = 1/2*((rate+1)*(5-1))+1
    print(pred_rate)

In [19]:
a = pred_D()

4.509984014468131


## Slope One 算法
### 第一步：计算差值
计算物品之间差异的公式是：
$$dev_{i,j} = \sum_{u \in S_{i,j}(X)} \frac{u_i-u_j}{card(S_{i,j}(X))} \quad$$
分母部分表示同时评价过物品 j 和 i 的用户数。分子部分用户对 i 的评分减去对 j 的评分。
### 第二步：使用加权的 Slope One 算法进行预测
$$P^{wS1}(u)_j=\frac{\sum_{i \in S(u)-j} (dev_{i,j}+u_i)c_{i,j}}{\sum_{i \in S(u)-{j}} c_{i,j}}\quad$$

分子部分表示遍历用户评价过除了j之外所有的歌手并计算出j和这些歌手的差值，且加上用户对j的评分，最后将这个结果乘以同时评价过两位歌手的用户数。

In [28]:
class recommender:
    def __init__(self, data, k=1, metric='pearson', n=5):
        """ 初始化推荐模块
        data 训练数据
        k    K邻近算法中的值
        metric 使用何种距离计算方式
        n 推荐结果的数量
        """
        self.k = k
        self.n = n
        self.username2id = {}
        self.userid2name = {}
        self.productid2name = {}
        # 以下变量将用于 Slope One 算法
        self.frequencies = {}
        self.deviations = {}
        # 将距离计算方式保存下来
        self.metric = metric
        if self.metric == 'pearson':
            self.fn = self.pearson
        # 如果 data 是一个字典类型，则保存下来，否则忽略
        if type(data).__name__ == 'dict':
            self.data = data
        
    def computeDeviations(self):
        # 获取每位用户的评分数据
        for ratings in self.data.values():
            # 对于该用户的每个评分项（歌手、分数）
            for (item, rating) in ratings.items():
                self.frequencies.setdefault(item, {})
                self.deviations.setdefault(item, {})
                # 再次遍历该用户的每个评分项
                for (item2, rating2) in ratings.items():
                    if item != item2:
                        # 将评分的差异保存到变量中
                        self.frequencies[item].setdefault(item2, 0)
                        self.deviations[item].setdefault(item2, 0.0)
                        self.frequencies[item][item2] += 1
                        self.deviations[item][item2] += rating - rating2
                        
        for (item, ratings) in self.deviations.items():
            for item2 in ratings:
                ratings[item2] /= self.frequencies[item][item2]
      
    def slopeOneRecommendations(self, userRatings):
        recommendations = {}
        frequencies = {}
        # 遍历目标用户的评分项（歌手、分数）
        for (userItem, userRating) in userRatings.items():
            # 对目标用户未评价的歌手进行计算
            for (diffItem, diffRatings) in self.deviations.items():
                if diffItem not in userRatings and userItem in self.deviations[diffItem]:
                    freq = self.frequencies[diffItem][userItem]
                    recommendations.setdefault(diffItem, 0.0)
                    frequencies.setdefault(diffItem, 0)
                    # 分子
                    recommendations[diffItem] += (diffRatings[userItem] + userRating) * freq
                    
                    # 分母
                    frequencies[diffItem] += freq
        recommendations = [(k, v / frequencies[k]) for (k, v) in recommendations.items()]
                    # 排序并返回
        recommendations.sort(key=lambda artistTuple: artistTuple[1], reverse=True)

        return recommendations
    
    def convertProductID2name(self, id):
        """通过产品 ID 获取名称"""
        if id in self.productid2name:
            return self.productid2name[id]
        else:
            return id
        
    def userRatings(self, id, n):
        """返回该用户评分最高的物品"""
        print ("Ratings for " + self.userid2name[id])
        ratings = self.data[id]
        print(len(ratings))
        ratings = list(ratings.items())
        ratings = [(self.convertProductID2name(k), v)for (k, v) in ratings]
        # 排序并返回结果
        ratings.sort(key=lambda artistTuple: artistTuple[1],reverse = True)
        ratings = ratings[:n]
        for rating in ratings:
            print("%s\t%i" % (rating[0], rating[1]))
            
    def loadBookDB(self, path=''):
        """加载 BX 数据集，path 是数据文件位置"""
        self.data = {}
        i = 0
        # 将书籍评分数据放入 self.data
        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            user = fields[0].strip('"')
            book = fields[1].strip('"')
            rating = int(fields[2].strip().strip('"'))
            if user in self.data:
                currentRatings = self.data[user]
            else:
                currentRatings = {}
            currentRatings[book] = rating
            self.data[user] = currentRatings
            f.close()
            # 将书籍信息存入 self.productid2name
            # 包括 isbn 号、书名、作者等
            f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
            for line in f:
                i += 1
                #separate line into fields
                fields = line.split(';')
                isbn = fields[0].strip('"')
                title = fields[1].strip('"')
                author = fields[2].strip().strip('"')
                title = title + ' by ' + author
                self.productid2name[isbn] = title
            f.close()
            # 将用户信息存入 self.userid2name 和 self.username2id
            f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
            for line in f:
                i += 1#print(line)
                #separate line into fields
                fields = line.split(';')
                userid = fields[0].strip('"')
                location = fields[1].strip('"')
                if len(fields) > 3:
                    age = fields[2].strip().strip('"')
                else:
                    age = 'NULL'
                if age != 'NULL':
                    value = location + ' (age: ' + age + ')'
                else:
                    value = location
                self.userid2name[userid] = value
                self.username2id[location] = userid
            f.close()
            print(i)
            
    def pearson(self, rating1, rating2):
        sum_xy = 0
        sum_x = 0
        sum_y = 0
        sum_x2 = 0
        sum_y2 = 0
        n = 0
        for key in rating1:
            if key in rating2:
                n += 1
                x = rating1[key]
                y = rating2[key]
                sum_xy += x * y
                sum_x += x
                sum_y += y
                sum_x2 += pow(x, 2)
                sum_y2 += pow(y, 2)
        if n == 0:
            return 0
        # 计算分母
        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n))
        if denominator == 0:
            return 0
        else:
            return (sum_xy - (sum_x * sum_y) / n) / denominator
        
    def computeNearestNeighbor(self, username):
        """获取邻近用户"""
        distances = []
        for instance in self.data:
            if instance != username:
                distance = self.fn(self.data[username],self.data[instance])
                distances.append((instance, distance))
                # 按距离排序，距离近的排在前面
                distances.sort(key=lambda artistTuple: artistTuple[1],reverse=True)
        return distances
    
    def recommend(self, user):
        """返回推荐列表"""
        recommendations = {}
        # 首先，获取邻近用户
        nearest = self.computeNearestNeighbor(user)
        # 获取用户评价过的商品
        userRatings = self.data[user]
        # 计算总距离
        totalDistance = 0.0
        for i in range(self.k):
            totalDistance += nearest[i][1]
        # 汇总 K 邻近用户的评分
        for i in range(self.k):
            # 计算饼图的每个分片
            weight = nearest[i][1] / totalDistance
            # 获取用户名称
            name = nearest[i][0]
            # 获取用户评分
            neighborRatings = self.data[name]
            # 获得没有评价过的商品
            for artist in neighborRatings:
                if not artist in userRatings:
                    if artist not in recommendations:
                        recommendations[artist] = (neighborRatings[artist] * weight)
                    else:
                        recommendations[artist] = (recommendations[artist] + neighborRatings[artist] * weight)
        # 开始推荐
        recommendations = list(recommendations.items())
        recommendations = [(self.convertProductID2name(k), v) for (k, v) in recommendations]
        # 排序并返回
        recommendations.sort(key=lambda artistTuple: artistTuple[1], reverse = True)
        # 返回前 n 个结果
        return recommendations[:self.n]

In [22]:
users2 = {"Amy": {"Taylor Swift": 4, "PSY": 3, "Whitney Houston": 4},
          "Ben": {"Taylor Swift": 5, "PSY": 2},
          "Clara": {"PSY": 3.5, "Whitney Houston": 4},
          "Daisy": {"Taylor Swift": 5, "Whitney Houston": 3}}

In [30]:
r = recommender(users2)
r.slopeOneRecommendations(users2['Ben'])

[]