## SVD 应用 -- 推荐系统

SVD 的另一个应用就是推荐系统了。其原理是先用 SVD 构建一个主题空间，然后在该空间下计算 items 之间的相似度。

### 奇异值怎么保留？

在上一节的例子中，我们是直接保留前两个奇异值的。一个能用的文案是将奇异值从大到小依次相加累计合达到 90% 以上就可以了。

### 相似度计算的方法

介绍三个方法，这三种方法我都将其值域限制在 [0, 1] 中，值越接近于 1 就越相似，而越接近于 0 就越不相似。

1. 欧氏距离
最简单的计算方法即：

$$
{\displaystyle {\begin{aligned}d(\mathbf {p} ,\mathbf {q} )=d(\mathbf {q} ,\mathbf {p} )&={\sqrt {(q_{1}-p_{1})^{2}+(q_{2}-p_{2})^{2}+\cdots +(q_{n}-p_{n})^{2}}}\\[8pt]&={\sqrt {\sum _{i=1}^{n}(q_{i}-p_{i})^{2}}}.\end{aligned}}} 
$$

若两个向量相等则，欧氏距离为 0，即最相似。为了和我们值域定义相一致，我们用公式来做下调解，即 

$$ sim = \frac {1} {1 + d(\textbf{p}, \textbf{q})} $$

2. cosine 

cosine 相似公式如下：
$$
{\displaystyle \cos(\theta )={\mathbf {A} \cdot \mathbf {B} \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}},} 
$$

cosine 的值域范围是 [-1, 1] 之间，我们将值域除 2，再加上 0.5, 将其变换到 [0, 1] 中，即有下面的公式:

$$
{\displaystyle sim = 0.5 + 0.5 \times {\mathbf {A} \cdot \mathbf {B} \over \|\mathbf {A} \|\|\mathbf {B} \|}} 
$$

3. 皮尔逊相关系数

皮尔逊系数是概率中引入的，计算方法如下：

$$
{\displaystyle \rho _{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}}}
$$

其中,
- $ \operatorname {cov(X, Y)} $ 是协方差, ${cov(X, Y)} = E\{ [X - E(X)] [Y - E(Y)] \}$。
- $\sigma _{X}$ 是 X 的标准方差。
- $\sigma _{Y}$ 是 Y 的标准方差。

皮尔逊系数的值域也是 [-1, 1], 为了保持一致我们也做一下转换，即：

$$
 sim = 0.5 + 0.5 \times \rho_{X,Y}
$$

通过下面的图，可以看出两个随机变量的皮尔逊系数，表示相关度还是可以的，但是距离的话，同一量级会有点问题。
![两个随机变量之间的皮尔逊相关系数](correlation.png)

**注意：** 皮尔逊系数对量级并不敏感，即如果一个向量全是 5，而另一个向量全是 1，那么皮尔逊相关系统会认为是一致的。

### 算法实现

下面一个例子是根据食客和菜品之间的推荐，每一行是一个食客而每一列是一个菜品。我们先通过 SVD 建立模型的主题空间，然后通过计算菜品的相似性来为食客未评过分的菜品进行评分。

食客(customer) 和 菜品 (meal) 组成的矩阵 X, 通过 SVD 最终的分解形式为：

$$
{\displaystyle {\begin{matrix}&X&&&U&&\Sigma &&V^{T}\\&({\textbf {m}}_{j})&&&&&&&({\hat {\textbf {m}}}_{j})\\&\downarrow &&&&&&&\downarrow \\({\textbf {c}}_{i}^{T})\rightarrow &{\begin{bmatrix}x_{1,1}&\dots &x_{1,j}&\dots &x_{1,n}\\\vdots &\ddots &\vdots &\ddots &\vdots \\x_{i,1}&\dots &x_{i,j}&\dots &x_{i,n}\\\vdots &\ddots &\vdots &\ddots &\vdots \\x_{m,1}&\dots &x_{m,j}&\dots &x_{m,n}\\\end{bmatrix}}&=&({\hat {\textbf {c}}}_{i}^{T})\rightarrow &{\begin{bmatrix}{\begin{bmatrix}\,\\\,\\{\textbf {u}}_{1}\\\,\\\,\end{bmatrix}}\dots {\begin{bmatrix}\,\\\,\\{\textbf {u}}_{l}\\\,\\\,\end{bmatrix}}\end{bmatrix}}&\cdot &{\begin{bmatrix}\sigma _{1}&\dots &0\\\vdots &\ddots &\vdots \\0&\dots &\sigma _{l}\\\end{bmatrix}}&\cdot &{\begin{bmatrix}{\begin{bmatrix}&&{\textbf {v}}_{1}&&\end{bmatrix}}\\\vdots \\{\begin{bmatrix}&&{\textbf {v}}_{l}&&\end{bmatrix}}\end{bmatrix}}\end{matrix}}} 
$$

我们找出奇异值总量超过 90% 的 k 值，所以最终可以写成：

$$
{\displaystyle X_{k}=U_{k}\Sigma _{k}V_{k}^{T}}
$$

$V_{k}^{T}$ 即是菜品在低维空间的新的变换坐标。

推荐算法：指定食客 customer, 找出 customer 未评分的菜品(unrated_meals) 和已经做出评分后的菜品 (rated_meal)。通过计算未评分菜品与所有评分菜品之间的相似度来给未评分的菜品一个推荐值，这就是基于菜品(items)的推荐。

In [190]:
import numpy as np

def ecludSim(inA, inB):
    '''
        计算欧氏相似度
    '''
    return 1 / (1 + np.linalg.norm(inA - inB))

def cosineSim(inA, inB):
    '''
        计算 cosine 相似度
    '''
    s = sum(inA * inB)
    f = np.linalg.norm(inA) * np.linalg.norm(inB)
    return 0.5 + 0.5 * (s/f)

def pearsonSim(inA, inB):
    '''
        计算 Pearson 相似度
    '''
    return 0.5 + 0.5 * np.corrcoef(inA, inB, rowvar = 0)[0][1]

class MealRec(object):
    '''
        菜品推荐系统
    '''
    def __init__(self, sim = cosineSim):
        self.sim = sim
        self.data = None
        self.U = None
        self.Sigma = None
        self.VT = None
        self.m = 0
        self.n = 0
        self.k = 0
        return
    
    def get90k(self, Sigma):
        k = 0
        acc = 0
        total = np.sum(Sigma)
        while ((acc / total) * 100 < 90):
            acc += Sigma[k]
            k += 1
        return k
    
    def model(self, data):
        self.data = data
        self.U, Sigma, self.VT = np.linalg.svd(data)
        print(Sigma)
        self.k = self.get90k(Sigma)
#         self.k = 4 
        self.Sigma = np.eye(self.k) * Sigma[:self.k]
        self.m = self.U.shape[0]
        self.n = self.VT.shape[0]
        return
    
    def estimate(self, customer, meal, rated_meals):
        if (len(rated_meals) == 0): return 0
        total = 0
        totalSim = 0
        VT = self.VT[0:self.k, :]
        for i in rated_meals:
            sim = self.sim(VT[:, meal], VT[:, i])
            totalSim = self.data[customer][i] * sim
            total += sim
            
        return totalSim / total
        
    
    def rec(self, customer):
        meals = ['Meal ' + str(i + 1) for i in range(self.n)]
        unrated_meals = []
        reated_meals = []
        for index, r in enumerate(self.data[customer]):
            if r == 0:
                unrated_meals.append(index)
            else:
                reated_meals.append(index)
        
        print(unrated_meals, list(map(str, unrated_meals)))
        if (len(unrated_meals) == 0):
            print('The customer has no unrated meals!')
        else:
            print("The customer's unrated meals are: ", ','.join([meals[i] for i in unrated_meals]))
            itemScores = []
            for i in unrated_meals:
                rating = self.estimate(customer, i, reated_meals)
                itemScores.append((i, rating))
                print("The meal {0}'s estimated rating is {1}".format(i, rating))
            print(sorted(itemScores, key = lambda item : item[1], reverse = True)[:3])
        return 
        

In [191]:
def loadExData():
    return[[0, 0, 0, 2, 2],
           [0, 0, 0, 3, 3],
           [0, 0, 0, 1, 1],
           [1, 1, 1, 0, 0],
           [2, 2, 2, 0, 0],
           [5, 5, 5, 0, 0],
           [1, 1, 1, 0, 0]]
data = loadExData()
rec = MealRec()
rec.model(data)
rec.rec(1)

[9.64365076e+00 5.29150262e+00 8.36478329e-16 6.91811207e-17
 1.11917251e-33]
[0, 1, 2] ['0', '1', '2']
The customer's unrated meals are:  Meal 1,Meal 2,Meal 3
The meal 0's estimated rating is 1.5
The meal 1's estimated rating is 1.5
The meal 2's estimated rating is 1.5
[(0, 1.5), (1, 1.5), (2, 1.5)]


In [192]:
def loadExData2():
    return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
           [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
           [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
           [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
           [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
           [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
           [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
           [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
           [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
           [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
           [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]

data = loadExData2()
rec = MealRec(pearsonSim)
rec.model(data)
rec.rec(1)
rec.rec(0)
VT = rec.VT[0:4, :]



[15.77075346 11.40670395 11.03044558  4.84639758  3.09292055  2.58097379
  1.00413543  0.72817072  0.43800353  0.22082113  0.07367823]
[0, 1, 2, 4, 6, 7, 8, 9] ['0', '1', '2', '4', '6', '7', '8', '9']
The customer's unrated meals are:  Meal 1,Meal 2,Meal 3,Meal 5,Meal 7,Meal 8,Meal 9,Meal 10
The meal 0's estimated rating is 1.1941939483801414
The meal 1's estimated rating is 1.0604446596974981
The meal 2's estimated rating is 1.0714481420479423
The meal 4's estimated rating is 0.9875830158134445
The meal 6's estimated rating is 0.9240840832595169
The meal 7's estimated rating is 1.3600484169927023
The meal 8's estimated rating is 1.3761430642061743
The meal 9's estimated rating is 0.967818555131494
[(8, 1.3761430642061743), (7, 1.3600484169927023), (0, 1.1941939483801414)]
[0, 1, 2, 3, 4, 6, 7, 8, 9] ['0', '1', '2', '3', '4', '6', '7', '8', '9']
The customer's unrated meals are:  Meal 1,Meal 2,Meal 3,Meal 4,Meal 5,Meal 7,Meal 8,Meal 9,Meal 10
The meal 0's estimated rating is 3.08472778

### 说明

在 《机器学习实战》 书用了 $ X_{k}=U_{k}\Sigma _{k}V_{k}^{T}, V = X^T U_k \Sigma_k^{-1} $ 来计算新的空间中的 V, 然后通过 V 来计算相似度，其实不用这么麻烦，因为在 SVD 的时候 VT 已经计算好了。下面可以看两者是等价的。

In [194]:
# print(VT)
print(VT[:,0])
print(VT[:,1])
V = np.mat(data).T * rec.U[:, :2] * np.mat(rec.Sigma).I
# print(V)
print(V[0])
print(V[1])


[-0.45137416  0.03084799 -0.00290108  0.01189185]
[-0.36239706  0.02584428 -0.00189127  0.01348796]


ValueError: shapes (11,2) and (5,5) not aligned: 2 (dim 1) != 5 (dim 0)