<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1-读取数据" data-toc-modified-id="1-读取数据-1">1 读取数据</a></span></li><li><span><a href="#2-基于邻域的协同过滤——4种算法对比" data-toc-modified-id="2-基于邻域的协同过滤——4种算法对比-2">2 基于邻域的协同过滤——4种算法对比</a></span><ul class="toc-item"><li><span><a href="#2.1-KNNBasic" data-toc-modified-id="2.1-KNNBasic-2.1">2.1 KNNBasic</a></span><ul class="toc-item"><li><span><a href="#2.1.1-基于用户的协同过滤" data-toc-modified-id="2.1.1-基于用户的协同过滤-2.1.1">2.1.1 基于用户的协同过滤</a></span></li><li><span><a href="#2.1.2-基于商品的协同过滤" data-toc-modified-id="2.1.2-基于商品的协同过滤-2.1.2">2.1.2 基于商品的协同过滤</a></span></li></ul></li><li><span><a href="#2.2-KNNBasicWithMeans" data-toc-modified-id="2.2-KNNBasicWithMeans-2.2">2.2 KNNBasicWithMeans</a></span><ul class="toc-item"><li><span><a href="#2.2.1-基于用户" data-toc-modified-id="2.2.1-基于用户-2.2.1">2.2.1 基于用户</a></span></li><li><span><a href="#2.2.2-基于商品" data-toc-modified-id="2.2.2-基于商品-2.2.2">2.2.2 基于商品</a></span></li></ul></li><li><span><a href="#2.3-KNNWithZScore" data-toc-modified-id="2.3-KNNWithZScore-2.3">2.3 KNNWithZScore</a></span><ul class="toc-item"><li><span><a href="#2.3.1-基于用户" data-toc-modified-id="2.3.1-基于用户-2.3.1">2.3.1 基于用户</a></span></li><li><span><a href="#2.3.2-基于商品" data-toc-modified-id="2.3.2-基于商品-2.3.2">2.3.2 基于商品</a></span></li></ul></li><li><span><a href="#2.4-KNNBaseline" data-toc-modified-id="2.4-KNNBaseline-2.4">2.4 KNNBaseline</a></span><ul class="toc-item"><li><span><a href="#2.4.1-基于用户" data-toc-modified-id="2.4.1-基于用户-2.4.1">2.4.1 基于用户</a></span></li><li><span><a href="#2.4.2-基于商品" data-toc-modified-id="2.4.2-基于商品-2.4.2">2.4.2 基于商品</a></span></li></ul></li></ul></li><li><span><a href="#3-总结：" data-toc-modified-id="3-总结：-3">3 总结：</a></span></li></ul></div>

Action3:     

使用基于邻域的协同过滤（KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline中的任意一种）     

对MovieLens数据集进行协同过滤，采用k折交叉验证(k=3)，输出每次计算的RMSE, MAE

In [1]:
import pandas as pd

In [2]:
from surprise import accuracy
from surprise import Dataset, Reader
from surprise.model_selection import KFold
from surprise.model_selection.split import train_test_split
from surprise import KNNBaseline, KNNWithMeans, KNNWithZScore, KNNBasic

# 1 读取数据

In [3]:
# 告诉文本阅读器，要读取的文本的格式是怎样的
reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)

In [4]:
# 加载数据
dataset = Dataset.load_from_file('./ratings.csv', reader=reader)

In [5]:
df = pd.read_csv('./ratings.csv')

In [6]:
df.head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676


# 2 基于邻域的协同过滤——4种算法对比

## 2.1 KNNBasic

### 2.1.1 基于用户的协同过滤

In [7]:
# 3折交叉验证实例
kf = KFold(n_splits=3)

In [8]:
sim_options = {'user_based': True,  # True表示基于用户的协同过滤
               'name':'MSD',        # 计算商品之间相似度的方法使用均方差
               'min_support': 10,     # 支持度筛选
               'verbose': False}   

knnbasic_u = KNNBasic(k=40,  # 邻域的个数
                    sim_options=sim_options,
                   )

In [9]:
%%time
for train,test in kf.split(dataset):
    # 在训练集上训练
    knnbasic_u.fit(train)
    # 测试集上预测
    predictions=knnbasic_u.test(test)
    # 计算RMSE
    accuracy.rmse(predictions, verbose=True)
    accuracy.mae(predictions, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8961
MAE:  0.6841
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8974
MAE:  0.6859
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8947
MAE:  0.6838
CPU times: user 9min 32s, sys: 5.34 s, total: 9min 37s
Wall time: 9min 54s


In [10]:
knnbasic_u.predict(uid='1', iid='2', r_ui=3.5)

Prediction(uid='1', iid='2', r_ui=3.5, est=3.390175951663089, details={'actual_k': 40, 'was_impossible': False})

In [36]:
(0.8947 + 0.8974 + 0.8961) / 3

0.8960666666666667

### 2.1.2 基于商品的协同过滤

In [11]:
# 3折交叉验证实例
kf = KFold(n_splits=3)

In [12]:
sim_options = {'user_based': False,  # False表示基于商品的协同过滤
               'name':'MSD',        # 计算商品之间相似度的方法使用均方差
               'min_support': 10,     # 0代表不因为支持度筛选
               'verbose': True}   # 

knnbasic_i = KNNBasic(k=40,  # 邻域的个数
                    sim_options=sim_options,
                   )

In [13]:
%%time
for train,test in kf.split(dataset):
    # 在训练集上训练
    knnbasic_i.fit(train)
    # 测试集上预测
    predictions=knnbasic_i.test(test)
    # 计算RMSE
    accuracy.rmse(predictions, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9067
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9054
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9026
CPU times: user 6min 14s, sys: 11.8 s, total: 6min 26s
Wall time: 6min 37s


In [14]:
knnbasic_i.predict(uid='1', iid='2', r_ui=3.5)

Prediction(uid='1', iid='2', r_ui=3.5, est=3.671374060490465, details={'actual_k': 40, 'was_impossible': False})

In [37]:
(0.9067 + 0.9054 + 0.9026) / 3

0.9049

## 2.2 KNNBasicWithMeans

### 2.2.1 基于用户

In [15]:
# 3折交叉验证实例
kf = KFold(n_splits=3)

In [16]:
sim_options = {'user_based': True,  # True表示基于用户的协同过滤
               'name':'MSD',        # 计算商品之间相似度的方法使用均方差
               'min_support': 10,     # 支持度筛选
               'verbose': False}   # 

knnmeans_u = KNNWithMeans(k=40,  # 邻域的个数
                        sim_options=sim_options,
                   )

In [17]:
%%time
for train,test in kf.split(dataset):
    # 在训练集上训练
    knnmeans_u.fit(train)
    # 测试集上预测
    predictions=knnmeans_u.test(test)
    # 计算RMSE
    accuracy.rmse(predictions, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8751
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8747
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8771
CPU times: user 9min 38s, sys: 4.99 s, total: 9min 43s
Wall time: 9min 55s


In [18]:
knnmeans_u.predict(uid='1', iid='2', r_ui=3.5)

Prediction(uid='1', iid='2', r_ui=3.5, est=3.3996500571907347, details={'actual_k': 40, 'was_impossible': False})

In [38]:
(0.8751 + 0.8747 + 0.8771) / 3

0.8756333333333334

### 2.2.2 基于商品

In [19]:
# 3折交叉验证实例
kf = KFold(n_splits=3)

In [20]:
sim_options = {'user_based': False,  # False表示基于商品的协同过滤
               'name':'MSD',        # 计算商品之间相似度的方法使用均方差
               'min_support': 10,     # 0代表不因为支持度筛选
               'verbose': False} 

knnmeans_i = KNNWithMeans(k=40,  # 邻域的个数
                        sim_options=sim_options,
                   )

In [21]:
%%time
for train,tests in kf.split(dataset):
    knnmeans_i.fit(train)
    preds = knnmeans_i.test(tests)
    accuracy.rmse(preds, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8588
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8589
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8572
CPU times: user 6min 14s, sys: 12.4 s, total: 6min 27s
Wall time: 6min 38s


In [22]:
knnmeans_i.predict(uid='1', iid='2', r_ui=3.5)

Prediction(uid='1', iid='2', r_ui=3.5, est=3.6045238476068833, details={'actual_k': 40, 'was_impossible': False})

In [39]:
(0.8572 + 0.8589 + 0.8588) / 3

0.8583

## 2.3 KNNWithZScore

### 2.3.1 基于用户

In [23]:
# 3折交叉验证实例
kf = KFold(n_splits=3)

In [24]:
sim_options = {
    'user_based': True,
    'name':'MSD',
    'min_support': 10,
    'verbose': False
}

knnzscore_u = KNNWithZScore(k=40, sim_options=sim_options) 

In [25]:
%%time
for train,tests in  kf.split(dataset):
    knnzscore_u.fit(train)
    preds=knnzscore_u.test(tests)
    accuracy.rmse(preds, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8758
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8744
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8733
CPU times: user 10min 1s, sys: 7.61 s, total: 10min 9s
Wall time: 10min 36s


In [26]:
knnzscore_u.predict('1', '2', 3.5)

Prediction(uid='1', iid='2', r_ui=3.5, est=3.604964020611294, details={'actual_k': 40, 'was_impossible': False})

In [40]:
(0.8733 + 0.8744 + 0.8758) / 3

0.8744999999999999

### 2.3.2 基于商品

In [27]:
kf = KFold(n_splits=3)

sim_options = {
    'user_based': False,
    'name':'MSD',
    'min_support': 10,
    'verbose': False
}

knnzscore_i = KNNWithZScore(k=40, sim_options=sim_options) 

In [28]:
for train,tests in kf.split(dataset):
    knnzscore_i.fit(train)
    preds = knnzscore_i.test(tests)
    accuracy.rmse(preds, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8605
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8606
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8608


In [29]:
knnzscore_i.predict('1', '2', 3.5)

Prediction(uid='1', iid='2', r_ui=3.5, est=3.621014084779232, details={'actual_k': 40, 'was_impossible': False})

In [41]:
(0.8608 + 0.8606 + 0.8605) / 3

0.8606333333333334

## 2.4 KNNBaseline

### 2.4.1 基于用户

In [30]:
kf = KFold(n_splits=3)

sim_options = {
    'user_based': True,
    'name': 'MSD',
    'min_support': 10,
    'verbose': False
}

knnbl_u = KNNBaseline(k=40, sim_options=sim_options)

In [31]:
for train,tests in kf.split(dataset):
    knnbl_u.fit(train)
    preds = knnbl_u.test(tests)
    accuracy.rmse(preds, verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8566
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8530
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8534


In [32]:
knnbl_u.predict('1', '2', 3.5)

Prediction(uid='1', iid='2', r_ui=3.5, est=3.2954546001656913, details={'actual_k': 40, 'was_impossible': False})

In [42]:
(0.8534 + 0.8530 + 0.8530) / 3

0.8531333333333334

### 2.4.2 基于商品

In [33]:
kf = KFold(n_splits=3)

sim_options = {
    'user_based': False,
    'name': 'MSD',
    'min_support': 10,
    'verbose': False
}

knnbl_i = KNNBaseline(k=40, sim_options=sim_options)

In [34]:
%%time
for train,tests in kf.split(dataset):
    knnbl_i.fit(train)
    preds = knnbl_i.test(tests)
    accuracy.rmse(preds, verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8491
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8501
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8492
CPU times: user 6min 57s, sys: 15.1 s, total: 7min 12s
Wall time: 7min 24s


In [35]:
knnbl_i.predict('1', '2', 3.5)

Prediction(uid='1', iid='2', r_ui=3.5, est=3.5491711471944325, details={'actual_k': 40, 'was_impossible': False})

In [43]:
(0.8492 + 0.8501 + 0.8491) / 3

0.8494666666666667

# 3 总结：

1. 4种算法耗时差不多；
2. 通过下表可以看出，KNNBaseline的RMSE最小，它基于商品的协同过滤算法预测评分最接近真实值。

各算法RMSE及对相同用户和商品的预测评分，真实评分3.5

 |   算法             | 基于用户RMSE | 基于商品RMSE | 基于用户预测评分  | 基于商品预测评分 | 
 | :----------------: | :-----------: | :-----------: | :-------------- :| :--------------: | 
 | KNNBasic          |     0.896   |    0.9049   |      3.39      |      3.60      | 
 | KNNBasicWithMeans |     0.8756  |    0.8583    |      3.40      |      3.60      | 
 | KNNWithZScore     |     0.8745   |    0.8606  |      3.60      | 3.62  |    
 | KNNBaseline       |     0.8531   |   0.8495    |     3.30      | 3.55  | 