# 简单易用，同时支持多种推荐算法
* 基础算法/baseline algorithms
* 基于近邻方法(协同过滤)/neighborhood methods
* 矩阵分解方法/matrix factorization-based (SVD, PMF, SVD++, NMF)

算法类名	说明

**random_pred.NormalPredictor**	Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

**baseline_only.BaselineOnly**	Algorithm predicting the baseline estimate for given user and item.

**knns.KNNBasic**	A basic collaborative filtering algorithm.

**knns.KNNWithMeans**	A basic collaborative filtering algorithm, taking into account the mean ratings of each user.

**knns.KNNBaseline**	A basic collaborative filtering algorithm taking into account a baseline rating.

**matrix_factorization.SVD**	The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.

**matrix_factorization.SVDpp**	The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

**matrix_factorization.NMF**	A collaborative filtering algorithm based on Non-negative Matrix Factorization.

**slope_one.SlopeOne**	A simple yet accurate collaborative filtering algorithm.

**co_clustering.CoClustering**	A collaborative filtering algorithm based on co-clustering.

# 其中基于近邻的方法(协同过滤)可以设定不同的度量准则

相似度度量标准	度量标准说明

**cosine**	Compute the cosine similarity between all pairs of users (or items).

**msd**	Compute the Mean Squared Difference similarity between all pairs of users (or items).

**pearson**	Compute the Pearson correlation coefficient between all pairs of users (or items).

**pearson_baseline**	Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using 

**baselines** for centering instead of means.

# 支持不同的评估准则

评估准则	准则说明

**rmse**	Compute RMSE (Root Mean Squared Error).

**mae**	Compute MAE (Mean Absolute Error).

**fcp**	Compute FCP (Fraction of Concordant Pairs).

# SVD算法示例

In [5]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

# Load the movielens-100k dataset (download it if needed).
# /Users/zhaoyadong/.surprise_data/ml-100k   下载位置
data = Dataset.load_builtin('ml-100k')

# Use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] 

 y


Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/zhaoyadong/.surprise_data/ml-100k
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9320  0.9384  0.9327  0.9429  0.9340  0.9360  0.0041  
MAE (testset)     0.7330  0.7406  0.7365  0.7420  0.7378  0.7380  0.0031  
Fit time          4.27    4.47    4.49    4.48    4.47    4.44    0.08    
Test time         0.17    0.13    0.12    0.15    0.12    0.14    0.02    


{'test_rmse': array([0.93196378, 0.93842642, 0.9327256 , 0.94291277, 0.93403339]),
 'test_mae': array([0.73302103, 0.74057704, 0.73654968, 0.7419691 , 0.7377506 ]),
 'fit_time': (4.268312931060791,
  4.472671031951904,
  4.493643045425415,
  4.479049921035767,
  4.474570989608765),
 'test_time': (0.17134571075439453,
  0.12970495223999023,
  0.1226348876953125,
  0.15154194831848145,
  0.12108802795410156)}

# 载入自己的数据集

In [18]:
import os
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate,train_test_split

# 指定文件所在路径
file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')
file_path

'/Users/zhaoyadong/.surprise_data/ml-100k/ml-100k/u.data'

In [19]:
# 告诉文本阅读器，文本的格式是怎么样的
reader = Reader(line_format='user item rating timestamp', sep='\t')
reader

<surprise.reader.Reader at 0x11c3eb5d0>

In [20]:
# 加载数据
data = Dataset.load_from_file(file_path, reader=reader)
data

<surprise.dataset.DatasetAutoFolds at 0x11331e910>

# 使用不同的推荐系统算法进行建模比较

In [23]:
### 使用NormalPredictor
from surprise import NormalPredictor
from surprise.model_selection import cross_validate
algo = NormalPredictor()
cross_validate(algo, data, measures=['RMSE','MAE'], cv=3, verbose=False)

{'test_rmse': array([1.52112873, 1.52385424, 1.52311202]),
 'test_mae': array([1.22258649, 1.22541031, 1.22553219]),
 'fit_time': (0.10647702217102051, 0.12001776695251465, 0.13048481941223145),
 'test_time': (0.275972843170166, 0.27295994758605957, 0.20937275886535645)}

In [25]:
### 使用BaselineOnly
from surprise import BaselineOnly
from surprise.model_selection import cross_validate
algo = BaselineOnly()
cross_validate(algo, data, measures=['RMSE','MAE'], cv=3, verbose=False)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


{'test_rmse': array([0.94493839, 0.94228404, 0.95453153]),
 'test_mae': array([0.74860023, 0.7482294 , 0.75718467]),
 'fit_time': (0.2533440589904785, 0.2319469451904297, 0.2509419918060303),
 'test_time': (0.21754813194274902, 0.23023319244384766, 0.23073101043701172)}

In [27]:
### 使用基础版协同过滤
from surprise import KNNBasic
from surprise.model_selection import cross_validate
algo = KNNBasic()
cross_validate(algo, data, measures=['RMSE','MAE'], cv=3, verbose=False)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([0.98133296, 0.99188314, 0.98792539]),
 'test_mae': array([0.77566632, 0.78575441, 0.77919485]),
 'fit_time': (0.2804989814758301, 0.2587110996246338, 0.25031518936157227),
 'test_time': (4.2373881340026855, 4.165133714675903, 4.306735992431641)}

In [28]:
### 使用均值协同过滤
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate
algo = KNNWithMeans()
cross_validate(algo, data, measures=['RMSE','MAE'], cv=3, verbose=False)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([0.95731906, 0.95732487, 0.95480123]),
 'test_mae': array([0.75462424, 0.75545846, 0.75230773]),
 'fit_time': (0.3175926208496094, 0.2771129608154297, 0.28166699409484863),
 'test_time': (4.626938104629517, 4.3745622634887695, 4.506859064102173)}

In [29]:
### 使用协同过滤baseline
from surprise import KNNBaseline
from surprise.model_selection import cross_validate
algo = KNNBaseline()
cross_validate(algo, data, measures=['RMSE','MAE'], cv=3, verbose=False)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([0.9438566 , 0.93318902, 0.9321911 ]),
 'test_mae': array([0.74604609, 0.73321011, 0.73435405]),
 'fit_time': (0.5125937461853027, 0.40860581398010254, 0.4391019344329834),
 'test_time': (5.216365098953247, 5.137558221817017, 5.172612190246582)}

In [30]:
### 使用SVD
from surprise import SVD
from surprise.model_selection import cross_validate
algo = SVD()
cross_validate(algo, data, measures=['RMSE','MAE'], cv=3, verbose=False)

{'test_rmse': array([0.94885668, 0.94459868, 0.94461429]),
 'test_mae': array([0.74998314, 0.74544288, 0.74404708]),
 'fit_time': (3.548034906387329, 3.535391092300415, 3.7439491748809814),
 'test_time': (0.20096206665039062, 0.27524304389953613, 0.3032660484313965)}

In [31]:
### 使用SVD++（耗时）
from surprise import SVDpp
from surprise.model_selection import cross_validate
algo = SVDpp()
cross_validate(algo, data, measures=['RMSE','MAE'], cv=3, verbose=False)

{'test_rmse': array([0.92375337, 0.93268092, 0.93254211]),
 'test_mae': array([0.72516722, 0.7333364 , 0.7326444 ]),
 'fit_time': (108.41236877441406, 107.79821062088013, 108.69368195533752),
 'test_time': (5.014692068099976, 4.52144193649292, 5.16087007522583)}

In [32]:
### 使用NMF
from surprise import NMF
from surprise.model_selection import cross_validate
algo = NMF()
cross_validate(algo, data, measures=['RMSE','MAE'], cv=3, verbose=False)

{'test_rmse': array([0.97079889, 0.97108764, 0.97510558]),
 'test_mae': array([0.76379115, 0.7641557 , 0.76433039]),
 'fit_time': (3.9752910137176514, 4.075232982635498, 3.928323984146118),
 'test_time': (0.24855399131774902, 0.2504160404205322, 0.23262906074523926)}

# 建模和存储模型

## 1.用协同过滤构建模型并进行预测

* movielens的例子

In [91]:
from __future__ import (absolute_import, division, print_function,unicode_literals)
import os
import io
from surprise import KNNBaseline
from surprise import Dataset
from surprise.model_selection import cross_validate

# 指定文件所在路径
file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')
# 告诉文本阅读器，文本的格式是怎么样的
reader = Reader(line_format='user item rating timestamp', sep='\t')
# 加载数据
data = Dataset.load_from_file(file_path, reader=reader)

sim_options = {'name': 'pearson_baseline', 'user_based': False} #相似度计算设定，使用皮尔逊相似度计算法，使用ItemCF的相似度计算
algo = KNNBaseline(sim_options=sim_options) #使用KNNBaseline算法（一种CF算法）进行推荐系统构建
cross_validate(algo, data, measures=['RMSE','MAE'], cv=3, verbose=False)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([0.9234231 , 0.92908341, 0.93160281]),
 'test_mae': array([0.72610348, 0.72945552, 0.72817546]),
 'fit_time': (1.2767589092254639, 1.2908000946044922, 1.1555171012878418),
 'test_time': (5.418834209442139, 5.452134847640991, 5.47606897354126)}

In [41]:
algo.bi

array([ 0.38756778, -0.49873698, -0.07257962, ..., -0.07891285,
        0.0410197 , -0.07891285])

In [42]:
algo.bsl_options

{}

In [44]:
algo.bu.shape

(943,)

In [45]:
algo.bx.shape

(1626,)

In [46]:
algo.by.shape

(943,)

In [58]:
algo.compute_baselines()[0].shape

(943,)

In [61]:
algo.compute_similarities().shape

Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


(1626, 1626)

In [63]:
algo.default_prediction()

3.532677336613317

In [66]:
algo.estimate?

[0;31mSignature:[0m [0malgo[0m[0;34m.[0m[0mestimate[0m[0;34m([0m[0mu[0m[0;34m,[0m [0mi[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/opt/anaconda3/lib/python3.7/site-packages/surprise/prediction_algorithms/knns.py
[0;31mType:[0m      method


In [71]:
algo.fit?

[0;31mSignature:[0m [0malgo[0m[0;34m.[0m[0mfit[0m[0;34m([0m[0mtrainset[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Train an algorithm on a given training set.

This method is called by every derived class as the first basic step
for training an algorithm. It basically just initializes some internal
structures and set the self.trainset attribute.

Args:
    trainset(:obj:`Trainset <surprise.Trainset>`) : A training
        set, as returned by the :meth:`folds
        <surprise.dataset.Dataset.folds>` method.

Returns:
    self
[0;31mFile:[0m      ~/opt/anaconda3/lib/python3.7/site-packages/surprise/prediction_algorithms/knns.py
[0;31mType:[0m      method


In [74]:
algo.get_neighbors?

[0;31mSignature:[0m [0malgo[0m[0;34m.[0m[0mget_neighbors[0m[0;34m([0m[0miid[0m[0;34m,[0m [0mk[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return the ``k`` nearest neighbors of ``iid``, which is the inner id
of a user or an item, depending on the ``user_based`` field of
``sim_options`` (see :ref:`similarity_measures_configuration`).

As the similarities are computed on the basis of a similarity measure,
this method is only relevant for algorithms using a similarity measure,
such as the :ref:`k-NN algorithms <pred_package_knn_inpired>`.

For a usage example, see the :ref:`FAQ <get_k_nearest_neighbors>`.

Args:
    iid(int): The (inner) id of the user (or item) for which we want
        the nearest neighbors. See :ref:`this note<raw_inner_note>`.

    k(int): The number of neighbors to retrieve.

Returns:
    The list of the ``k`` (inner) ids of the closest users (or items)
    to ``iid``.
[0;31mFile:[0m      ~/opt/anaconda3/lib/python3.7/site-packages

In [75]:
algo.k

40

In [76]:
algo.min_k

1

In [77]:
algo.n_x

1626

In [78]:
algo.n_y

943

In [81]:
algo.predict?

[0;31mSignature:[0m [0malgo[0m[0;34m.[0m[0mpredict[0m[0;34m([0m[0muid[0m[0;34m,[0m [0miid[0m[0;34m,[0m [0mr_ui[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mclip[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mverbose[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Compute the rating prediction for given user and item.

The ``predict`` method converts raw ids to inner ids and then calls the
``estimate`` method which is defined in every derived class. If the
prediction is impossible (e.g. because the user and/or the item is
unkown), the prediction is set according to :meth:`default_prediction()
<surprise.prediction_algorithms.algo_base.AlgoBase.default_prediction>`.

Args:
    uid: (Raw) id of the user. See :ref:`this note<raw_inner_note>`.
    iid: (Raw) id of the item. See :ref:`this note<raw_inner_note>`.
    r_ui(float): The true rating :math:`r_{ui}`. Optional, default is
        ``None``.
    clip(bool): Whether to cli

In [83]:
algo.sim.shape

(1626, 1626)

In [84]:
algo.sim_options

{'name': 'pearson_baseline', 'user_based': False}

In [86]:
algo.verbose

True

In [92]:
# 在协同过滤算法建模以后，根据一个item取回相似度最高的item，主要是用到algo.get_neighbors()这个函数 
# 读取物品（电影）名称信息
def read_item_names():
    """
    获取电影名到电影id 和 电影id到电影名的映射
    """
    file_name = (os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.item'))
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]
    return rid_to_name,name_to_rid

# 获取电影名到电影id 和 电影id到电影名的映射
rid_to_name, name_to_rid = read_item_names()

# 获得Toy Story电影的电影ID
toy_story_raw_id = name_to_rid['Toy Story (1995)']
print(toy_story_raw_id)

1


In [93]:
# 通过Toy Story电影的电影ID获取该电影的推荐内部id
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
print(toy_story_inner_id)

182


In [94]:
# 获得电影ID=182电影名称
name = rid_to_name['182']
print(name)

GoodFellas (1990)


In [101]:
# 获得Toy Story电影的相似（邻居）电影的ID集合(10个)
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=20)
print(toy_story_neighbors)

[24, 298, 748, 373, 134, 436, 241, 164, 395, 397, 1026, 221, 689, 262, 151, 987, 259, 487, 269, 547]


In [102]:
# 根据相似电影的内部电影ID获得实际电影ID
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id) for inner_id in toy_story_neighbors)
# 根据相似电影的实际电影ID获得实际电影名称
toy_story_neighbors = (rid_to_name[rid] for rid in toy_story_neighbors)
# 输出推荐结果
print("与《Toy Story》最相似的10个电影是：")
for movie in toy_story_neighbors:
    print(movie)

与《Toy Story》最相似的10个电影是：
Liar Liar (1997)
Lion King, The (1994)
That Thing You Do! (1996)
Seven (Se7en) (1995)
Jurassic Park (1993)
Wizard of Oz, The (1939)
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
Star Trek: The Wrath of Khan (1982)
Craft, The (1996)
Army of Darkness (1993)
Abyss, The (1989)
Evil Dead II (1987)
Empire Strikes Back, The (1980)
Indiana Jones and the Last Crusade (1989)
So I Married an Axe Murderer (1993)
E.T. the Extra-Terrestrial (1982)
Long Kiss Goodnight, The (1996)
Princess Bride, The (1987)
Mask, The (1994)


## 2 找到和用户A相似的N个用户
## 3 找到和物品A相似的N个物品

In [103]:
import os
from surprise import SVD
from surprise import SVDpp
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import KNNBasic
from surprise import BaselineOnly
from surprise import Reader
from surprise.model_selection import KFold
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV

# 1.item_user_rate_time.txt 数据格式 user item rating timestamp (用户id 物品id 评分 时间戳)
# 2.数据读取 训练模型
file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')
reader = Reader(line_format='user item rating timestamp', sep='\t')
surprise_data = Dataset.load_from_file(file_path, reader=reader)

all_trainset = surprise_data.build_full_trainset()
algo = KNNBasic(k=40,
                min_k=3,
                sim_options={
                    'user_based': True})
# sim_options={'name': 'cosine','user_based': True} cosine/msd/pearson/pearson_baseline
algo.fit(all_trainset)

# 3.找到相似用户
def getSimilarUsers(top_k,u_id):
    user_inner_id = algo.trainset.to_inner_uid(u_id)
    user_neighbors = algo.get_neighbors(user_inner_id, k=top_k)
    user_neighbors = (algo.trainset.to_raw_uid(inner_id) for inner_id in user_neighbors)
    return user_neighbors
print(list(getSimilarUsers(5,'196')))
# ['241', '162', '80', '36', '61']

# 4.找到相似物品 sim_options中的user_based设置为false，基于物品相似度进行计算
item_algo = KNNBasic(k=40,
                     min_k=3,
                     sim_options={
                         'user_based': False})
# sim_options={'name': 'cosine','user_based': True} cosine/msd/pearson/pearson_baseline
item_algo.fit(all_trainset)

def getSimilarItems(top_k, item_id):
    item_inner_id = item_algo.trainset.to_inner_iid(item_id)
    item_neighbors = item_algo.get_neighbors(item_inner_id, k=top_k)
    f_item_neighbors = (item_algo.trainset.to_raw_iid(inner_id)
                        for inner_id in item_neighbors)
    return f_item_neighbors
print(list(getSimilarItems(10, '242')))
# ['1081', '1444', '842', '1110', '812', '626', '1150', '1334', '1327', '1346']

Computing the msd similarity matrix...
Done computing similarity matrix.
['241', '162', '80', '36', '61']
Computing the msd similarity matrix...
Done computing similarity matrix.
['1081', '1444', '842', '1110', '812', '626', '1150', '1334', '1327', '1346']
