# 核心能力提升班商业智能方向 004期 Week 5

### Thinking 1： 在实际工作中，FM和MF哪个应用的更多，为什么

FM在实际工作中应用的更多，MF只能利用user和item的评分信息，无法引入其他特征，这在实际应用中会缺少很多信息导致评分预测精度不高，而FM可以引入多个特征，在实际应用中评分预测的精度和泛化程度都会好很多。

### Thinking 2： FFM与FM有哪些区别？

在FM中将所有特征都作用于同一隐向量，忽略了不同特征之间的区别和独立性。FFM通过因素field的概念，针对每个field学习一个独立的隐向量，防止相互影响，细化了隐向量的表示。

### Thinking 3： DeepFM相比于FM解决了哪些问题，原理是怎样的

FM通常学习到二阶特征组合的关系，继续引入高阶特征会大幅度增加训练参数，学习效率低并且会引发很多问题。DeepFM在FM的基础上引入DNN来学习二阶以上的高阶特征交互，并且通过embedding来学习user和item的特征表示，避免做特征工程。

### Thinking 4： Surprise工具中的baseline算法原理是怎样的？BaselineOnly和KNNBaseline有什么区别？

BaselineOnly算法：  
$$\hat{r_{ui}}=b_{ui}=\mu+b_u+b_i$$
基于系统评分期望$\mu$，user偏差$b_u$和item偏差$b_i$来进行评分预测，在surprise中如果user或者item是未知的，那么偏差将设置为0。  
KNNBaseline算法：  
$$\hat{r_{ui}}=b_{ui}+\frac{\sum _{v\in N_i^k(u)}sim(u,v)\cdot (r_{vi}-b_{vi})}{\sum _{v\in N_i^k(u)}sim(u,v)}$$ 
或者
$$\hat{r_{ui}}=b_{ui}+\frac{\sum _{j\in N_u^k(i)}sim(i,j)\cdot (r_{uj}-b_{uj})}{\sum _{j\in N_u^k(u)}sim(i,j)}$$  
选择一种相似度计算方法进行相似度计算，然后根据相似度进行排序，取前K个对象(user或item)，使用其与目标对象的相似度做为权重，对得分进行加权求和，最后用这K个对象与目标的相似度之和对结果进行归一化。

### Thinking 5： 基于邻域的协同过滤都有哪些算法，请简述原理

1. 基于user的协同过滤算法UserCF：  
    1. 寻找和目标用户兴趣相似的用户集合
协同过滤算法主要利用行为的相似度计算兴趣的相似度，可以使用Jaccard或者余弦相似度来计算它们之间的兴趣相似度。  
    2.找到这个集合中的用户喜欢的，且目标用户没有听说过的物品推荐给目标用户。
2. 基于item的协同过滤算法ItemCF：  
    1. 计算物品之间的相似度  
    2. 根据物品的相似度和用户的历史行为给用户生成推荐列表

### Action 1： 使用libfm工具对movielens进行评分预测，采用SGD优化算法

#### MovieLens数据集切分

In [1]:
import pandas as pd
import random

In [2]:
data = pd.read_csv("data/MovieLens/ratings.csv")
data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [3]:
# 将数据集拆分为训练集和测试集
def train_test_split(data, ratio, seed=100):
    random.seed(seed)
    train_index = []
    test_index = []
    for i in range(len(data)):
        if random.random()<ratio:
            test_index.append(i)
        else:
            train_index.append(i)
    print("训练集样本数 %d, 测试集样本数 %d" % (len(train_index),len(test_index)))
    return data.iloc[train_index].reset_index(drop=True), data.iloc[test_index].reset_index(drop=True)

In [4]:
train_data, test_data = train_test_split(data, 0.2)

训练集样本数 839203, 测试集样本数 209372


In [5]:
train_data.to_csv("data/MovieLens/movielens_train.csv",index=0)

In [6]:
test_data.to_csv("data/MovieLens/movielens_test.csv",index=0)

#### 使用triple_format_to_libfm.pl脚本将测试集和训练集转换成libfm格式

在终端中执行命令：  
perl data/triple_format_to_libfm.pl -in data/MovieLens/movielens_train.csv -target 2 -delete_column 3 -header 1 -separator ","  
perl data/triple_format_to_libfm.pl -in data/MovieLens/movielens_test.csv -target 2 -delete_column 3 -header 1 -separator ","

得到movielens_train.csv.libfm和movielens_test.csv.libfm两个文件

#### 使用libfm进行训练和预测 使用SGD优化

训练：

在终端中执行命令：  
libfm -task r -train data/MovieLens/movielens_train.csv.libfm -test data/MovieLens/movielens_test.csv.libfm -dim '1,1,8' -out data/MovieLens/movielens_out.txt -method sgd   
<b>发现会报错：</b>   
<font color="red">Assertion failed: (lr.size() == 1) || (lr.size() == 3), file libfm.cpp, line 345</font>  
查看文档<a href="libfm-1.42.manual.pdf">libfm-1.42.manual.pdf</a>后发现需要添加以下设置：  
* -learn rate: the learning rate aka step size of SGD which should have a non-zero or positive value.
* -regular: the regularization parameters which should have zero or positive value. For SGD you can specify the regularization values the following way:  
– One value (-regular value): all model parameters use the same regularization value.  
– Three values (-regular ’value0,value1,value2’): 0-way interactions (w0) use value0 as regularization, 1-way interactions (w) use value1 and pairwise ones (V ) use value2.  
– No value: if the parameter -regular is not specified at all, this corresponds to no regularization, i.e. -regular 0.
* -init stdev: the standard deviation of the normal distribution that is used for initializing the parameters V . You should use a non-zero, positive value here.

在终端中执行命令：  
libfm -task r -train data/MovieLens/movielens_train.csv.libfm -test data/MovieLens/movielens_test.csv.libfm -dim '1,1,8' -out data/MovieLens/movielens_out.txt -method sgd -learn_rate 0.01 -regular '0,0,0.01' -init_stdev 0.1  
可以正常训练模型和进行预测
<img src="libfm.PNG">  
实验结果：  
<b>Final   Train=0.687886  Test=1.73966</b>

### Action 2： 使用DeepFM对movielens进行评分预测

In [7]:
# 引包
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat, get_feature_names

In [8]:
#数据加载
data = pd.read_csv("data/MovieLens/movielens_sample.txt")
sparse_features = ["movie_id", "user_id", "gender", "age", "occupation", "zip"]
target = ['rating']

In [9]:
# 对特征标签进行编码
for feature in sparse_features:
    lbe = LabelEncoder()
    data[feature] = lbe.fit_transform(data[feature])

In [10]:
# 计算每个特征中的 不同特征值的个数
fixlen_feature_columns = [SparseFeat(feature, data[feature].nunique()) for feature in sparse_features]
print(fixlen_feature_columns)
linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

[SparseFeat(name='movie_id', vocabulary_size=187, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x000001CEC0725780>, embedding_name='movie_id', group_name='default_group', trainable=True), SparseFeat(name='user_id', vocabulary_size=193, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x000001CEC07257F0>, embedding_name='user_id', group_name='default_group', trainable=True), SparseFeat(name='gender', vocabulary_size=2, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x000001CEC0725860>, embedding_name='gender', group_name='default_group', trainable=True), SparseFeat(name='age', vocabulary_size=7, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x000001CEC

In [11]:
# 将数据集切分成训练集和测试集
train, test = train_test_split(data, test_size=0.2)
train_model_input = {name:train[name].values for name in feature_names}
test_model_input = {name:test[name].values for name in feature_names}

In [12]:
# 使用DeepFM进行训练
model = DeepFM(linear_feature_columns, dnn_feature_columns, task='regression')
model.compile("adam", "mse", metrics=['mse'], )
history = model.fit(train_model_input, train[target].values, batch_size=256, epochs=1, verbose=True, validation_split=0.2, )

Instructions for updating:
keep_dims is deprecated, use keepdims instead
Train on 128 samples, validate on 32 samples
Epoch 1/1


In [13]:
# 使用DeepFM进行预测
pred_ans = model.predict(test_model_input, batch_size=256)
pred_ans

array([[0.00966977],
       [0.00964086],
       [0.00970086],
       [0.01015913],
       [0.00983655],
       [0.00987573],
       [0.0097288 ],
       [0.01038986],
       [0.00980756],
       [0.01030437],
       [0.01148523],
       [0.00975049],
       [0.0095842 ],
       [0.009939  ],
       [0.00972055],
       [0.01072308],
       [0.01009732],
       [0.00974382],
       [0.01010538],
       [0.01143121],
       [0.00971854],
       [0.00988787],
       [0.00960259],
       [0.01022805],
       [0.00966798],
       [0.0107019 ],
       [0.01020534],
       [0.01017312],
       [0.00975659],
       [0.00994749],
       [0.0098782 ],
       [0.01109778],
       [0.01000077],
       [0.00982914],
       [0.00972634],
       [0.01028812],
       [0.00984843],
       [0.00962214],
       [0.01000965],
       [0.01100385]], dtype=float32)

In [14]:
# 输出RMSE或MSE
mse = round(mean_squared_error(test[target].values, pred_ans), 4)
rmse = mse ** 0.5
print("test RMSE", rmse)
print("test MSE", mse)

test RMSE 3.830900677386455
test MSE 14.6758


数据集中数据量较小，结果不好  
<b>接下来使用movielens-1m数据集</b>  
<font color="red">These files contain 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000.</font>

In [15]:
ratings = pd.read_csv("data/MovieLens/ml-1m/ratings.dat", header=None, sep="::", engine='python')
ratings.columns="UserID::MovieID::Rating::Timestamp".split("::")
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [16]:
movies = pd.read_csv("data/MovieLens/ml-1m/movies.dat", header=None, sep="::", engine='python')
movies.columns="MovieID::Title::Genres".split("::")
movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [17]:
users = pd.read_csv("data/MovieLens/ml-1m/users.dat", header=None, sep="::", engine='python')
users.columns="UserID::Gender::Age::Occupation::Zip-code".split("::")
users.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [18]:
data = pd.merge(ratings, movies,how="left", on="MovieID")

In [19]:
data = pd.merge(data, users, how="left", on="UserID")
data.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Title,Genres,Gender,Age,Occupation,Zip-code
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama,F,1,10,48067
1,1,661,3,978302109,James and the Giant Peach (1996),Animation|Children's|Musical,F,1,10,48067
2,1,914,3,978301968,My Fair Lady (1964),Musical|Romance,F,1,10,48067
3,1,3408,4,978300275,Erin Brockovich (2000),Drama,F,1,10,48067
4,1,2355,5,978824291,"Bug's Life, A (1998)",Animation|Children's|Comedy,F,1,10,48067


In [20]:
data.to_csv("data/MovieLens/ml-1m/ml-1m.csv",index=0)

In [32]:
# MovieLens数据集预处理
def movie_lens_preprocess(ratings_file, users_file, movies_file, 
                          rating_col="UserID::MovieID::Rating::Timestamp",
                          user_col="UserID::Gender::Age::Occupation::Zip-code", 
                          movie_col="MovieID::Title::Genres"):
    ratings = pd.read_csv(ratings_file, header=None, sep="::", engine='python')
    ratings.columns=rating_col.split("::")
    movies = pd.read_csv(movies_file, header=None, sep="::", engine='python')
    movies.columns=movie_col.split("::")
    users = pd.read_csv(users_file, header=None, sep="::", engine='python')
    users.columns=user_col.split("::")
    data = pd.merge(ratings, movies,how="left", on="MovieID")
    data = pd.merge(data, users, how="left", on="UserID")
    return data

In [29]:
# DeepFM模型训练和预测
def DeepFM_train_predict(data, sparse_features, target):
    # 对特征标签进行编码
    for feature in sparse_features:
        lbe = LabelEncoder()
        data[feature] = lbe.fit_transform(data[feature])
    # 计算每个特征中的 不同特征值的个数
    fixlen_feature_columns = [SparseFeat(feature, data[feature].nunique()) for feature in sparse_features]
    print(fixlen_feature_columns)
    linear_feature_columns = fixlen_feature_columns
    dnn_feature_columns = fixlen_feature_columns
    feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
    # 将数据集切分成训练集和测试集
    train, test = train_test_split(data, test_size=0.2)
    train_model_input = {name:train[name].values for name in feature_names}
    test_model_input = {name:test[name].values for name in feature_names}
    # 使用DeepFM进行训练
    model = DeepFM(linear_feature_columns, dnn_feature_columns, task='regression')
    model.compile("adam", "mse", metrics=['mse'], )
    history = model.fit(train_model_input, train[target].values, batch_size=256, epochs=1, verbose=True, validation_split=0.2, )
    # 使用DeepFM进行预测
    pred_ans = model.predict(test_model_input, batch_size=256)
    # 输出RMSE或MSE
    mse = round(mean_squared_error(test[target].values, pred_ans), 4)
    rmse = mse ** 0.5
    print("\n\n","*"*150)
    print("test RMSE", rmse)
    print("test MSE", mse)

In [33]:
data = movie_lens_preprocess("data/MovieLens/ml-1m/ratings.dat", 
                             "data/MovieLens/ml-1m/users.dat", 
                             "data/MovieLens/ml-1m/movies.dat")
DeepFM_train_predict(data, ["MovieID", "UserID", "Gender", "Age", "Occupation", "Zip-code"], ['Rating'])

[SparseFeat(name='MovieID', vocabulary_size=3706, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x000001D00E738E48>, embedding_name='MovieID', group_name='default_group', trainable=True), SparseFeat(name='UserID', vocabulary_size=6040, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x000001D00E738898>, embedding_name='UserID', group_name='default_group', trainable=True), SparseFeat(name='Gender', vocabulary_size=2, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x000001D01342AD30>, embedding_name='Gender', group_name='default_group', trainable=True), SparseFeat(name='Age', vocabulary_size=7, embedding_dim=4, use_hash=False, dtype='int32', embeddings_initializer=<tensorflow.python.keras.initializers.RandomNormal object at 0x000001D0129

### Action 3: 使用基于邻域的协同过滤（KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline中的任意一种）对MovieLens数据集进行协同过滤，采用k折交叉验证(k=3)，输出每次计算的RMSE, MAE

In [41]:
from surprise import KNNWithMeans, KNNWithZScore
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.model_selection import KFold

In [36]:
# 数据读取
reader = Reader(line_format='user item rating timestamp', sep=',', skip_lines=1)
data = Dataset.load_from_file('data/MovieLens/ratings.csv', reader=reader)
trainset = data.build_full_trainset()

In [37]:
# ItemCF 计算得分
# 取最相似的用户计算时，只取最相似的k个
algo = KNNWithMeans(k=50, sim_options={'user_based': False, 'verbose': 'True'})

In [39]:
# 定义K折交叉验证迭代器，K=3
kf = KFold(n_splits=3)
for trainset, testset in kf.split(data):
    # 训练并预测
    algo.fit(trainset)
    predictions = algo.test(testset)
    # 计算RMSE
    accuracy.rmse(predictions, verbose=True)
    # 计算MAE
    accuracy.mae(predictions, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8561
MAE:  0.6545
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8588
MAE:  0.6562
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8545
MAE:  0.6541


In [42]:
algo2 = KNNWithZScore(k=50, sim_options={'user_based': False, 'verbose': 'True'})
# 定义K折交叉验证迭代器，K=3
kf = KFold(n_splits=3)
for trainset, testset in kf.split(data):
    # 训练并预测
    algo2.fit(trainset)
    predictions = algo2.test(testset)
    # 计算RMSE
    accuracy.rmse(predictions, verbose=True)
    # 计算MAE
    accuracy.mae(predictions, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8595
MAE:  0.6559
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8565
MAE:  0.6548
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.8594
MAE:  0.6560
