## Description：
这里分别用FM演示分类和回归问题如何通过调包来完成任务。 回归任务采用的协同过滤里面的评分数据集， 分类任务采用自动生成的一个数据集

## 回归任务
回归任务的数据依然是电影评分数据集， 数据集的下载地址: [ http://www.grouplens.org/system/files/ml-100k.zip](http://www.grouplens.org/system/files/ml-100k.zip)

In [1]:
# 导入包
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from pyfm import pylibfm

In [2]:
# 导入数据
def loadData(filename, path='ml-100k/'):
    data = []
    y = []
    users = set()
    items = set()
    with open(path+filename) as f:
        for line in f:
            (user, movieid, rating, ts) = line.split('\t')
            data.append({'user_id': str(user), 'movie_id': str(movieid)})
            y.append(float(rating))
            users.add(user)
            items.add(movieid)
    
    return (data, np.array(y), users, items)

In [3]:
# 导入数据
(train_data, y_train, train_users, train_items) = loadData('ua.base')
(test_data, y_test, test_users, test_items) = loadData('ua.test')

In [4]:
train_data

[{'user_id': '1', 'movie_id': '1'},
 {'user_id': '1', 'movie_id': '2'},
 {'user_id': '1', 'movie_id': '3'},
 {'user_id': '1', 'movie_id': '4'},
 {'user_id': '1', 'movie_id': '5'},
 {'user_id': '1', 'movie_id': '6'},
 {'user_id': '1', 'movie_id': '7'},
 {'user_id': '1', 'movie_id': '8'},
 {'user_id': '1', 'movie_id': '9'},
 {'user_id': '1', 'movie_id': '10'},
 {'user_id': '1', 'movie_id': '11'},
 {'user_id': '1', 'movie_id': '12'},
 {'user_id': '1', 'movie_id': '13'},
 {'user_id': '1', 'movie_id': '14'},
 {'user_id': '1', 'movie_id': '15'},
 {'user_id': '1', 'movie_id': '16'},
 {'user_id': '1', 'movie_id': '17'},
 {'user_id': '1', 'movie_id': '18'},
 {'user_id': '1', 'movie_id': '19'},
 {'user_id': '1', 'movie_id': '21'},
 {'user_id': '1', 'movie_id': '22'},
 {'user_id': '1', 'movie_id': '23'},
 {'user_id': '1', 'movie_id': '24'},
 {'user_id': '1', 'movie_id': '25'},
 {'user_id': '1', 'movie_id': '26'},
 {'user_id': '1', 'movie_id': '27'},
 {'user_id': '1', 'movie_id': '28'},
 {'user_id

In [5]:
# 下面需要转成one-hot
v = DictVectorizer()
X_train = v.fit_transform(train_data)
X_test = v.transform(test_data)

In [6]:
# 建立FM模型 
fm = pylibfm.FM(num_factors=10, num_iter=100, verbose=True, task='regression', initial_learning_rate=0.001, learning_rate_schedule='optimal')

FM的具体参数函数如下: 这里面重点需要设置的我已标出(详细的可以参考源码)
* **num_factors**: 隐向量的维度， 也就是k
* **num_iter**: 迭代次数， 由于使用的SGD， 随机梯度下降， 要指明迭代多少个epoch
* k0, k1: k0表示是否用偏置（看FM的公式)， k1表示是否要第二项， 就是单个特征的， 这俩默认True
* init_stdev: 初始化隐向量时候的方差, 默认0.01
* **validation_size**: 验证集的比例， 默认0.01
* learning_rate_schedule: 学习率衰减方式， 有constant, optimal, 和invscaling三种方式， 具体公式看源码
* **initial_learning_rate**: 初始学习率， 默认0.01
* power_t， t0: 逆缩放学习率的指数，最优学习率分母常数， 这两个和上面学习率衰减方式的计算有关
* **task**: 分类或者回归任务， 要指明
* verbose: 是否打印当前的迭代次数， 训练误差
* shuffle_training: 是否在学习之前打乱训练集
* seed: 随机种子

In [7]:
# 模型训练
fm.fit(X_train, y_train)

Creating validation dataset of 0.01 of training for adaptive regularization
-- Epoch 1
Training MSE: 0.59438
-- Epoch 2
Training MSE: 0.51745
-- Epoch 3
Training MSE: 0.48994
-- Epoch 4
Training MSE: 0.47416
-- Epoch 5
Training MSE: 0.46356
-- Epoch 6
Training MSE: 0.45625
-- Epoch 7
Training MSE: 0.45053
-- Epoch 8
Training MSE: 0.44590
-- Epoch 9
Training MSE: 0.44219
-- Epoch 10
Training MSE: 0.43918
-- Epoch 11
Training MSE: 0.43638
-- Epoch 12
Training MSE: 0.43391
-- Epoch 13
Training MSE: 0.43176
-- Epoch 14
Training MSE: 0.42992
-- Epoch 15
Training MSE: 0.42817
-- Epoch 16
Training MSE: 0.42656
-- Epoch 17
Training MSE: 0.42497
-- Epoch 18
Training MSE: 0.42359
-- Epoch 19
Training MSE: 0.42216
-- Epoch 20
Training MSE: 0.42088
-- Epoch 21
Training MSE: 0.41962
-- Epoch 22
Training MSE: 0.41846
-- Epoch 23
Training MSE: 0.41727
-- Epoch 24
Training MSE: 0.41613
-- Epoch 25
Training MSE: 0.41488
-- Epoch 26
Training MSE: 0.41367
-- Epoch 27
Training MSE: 0.41254
-- Epoch 28
Tra

In [8]:
# 评估
preds = fm.predict(X_test)

In [9]:
preds

array([4.02375697, 3.48672592, 4.00912143, ..., 3.17436818, 2.89122399,
       3.31057522])

In [10]:
from sklearn.metrics import mean_squared_error

In [11]:
print('FM MSE: %.4f' % mean_squared_error(y_test, preds))

FM MSE: 0.9000


## 分类任务

In [12]:
from sklearn.datasets import make_classification   # 创建一个随机的分类数据集
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

In [13]:
X, y = make_classification(n_samples=1000, n_features=100, n_clusters_per_class=1) # 1000个训练样本， 100维的数据
data = [{v: k for k, v in dict(zip(i, range(len(i)))).items()} for i in X]

In [14]:
x_train, x_test, y_train, y_test = train_test_split(data, y, test_size=0.1, random_state=42)

In [15]:
v = DictVectorizer()
x_train = v.fit_transform(x_train)
x_test = v.transform(x_test)

In [16]:
x_train.toarray()

array([[-0.50084367, -1.15958151, -1.34485874, ..., -0.1791191 ,
         0.44743747, -2.26121926],
       [-0.51000164,  1.50170871,  1.58228806, ..., -0.19630376,
         0.9405414 ,  0.86885242],
       [-0.89178052,  1.26177447, -1.79259503, ...,  0.54888147,
         0.16758642, -0.68694894],
       ...,
       [-0.9115421 ,  0.92118578,  1.89000283, ..., -0.81724515,
         0.76858602, -0.8606871 ],
       [ 0.75156516, -0.09004403, -2.28495839, ..., -0.49285608,
         0.58150397, -1.40983261],
       [-0.10283194, -0.09581366,  1.05650123, ...,  1.5900855 ,
         0.31219612,  0.03709867]])

In [17]:
# 建立模型
fm = pylibfm.FM(num_factors=50, num_iter=10, verbose=True, task='classification', initial_learning_rate=0.0001, learning_rate_schedule='optimal')

In [18]:
fm.fit(x_train, y_train)

Creating validation dataset of 0.01 of training for adaptive regularization
-- Epoch 1
Training log loss: 2.12467
-- Epoch 2
Training log loss: 1.74185
-- Epoch 3
Training log loss: 1.42232
-- Epoch 4
Training log loss: 1.16085
-- Epoch 5
Training log loss: 0.94964
-- Epoch 6
Training log loss: 0.78052
-- Epoch 7
Training log loss: 0.64547
-- Epoch 8
Training log loss: 0.53758
-- Epoch 9
Training log loss: 0.45132
-- Epoch 10
Training log loss: 0.38187


In [19]:
y_pre = fm.predict(x_test)

In [20]:
print('validation log loss: %.4f' % log_loss(y_test, y_pre))

validation log loss: 1.3678
