真实的场景中，我们有非常非常多的训练数据，不得不面对一些问题。比如：
    
    1）海量数据无法一次载入内存中进行训练
    2）数据每天都在增加，我们有没有一种增量训练方式不断持续更新迭代模型？
    
这里采用tensorflow神经网络来实现SVD操作。
LFM把用户在item上打分的行为，看做是有内部依据的，认为其和k个factor有关系，每一个user会有一个用户的向量(k维)，每一个item会有一个item向量(k维)。这里SVD是LFM实现的一种方式。

### 预测公式如下
$y_{pred[u, i]} = bias_{global} + bias_{user[u]} + bias_{item_[i]} + <embedding_{user[u]}, embedding_{item[i]}>$

用户向量和物品向量做内积，再加上三个偏执项。

### 我们需要最小化的loss计算如下（添加正则化项）
$\sum_{u, i} |y_{pred[u, i]} - y_{true[u, i]}|^2 + \lambda(|embedding_{user[u]}|^2 + |embedding_{item[i]}|^2)$

## 1.获取数据
该模型以movielens为例，数据格式为user item rating timestamp

In [5]:
#!wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
#!sudo unzip ml-1m.zip -d ./movielens

## 2.数据预处理 

In [19]:
import numpy as np
import pandas as pd
import tensorflow as tf

ModuleNotFoundError: No module named 'tensorflow'

In [16]:
def read_data_and_data_process(filename, sep='\t'):
    """ 读数据并做数据处理 """
    col_names = ['user', 'item', 'rate', 'st']
    df = pd.read_csv(filename, sep=sep, header=None, names=col_names)
    df['user'] -= 1 # 下标从0开始
    df['item'] -= 1
    for col in ("user", "item"):
        df[col] = df[col].astype(np.int32)
    df["rate"] = df["rate"].astype(np.float32)
    return df

In [14]:
class ShuffleDataIterator(object):
    """ 随机生成一个batch一个batch的数据 """
    def __init__(self, inputs, batch_size=10):
        self.inputs = inputs
        self.batch_size = batch_size
        self.num_cols = len(self.inputs)
        self.len = len(self.inputs)
        self.inputs = np.transpose(np.vstack([np.array(self.inputs[i]) for i in range(self.num_cols)]))
        
    def __len__(self):
        """ 总样本量 """
        return self.len
    
    def __iter__(self):
        return self
        
    def __next__(self):
        """ 取出下一个batch """
        return self.next()
    
    def next(self):
        """ 随机生成batch_size个下标，并取出对应的样本 """
        ids = np.random.randint(0, self.len, (self.batch_size,))
        out = self.inputs[ids, :]
        return [out[:, i] for i in range(self.num_cols)]

class OneEpochDataIterator(ShuffleDataIterator):
    """ 顺序给出一个epoch数据 """
    def __init__(self, inputs, batch_size=10):
        super(OneEpochDataIterator, self).__init__(inputs, batch_size=batch_size)
        if batch_size > 0:
            self.idx_group = np.array_split(np.arange(self.len), np.ceil(self.len / batch_size))
        else:
            self.idx_group = [np.arange(self.len)]
        self.group_id = 0
        
    def next(self):
        if self.group_id >= len(self.idx_group):
            self.group_id = 0
            raise StopIteration
        out = self.inputs[self.idx_group[self.group_id], :]
        self.group_id += 1
        return [out[:, i] for i in range(self.num_cols)]

In [18]:


# 使用矩阵分解搭建网络结构
def inference_svd(user_batch, item_batch, user_num, item_num, dim=5, device="/cpu:0"):
    # 使用CPU
    with tf.device("/cpu:0"):
        # 初始化几个bias项
        global_bias = tf.get_variable("global_bias", shape=[])
        w_bias_user = tf.get_variable("embd_bias_user", shape=[user_num])
        w_bias_item = tf.get_variable("embd_bias_item", shape=[item_num])
        # bias向量
        bias_user = tf.nn.embedding_lookup(w_bias_user, user_batch, name="bias_user")
        bias_item = tf.nn.embedding_lookup(w_bias_item, item_item_batch, name="bias_item")
        
        w_user = tf.get_variable("embd_user", shape=[user_num, dim],
                                 initializer=tf.truncated_normal_initializer(stddev=0.02))
        w_item = tf.get_variable("embd_item", shape=[item_num, dim],
                                 initializer=tf.truncated_normal_initializer(stddev=0.02))
        # user向量与item向量
        embd_user = tf.nn.embedding_lookup(w_user, user_batch, name="embedding_user")
        embd_item = tf.nn.embedding_lookup(w_item, item_batch, name="embedding_item")
        
    with tf.device(device):
        # 按照实际公式进行计算
        # 先对user向量和item向量求内积
        infer = tf.reduce_sum(tf.multiply(embd_user, embd_item), 1)
        # 加上几个偏置项
        infer = tf.add(infer, global_bias)
        infer = tf.add(infer, bias_user)
        infer = tf.add(infer, bias_item, name="svd_inference")
        # 加上正则化项
        regularizer = tf.add(tf.nn.l2_loss(embd_user), tf.nn.l2_loss(embd_item), name="svd_regularizer")
    return infer, regularizer

# 迭代优化部分
def optimization(infer, regularizer, rate_batch, learning_rate=0.001, reg=0.1, device="/cpu:0"):
    global_step = tf.train.get_global_step()
    assert global_step is not None
    # 选择合适的optimizer做优化
    with tf.device(device):
        cost_l2 = tf.nn.l2_loss(tf.subtract(infer, rate_batch))
        penalty = tf.constant(reg, dtype=tf.float32, shape=[], name="l2")
        cost = tf.add(cost_l2, tf.multiply(regularizer, penalty))
        train_op = tf.train.AdamOptimizer(learning_rate).minimize(cost, global_step=global_step)
    return cost, train_op

ModuleNotFoundError: No module named 'tensorflow'