# Baseline: 基准预测 

* 所有物品的评分平均值
* 找到每一个用户的评分偏置 $b_u$
* 找到每一个物品的评分偏置 $b_i$
* 预测得分 $\hat{r}_{ui} = \mu + b_u + b_i$
* 可以使用**梯度下降**和**最小二乘法(ALS)**来优化损失

## 梯度下降优化baseline损失

In [1]:
# 加载数据集
import numpy as np
import pandas as pd
dtype = {"userId":np.int32, "movieId":np.int32, "rating":np.float32}
data = pd.read_csv("../dataset/ml-latest-small/ratings.csv", usecols=range(3), dtype=dtype)

In [2]:
data

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0
...,...,...,...
100831,610,166534,4.0
100832,610,168248,5.0
100833,610,168250,5.0
100834,610,168252,5.0


In [3]:
user_ratings = data.groupby("userId").agg([list])
item_ratings = data.groupby("movieId").agg([list])

In [4]:
user_ratings.head()

Unnamed: 0_level_0,movieId,rating
Unnamed: 0_level_1,list,list
userId,Unnamed: 1_level_2,Unnamed: 2_level_2
1,"[1, 3, 6, 47, 50, 70, 101, 110, 151, 157, 163,...","[4.0, 4.0, 4.0, 5.0, 5.0, 3.0, 5.0, 4.0, 5.0, ..."
2,"[318, 333, 1704, 3578, 6874, 8798, 46970, 4851...","[3.0, 4.0, 4.5, 4.0, 4.0, 3.5, 4.0, 4.0, 4.5, ..."
3,"[31, 527, 647, 688, 720, 849, 914, 1093, 1124,...","[0.5, 0.5, 0.5, 0.5, 0.5, 5.0, 0.5, 0.5, 0.5, ..."
4,"[21, 32, 45, 47, 52, 58, 106, 125, 126, 162, 1...","[3.0, 2.0, 3.0, 2.0, 3.0, 3.0, 4.0, 5.0, 1.0, ..."
5,"[1, 21, 34, 36, 39, 50, 58, 110, 150, 153, 232...","[4.0, 4.0, 4.0, 4.0, 3.0, 4.0, 5.0, 4.0, 3.0, ..."


In [5]:
item_ratings.head()

Unnamed: 0_level_0,userId,rating
Unnamed: 0_level_1,list,list
movieId,Unnamed: 1_level_2,Unnamed: 2_level_2
1,"[1, 5, 7, 15, 17, 18, 19, 21, 27, 31, 32, 33, ...","[4.0, 4.0, 4.5, 2.5, 4.5, 3.5, 4.0, 3.5, 3.0, ..."
2,"[6, 8, 18, 19, 20, 21, 27, 51, 62, 68, 82, 91,...","[4.0, 4.0, 3.0, 3.0, 3.0, 3.5, 4.0, 4.5, 4.0, ..."
3,"[1, 6, 19, 32, 42, 43, 44, 51, 58, 64, 68, 91,...","[4.0, 5.0, 3.0, 3.0, 4.0, 5.0, 3.0, 4.0, 3.0, ..."
4,"[6, 14, 84, 162, 262, 411, 600]","[3.0, 3.0, 3.0, 3.0, 1.0, 2.0, 1.5]"
5,"[6, 31, 43, 45, 58, 66, 68, 84, 103, 107, 111,...","[5.0, 3.0, 5.0, 3.0, 4.0, 4.0, 2.0, 3.0, 4.0, ..."


In [6]:
# 计算全局的平均分
global_mean = data["rating"].mean()
global_mean

3.501556873321533

In [7]:
# 初始化 bu bi
bu = dict(zip(user_ratings.index, np.zeros(len(user_ratings.index))))
bi = dict(zip(item_ratings.index, np.zeros(len(item_ratings.index))))

In [8]:
bu

{1: 0.0,
 2: 0.0,
 3: 0.0,
 4: 0.0,
 5: 0.0,
 6: 0.0,
 7: 0.0,
 8: 0.0,
 9: 0.0,
 10: 0.0,
 11: 0.0,
 12: 0.0,
 13: 0.0,
 14: 0.0,
 15: 0.0,
 16: 0.0,
 17: 0.0,
 18: 0.0,
 19: 0.0,
 20: 0.0,
 21: 0.0,
 22: 0.0,
 23: 0.0,
 24: 0.0,
 25: 0.0,
 26: 0.0,
 27: 0.0,
 28: 0.0,
 29: 0.0,
 30: 0.0,
 31: 0.0,
 32: 0.0,
 33: 0.0,
 34: 0.0,
 35: 0.0,
 36: 0.0,
 37: 0.0,
 38: 0.0,
 39: 0.0,
 40: 0.0,
 41: 0.0,
 42: 0.0,
 43: 0.0,
 44: 0.0,
 45: 0.0,
 46: 0.0,
 47: 0.0,
 48: 0.0,
 49: 0.0,
 50: 0.0,
 51: 0.0,
 52: 0.0,
 53: 0.0,
 54: 0.0,
 55: 0.0,
 56: 0.0,
 57: 0.0,
 58: 0.0,
 59: 0.0,
 60: 0.0,
 61: 0.0,
 62: 0.0,
 63: 0.0,
 64: 0.0,
 65: 0.0,
 66: 0.0,
 67: 0.0,
 68: 0.0,
 69: 0.0,
 70: 0.0,
 71: 0.0,
 72: 0.0,
 73: 0.0,
 74: 0.0,
 75: 0.0,
 76: 0.0,
 77: 0.0,
 78: 0.0,
 79: 0.0,
 80: 0.0,
 81: 0.0,
 82: 0.0,
 83: 0.0,
 84: 0.0,
 85: 0.0,
 86: 0.0,
 87: 0.0,
 88: 0.0,
 89: 0.0,
 90: 0.0,
 91: 0.0,
 92: 0.0,
 93: 0.0,
 94: 0.0,
 95: 0.0,
 96: 0.0,
 97: 0.0,
 98: 0.0,
 99: 0.0,
 100: 0.0,
 101: 0.

In [9]:
list(data.itertuples())[:5]

[Pandas(Index=0, userId=1, movieId=1, rating=4.0),
 Pandas(Index=1, userId=1, movieId=3, rating=4.0),
 Pandas(Index=2, userId=1, movieId=6, rating=4.0),
 Pandas(Index=3, userId=1, movieId=47, rating=5.0),
 Pandas(Index=4, userId=1, movieId=50, rating=5.0)]

In [10]:
list(data.itertuples(index=False))[:5]

[Pandas(userId=1, movieId=1, rating=4.0),
 Pandas(userId=1, movieId=3, rating=4.0),
 Pandas(userId=1, movieId=6, rating=4.0),
 Pandas(userId=1, movieId=47, rating=5.0),
 Pandas(userId=1, movieId=50, rating=5.0)]

[itertuples和iteritems的区别可以看这片文章](https://blog.csdn.net/qq_27575895/article/details/90034037)

In [11]:
list(data.iteritems())[:5]

[('userId',
  0           1
  1           1
  2           1
  3           1
  4           1
           ... 
  100831    610
  100832    610
  100833    610
  100834    610
  100835    610
  Name: userId, Length: 100836, dtype: int32),
 ('movieId',
  0              1
  1              3
  2              6
  3             47
  4             50
             ...  
  100831    166534
  100832    168248
  100833    168250
  100834    168252
  100835    170875
  Name: movieId, Length: 100836, dtype: int32),
 ('rating',
  0         4.0
  1         4.0
  2         4.0
  3         5.0
  4         5.0
           ... 
  100831    4.0
  100832    5.0
  100833    5.0
  100834    5.0
  100835    3.0
  Name: rating, Length: 100836, dtype: float32)]

In [12]:
# 梯度下降
epoch_num = 10
alpha = 0.01
lamda = 0.01
for epoch in range(epoch_num):
    print("Epoch: %d" % epoch)
    for uid, iid, rating in data.itertuples(index=False):
        error = rating - global_mean - bu[uid] - bi[iid]
        bu[uid] += alpha * (error - lamda * bu[uid])
        bi[iid] += alpha * (error - lamda * bi[iid])


Epoch: 0
Epoch: 1
Epoch: 2
Epoch: 3
Epoch: 4
Epoch: 5
Epoch: 6
Epoch: 7
Epoch: 8
Epoch: 9


In [13]:
bu

{1: 0.781645080712339,
 2: 0.05084129089064704,
 3: -0.9856783273888134,
 4: -0.2648176988499452,
 5: -0.08442373074844758,
 6: 0.10590347232924022,
 7: -0.5166782894807255,
 8: -0.036429070489573535,
 9: -0.3261681694414816,
 10: -0.33643800173637034,
 11: 0.3108314997495554,
 12: 0.9498664660409566,
 13: 0.13744581114733412,
 14: -0.10843077331458348,
 15: -0.4049404378095475,
 16: -0.29679202302524743,
 17: 0.18006032629716537,
 18: 0.07417122949773162,
 19: -0.8744083071918849,
 20: -0.08299295728593677,
 21: -0.5021384954276988,
 22: -1.5153943552287248,
 23: -0.2634960872077756,
 24: -0.13965932435717823,
 25: 0.8021940895437081,
 26: -0.28545947465262467,
 27: -0.13081936614604808,
 28: -0.5989213922888809,
 29: 0.27204186024087174,
 30: 0.7173915479983927,
 31: 0.2551552081339989,
 32: 0.16804431293811994,
 33: 0.04658882434957617,
 34: -0.08576432673975187,
 35: 0.4759342508428386,
 36: -0.9390170192264451,
 37: 0.40684685470220094,
 38: -0.31678118688946105,
 39: 0.1514268339

In [14]:
# 预测评分
def predict(uid, iid):
    predict = global_mean + bu[userid] + bi[movieid]
    return predict

In [15]:
userid = 1
movieid = 5
predict_rating = predict(userid, movieid)
predict_rating

3.7223945847712807