### 知识：

1. word2vec：输入(doc, words)，得到word embedding
2. item2vec：输入（userid, itemids），得到item embedding

说明：

1. 使用标题/内容的分词embedding作推荐，属于内容相似推荐
2. 使用行为列表作embedding作推荐，属于行为相关推荐，效果比内容相似推荐更好

延伸：

1. 把word embedding进行加和、平均，就得到了document embedding；
2. 把item embedding进行加和、平均，就得到了user embedding；

#### 数据整理

In [1]:
import pandas as pd 
import numpy as np

In [2]:
df = pd.read_csv("./dataset/datas/ml-latest-small/ratings.csv")

In [3]:
df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [4]:
df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [6]:
# 取评分大于均值的数据
df_new = df[df["rating"] >= df["rating"].mean()]

In [7]:
df_new

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100830,610,166528,4.0,1493879365
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047


In [9]:
# 按照用户id聚合，得到userid， movieids
df_group = df_new.groupby("userId")["movieId"].apply(lambda x : [str(m) for m in x]).reset_index()

In [10]:
df_group

Unnamed: 0,userId,movieId
0,1,"[1, 3, 6, 47, 50, 101, 110, 151, 157, 163, 216..."
1,2,"[333, 1704, 3578, 6874, 46970, 48516, 58559, 6..."
2,3,"[849, 1587, 2288, 2851, 3024, 3703, 4518, 5181..."
3,4,"[106, 125, 162, 176, 215, 232, 260, 265, 319, ..."
4,5,"[1, 21, 34, 36, 50, 58, 110, 232, 247, 261, 29..."
...,...,...
604,606,"[17, 18, 29, 32, 46, 50, 68, 70, 73, 80, 82, 1..."
605,607,"[1, 36, 86, 110, 150, 165, 188, 241, 292, 318,..."
606,608,"[10, 16, 47, 50, 110, 170, 172, 293, 296, 318,..."
607,609,"[10, 253, 296, 318, 356, 457, 590, 731, 1150, ..."


#### pyspark.ml  word2vec

In [16]:
import findspark
findspark.init()

In [17]:
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("pyspark item2vec") \
        .getOrCreate()

In [18]:
sc = spark.sparkContext