## Python训练item2vec实现电影相关推荐

知识：
* word2vec：输入(doc, words)，得到word embedding
* item2vec：输入（userid, itemids），得到item embedding

说明：
* 使用标题/内容的分词embedding作推荐，属于内容相似推荐
* 使用行为列表作embedding作推荐，属于行为相关推荐，效果比内容相似推荐更好

延伸：
* 把word embedding进行加和、平均，就得到了document embedding；
* 把item embedding进行加和、平均，就得到了user embedding；

### 1. 获取数据

In [2]:
import pandas as pd

In [30]:
df = pd.read_csv("./datas/ml-25m/ratings.csv")
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [31]:
df["rating"].mean()

3.533854451353085

In [32]:
# 只取平均分以上的数据，作为喜欢的列表
df = df[df["rating"] > df["rating"].mean()].copy()
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
5,1,1088,4.0,1147868495
8,1,1237,5.0,1147868839


In [33]:
# 聚合得到userId，movieId列表
df_group = df.groupby(['userId'])['movieId'].apply(lambda x: ' '.join([str(m) for m in x])).reset_index()
df_group.head()

Unnamed: 0,userId,movieId
0,1,296 307 665 1088 1237 1250 1653 2351 2573 2632...
1,2,110 150 151 236 260 318 333 349 356 364 457 49...
2,3,1 29 32 50 111 172 214 260 293 296 318 356 527...
3,4,296 541 589 924 1036 1136 1196 1197 1198 1220 ...
4,5,1 19 32 36 47 50 88 104 141 147 150 170 216 22...


In [34]:
df_group.to_csv("./datas/movielens_uid_movieids.csv", index=False)

In [35]:
!hadoop fs -put ./datas/movielens_uid_movieids.csv /dataset/ml-25m/movielens_uid_movieids.csv

2021-03-24 16:22:09,282 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false


### 3. 使用Pyspark训练item2vec

In [36]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("PySpark Item2vec") \
    .getOrCreate()

sc = spark.sparkContext

#### Pyspark读取CSV数据

In [37]:
df = spark.read.csv("/dataset/ml-25m/movielens_uid_movieids.csv", header=True)
df.show(5)

+------+--------------------+
|userId|             movieId|
+------+--------------------+
|     1|296 307 665 1088 ...|
|     2|110 150 151 236 2...|
|     3|1 29 32 50 111 17...|
|     4|296 541 589 924 1...|
|     5|1 19 32 36 47 50 ...|
+------+--------------------+
only showing top 5 rows



In [38]:
from pyspark.sql import functions as F
from pyspark.sql import types as T

# 把非常的字符串格式变成LIST形式
df = df.withColumn('movie_ids', F.split(df.movieId, " "))

#### 实现word2vec的训练与转换

In [39]:
# https://spark.apache.org/docs/2.4.6/ml-features.html#word2vec

from pyspark.ml.feature import Word2Vec

word2Vec = Word2Vec(
    vectorSize=5, 
    minCount=0, 
    inputCol="movie_ids", 
    outputCol="movie_2vec")

model = word2Vec.fit(df)

In [40]:
# 不计算每个user的embedding，而是计算item的embedding
model.getVectors().show(3, truncate=False)

+------+-------------------------------------------------------------------------------------------------------+
|word  |vector                                                                                                 |
+------+-------------------------------------------------------------------------------------------------------+
|186427|[0.3085939586162567,-0.24025790393352509,-0.17938601970672607,-0.3837507665157318,0.03265058994293213] |
|132561|[0.5781996250152588,-1.0694571733474731,0.27098149061203003,-0.884647786617279,0.26468193531036377]    |
|133718|[0.22011743485927582,-0.43878602981567383,0.07164935022592545,-0.20302961766719818,0.11126227676868439]|
+------+-------------------------------------------------------------------------------------------------------+
only showing top 3 rows



In [41]:
model.getVectors().select("word", "vector") \
           .toPandas() \
           .to_csv('./datas/movielens_movie_embedding.csv', index=False)

### 4. 对于给定电影算出最相似的10个电影

In [42]:
df_embedding = pd.read_csv("./datas/movielens_movie_embedding.csv")
df_embedding.head(3)

Unnamed: 0,word,vector
0,186427,"[0.3085939586162567,-0.24025790393352509,-0.17..."
1,132561,"[0.5781996250152588,-1.0694571733474731,0.2709..."
2,133718,"[0.22011743485927582,-0.43878602981567383,0.07..."


In [43]:
df_movie = pd.read_csv("./datas/ml-latest-small/movies.csv")
df_movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [44]:
df_merge = pd.merge(left=df_embedding, 
                    right=df_movie,
                    left_on="word",
                    right_on="movieId")
df_merge.head()

Unnamed: 0,word,vector,movieId,title,genres
0,26985,"[0.16287139058113098,0.1859545111656189,-1.185...",26985,Nirvana (1997),Action|Sci-Fi
1,5451,"[-0.14193858206272125,0.9305868744850159,-0.28...",5451,Pumpkin (2002),Comedy|Drama|Romance
2,4018,"[0.3904050588607788,0.39037156105041504,0.7572...",4018,What Women Want (2000),Comedy|Romance
3,184641,"[0.2873527407646179,-0.25498709082603455,-0.08...",184641,Fullmetal Alchemist 2018 (2017),Action|Adventure|Fantasy
4,4056,"[0.4343894124031067,0.2795574367046356,0.55841...",4056,"Pledge, The (2001)",Crime|Drama|Mystery|Thriller


In [45]:
# 把 movie vector 转化成一个 json 对象
import numpy as np
import json
df_merge["vector"] = df_merge["vector"].map(lambda x : np.array(json.loads(x)))

In [46]:
# 随便挑选一个电影：4018	What Women Want (2000)
movie_id = 4018
df_merge.loc[df_merge["movieId"]==movie_id]

Unnamed: 0,word,vector,movieId,title,genres
2,4018,"[0.3904050588607788, 0.39037156105041504, 0.75...",4018,What Women Want (2000),Comedy|Romance


In [47]:
movie_embedding = df_merge.loc[df_merge["movieId"]==movie_id, "vector"].iloc[0]
movie_embedding

array([ 0.39040506,  0.39037156,  0.75722688, -0.70923549,  1.7250036 ])

In [48]:
# 余弦相似度
from scipy.spatial import distance
df_merge["sim_value"] = df_merge["vector"].map(lambda x : 1 - distance.cosine(movie_embedding, x))

In [49]:
df_merge[["movieId", "title", "genres", "sim_value"]].head(3)

Unnamed: 0,movieId,title,genres,sim_value
0,26985,Nirvana (1997),Action|Sci-Fi,-0.056454
1,5451,Pumpkin (2002),Comedy|Drama|Romance,0.573961
2,4018,What Women Want (2000),Comedy|Romance,1.0


In [50]:
# 按相似度降序排列，查询前10条
df_sorted = df_merge.sort_values(by="sim_value", ascending=False)[["movieId", "title", "genres", "sim_value"]]
df_sorted.head(10)

Unnamed: 0,movieId,title,genres,sim_value
2,4018,What Women Want (2000),Comedy|Romance,1.0
3148,4019,Finding Forrester (2000),Drama,0.999715
5190,4022,Cast Away (2000),Drama,0.999557
1624,3999,Vertical Limit (2000),Action|Adventure,0.999478
5057,4016,"Emperor's New Groove, The (2000)",Adventure|Animation|Children|Comedy|Fantasy,0.999336
8920,4014,Chocolat (2000),Drama|Romance,0.999333
5019,4015,"Dude, Where's My Car? (2000)",Comedy|Sci-Fi,0.999196
7963,3994,Unbreakable (2000),Drama|Sci-Fi,0.99914
4029,3993,Quills (2000),Drama|Romance,0.999022
6429,3992,Malèna (2000),Drama|Romance|War,0.998928
