## Python使用SparkALS矩阵分解实现电影推荐

背景知识：
* 协同过滤：简单来说是利用某兴趣相投、拥有共同经验之群体的喜好来推荐用户感兴趣的信息，即群体的智慧
* 矩阵分解：将（用户、物品、行为）矩阵分解成（用户、隐向量）和（物品，隐向量）两个子矩阵，通过隐向量实现推荐
* ALS：交替最小二乘法，先假设U的初始值U(0)，可以根据U(0)可以计算出V(0)，再根据V(0)计算出U(1)，迭代到收敛

演示目标：
1. 实现矩阵分解，得到user embedding和item embedding
2. 对于目标user，近邻搜索得到推荐的item列表（需要去除已看、需要查询电影名称）

延伸：
1. user embedding自身的搜索，可以实现兴趣相投的人的推荐
2. item embedding自身的搜索，可以实现相关推荐

In [1]:
import pandas as pd
import numpy as np
import json

import findspark
findspark.init()

from pyspark.sql import SparkSession

### 1. Pyspark读取CSV数据

In [2]:
spark = SparkSession \
    .builder \
    .appName("PySpark ALS") \
    .getOrCreate()

sc = spark.sparkContext

In [3]:
from pyspark.sql import functions as F
from pyspark.sql import types as T

In [4]:
# 指定excel的解析字段类型
customSchema = T.StructType([
    T.StructField("userId", T.IntegerType(), True),        
    T.StructField("movieId", T.IntegerType(), True),
    T.StructField("rating", T.FloatType(), True),
    T.StructField("timestamp", T.LongType(), True),
])

In [9]:
df = spark.read.csv(
    "/ratings-small.csv", 
    header=True,
    schema=customSchema
)
df.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|   1193|   5.0|978300760|
|     1|    661|   3.0|978302109|
|     1|    914|   3.0|978301968|
|     1|   3408|   4.0|978300275|
|     1|   2355|   5.0|978824291|
+------+-------+------+---------+
only showing top 5 rows



In [10]:
df.select("userId").distinct().count()

6040

In [11]:
df.select("movieId").distinct().count()

3706

In [12]:
df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: float (nullable = true)
 |-- timestamp: long (nullable = true)



### 2. 实现SparkALS的矩阵分解

In [13]:
from pyspark.ml.recommendation import ALS

In [14]:
als = ALS(
    maxIter=5, 
    regParam=0.01, 
    userCol="userId", 
    itemCol="movieId", 
    ratingCol="rating",
    coldStartStrategy="drop")

# 实现训练
model = als.fit(df)

#### 保存user embedding

In [15]:
model.userFactors.show(5)

+---+--------------------+
| id|            features|
+---+--------------------+
| 10|[-1.4438248, 0.04...|
| 20|[-1.1907033, -0.1...|
| 30|[-1.398386, 0.508...|
| 40|[-0.5107502, 0.17...|
| 50|[-0.4870928, 0.06...|
+---+--------------------+
only showing top 5 rows



In [16]:
model.userFactors.count()

6040

In [17]:
model.userFactors.select("id", "features") \
           .toPandas() \
           .to_csv('./datas/movielens_sparkals_user_embedding.csv', index=False)

#### 保存item embedding

In [18]:
model.itemFactors.show(5)

+---+--------------------+
| id|            features|
+---+--------------------+
| 10|[-0.4235651, 0.13...|
| 20|[-0.46852362, 0.3...|
| 30|[-0.23346072, -0....|
| 40|[-0.8263968, -0.6...|
| 50|[-0.4746313, 0.16...|
+---+--------------------+
only showing top 5 rows



In [19]:
model.itemFactors.count()

3706

In [20]:
model.itemFactors.select("id", "features") \
           .toPandas() \
           .to_csv('./datas/movielens_sparkals_item_embedding.csv', index=False)

### 4. 对于给定用户算出可能最喜欢的10个电影

思路：
1. 查询目标用户的embedding
2. 计算目标用户embedding跟所有movie embedding的sim value
3. 计算用户看过的集合
4. 第2步骤过滤掉看过的集合，然后挑选出前10个电影

In [21]:
# 目标用户ID
target_user_id = 1

#### 4.1 读取多份数据 

In [22]:
df_movie = pd.read_csv("./datas/ml-25m/movies.csv")
df_movie_embedding = pd.read_csv("./datas/movielens_sparkals_item_embedding.csv")
df_rating = pd.read_csv("./datas/ml-25m/ratings.csv")
df_user_embedding = pd.read_csv("./datas/movielens_sparkals_user_embedding.csv")

In [23]:
# embedding从字符串向量化
df_movie_embedding["features"] = df_movie_embedding["features"].map(lambda x : np.array(json.loads(x)))
df_user_embedding["features"] = df_user_embedding["features"].map(lambda x : np.array(json.loads(x)))

#### 4.2 查询用户的embedding

In [24]:
df_user_embedding.head(3)

Unnamed: 0,id,features
0,10,"[-1.4438247680664062, 0.0444270484149456, 0.49..."
1,20,"[-1.190703272819519, -0.1952604204416275, -0.2..."
2,30,"[-1.398386001586914, 0.5086374282836914, -0.43..."


In [25]:
user_embedding = df_user_embedding[df_user_embedding["id"] == target_user_id].iloc[0, 1]
user_embedding

array([-0.82889444,  0.06750389,  1.12229645,  1.00567043, -1.98782277,
        0.43003121,  0.45709184,  2.77200723, -0.18764918, -1.09612036])

#### 4.3 计算userembedding和所有itemembedding的相似度

In [26]:
df_movie_embedding.head(3)

Unnamed: 0,id,features
0,10,"[-0.4235650897026062, 0.13135862350463867, -0...."
1,20,"[-0.46852362155914307, 0.3792363405227661, 0.4..."
2,30,"[-0.23346072435379028, -0.09688711166381836, 0..."


In [27]:
# 余弦相似度
from scipy.spatial import distance
df_movie_embedding["sim_value"] = (
    df_movie_embedding["features"].map(lambda x : 1 - distance.cosine(user_embedding, x)))

In [28]:
df_movie_embedding.head(3)

Unnamed: 0,id,features,sim_value
0,10,"[-0.4235650897026062, 0.13135862350463867, -0....",0.759754
1,20,"[-0.46852362155914307, 0.3792363405227661, 0.4...",0.746101
2,30,"[-0.23346072435379028, -0.09688711166381836, 0...",0.344172


#### 4.4 计算用户看过的movieId集合

In [29]:
df_rating.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828


In [30]:
# 筛选、查询单列、去重、变成set
watched_ids = set(df_rating[df_rating["userId"] == target_user_id]["movieId"].unique())
len(watched_ids)

70

#### 4.5 筛选出推荐的10个电影ID

In [31]:
df_movie_embedding.head(3)

Unnamed: 0,id,features,sim_value
0,10,"[-0.4235650897026062, 0.13135862350463867, -0....",0.759754
1,20,"[-0.46852362155914307, 0.3792363405227661, 0.4...",0.746101
2,30,"[-0.23346072435379028, -0.09688711166381836, 0...",0.344172


In [32]:
# 筛选ID列表
df_target_movieIds = (
    df_movie_embedding[~df_movie_embedding["id"].isin(watched_ids)]
        .sort_values(by="sim_value", ascending=False)
        .head(10)
        [["id", "sim_value"]]
)
df_target_movieIds

Unnamed: 0,id,sim_value
138,1470,0.970562
221,2420,0.969357
1339,2423,0.955955
1526,404,0.948907
394,311,0.931636
3351,139,0.926449
770,292,0.925331
46,480,0.919895
2900,3147,0.918364
593,2421,0.9173


#### 4.6 查询ID的电影名称信息展现给用户

In [33]:
pd.merge(
    left=df_target_movieIds,
    right=df_movie,
    left_on="id",
    right_on="movieId"
)[["movieId", "title", "genres", "sim_value"]]

Unnamed: 0,movieId,title,genres,sim_value
0,1470,Rhyme & Reason (1997),Documentary,0.970562
1,2420,"Karate Kid, The (1984)",Drama,0.969357
2,2423,Christmas Vacation (National Lampoon's Christm...,Comedy,0.955955
3,404,Brother Minister: The Assassination of Malcolm...,Documentary,0.948907
4,311,Relative Fear (1994),Horror|Thriller,0.931636
5,139,Target (1995),Action|Drama,0.926449
6,292,Outbreak (1995),Action|Drama|Sci-Fi|Thriller,0.925331
7,480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,0.919895
8,3147,"Green Mile, The (1999)",Crime|Drama,0.918364
9,2421,"Karate Kid, Part II, The (1986)",Action|Adventure|Drama,0.9173
