数据加载

In [1]:
import pandas as pd

# demo 验证
train_df = pd.read_csv('Data/Movie_RS.csv')
print(train_df.shape)
train_df.head(1)

(10000, 13)


Unnamed: 0,ID,Movie_Name,Movie_Score,Review_Count,Movie_Star_Distribution,Collect_Date,Username,Post_Date,Score,User_Comment,User_Comment_Distribution,Comment_Like_Count,Movie_Tags
0,0,"1988年的妮可 Nico, 1988",7.5,565,15.2%48.2%32.3%3.4%0.8%,2019-10-05,尾黑,2018-06-23,3,成本低廉的PPT电影，用Nico生命中最后一年发生的事给Nico的歌配上情节，倒不算尴尬。女...,66%31%3%,4,"['音乐', '电影', '儿子', '丝绒', '人物', '传记', '传记片', '歌..."


#### 数据预处理

In [2]:
# 去除空值
train_df.dropna(axis=0, how='any', inplace=True)

# 两列去除重复
train_df.drop_duplicates(subset=['Movie_Name','Username'],keep='first',inplace=True)

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9999 entries, 0 to 9999
Data columns (total 13 columns):
ID                           9999 non-null int64
Movie_Name                   9999 non-null object
Movie_Score                  9999 non-null float64
Review_Count                 9999 non-null int64
Movie_Star_Distribution      9999 non-null object
Collect_Date                 9999 non-null object
Username                     9999 non-null object
Post_Date                    9999 non-null object
Score                        9999 non-null int64
User_Comment                 9999 non-null object
User_Comment_Distribution    9999 non-null object
Comment_Like_Count           9999 non-null int64
Movie_Tags                   9999 non-null object
dtypes: float64(1), int64(4), object(8)
memory usage: 1.1+ MB


In [3]:
from sklearn.metrics.pairwise import cosine_similarity
from lightfm import LightFM, cross_validation
from scipy.sparse import csr_matrix, coo_matrix
from lightfm.evaluation import auc_score
from lightfm.data import Dataset
import numpy as np

### 数据预处理

In [4]:
# 建立用户名和 id 映射的字典
user_dict = {}
for index, value in enumerate(train_df['Username'].unique()):
    user_dict[value] = index
train_df['uid_int'] = train_df['Username'].apply(lambda x: user_dict[x])

# 字典翻转
reverse_user_dict = {v: k for k, v in user_dict.items()}

# 建立电影名和 id 映射的字典
item_dict = {}
for index, value in enumerate(train_df['Movie_Name'].unique()):
    item_dict[value] = index
train_df['item_int'] = train_df['Movie_Name'].apply(lambda x: item_dict[x])

# 字典翻转
reverse_item_dict = {v: k for k, v in item_dict.items()}

#### 设置电影和用户特征

In [5]:
train_df.head(1)

Unnamed: 0,ID,Movie_Name,Movie_Score,Review_Count,Movie_Star_Distribution,Collect_Date,Username,Post_Date,Score,User_Comment,User_Comment_Distribution,Comment_Like_Count,Movie_Tags,uid_int,item_int
0,0,"1988年的妮可 Nico, 1988",7.5,565,15.2%48.2%32.3%3.4%0.8%,2019-10-05,尾黑,2018-06-23,3,成本低廉的PPT电影，用Nico生命中最后一年发生的事给Nico的歌配上情节，倒不算尴尬。女...,66%31%3%,4,"['音乐', '电影', '儿子', '丝绒', '人物', '传记', '传记片', '歌...",0,0


In [6]:
# 电影特征
items_f = ['Movie_Score', 'Review_Count', 'item_int','Movie_Name']

# 用户特征
users_f = ['uid_int']

#### 数据划分

优化特征提取:
1. 电影信息拆分:单独编码
2. 用户信息拆分：单独编码
3. 通过交互信息进行 join

用户交互表划分

In [7]:
user_post_event = train_df[['uid_int', 'item_int', 'Score','Post_Date']]
user_post_event.shape

(9999, 4)

In [8]:
# 时间排序,时间倒叙,最近的排在前面
user_post_event = user_post_event.sort_values(by='Post_Date', ascending=False)
# 查看最后几行数据
user_post_event.tail()

Unnamed: 0,uid_int,item_int,Score,Post_Date
4507,4298,12,3,2006-04-11
242,240,1,4,2006-02-03
3292,3150,9,4,2005-12-18
2072,2025,6,5,2005-10-13
204,203,1,2,2005-07-19


增加时间过滤，不需要很久远的数据。

In [9]:
Time_Threshold =  '2018-01-01'
# 直接对时间字段进行截断
user_post_event = user_post_event[user_post_event['Post_Date'] > Time_Threshold]
user_post_event.tail()

Unnamed: 0,uid_int,item_int,Score,Post_Date
4049,3872,11,4,2018-01-04
4758,4525,12,4,2018-01-03
4123,3943,11,4,2018-01-03
155,155,0,5,2018-01-02
5014,4740,13,4,2018-01-02


In [10]:
user_post_event.shape

(1755, 4)

In [11]:
# 找到每位用户的电影 id 序列
raw_movies = [user_post_event[user_post_event['uid_int'] == i]['item_int'].unique().tolist() for i in user_post_event['uid_int'].unique()]

# 对电影评论序列进行 str 处理
raw_id = []
for r_list in raw_movies:
    raw_id.append([str(i) for i in r_list])
len(raw_id)

1698

### 基于用户评论的最近邻电影推荐

In [12]:
from gensim import corpora, similarities
from gensim.models import Word2Vec
import multiprocessing

# 模型训练
%time model = Word2Vec(raw_id, window=3, size=300, workers=multiprocessing.cpu_count()*2, min_count=1)

CPU times: user 24 ms, sys: 0 ns, total: 24 ms
Wall time: 22.8 ms


通过建立的电影 id 向量，找到最近的 tokn 个电影。

In [13]:
# 相关查询
topn = 10
# 假设当前电影 id
movie_check = '11'

# 开始近邻检索
%time ANN_List = [int(i[0]) for i in model.wv.most_similar(movie_check, topn=topn)]
ANN_List

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 830 µs


[22, 4, 14, 21, 7, 1, 12, 20, 15, 23]

打印近邻的电影信息。

In [14]:
ANN_List_Info = train_df.loc[train_df['item_int'].isin(ANN_List)].drop_duplicates('item_int', keep='first', inplace=False)
ANN_List_Info[items_f].head()

Unnamed: 0,Movie_Score,Review_Count,item_int,Movie_Name
159,7.9,1509,1,24小时狂欢派对 24 Hour Party People
1574,6.1,700,4,JT·莱罗伊 JT Leroy
2132,7.5,1539,7,一代巨星桑杰君 Sanju
4477,7.6,2620,12,七小福
5277,6.8,527,14,万视瞩目 All Eyez on Me


#### 推广
1. 用户电影之间的电影近邻推荐。
2. 用户用户之间的用户近邻推荐。
3. 将电影和用户的向量进行 inner dot 计算，形成用户个性化推荐和电影 push 推送。