## 基于内容的推荐算法

目标：  
        基于内容的推荐方法是非常直接的，它以物品的内容描述信息为依据来做出的推荐，本质上是基于对物品和用户自身的特征或属性的直接分析和计算。
例如，假设已知电影A是一部喜剧，而恰巧我们得知某个用户喜欢看喜剧电影，那么我们基于这样的已知信息，就可以将电影A推荐给该用户。  
算法原理：  
TF-IDF**自然语言处理领域中计算文档中词或短语的权值的方法**，是**词频**（Term Frequency，TF）和逆转文档频率（Inverse Document Frequency，IDF）的乘积。TF指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被正规化，以防止它偏向长的文件（同一个词语在长文件里可能会比短文件有更高的词频，而不管该词语重要与否）。IDF是一个词语普遍重要性的度量，某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到。

In [201]:
import numpy as np
import pandas as pd
import collections
from functools import reduce

### 数据集构建

In [38]:
df = pd.read_csv("data/tags.csv", usecols=range(3))
movies = pd.read_csv("data/movies.csv", index_col="movieId")

In [39]:
# 每一部电影的标签信息
tags = df.groupby(by="movieId")["tag"].agg([list])
tags[:3]

Unnamed: 0_level_0,list
movieId,Unnamed: 1_level_1
1,"[pixar, pixar, fun]"
2,"[fantasy, magic board game, Robin Williams, game]"
3,"[moldy, old]"


In [40]:
# genres划分
movies["genres"] = movies["genres"].apply(lambda x: x.split("|"))

In [41]:
movies.head(3)

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
2,Jumanji (1995),"[Adventure, Children, Fantasy]"
3,Grumpier Old Men (1995),"[Comedy, Romance]"


In [51]:
# 给电影打上新的标签
movies_index = set(movies.index)&set(tags.index)
new_tags = tags.loc[list(movies_index)]
ret = movies.join(new_tags)
ret.head()

Unnamed: 0_level_0,title,genres,list
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]","[pixar, pixar, fun]"
2,Jumanji (1995),"[Adventure, Children, Fantasy]","[fantasy, magic board game, Robin Williams, game]"
3,Grumpier Old Men (1995),"[Comedy, Romance]","[moldy, old]"
4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",
5,Father of the Bride Part II (1995),[Comedy],"[pregnancy, remake]"


In [52]:
# 标签的合并，如果没有具体的标签数据则置为空列表
movies_dataset = pd.DataFrame(map(
    lambda x:(x[0], x[1], x[2], x[2]+x[3]) if x[3] is not np.nan 
                                  else(x[0], x[1], x[2], []), ret.itertuples()), columns=["movieId", "title", "genres","tags"])
movies_dataset.head()

Unnamed: 0,movieId,title,genres,tags
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]","[Adventure, Animation, Children, Comedy, Fanta..."
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]","[Adventure, Children, Fantasy, fantasy, magic ..."
2,3,Grumpier Old Men (1995),"[Comedy, Romance]","[Comedy, Romance, moldy, old]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",[]
4,5,Father of the Bride Part II (1995),[Comedy],"[Comedy, pregnancy, remake]"


In [53]:
movies_dataset.set_index("movieId", inplace=True)
movies_dataset.head()

Unnamed: 0_level_0,title,genres,tags
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]","[Adventure, Animation, Children, Comedy, Fanta..."
2,Jumanji (1995),"[Adventure, Children, Fantasy]","[Adventure, Children, Fantasy, fantasy, magic ..."
3,Grumpier Old Men (1995),"[Comedy, Romance]","[Comedy, Romance, moldy, old]"
4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",[]
5,Father of the Bride Part II (1995),[Comedy],"[Comedy, pregnancy, remake]"


## 物品画像

### TF-IDF提取关键词,构建电影画像

In [62]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from pprint import pprint

In [63]:
# 基于文本的分词,对不同电影计算其相应的TFIDF值
datasets = movies_dataset["tags"].values
datasets

array([list(['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy', 'pixar', 'pixar', 'fun']),
       list(['Adventure', 'Children', 'Fantasy', 'fantasy', 'magic board game', 'Robin Williams', 'game']),
       list(['Comedy', 'Romance', 'moldy', 'old']), ..., list([]),
       list([]), list([])], dtype=object)

In [70]:
# 显示电影的tag
datasets[0:2]

array([list(['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy', 'pixar', 'pixar', 'fun']),
       list(['Adventure', 'Children', 'Fantasy', 'fantasy', 'magic board game', 'Robin Williams', 'game'])],
      dtype=object)

In [77]:
# 获取（词索引，词频）
dic = Dictionary(datasets)  # 根据文本创建字典
corpus = [dic.doc2bow(line) for line in datasets]
corpus[:2]

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2)],
 [(0, 1), (2, 1), (4, 1), (7, 1), (8, 1), (9, 1), (10, 1)]]

In [83]:
datasets[0]

['Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Fantasy',
 'pixar',
 'pixar',
 'fun']

In [84]:
dic.get(0), dic.get(1), dic.get(2)

('Adventure', 'Animation', 'Children')

In [97]:
dic[5]

'fun'

In [91]:
# 计算TFIFD值　(词索引, TFIDF值)
model = TfidfModel(corpus=corpus)
model[corpus[1]]

[(0, 0.20506508893148376),
 (2, 0.250737357659749),
 (4, 0.23699778877133218),
 (7, 0.4358410782580723),
 (8, 0.39847806133235253),
 (9, 0.49506005899914796),
 (10, 0.49506005899914796)]

In [100]:
movie_profile = {}

for i, mid in enumerate(movies_dataset.index):
    vector = model[corpus[i]]
    # 获取Top-N的关键词
    movie_tag = sorted(vector, key=lambda x:x[1], reverse=True)[:30]
    movie_profile[mid] = dict(map(lambda x: (dic[x[0]], x[1]), movie_tag))

In [108]:
# 获得每一部电影的相对应的TFIDF值
for key, value in movie_profile.items():
    print(key, "\n", value)
    break

1 
 {'pixar': 0.837374709121301, 'fun': 0.34531665530514855, 'Animation': 0.21562355612017706, 'Children': 0.21205621229134275, 'Fantasy': 0.2004362408431816, 'Adventure': 0.17342978500637887, 'Comedy': 0.1335891911679789}


### 完善画像关键词

In [115]:
_movie_profile = []

for i, data in enumerate(movies_dataset.itertuples()):
    mid = data[0]
    title = data[1]
    genres = data[2]
    
    # 获得TFIFD值
    vector = model[corpus[i]]
    movie_tag = sorted(vector, key=lambda x:x[1], reverse=True)[:30]
    TopN_tags_weight = dict(map(lambda x: (dic[x[0]], x[1]), movie_tag))
    
    # 加入类别词，初始权重1
    for g in genres:
        TopN_tags_weight[g] = 1.0
    
    # 获取相对应的tag
    TopN_tag = [i[0] for i in TopN_tags_weight.items()]
    _movie_profile.append((mid, title, TopN_tag, TopN_tags_weight))

In [117]:
_movie_profile[1]

(2,
 'Jumanji (1995)',
 ['game',
  'magic board game',
  'Robin Williams',
  'fantasy',
  'Children',
  'Fantasy',
  'Adventure'],
 {'Adventure': 1.0,
  'Children': 1.0,
  'Fantasy': 1.0,
  'Robin Williams': 0.4358410782580723,
  'fantasy': 0.39847806133235253,
  'game': 0.49506005899914796,
  'magic board game': 0.49506005899914796})

In [118]:
TopN_tags_weight

{'Comedy': 1.0}

In [120]:
# 得到电影画像，主要包括tag及其相对应的TFIFD值
movie_profile = pd.DataFrame(_movie_profile, columns=["movieId", "title", "profile", "weights"])
movie_profile.set_index("movieId", inplace=True)
movie_profile.head()

Unnamed: 0_level_0,title,profile,weights
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),"[pixar, fun, Animation, Children, Fantasy, Adv...","{'pixar': 0.837374709121301, 'fun': 0.34531665..."
2,Jumanji (1995),"[game, magic board game, Robin Williams, fanta...","{'game': 0.49506005899914796, 'magic board gam..."
3,Grumpier Old Men (1995),"[moldy, old, Romance, Comedy]","{'moldy': 0.669101789463952, 'old': 0.66910178..."
4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]","{'Comedy': 1.0, 'Drama': 1.0, 'Romance': 1.0}"
5,Father of the Bride Part II (1995),"[pregnancy, remake, Comedy]","{'pregnancy': 0.7029528753875794, 'remake': 0...."


In [281]:
list(movie_profile["weights"].iteritems())[:2]

[(1,
  {'Adventure': 1.0,
   'Animation': 1.0,
   'Children': 1.0,
   'Comedy': 1.0,
   'Fantasy': 1.0,
   'fun': 0.34531665530514855,
   'pixar': 0.837374709121301}),
 (2,
  {'Adventure': 1.0,
   'Children': 1.0,
   'Fantasy': 1.0,
   'Robin Williams': 0.4358410782580723,
   'fantasy': 0.39847806133235253,
   'game': 0.49506005899914796,
   'magic board game': 0.49506005899914796})]

### 倒排序索引
目标：通过关键特征推荐电影

In [143]:
inverted_table = {}
for mid, weights in movie_profile["weights"].iteritems():
    for tag, weight in weights.items():
        # 为每一tag申请一个空列表来存放相对应的（mid, weights）
        _ = inverted_table.get(tag, [])
        _.append((mid, weight))
        inverted_table.setdefault(tag, _)

In [284]:
# tag:(mid, weight)
inverted_table["fun"]

[(1, 0.34531665530514855),
 (89745, 0.3284369053807601),
 (108932, 0.31964654815096755),
 (122918, 0.747908115567127)]

## 用户画像

In [231]:
movies_dataset[:5]

Unnamed: 0_level_0,title,genres,tags
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]","[Adventure, Animation, Children, Comedy, Fanta..."
2,Jumanji (1995),"[Adventure, Children, Fantasy]","[Adventure, Children, Fantasy, fantasy, magic ..."
3,Grumpier Old Men (1995),"[Comedy, Romance]","[Comedy, Romance, moldy, old]"
4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",[]
5,Father of the Bride Part II (1995),[Comedy],"[Comedy, pregnancy, remake]"


In [232]:
movie_profile.head()

Unnamed: 0_level_0,title,profile,weights
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),"[pixar, fun, Animation, Children, Fantasy, Adv...","{'pixar': 0.837374709121301, 'fun': 0.34531665..."
2,Jumanji (1995),"[game, magic board game, Robin Williams, fanta...","{'game': 0.49506005899914796, 'magic board gam..."
3,Grumpier Old Men (1995),"[moldy, old, Romance, Comedy]","{'moldy': 0.669101789463952, 'old': 0.66910178..."
4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]","{'Comedy': 1.0, 'Drama': 1.0, 'Romance': 1.0}"
5,Father of the Bride Part II (1995),"[pregnancy, remake, Comedy]","{'pregnancy': 0.7029528753875794, 'remake': 0...."


In [233]:
watch_record = pd.read_csv("data/ratings.csv"
                           , usecols=range(2)
                           , dtype={"userId":np.int32, "movieId": np.int32}
                          )
watch_record[:5]

Unnamed: 0,userId,movieId
0,1,1
1,1,3
2,1,6
3,1,47
4,1,50


In [234]:
watch_record = watch_record.groupby("userId").agg([list])
watch_record.head()

Unnamed: 0_level_0,movieId
Unnamed: 0_level_1,list
userId,Unnamed: 1_level_2
1,"[1, 3, 6, 47, 50, 70, 101, 110, 151, 157, 163,..."
2,"[318, 333, 1704, 3578, 6874, 8798, 46970, 4851..."
3,"[31, 527, 647, 688, 720, 849, 914, 1093, 1124,..."
4,"[21, 32, 45, 47, 52, 58, 106, 125, 126, 162, 1..."
5,"[1, 21, 34, 36, 39, 50, 58, 110, 150, 153, 232..."


In [271]:
# 用户画像
user_profile = {}

for uid, mids in watch_record.itertuples():
    
    # 获取每一uid 观看电影的信息
    record_movie_profile = movie_profile.loc[list(mids)]
    # 计算最感兴趣的种类
    counter = collections.Counter(reduce(lambda x, y:list(x)+list(y), record_movie_profile["profile"].values))
    interest_word = counter.most_common(50)
    maxcount = interest_word[0][1]
    #　归一化
    interest_word = [(w,round(c/maxcount, 4)) for w,c in interest_word]
    # {uid:{ii1:weight, ii2:weight}}
    user_profile[uid] = interest_word

In [292]:
user_profile[1][:10]

[('Action', 1.0),
 ('Adventure', 0.9444),
 ('Comedy', 0.9222),
 ('Drama', 0.7556),
 ('Thriller', 0.6111),
 ('Fantasy', 0.5222),
 ('Crime', 0.5),
 ('Children', 0.4667),
 ('Sci-Fi', 0.4444),
 ('Animation', 0.3222)]

## 为用户产生TOP-N推荐结果

## 知识点

In [169]:
T = {}
test = {"a":4, "b":5}
test

{'a': 4, 'b': 5}

In [170]:
# 获取某一key的value值，如果存在返回相对应的值，否则返回[]
b = T.get("a", [])
b.append((11, 4, 5, 6))

In [171]:
# 外部设置键值对
c = T.setdefault("a", b)
c

[(11, 4, 5, 6)]

In [172]:
T

{'a': [(11, 4, 5, 6)]}

In [268]:
a = collections.Counter(reduce(lambda x, y: list(x)+list(y), record_movie_profile["profile"].values))
b = a.most_common()[:10]
b

[('Action', 90),
 ('Adventure', 85),
 ('Comedy', 83),
 ('Drama', 68),
 ('Thriller', 55),
 ('Fantasy', 47),
 ('Crime', 45),
 ('Children', 42),
 ('Sci-Fi', 40),
 ('Animation', 29)]

In [269]:
maxcount = b[0][1]
maxcount

90

In [270]:
for w, c in b:
    print((w, round(c/maxcount, 4)))

('Action', 1.0)
('Adventure', 0.9444)
('Comedy', 0.9222)
('Drama', 0.7556)
('Thriller', 0.6111)
('Fantasy', 0.5222)
('Crime', 0.5)
('Children', 0.4667)
('Sci-Fi', 0.4444)
('Animation', 0.3222)
