# 目标

目标：简单实现一个基于大模型的新闻推荐系统

Q：大模型在推荐系统中能做什么？（LLM in Rec? ）

A：

![image.png](assets/img.png)

我们的应用程序主要用了 LLM for 特征工程、LLM for 特征编码、LLM for 打分排序

> 参考链接：
> 
> [大语言模型在推荐系统的实践应用](https://mp.weixin.qq.com/s/pUrqdglF26ww1nDK9hANTA)

# 数据集
## MIND: Microsoft News Recommendation Dataset

官网：https://msnews.github.io

论文：https://msnews.github.io/assets/doc/ACL2020_MIND.pdf

阿里天池对数据集MIND的介绍：https://tianchi.aliyun.com/dataset/89539


> MIcrosoft News Dataset （MIND） 是一个**用于新闻推荐研究**的大规模数据集。它是从Microsoft新闻网站的匿名行为日志中收集的。MIND的使命是作为新闻推荐的基准数据集，促进新闻推荐和推荐系统领域的研究。

> MIND包含约160k篇英文**新闻文章**和100万用户产生的超过1500万条**展示日志**。
> 每篇新闻文章都包含丰富的文本内容，包括标题、摘要、正文、类别和实体。
> 
> 每个展示日志都包含该用户在本次展示前的点击事件、未点击事件和历史新闻点击行为。为了保护用户隐私，当安全地散列到匿名 ID 时，每个用户都会与生产系统取消链接。
>

- MINDsmall_train.zip
  - news.tsv：新闻文章的信息
  - behaviors.tsv：用户的点击历史和印象日志
  - entity_embedding.vec：从知识图中提取的新闻中的实体嵌入
  - lation_embedding.vec：从知识图中提取的实体之间的关系的嵌入

### news.tsv

新闻文章的详细信息。有7列：

- News ID 新闻编号
- Category 类别
- SubCategory 子类别
- Title 标题
- Abstract 摘要
- URL 新闻网址
- Title Entities (entities contained in the title of this news)
- 标题实体（本新闻标题中包含的实体）
- Abstract Entities (entites contained in the abstract of this news)
- 摘要实体（本新闻摘要中包含的实体）

In [1]:
import pandas as pd

In [2]:
# 设置显示的列宽为 None，以显示完整文本列内容
pd.set_option('display.max_colwidth', None)

In [5]:
df_news = pd.read_csv(
    './MIND/MINDsmall_train/news.tsv',
    names=["news_id", "category", "sub_category", "title", "abstract", "url", "title_entities", "abstract_entities"],
    sep='\t',
    header=None
)

In [6]:
df_news.head(2)

Unnamed: 0,news_id,category,sub_category,title,abstract,url,title_entities,abstract_entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By","Shop the notebooks, jackets, and more that the royals can't live without.",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"", ""Type"": ""P"", ""WikidataId"": ""Q80976"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [48], ""SurfaceForms"": [""Prince Philip""]}, {""Label"": ""Charles, Prince of Wales"", ""Type"": ""P"", ""WikidataId"": ""Q43274"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [28], ""SurfaceForms"": [""Prince Charles""]}, {""Label"": ""Elizabeth II"", ""Type"": ""P"", ""WikidataId"": ""Q9682"", ""Confidence"": 0.97, ""OccurrenceOffsets"": [11], ""SurfaceForms"": [""Queen Elizabeth""]}]",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding you back and keeping you from shedding that unwanted belly fat for good.,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""WikidataId"": ""Q193583"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [20], ""SurfaceForms"": [""Belly Fat""]}]","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""WikidataId"": ""Q193583"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [97], ""SurfaceForms"": [""belly fat""]}]"


In [7]:
df_news.iloc[0]

news_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              N55528
category                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          lifestyle
sub_category    

In [8]:
df_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51282 entries, 0 to 51281
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   news_id            51282 non-null  object
 1   category           51282 non-null  object
 2   sub_category       51282 non-null  object
 3   title              51282 non-null  object
 4   abstract           48616 non-null  object
 5   url                51282 non-null  object
 6   title_entities     51279 non-null  object
 7   abstract_entities  51278 non-null  object
dtypes: object(8)
memory usage: 3.1+ MB


In [9]:
df_news.describe()

Unnamed: 0,news_id,category,sub_category,title,abstract,url,title_entities,abstract_entities
count,51282,51282,51282,51282,48616,51282,51279,51278
unique,51282,17,264,50434,47309,51281,34472,36277
top,N55528,news,newsus,Photos of the Day,What's the weather today? What's the weather for the week? Here's your forecast.,[],[],[]
freq,1,15774,6564,15,124,2,13842,13825


In [10]:
# 设置显示的列宽为 None，以显示完整文本列内容
pd.set_option('display.max_colwidth', None)

In [11]:
df_news = pd.read_csv(
    './MIND/MINDsmall_train/news.tsv',
    names=["news_id", "category", "sub_category", "title", "abstract", "url", "title_entities", "abstract_entities"],
    sep='\t',
    header=None
)

In [12]:
df_news.head(3)

Unnamed: 0,news_id,category,sub_category,title,abstract,url,title_entities,abstract_entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By","Shop the notebooks, jackets, and more that the royals can't live without.",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"", ""Type"": ""P"", ""WikidataId"": ""Q80976"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [48], ""SurfaceForms"": [""Prince Philip""]}, {""Label"": ""Charles, Prince of Wales"", ""Type"": ""P"", ""WikidataId"": ""Q43274"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [28], ""SurfaceForms"": [""Prince Charles""]}, {""Label"": ""Elizabeth II"", ""Type"": ""P"", ""WikidataId"": ""Q9682"", ""Confidence"": 0.97, ""OccurrenceOffsets"": [11], ""SurfaceForms"": [""Queen Elizabeth""]}]",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding you back and keeping you from shedding that unwanted belly fat for good.,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""WikidataId"": ""Q193583"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [20], ""SurfaceForms"": [""Belly Fat""]}]","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""WikidataId"": ""Q193583"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [97], ""SurfaceForms"": [""belly fat""]}]"
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches of Ukraine's War,"Lt. Ivan Molchanets peeked over a parapet of sand bags at the front line of the war in Ukraine. Next to him was an empty helmet propped up to trick snipers, already perforated with multiple holes.",https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId"": ""Q212"", ""Confidence"": 0.946, ""OccurrenceOffsets"": [87], ""SurfaceForms"": [""Ukraine""]}]"


In [13]:
df_news.iloc[0]

news_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              N55528
category                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          lifestyle
sub_category    

In [14]:
df_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51282 entries, 0 to 51281
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   news_id            51282 non-null  object
 1   category           51282 non-null  object
 2   sub_category       51282 non-null  object
 3   title              51282 non-null  object
 4   abstract           48616 non-null  object
 5   url                51282 non-null  object
 6   title_entities     51279 non-null  object
 7   abstract_entities  51278 non-null  object
dtypes: object(8)
memory usage: 3.1+ MB


In [15]:
df_news.describe()

Unnamed: 0,news_id,category,sub_category,title,abstract,url,title_entities,abstract_entities
count,51282,51282,51282,51282,48616,51282,51279,51278
unique,51282,17,264,50434,47309,51281,34472,36277
top,N55528,news,newsus,Photos of the Day,What's the weather today? What's the weather for the week? Here's your forecast.,[],[],[]
freq,1,15774,6564,15,124,2,13842,13825


### behaviors.tsv

The click histories and impression logs of users

用户的点击记录和展示日志

- Impression ID. The ID of an impression.
    - 展示 ID。展示的 ID。
- User ID. The anonymous ID of a user.
    - 用户 ID。用户的匿名 ID。
- Time. The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
    - 时间。格式为“MM/DD/YYYY HH：MM：SS AM/PM”的展示时间。
- History. The news click history (ID list of clicked news) of this user before this impression. The clicked news articles are ordered by time.
    - 历史。此用户在此展示之前的新闻点击记录（点击新闻的 ID 列表）。点击的新闻文章按时间排序。
- Impressions. List of news displayed in this impression and user's click behaviors on them (1 for click and 0 for non-click). The orders of news in a impressions have been shuffled.
    - 展示。此展示中显示的新闻列表以及用户在这些新闻上的点击行为（1 表示点击，0 表示未点击）。展示中的新闻顺序被洗牌了。

In [17]:
# 行为数据
df_behaviors = pd.read_csv('./MIND/MINDsmall_train/behaviors.tsv', 
                           names=["impression_id", "user_id", "time", "click_history", "impression_lpg"],
                           sep='\t', 
                           header=None)

In [18]:
df_behaviors.head(2)

Unnamed: 0,impression_id,user_id,time,click_history,impression_lpg
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N10414 N19347 N31801,N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129 N1569 N17686 N13008 N21623 N6233 N14340 N48031 N62285 N44383 N23061 N16290 N6244 N45099 N58715 N59049 N7023 N50528 N42704 N46082 N8275 N15710 N59026 N8429 N30867 N56514 N19709 N31402 N31741 N54889 N9798 N62612 N2663 N16617 N6087 N13231 N63317 N61388 N59359 N51163 N30698 N34567 N54225 N32852 N55833 N64467 N3142 N13912 N29802 N44462 N29948 N4486 N5398 N14761 N47020 N65112 N31699 N37159 N61101 N14761 N3433 N10438 N61355 N21164 N22976 N2511 N48390 N58224 N48742 N35458 N24611 N37509 N21773 N41011 N19041 N25785,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N22407-0 N14592-0 N17059-1 N33677-0 N7821-0 N6890-0


In [19]:
df_behaviors.iloc[0]

impression_id                                                                  1
user_id                                                                   U13740
time                                                       11/11/2019 9:05:58 AM
click_history     N55189 N42782 N34694 N45794 N18445 N63302 N10414 N19347 N31801
impression_lpg                                                 N55689-1 N35729-0
Name: 0, dtype: object

In [20]:
df_behaviors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156965 entries, 0 to 156964
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   impression_id   156965 non-null  int64 
 1   user_id         156965 non-null  object
 2   time            156965 non-null  object
 3   click_history   153727 non-null  object
 4   impression_lpg  156965 non-null  object
dtypes: int64(1), object(4)
memory usage: 6.0+ MB


In [21]:
df_behaviors.describe()

Unnamed: 0,impression_id
count,156965.0
mean,78483.0
std,45312.036839
min,1.0
25%,39242.0
50%,78483.0
75%,117724.0
max,156965.0


In [22]:
unique_user_ids = df_behaviors['user_id'].unique()
print(len(unique_user_ids))
unique_user_ids

50000


array(['U13740', 'U91836', 'U73700', ..., 'U43157', 'U66493', 'U72015'],
      dtype=object)

### entity_embedding.vec和 lation_embedding.vec

entity_embedding.vec和lation_embedding.vec文件包含通过TransE方法从子图（从WikiData知识图）获知的实体和关系的100维嵌入。在两个文件中，第一列是实体/关系的ID，其他列是嵌入矢量值。我们希望这些数据可以促进对知识意识新闻推荐的研究。

In [23]:
# 设置.vec 文件路径
vec_file_path = './MIND/MINDsmall_train/entity_embedding.vec'

# 读取 .vec 文件
with open(vec_file_path, 'r', encoding='utf-8') as file:
    
    # 逐行读取词向量
    for line in file:
        # print(line)
        line = line.strip()
        parts = line.split('\t')
        # print(parts)
        word = parts[0]
        vector = [float(value) for value in parts[1:]]
        print(f'Word: {word}, Vector: {vector}')
        
        # 如果要查找特定单词的向量，可以在这里添加条件
        if word == 'Q41':
            break

Word: Q41, Vector: [-0.063388, -0.181451, 0.057501, -0.091254, -0.076217, -0.052525, 0.0505, -0.224871, -0.018145, 0.030722, 0.064276, 0.073063, 0.039489, 0.159404, -0.128784, 0.016325, 0.026797, 0.13709, 0.001849, -0.059103, 0.012091, 0.045418, 0.000591, 0.211337, -0.034093, -0.074582, 0.014004, -0.099355, 0.170144, 0.109376, -0.014797, 0.071172, 0.080375, 0.045563, -0.046462, 0.070108, 0.015413, -0.020874, -0.170324, -0.00113, 0.05981, 0.054342, 0.027358, -0.028995, -0.224508, 0.066281, -0.200006, 0.018186, 0.082396, 0.167178, -0.136239, 0.055134, -0.080195, -0.00146, 0.031078, -0.017084, -0.091176, -0.036916, 0.124642, -0.098185, -0.054836, 0.152483, -0.053712, 0.092816, -0.112044, -0.072247, -0.114896, -0.036541, -0.186339, -0.16061, 0.037342, -0.133474, 0.11008, 0.070678, -0.005586, -0.046667, -0.07201, 0.086424, 0.026165, 0.030561, 0.077888, -0.117226, 0.211597, 0.112512, 0.079999, -0.083398, -0.121117, 0.071751, -0.017654, -0.134979, -0.051949, 0.001861, 0.124535, -0.151043, -0.

In [24]:
# 设置.vec 文件路径
vec_file_path = './MIND/MINDsmall_train/relation_embedding.vec'

# 读取 .vec 文件
with open(vec_file_path, 'r', encoding='utf-8') as file:
    
    # 逐行读取词向量
    for line in file:
        # print(line)
        line = line.strip()
        parts = line.split('\t')
        # print(parts)
        word = parts[0]
        vector = [float(value) for value in parts[1:]]
        print(f'Word: {word}, Vector: {vector}')
        
        # 如果要查找特定单词的向量，可以在这里添加条件
        if word == 'P31':
            break


Word: P31, Vector: [-0.073467, -0.132227, 0.034173, -0.032769, 0.008289, -0.107088, -0.031712, -0.039581, 0.101882, -0.106961, -0.053441, 0.068202, -0.045584, -0.140448, -0.079402, 0.001022, 0.059921, -0.06251, 0.102848, 0.077947, -0.063644, 0.05007, -0.01918, 0.064456, -0.052222, 0.071078, -0.036413, -0.039235, 0.137947, 0.067378, -0.137468, 0.103482, 0.121755, -0.006587, 0.063077, -0.024954, -0.0313, -0.056833, -0.139115, -0.05357, 0.165815, -0.022143, 0.006561, -0.108691, -0.149139, 0.080943, 0.054542, -0.034564, 0.082343, -0.095843, -0.068758, 0.01385, -0.025589, -0.012451, 0.116367, -0.066981, -0.006472, 0.136078, -0.057084, -0.066427, -0.035916, -0.028447, -0.070395, -0.052364, -0.040038, 0.037342, -0.073347, 0.112529, 0.106537, 0.107426, 0.086297, 0.085833, 0.054393, 0.053187, 0.066242, 0.058507, -0.04718, -0.086089, 0.050148, 0.053491, -0.04237, -0.110435, -0.058929, 0.063987, -0.037393, -0.057942, -0.032128, 0.141226, -0.106979, 0.072183, -0.045641, -0.050068, -0.053686, -0.04

# 系统实现思路

![img.png](assets/palr.png)

> 图片来源 PALR: https://arxiv.org/abs/2305.07622

为一个用户做新闻推荐：基于当前用户的行为信息，猜测他的偏好，该系统能推荐与偏好相似的新闻文章。

- 循环 df_behaviors 的每一行，每一行都是一条用户行为记录，想基于用户行为记录信息去做推荐。
- 排序算法
  - 基于点击历史、用户画像、候选集，用prompt进行排序
    - 点击历史：从用户行为记录中获取当前用户的点击历史，点击历史即新闻ID列表，可以基于新闻ID去 df_news 查询新闻的详细信息。
    - 用户画像：基于点击过的新闻详细信息字符串设计一个prompt，llm输出用户画像（例如：用户爱看的主题和关注的地区）
    - 候选集：利用召回算法从大量的新闻文章里筛选出来
- 召回算法
  - 第一轮筛选：规则筛选，几千篇
    - 从用户的点击历史中统计出当前用户的（新闻类别，新闻子类别）偏好组合，只筛选符合该组合的新闻
  - 第二轮筛选：向量相似度筛选，20篇
    - 用户画像字符串去做embedding作为user_emb
    - 大量新闻文章向量化后存入向量库等待被检索，news_emb
    - 用户画像得到的user_emb 与 news_emb做相似度计算，只取前20篇


## LLM for 特征工程

论文地址：https://arxiv.org/pdf/2305.06566v4.pdf

![image.png](assets/3prompt.png)


该工作（GENRE）在新闻推荐的场景下，用 LLM 构造了三个不同的prompts，分别来进行新闻摘要的改写，用户画像的构建，还有样本增强。

首先可以看到它把新闻的title， abstract 还有category 当作输入，然后要求大语言模型来生成一个摘要，把这个摘要当作这个新闻的 new feature输入下游。

然后是用户画像，根据用户过去观看过的新闻的标题，尝试去问大语言模型是否知道这个用户的一些感兴趣的topic，也就是用户的喜好和他所在的位置。

另外，因为有一些用户看过的新闻非常少，所以用大语言模型来做一些样本的扩充。这里是把用户看过的一些新闻的category，还有 title 输入到大语言模型里面去，希望大语言模型能够根据他看过的这些新闻，生成出来一些用户并没有看过，但可能感兴趣的“伪新闻”，然后把这些“伪”交互数据也当作训练集的一部分来进行训练。

实验表明这些手段都可以增强原始推荐的效果。

## LLM for 特征编码

评测embedding模型：https://huggingface.co/spaces/mteb/leaderboard

我们选择的embedding模型：https://huggingface.co/DMetaSoul/Dmeta-embedding

## LLM for 打分排序

Chat-REC: https://arxiv.org/pdf/2303.14524.pdf

![image.png](assets/rank.png)

