# 在推荐系统中使用嵌入向量

推荐在网络上广泛存在。

- 购买了那个商品？试试这些类似商品。
- 喜欢那本书吗？试试这些类似的书名。
- 找不到您想要的帮助页面？试试这些类似的页面。

本笔记本演示了如何使用嵌入向量来寻找相似项目以进行推荐。特别是，我们使用 AG 的新闻文章语料库作为我们的数据集。

我们的模型将回答以下问题：给定一篇文章，哪些其他文章与它最为相似？

In [1]:
import pandas as pd
import pickle
from lxml import etree

from utils.embeddings_utils import (
    get_embedding,
    distances_from_embeddings,
    tsne_components_from_embeddings,
    chart_from_components,
    indices_of_nearest_neighbors_from_distances,
)

EMBEDDING_MODEL = "text-embedding-3-small"


## 加载xml数据，这次比csv稍复杂一些

In [2]:
import pandas as pd


# 加载和检查数据集
input_datapath = "./archive/newsspace200.xml"  # 修改为你自己的路径

# 读取 XML 文件
tree = etree.parse(input_datapath)
root = tree.getroot()

# 提取所有 <all_news> 节点的数据
data = []
for news in root.findall('.//all_news'):
    sources = news.findall("source")
    urls = news.findall("url")
    titles = news.findall("title")
    images = news.findall("image")
    categories = news.findall("category")
    descriptions = news.findall("description")
    ranks = news.findall("rank")
    pubdates = news.findall("pubdate")
    
    for i in range(len(sources)):
        row = {
            "source": sources[i].text if i < len(sources) else "",
            "url": urls[i].text if i < len(urls) else "",
            "title": titles[i].text if i < len(titles) else "",
            "image": images[i].text if i < len(images) else "",
            "category": categories[i].text if i < len(categories) else "",
            "description": descriptions[i].text if i < len(descriptions) else "",
            "rank": ranks[i].text if i < len(ranks) else "",
            "pubdate": pubdates[i].text if i < len(pubdates) else "",
        }
        data.append(row)

# 创建数据框
df = pd.DataFrame(data)

# 确保所有列都是字符串类型
df = df.astype(str)

# 删除空值
df = df.dropna()

# 用于快速查看数据的初始状态
n_examples = 5
df.head(n_examples)

Unnamed: 0,source,url,title,image,category,description,rank,pubdate
0,Yahoo Business,http://us.rd.yahoo.com/dailynews/rss/business/...,Wall St. Pullback Reflects Tech Blowout (Reuters),none,Business,"Reuters - Wall Street's long-playing drama,\""W...",5,0000-00-00 00:00:00
1,Yahoo Business,http://us.rd.yahoo.com/dailynews/rss/business/...,Wall St. Bears Claw Back Into the Black (Reuters),none,Business,"Reuters - Short-sellers, Wall Street's dwindli...",5,0000-00-00 00:00:00
2,Yahoo Business,http://us.rd.yahoo.com/dailynews/rss/business/...,Carlyle Looks Toward Commercial Aerospace (Reu...,none,Business,Reuters - Private investment firm Carlyle Grou...,5,0000-00-00 00:00:00
3,Yahoo Business,http://us.rd.yahoo.com/dailynews/rss/business/...,Oil and Economy Cloud Stocks' Outlook (Reuters),none,Business,Reuters - Soaring crude prices plus worries\ab...,5,0000-00-00 00:00:00
4,Yahoo Business,http://us.rd.yahoo.com/dailynews/rss/business/...,Iraq Halts Oil Exports from Main Southern Pipe...,none,Business,Reuters - Authorities have halted oil export\f...,5,0000-00-00 00:00:00


让我们来看看那些完整的数据，但这次不以省略号截断。

In [3]:
for idx, row in df.head(n_examples).iterrows():
    print("")
    print(f"Title: {row['title']}")
    print(f"Description: {row['description']}")
    print(f"Label: {row['source']}")



Title: Wall St. Pullback Reflects Tech Blowout (Reuters)
Description: Reuters - Wall Street's long-playing drama,\"Waiting for Google," is about to reach its final act, but its\stock market debut is ending up as more of a nostalgia event\than the catalyst for a new era.
Label: Yahoo Business

Title: Wall St. Bears Claw Back Into the Black (Reuters)
Description: Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
Label: Yahoo Business

Title: Carlyle Looks Toward Commercial Aerospace (Reuters)
Description: Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
Label: Yahoo Business

Title: Oil and Economy Cloud Stocks' Outlook (Reuters)
Description: Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next 

## 构建缓存以保存嵌入向量

在为这些文章生成嵌入之前，让我们设置一个缓存来保存我们生成的嵌入。通常，保存你的嵌入是一个好主意，这样你以后可以重复使用它们。如果不保存它们，每次重新计算时你都需要再次付费。

缓存是一个字典，它将元组 (text, model) 映射到一个嵌入向量，该向量是一个浮点数列表。缓存以 Python pickle 文件的形式保存。

In [4]:
# 建立嵌入缓存以避免重新计算
# 缓存是元组（文本、模型）->嵌入的判决，保存为泡菜文件

# 设置嵌入缓存的路径
embedding_cache_path = "./archive/recommendations_embeddings_cache.pkl"

# 如果缓存存在，则加载它，并将副本保存到磁盘
try:
    embedding_cache = pd.read_pickle(embedding_cache_path)
except FileNotFoundError:
    embedding_cache = {}
with open(embedding_cache_path, "wb") as embedding_cache_file:
    pickle.dump(embedding_cache, embedding_cache_file)

# 定义一个函数以从缓存中检索嵌入（如果存在），否则通过API请求
def embedding_from_string(
    string: str,
    model: str = EMBEDDING_MODEL,
    embedding_cache=embedding_cache
) -> list:
    """返回给定字符串的嵌入，使用缓存以避免重新计算。"""
    if (string, model) not in embedding_cache.keys():
        embedding_cache[(string, model)] = get_embedding(string, model)
        with open(embedding_cache_path, "wb") as embedding_cache_file:
            pickle.dump(embedding_cache, embedding_cache_file)
    return embedding_cache[(string, model)]


让我们通过获取一个嵌入来检查它是否正常工作。

In [5]:
# 例如，以数据集中的第一个描述为例
example_string = df["description"].values[0]
print(f"\nExample string: {example_string}")

# 打印嵌入的前10个维度
example_embedding = embedding_from_string(example_string)
print(f"\nExample embedding: {example_embedding[:10]}...")



Example string: Reuters - Wall Street's long-playing drama,\"Waiting for Google," is about to reach its final act, but its\stock market debut is ending up as more of a nostalgia event\than the catalyst for a new era.

Example embedding: [0.007917104288935661, -0.0023091554176062346, -0.0028465643990784883, -0.004404586274176836, 0.004630700685083866, -0.0017918797675520182, 0.021174227818846703, 0.03268438205122948, 0.03459241986274719, -0.014458936639130116]...


## 根据嵌入向量推荐相似文章

要找到类似的文章，让我们遵循一个三步计划：
1. 获取所有文章描述的相似度嵌入
2. 计算源标题与所有其他文章之间的距离
3. 打印出与源标题最接近的其他文章

In [6]:
from utils.openai_base_api import translate_text
def print_recommendations_from_strings(
    strings: list[str],              # 字符串列表
    index_of_source_string: int,     # 源字符串在列表中的索引
    k_nearest_neighbors: int = 1,    # 需要打印的最近邻数量，默认为1
    model=EMBEDDING_MODEL,           # 用于生成嵌入的模型，默认为EMBEDDING_MODEL
) -> list[int]:                      # 返回最近邻字符串的索引列表
    """
    根据给定的源字符串，从字符串列表中打印出其k个最近邻。

    :return: 包含找到的最近邻字符串索引的列表。
    """
    print("正在生成嵌入表示...")
    # 为所有字符串生成嵌入表示
    embeddings = [embedding_from_string(string, model=model) for string in strings]
    print("获取源字符串的嵌入表示...")
    query_embedding = embeddings[index_of_source_string]
    print("计算源字符串嵌入与其他所有字符串嵌入之间的余弦距离...")
    distances = distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine")
    print("根据距离找出最近邻的索引...")
    indices_of_nearest_neighbors = indices_of_nearest_neighbors_from_distances(distances)
    print("打印源字符串信息...")
    query_string = strings[index_of_source_string]
    print(f"源字符串: {query_string}")
    # 为了方便阅读，打印源字符串的翻译
    translate_string=translate_text(query_string,"Chinese")
    print(f"源字符串翻译: {translate_string}")
    # 打印k个最近邻字符串及其距离
    k_counter = 0
    for i in indices_of_nearest_neighbors:
        # 跳过与源字符串完全相同的匹配项
        if query_string == strings[i]:
            continue
        # 达到k个最近邻后停止打印
        if k_counter >= k_nearest_neighbors:
            break
        k_counter += 1
        translate_target_string=translate_text(strings[i],"Chinese")
        # 输出相似字符串及其距离
        print(
            f"""
        --- 推荐 #{k_counter} (最近邻第{k_counter}个/{k_nearest_neighbors}个) ---
        字符串: {strings[i]}
        翻译：{translate_target_string}
        距离: {distances[i]:0.3f}"""
        )

    return indices_of_nearest_neighbors


In [7]:
translate_string=translate_text("Reuters - Wall Street's long-playing drama,\"Waiting for Google,\" is about to reach its final act, but its\stock market debut is ending up as more of a nostalgia event\than the catalyst for a new era.","Chinese")
print(f"源字符串翻译: {translate_string}")

  translate_string=translate_text("Reuters - Wall Street's long-playing drama,\"Waiting for Google,\" is about to reach its final act, but its\stock market debut is ending up as more of a nostalgia event\than the catalyst for a new era.","Chinese")


源字符串翻译: 路透社 - 华尔街的长期戏剧《等待谷歌》即将迎来最后一幕，但其股票市场首次亮相似乎更像是一场怀旧活动，而非新纪元的催化剂。


## 结果示例

让我们寻找与第一篇相似的文章。

In [8]:
# [:20]是取前20条数据，否则等待时间太长了
article_descriptions = df["description"].tolist()[:20]

tony_blair_articles = print_recommendations_from_strings(
    strings=article_descriptions,  # 让我们根据文章描述建立相似性
    index_of_source_string=0,  # 类似于第一篇关于托尼·布莱尔的文章
    k_nearest_neighbors=5,  # 5篇最相似的文章
)


正在生成嵌入表示...
获取源字符串的嵌入表示...
计算源字符串嵌入与其他所有字符串嵌入之间的余弦距离...
根据距离找出最近邻的索引...
打印源字符串信息...
源字符串: Reuters - Wall Street's long-playing drama,\"Waiting for Google," is about to reach its final act, but its\stock market debut is ending up as more of a nostalgia event\than the catalyst for a new era.
源字符串翻译: 路透社 - 华尔街的长篇大戏《等待谷歌》即将迎来其最后一幕，但其股市首次亮相更像是一场怀旧活动，而不是一个引发新时代的催化剂。

        --- 推荐 #1 (最近邻第1个/5个) ---
        字符串:  NEW YORK (Reuters) - Wall Street's long-playing drama,  "Waiting for Google," is about to reach its final act, but its  stock market debut is ending up as more of a nostalgia event  than the catalyst for a new era.
        翻译：纽约（路透社）- 华尔街的长篇剧作“等待谷歌”即将迎来最后一幕，但它的股市首秀却更像是一场怀旧活动，而非新纪元的催化剂。
        距离: 0.065

        --- 推荐 #2 (最近邻第2个/5个) ---
        字符串:  WASHINGTON/NEW YORK (Reuters) - The auction for Google  Inc.'s highly anticipated initial public offering got off to a  rocky start on Friday after the Web search company sidestepped  a bullet from U.S. securities regulators.
        

相当不错！5 条推荐中有 3 条明确提到了华尔街，一条提到了谷歌，一条提到了股市，都是和第一条内容相关的

让我们看看我们的推荐系统在第二篇关于 NVIDIA 新芯片组增强安全性的文章上的表现如何。

In [9]:
chipset_security_articles = print_recommendations_from_strings(
    strings=article_descriptions,  # let's base similarity off of the article description
    index_of_source_string=5,  # let's look at articles similar to the second one about a more secure chipset
    k_nearest_neighbors=5,  # let's look at the 5 most similar articles
)


正在生成嵌入表示...
获取源字符串的嵌入表示...
计算源字符串嵌入与其他所有字符串嵌入之间的余弦距离...
根据距离找出最近邻的索引...
打印源字符串信息...
源字符串: AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.
源字符串翻译: 法新社 - 世界油价飞涨，打破纪录并给钱包带来压力，在美国总统选举前不到三个月的时间里，呈现出一种新的经济威胁。

        --- 推荐 #1 (最近邻第1个/5个) ---
        字符串: Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.
        翻译：路透社 - 原油价格飙升，加上对经济和盈利前景的担忧，预计将影响下周股市的表现，正值夏季淡季的深处。
        距离: 0.434

        --- 推荐 #2 (最近邻第2个/5个) ---
        字符串:  NEW YORK (Reuters) - Soaring crude prices plus worries  about the economy and the outlook for earnings are expected to  hang over the stock market next week during the depth of the  summer doldrums.
        翻译：纽约（路透社）- 猛增的原油价格加上对经济和盈利前景的担忧，预计将在下周的夏季低迷期压制股市。
        距离: 0.463

        --- 推荐 #3 (最近邻第3个/5个) ---
        字符串

从打印的距离中，你可以看到#1 推荐项比其他所有项都要近得多（0.434 vs 0.5x）。而且#1 推荐项与起始文章看起来非常相似。