# Using GLM Models to finish QA using a search API and re-ranking.

**This tutorial is available in English and is attached below the Chinese explanation**

本脚本参考了[OpenAI CookBook](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_a_search_API.ipynb)的代码，并使用 GLM 系列模型实现同样的功能。

**步骤 1：搜索**

1. 用户提问。
2. 由 GLM-4 生成潜在查询列表。
3. 并行执行搜索查询。

**步骤 2：重新排序**

1. 使用每个结果的嵌入来计算与生成的用户问题理想答案的语义相似度。
2. 基于这个相似度度量对结果进行排序和筛选。

**步骤 3：回答**

1. 给出最顶端的搜索结果，模型生成用户问题的答案，包括引用和链接。

这种混合方法提供了相对较低的延迟，并且可以集成到任何现有的搜索端点中，而不需要维护一个向量数据库！我们将使用[News API](https://newsapi.org/)作为搜索领域的示例。

This script refers to the code in [OpenAI CookBook](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_a_search_API.ipynb) and implements the same functionality using the GLM family of models.

**Step 1: Search**

1. User asks a question.
2. Generate a list of potential queries by GLM-4.
3. Execute search queries in parallel.

**Step 2: Rerank**

1. Use the embedding of each result to calculate the semantic similarity with the generated ideal answer to the user's question.
2. Sort and filter the results based on this similarity measure.

**Step 3: Answer**

1. Given the top search results, the model generates an answer to the user's question, including citations and links.

This hybrid approach provides relatively low latency and can be integrated into any existing search endpoint without the need to maintain a vector database! We will use the [News API](https://newsapi.org/) as an example in the search domain.

## 1 . Setting the API keys

首先，我们需要设置 API 密钥。这包括 [News API](https://newsapi.org/) 的 密匙 和 智谱AI API 的密匙。
同时，我们设置好一个脚本，用于之情json格式的返回结果。

First, we need to set the API keys. This includes the key for the [News API](https://newsapi.org/) and the key for the ZhipuAI API. We also set up a script to handle the return results in json format.

In [None]:
api_key  = "your zhipuAI API KEY"
news_api_key = "your newsapi.org API KEY"

In [1]:
import json
from numpy import dot
import requests
from tqdm.notebook import tqdm
from openai import OpenAI

client = OpenAI(
    api_key=api_key,
    base_url="https://open.bigmodel.cn/api/paas/v4/"
)

def json_glm(input: str):
    completion = client.chat.completions.create(
        model="glm-4-0520",
        messages=[
            {"role": "system", "content": "请你严格按照用户指令的要求执行，必须按照 JSON BLOB的格式输出"},
            {"role": "user", "content": input},
        ],
        temperature=0.5
    )

    text = completion.choices[0].message.content
    json_content = text.strip().strip('```json').strip('```').strip()
    parsed = json.loads(json_content)
    return parsed




## 1. 确定搜索的问题

我们需要确定本次搜索的问题

We need to determine the problem of this search

In [2]:
USER_QUESTION = "今天美国总统大选的辩论情况"

现在，为了尽可能详尽，我们使用该模型根据这个问题生成一个多样化的查询列表。

Now, in order to be as exhaustive as possible, we use the model to generate a list of diverse queries based on this question.

In [3]:
QUERIES_INPUT = f"""
您可以访问返回最新新闻文章的搜索 API。生成与此问题相关的搜索查询数组。使用相关关键字的变体进行查询，尽量尽可能通用。包括您能想到的尽可能多的查询，包括和排除术语。这是搜索的关键词例子：
['keyword_1 keyword_2', 'keyword_1', 'keyword_2'] 
发挥创意。您包含的查询越多，找到相关结果的可能性就越大。最好使用英语进行搜索，因为这个网站需要使用英语的关键词。
用户问题：{USER_QUESTION}
返回格式：
{{"queries": ["query_1", "query_2", "query_3"]}}
请你严格按照返回格式的要求输出。
"""

queries = json_glm(QUERIES_INPUT)["queries"]
queries.append(USER_QUESTION)
queries

['President election debate today',
 '2023 presidential debate',
 "today's presidential election debate",
 'President debate live',
 'election 2023 debate',
 'debate between presidential candidates',
 'Presidential candidates debate today',
 'latest presidential debate news',
 "today's political debate",
 'President election 2023 debate',
 '今天美国总统大选的辩论情况']

## 3. Search

接着，我们运行搜索。

let's run the searches.

In [4]:
def search_news(
        query: str,
        news_api_key: str,
        num_articles: int = 50,
        from_datetime: str = "2024-06-27",
        to_datetime: str = "2024-06-29",
) -> dict:
    response = requests.get(
        "https://newsapi.org/v2/everything",
        params={
            "q": query,
            "apiKey": news_api_key,
            "pageSize": num_articles,
            "sortBy": "relevancy",
            "from": from_datetime,
            "to": to_datetime,
        },
    )

    return response.json()


articles = []

for query in tqdm(queries):
    result = search_news(query=query, news_api_key=news_api_key)
    if result["status"] == "ok":
        articles = articles + result["articles"]
    else:
        raise Exception(result["message"])

# remove duplicates
articles = list({article["url"]: article for article in articles}.values())

print("Total number of articles:", len(articles))
print("Top 5 articles of query 1:", "\n")

for article in articles[0:5]:
    print("Title:", article["title"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:100] + "...")
    print()


  0%|          | 0/11 [00:00<?, ?it/s]

Total number of articles: 263
Top 5 articles of query 1: 

Title: The first US presidential debate will have some noticeable differences
Description: It didn't come easy, and it won't feel quite the same, but we're getting our first US presidential debate tonight.
Content: Chip Somodevilla/Getty; Scott Eisen/Getty Images; BI
<ul><li>This post originally appeared in the I...

Title: Kari Lake refused to debate Mark Lamb. Now I know why
Description: Opinion: Mark Lamb is the Republican U.S. Senate candidate that Ruben Gallego doesn't want to face. Neither did Kari Lake.
Content: There he stood on the debate stage, all by his lonesome.
Pinal County Sheriff Mark Lamb, rocking a ...

Title: Calls for Biden to Step Aside Are About to Get Deafening
Description: The evening left Democratic insiders gobsmacked.
Content: This article is part of The D.C. Brief, TIMEs politics newsletter. Sign up here to get stories like ...

Title: Clash Of The Titans? Charlamagne Tha God On Biden Vs. Trump CNN

我们可以看到，搜索查询通常会返回大量结果，其中许多结果与用户提出的原始问题无关。为了提高最终答案的质量，我们使用嵌入来重新排序和过滤结果。

As we can see, oftentimes, the search queries will return a large number of results, many of which are not relevant to the original question asked by the user. In order to improve the quality of the final answer, we use embeddings to re-rank and filter the results.

## 4. Re-rank

我们首先生成一个假设的理想答案，然后重新排序并与结果进行比较。这有助于优先考虑看起来像好答案的结果，而不是与我们的问题相似的结果。这是我们用来生成假设答案的提示。

We first generate a hypothetical ideal answer to rerank our compare our results against. This helps prioritize results that look like good answers, rather than those similar to our question. Here’s the prompt we use to generate our hypothetical answer.


In [5]:
HA_INPUT = f"""
为用户的问题生成一个假设答案。此答案将用于对搜索结果进行排名。假装您拥有回答所需的所有信息，但不要使用任何实际事实。相反，使用占位符例如 NAME 做了某事，或 NAME 在 PLACE 说了某事，请你用英语来输出。
用户问题: {USER_QUESTION}
格式要求: {{"hypotheticalAnswer": "hypothetical answer text"}}
"""
hypothetical_answer = json_glm(HA_INPUT)["hypotheticalAnswer"]
hypothetical_answer

'Today, during the presidential debate, Candidate NAME showcased strong arguments on the topic of ECONOMIC POLICY, while Opponent NAME emphasized their stance on FOREIGN POLICY. The debate took place at VENUE NAME, with Moderator NAME facilitating the discussion.'

现在，让我们为搜索结果和假设答案生成嵌入。然后我们计算这些嵌入之间的余弦距离，从而得到一个语义相似度指标。请注意，我们可以简单地计算点积，而不必进行完整的余弦相似度计算，因为 OpenAI 嵌入在我们的 API 中是经过归一化的。

Now, let's generate embeddings for the search results and the hypothetical answer. We then calculate the cosine distance between these embeddings, giving us a semantic similarity metric. Note that we can simply calculate the dot product in lieu of doing a full cosine similarity calculation since the OpenAI embeddings are returned normalized in our API.


In [6]:
def embeddings(input: list[str]) -> list[list[str]]:
    response = client.embeddings.create(model="embedding-2", input=input)
    return [data.embedding for data in response.data]


hypothetical_answer_embedding = embeddings(hypothetical_answer)[0]
articles_list = [f"{article['title']} {article['description']} {article['content'][0:100]}" for article in articles]
articles_list[0:10]

["The first US presidential debate will have some noticeable differences It didn't come easy, and it won't feel quite the same, but we're getting our first US presidential debate tonight. Chip Somodevilla/Getty; Scott Eisen/Getty Images; BI\r\n<ul><li>This post originally appeared in the I",
 "Kari Lake refused to debate Mark Lamb. Now I know why Opinion: Mark Lamb is the Republican U.S. Senate candidate that Ruben Gallego doesn't want to face. Neither did Kari Lake. There he stood on the debate stage, all by his lonesome.\r\nPinal County Sheriff Mark Lamb, rocking a ",
 'Calls for Biden to Step Aside Are About to Get Deafening The evening left Democratic insiders gobsmacked. This article is part of The D.C. Brief, TIMEs politics newsletter. Sign up here to get stories like ',
 'Clash Of The Titans? Charlamagne Tha God On Biden Vs. Trump CNN Debate, Power Of Political Plain Speaking On ElectionLine Podcast Editor’s note:\xa0It’s Debate Night in America and the\xa0Deadline ElectionLine 

In [7]:
batch_size = 50
all_article_embeddings = []

for i in range(0, len(articles_list), batch_size):
    batch = articles_list[i:i + batch_size]
    batch_embeddings = embeddings(batch)
    all_article_embeddings.extend(batch_embeddings)

print(len(all_article_embeddings))

cosine_similarities = []
for article_embedding in all_article_embeddings:
    cosine_similarities.append(dot(hypothetical_answer_embedding, article_embedding))

cosine_similarities[0:10]

263


[0.5579376441846292,
 0.39300627131951305,
 0.4572582644647345,
 0.4668488777936274,
 0.5289598787154811,
 0.48261735464648114,
 0.521426815078863,
 0.36705116044372865,
 0.46313505343410993,
 0.4557141810237793]

## 5. Using the similarity scores to re-rank the results

最后，我们使用这些相似度分数对结果进行排序和筛选。

Finally, we use these similarity scores to sort and filter the results.


In [8]:
scored_articles = zip(articles, cosine_similarities)

# Sort articles by cosine similarity
sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)

# Print top 5 articles
print("Top 5 articles:", "\n")

for article, score in sorted_articles[0:5]:
    print("Title:", article["title"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:100] + "...")
    print("Score:", score)
    print()


Top 5 articles: 

Title: How have debate topics changed? What to expect tonight.
Description: Will the topics in Thursday’s presidential debate cover what voters care about most?
Content: Among the top issues for uncommitted and sporadic voters from battleground states who will likely pi...
Score: 0.6052164837899081

Title: 2024 presidential debate analysis, discussion
Description: President Joe Biden and his rival Donald Trump squared off as the candidates attempt to lure currently undecided voters in the first general election debate on Thursday, June 27.
Content: MILWAUKEE - The first general election debate of the 2024 season kicked off on Thursday, June 27.
U...
Score: 0.596790424103

Title: Trump Should Never Have Had This Platform
Description: The debate was a travesty—because its whole premise was to treat a failed coup leader as a legitimate candidate for the presidency again.
Content: The first question about January 6 was asked at Minute 41.Donald Trump replied with a barra

## 6. Get the answer

最后，我们给出最顶端的搜索结果，模型生成用户问题的答案，包括引用和链接。

Finally, given the top search results, the model generates an answer to the user's question, including citations and links.


In [9]:
formatted_top_results = [
    {
        "title": article["title"],
        "description": article["description"],
        "url": article["url"],
    }
    for article, _score in sorted_articles[0:5]
]

ANSWER_INPUT = f"""
根据给定的搜索结果生成用户问题的答案。
TOP_RESULTS：{formatted_top_results}
USER_QUESTION：{USER_QUESTION}
在答案中包含尽可能多的信息。将相关搜索结果网址引用为 markdown 链接。用中文输出答案。
"""

completion = client.chat.completions.create(
    model="glm-4-0520",
    messages=[{"role": "user", "content": ANSWER_INPUT}],
    temperature=0.5,
    stream=False,
)
completion.choices[0].message.content

'今天美国总统大选的辩论情况如下：总统乔·拜登和其对手唐纳德·特朗普在6月27日的首次总统选举辩论中交锋，双方试图吸引目前尚未决定投票给谁选民。辩论的议题可能包括经济、堕胎、民主和战争等，这些议题是否涵盖了选民最关心的问题仍存在争议。\n\n- 关于辩论议题的变化以及今晚辩论的预期，请参考：[How have debate topics changed? What to expect tonight.](https://www.washingtonpost.com/politics/2024/06/27/debate-topics-economy-abortion-democracy-war/)\n- 对于2024年总统辩论的分析和讨论，请参考：[2024 presidential debate analysis, discussion](https://www.fox6now.com/news/2024-presidential-debate-analysis)\n- 关于特朗普不应该拥有这个辩论平台的观点，请参考：[Trump Should Never Have Had This Platform](https://www.theatlantic.com/politics/archive/2024/06/debate-trump-platform-january-6/678818/?utm_source=feed)\n- 关于拜登提前进行辩论是否是一个巨大的赌博，请参考：[Early debate was a huge gamble Biden may regret](https://news.sky.com/story/us-presidential-election-early-debate-was-a-huge-gamble-biden-may-regret-13160215)\n- 关于谁将赢得首次总统辩论的预测，请参考：[Here’s who will win the first presidential debate](https://nypost.com/2024/06/26/us-news/heres-who-will-win-the-first-presidential-debate/)'