# 使用Embedding进行文章搜索
这个Notebook主要用于使用Embedding进行文章搜索，并应用文心大模型进行回答[Question Answering](../Question_Answering_using_embedding.ipynb)

主要步骤有:
1. 获取要使用的后端的鉴权参数，请参考[认证鉴权文档](../../docs/authentication.md)。
2. 使用维基百科API获取相关文章
3. 对文章从小标题的层级进行切片
4. 调用文心百中语义模型获得Embedding并存储

In [36]:
import time,math,erniebot,os
from tqdm import tqdm
import pandas as pd
from typing import List
from tqdm import tqdm

erniebot.api_type = 'aistudio'
erniebot.access_token = os.getenv("EB_ACCESS_TOKEN")

def get_embedding(word: List[str]) -> List[float]:
    if len(word) <= 16:
        embedding = erniebot.Embedding.create(
                                            model = 'ernie-text-embedding',
                                            input = word
                                            ).get_result()
    else:
        size = len(word)
        embedding = []
        for i in tqdm(range(math.ceil(size / 16))):
            embedding.extend(erniebot.Embedding.create(model = 'ernie-text-embedding', input = word[i*16:(i+1)*16]).get_result())
            time.sleep(1)
    return embedding

In [32]:
# 文章来源为维基百科的2020奥林匹克相关词条，已经按照小标题将文章进行的切分(chunk)，共计3910个片段
df = pd.read_csv('../data/olympics_data.csv')
df.shape

(3910, 3)

In [39]:
df.content.to_list()[0]

'The 2020 Summer Olympics (Japanese: 2020年夏季オリンピック, Hepburn: Nisen Nijū-nen Kaki Orinpikku), officially the Games of the XXXII Olympiad (第三十二回オリンピック競技大会, Dai Sanjūni-kai Orinpikku Kyōgi Taikai) and also known as Tokyo 2020 (東京2020, Tōkyō Nii Zero Nii Zero), was an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July 2021. Tokyo was selected as the host city during the 125th IOC Session in Buenos Aires, Argentina, on 7 September 2013.The Games were originally scheduled to take place from 24 July to 9 August 2020, but due to the global COVID-19 pandemic, on 24 March 2020 the event was postponed to 2021, the first such instance in the history of the Olympic Games (previous games had been cancelled but not rescheduled). However, the event retained the Tokyo 2020 branding for marketing purposes. It was largely held behind closed doors with no public spectators permitted due to the declaration of a state of eme

In [37]:
olympics_doc = get_embedding(df.content.to_list())

  0%|          | 0/245 [00:00<?, ?it/s]

  0%|          | 0/245 [00:00<?, ?it/s]


InvalidParameterError: embeddings max tokens per batch size is 384

In [29]:
df = df[['title','heading','content']]
df.to_csv('olympics_data.csv',index = False)

In [5]:
import wikipedia
wikipedia.search("2022 Winter Olympics")

['2022 Winter Olympics',
 '2022 Winter Olympics medal table',
 'Winter Olympic Games',
 'India at the 2022 Winter Olympics',
 "Ice hockey at the 2022 Winter Olympics – Men's tournament",
 'All-time Olympic Games medal table',
 'Figure skating at the 2022 Winter Olympics',
 'Ice hockey at the 2022 Winter Olympics',
 'United States at the 2022 Winter Olympics',
 "Figure skating at the 2022 Winter Olympics – Women's singles"]

In [9]:
page = wikipedia.page(wikipedia.search("2022 Winter Olympics")[1])

PageError: Page id "2021 winter olympics medal table" does not match any pages. Try another id!

In [15]:
import wikipediaapi

title = "china"
wiki = wikipediaapi.Wikipedia(
    language='en',
    extract_format=wikipediaapi.ExtractFormat.WIKI
)
page = wiki.page(title)
language = "zh"
lpage = page.langlinks[language]  # fr es ...
print(lpage.text)


TypeError: Wikipedia.__init__() missing 1 required positional argument: 'user_agent'

In [17]:
wikipedia.set_lang("fr")
wikipedia.search("Olympics",results=20)

['Olympic Games',
 '1988 Summer Olympics',
 '2020 Summer Olympics',
 'Summer Olympic Games',
 '1936 Olympics',
 '1972 Olympics',
 '2000 Summer Olympics',
 'International Olympic Committee',
 '1972 Summer Olympics',
 '1936 Summer Olympics',
 'Winter Olympic Games',
 '2008 Summer Olympics',
 '1996 Summer Olympics',
 '2010 Winter Olympics',
 '1992 Summer Olympics',
 '1980 Summer Olympics',
 '2004 Summer Olympics',
 'Tokyo Olympics',
 '2016 Summer Olympics',
 '1948 Olympics']

In [18]:
# Get the URL of the Article
page = wikipedia.page("2020 Summer Olympics")
print(page.url)


https://en.wikipedia.org/wiki/2020_Summer_Olympics
