## 文本问答Promopt优化

### 任务
- 任务说明：将检索结果结合问题构造promot，完成问答
- 任务要求：
    - 构造prompt
    - 调用API进行问答
    - 打卡要求：完成RAG完整流程，并提交结果进行打分

### 代码

In [1]:
import time 
import jwt
import requests
import jieba
import re
from tqdm import tqdm
import json
import pdfplumber
from langchain.schema import Document
from sentence_transformers import SentenceTransformer
page_content  = []

  from .autonotebook import tqdm as notebook_tqdm


In [2]:

def extract_page_text(filepath, max_len=256, overlap_len=100):
    page_content  = []
    pdf =pdfplumber.open(filepath)
    page_count = 0
    # pattern = r'^\d{1,3}'
    for page in tqdm(pdf.pages):
        page_text = page.extract_text().strip()
        raw_text = [text.strip() for text in page_text.split('\n')]
        new_text = '\n'.join(raw_text)
        new_text = re.sub(r'\n\d{2,3}\s?', '\n', new_text)
        # new_text = re.sub(pattern, '', new_text).strip()
        if len(new_text)>10 and '..............' not in new_text:
            page_content.append(new_text)
        else:
            page_content.append('  ')

    cleaned_chunks = []
    i = 0
    all_str = ''.join(page_content)
    all_str = all_str.replace('\n', '')
    while i<len(all_str):
        cur_s = all_str[i:i+max_len]
        if len(cur_s)>10:
            cleaned_chunks.append(Document(page_content=cur_s, metadata={'page':page_count+1}))
        i+=(max_len - overlap_len)

    return cleaned_chunks,page_content
# 实际KEY，过期时间
def generate_token(apikey: str, exp_seconds: int):
    try:
        id, secret = apikey.split(".")
    except Exception as e:
        raise Exception("invalid apikey", e)

    payload = {
        "api_key": id,
        "exp": int(round(time.time() * 1000)) + exp_seconds * 1000,
        "timestamp": int(round(time.time() * 1000)),
    }
    return jwt.encode(
        payload,
        secret,
        algorithm="HS256",
        headers={"alg": "HS256", "sign_type": "SIGN"},
    )
def ask_glm(content):
    url = "https://open.bigmodel.cn/api/paas/v4/chat/completions"
    headers = {
      'Content-Type': 'application/json',
      'Authorization': generate_token("f1a0b6c3d36d46d3eed74a6c7de3e9e4.pZ88EkbBscyHXXcJ", 1000)
    }

    data = {
        "model": "glm-3-turbo",
        "messages": [{"role": "user", "content": content}]
    }

    response = requests.post(url, headers=headers, json=data)
    return response.json()



In [3]:
questions = json.load(open("./data/questions.json"))
filepath = './data/初赛训练数据集.pdf'
_,pdf_content = extract_page_text(filepath, max_len=256, overlap_len=100)

100%|██████████| 354/354 [00:06<00:00, 51.54it/s]


In [5]:
from sklearn.preprocessing import normalize
from rank_bm25 import BM25Okapi
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('/root/code/quietnight/bge-reranker-large/')
rerank_model = AutoModelForSequenceClassification.from_pretrained('/root/code/quietnight/bge-reranker-large/')
rerank_model.cuda()


XLMRobertaForSequenceClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 1024, padding_idx=1)
      (position_embeddings): Embedding(514, 1024, padding_idx=1)
      (token_type_embeddings): Embedding(1, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-23): 24 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=1024, out_fe

In [6]:
pdf_content_words = [jieba.lcut(x ) for x in pdf_content]
bm25 = BM25Okapi(pdf_content_words)


Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.571 seconds.
Prefix dict has been built successfully.


In [7]:
# 首先使用BM25计算question 和内容的得分 提取TOP3
# TOP3使用BGE重排序
for query_idx in range(5):
    doc_scores = bm25.get_scores(jieba.lcut(questions[query_idx]["question"]))
    max_score_page_idxs = doc_scores.argsort()[-3:]

    pairs = []
    for idx in max_score_page_idxs:
        pairs.append([questions[query_idx]["question"], pdf_content[idx] ])

    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    with torch.no_grad():
        inputs = {key: inputs[key].cuda() for key in inputs.keys()}
        scores = rerank_model(**inputs, return_dict=True).logits.view(-1, ).float()
    max_score_page_idx = max_score_page_idxs[scores.cpu().numpy().argmax()]
    questions[query_idx]['reference'] = 'page_' + str(max_score_page_idx + 1)

    prompt = '''你是一个汽车专家，帮我结合给定的资料，回答所给的问题。如果问题无法从资料中获得，请输出:结合给定的资料，无法回答问题。
资料：{0}

问题：{1}
    '''.format(
        pdf_content[max_score_page_idx] ,
        questions[query_idx]["question"]
    )
    answer = ask_glm(prompt)['choices'][0]['message']['content']
    questions[query_idx]['answer'] = answer
    print(query_idx,questions[query_idx]["question"])
    print(answer)

    # break

0 “前排座椅通风”的相关内容在第几页？
结合给定的资料，无法回答问题。因为资料中并没有提到“前排座椅通风”的相关内容在哪几页。
1 "关于车辆的儿童安全座椅固定装置，在哪一页可以找到相关内容？"
根据给定资料，关于车辆的儿童安全座椅固定装置的相关内容可以在第二页找到。
2 “打开前机舱盖”的相关信息在第几页？
结合给定的资料，无法回答问题。因为资料中并没有提及“打开前机舱盖”的信息所在的页数。
3 “打开前机舱盖”这个操作在哪一页？
“打开前机舱盖”这个操作在资料的第3页。
4 “查看行车记录仪视频”这一项内容在第几页？
根据给定的资料，无法回答“查看行车记录仪视频”这一项内容在第几页的问题，因为资料中没有提到查看视频的具体页面信息。


In [12]:

with open(f'prompt_submit2.json', 'w', encoding='utf8') as up:
    json.dump(questions, up, ensure_ascii=False, indent=4)

In [26]:
prompt = """
你是一个汽车方面的专家，请判断下面的提问回答是否与汽车使用相关。只能回答相关或者是不相关，不要回答其他内容
问题是:
{}
"""

"""
你是一个汽车驾驶安全员,精通有关汽车驾驶、维修和保养的相关知识。我会给你一段汽车驾驶、维修和保养相关的文本，这是从PDF文件转换而来，里面格式可能会有些问题，需要你帮忙从中提取一些关键信息出来，
    请返回一个yaml脚本的字典命名为key_word，key为关键信息，value为提取到的信息，如果提取不到，则返回一个空value

"""

## 参考文献
[hnsw](https://www.luxiangdong.com/2023/11/06/hnsw/)