# 让GPT帮你读文档：一种简单的实现方法

GPT-4阅读文档的原理与人类阅读类似。想象一下，当您拿到一份数十页的PDF文件时，您会先关注哪些部分？摘要、总结以及目录结构。接着，您会在心里提出若干问题（大约3-5个），并带着这些问题继续阅读。

为了借助GPT-4实现高效阅读，并尝试突破单次 token 数量限制，我们需要使用官方提供的 embedding 工具箱。简单来说，embedding 的原理就是将一段文本压缩成一组向量数据，就像是将文章片段存储到大脑中。

因此，我们的程序分为以下几个步骤：

第一步：清洗并切片PDF文档

对PDF文档进行清洗，去除重复的页眉、页脚以及目录中的过长连字符，以尽量减少API调用次数（毕竟每次调用都需要花费）。
将文档按段落切片，对于过长的段落则拆分成两部分。
将所有切片输入API生成embedding，并将其存储到 parquet 文件格式中，便于后续复用。

第二步：生成概述和提出问题

读取文档前10页（不超过4096个token）的数据量，提交给GPT-4以生成概述。
让GPT-4根据概述提出五个相关问题。至此，阅读文档和提出问题的第一步已完成。

第三步：回答问题

以“问题一”为例，我们需要执行以下操作：
将“问题一”输入API生成embedding-1。
将embedding-1与之前生成的embedding集合进行一一比对，计算余弦相似度。
对数据进行排序，筛选出Top N条相似的embedding。
将第3步筛选出的embedding原文提交给GPT-4，让其生成一段通顺的回答。
输出第3步Top N的embedding原文，以便了解答案来源。

重复以上过程四次，即可让GPT-4回答五个问题。将所有内容整合到一个Markdown文件中保存即可。

第四步：提供额外的问题支持

有时，我们对GPT-4提出的问题可能并不满意，因此需要继续向文档提问。在这里，我们使用Python的input函数在命令行中执行上述提问流程。当我们提出所有想要问的问题后，这些后续问题的回答将整合到另一个Markdown文件中，并保存在与PDF文件同一路径下。
通过以上步骤，我们可以利用GPT-4更高效地阅读文档，并对文档内容进行深入理解。这种方法既节省了时间，又提高了工作效率，使得我们能够更轻松地处理大量文档资料。

In [None]:
import pandas as pd
from openai.embeddings_utils import get_embedding, cosine_similarity
import openai
import os
import logging as logger
from flask_cors import CORS
import os, json
from tqdm import tqdm
import pdfplumber

In [None]:
def search_embeddings(df, query, n=3, pprint=True):
    query_embedding = get_embedding(
        query,
        engine="text-embedding-ada-002"
    )
    df["similarity"] = df.embeddings.apply(lambda x: cosine_similarity(x, query_embedding))

    results = df.sort_values("similarity", ascending=False, ignore_index=True)
    results = results.head(n)
    global sources
    sources = []
    for i in range(n):
        sources.append({'Page '+str(results.iloc[i]['page']): results.iloc[i]['text'][:150]+'...'})
    return results.head(n)

In [None]:
def create_prompt(df, user_input, strategy=None):
    result = search_embeddings(df, user_input)
    if strategy == "paper":
        prompt = """You are a large language model whose expertise is reading and summarizing scientific papers.
        You are given a query and a series of text embeddings from a paper in order of their cosine similarity to the query.
        You must take the given embeddings and return a very detailed summary of the paper that answers the query.
            Given the question: """+ user_input + """

            and the following embeddings as data: 

            1.""" + str(result.iloc[0]['text']) + """
            2.""" + str(result.iloc[1]['text']) + """

            Return a concise and accurate answer:"""
    elif strategy == "handbook":
        prompt = """You are a large language model whose expertise is reading and summarizing financial handbook.
        You are given a query and a series of text embeddings from a handbook in order of their cosine similarity to the query.
        You must take the given embeddings and return a very detailed answer in Chinese of the handbook that answers the query.
        If not necessary, your answer please use the original text as much as possible.
        You should also ensure that your response is written in clear and concise Chinese, using appropriate grammar and vocabulary.  
        Additionally, your response should focus on answering the specific query provided..
            Given the question: """+ user_input + """
            and the following embeddings as data: 

            1.""" + str(result.iloc[0]['text']) + """
            2.""" + str(result.iloc[1]['text']) + """

            Return a concise and accurate answer:"""
    elif strategy == "contract":
        prompt = """As a large language model specializing in reading and summarizing, your task is to read a query and a sequence of text inputs sorted by their cosine similarity to the query.
         Your goal is to provide a Chinese answer to the query using the given padding. If possible, please use the original text of your answer. 
         Please ensure that your response adheres to the terms of the agreement. Your response should focus on addressing the specific query provided, 
         providing relevant information and details based on the input texts' content. You should also strive for clarity and conciseness in your response, 
         summarizing key points while maintaining accuracy and relevance. Please note that you should prioritize understanding the context and meaning 
         behind both the query and input texts before generating a response.
            Given the question: """+ user_input + """
            and the following embeddings as data: 

            1.""" + str(result.iloc[0]['text']) + """
            2.""" + str(result.iloc[1]['text']) + """

            Return a concise and accurate answer:"""
    else:
        prompt = """As a language model specialized in reading and summarizing documents,your task is to provide a concise answer in Chinese based on a given query and a series of text embeddings from the document.The embeddings are provided in order of their cosine similarity to the query. Your response should use as much original text as possible.Your answer should be highly concise and accurate, providing relevant information that directly answers the query.You should ensure that your response is written in clear and concise Chinese, using appropriate grammar and vocabulary.Please note that you must use the provided text embeddings to generate your response, which means you will need to understand how they relate to the original document.Your response should focus on answering the specific query provided.Given the question: """+ user_input + """ and the following embeddings as data: 1.""" + str(result.iloc[0]['text']) + """2.""" + str(result.iloc[1]['text'])
    logger.info('Done creating prompt')
    return prompt

In [None]:
# 运行环境初始化
os.environ["http_proxy"] = "http://127.0.0.1:1088"
os.environ["https_proxy"] = "http://127.0.0.1:1088"

openai.organization = "org-your_org"
openai.api_key = "sk-your_api_key"

full_report = ""

In [None]:
# pdf文档输入

pdf_path = "/Users/januswing/data/PDFReadTest/大数据在路况监测中的应用（英） PIARC 2023.pdf"
file_name_prefix = pdf_path[:-4]
pdf = pdfplumber.open(pdf_path)
number_of_pages = len(pdf.pages)
full_report = "{}分析文档{}，总页数{}\n\n".format(full_report, pdf_path, number_of_pages)

In [None]:
# 读取pdf内容
import re

paper_text = []
pre_read_content = ""

with pdfplumber.open(pdf_path) as pdf:
    
    # 对每一页的文本进行处理
    for i in range(len(pdf.pages)):
        if i <= 10 and len(pre_read_content) < 3000:
            pre_read_content = '{}{}'.format(pre_read_content, pdf.pages[i].extract_text()) # 读取前10页的内容
        else:
            pre_read_content = pre_read_content[0:3000]
        page = pdf.pages[i]
        words = page.extract_words(extra_attrs=['size'])
        blob_font_size = None
        blob_text = ''
        processed_text = []

        for word in words:
            if word['size'] == blob_font_size:
                blob_text += f" {word['text']}"
                if len(blob_text) >= 2000: #这个数值控制的是一个段落可能最小的长度
                    processed_text.append({
                        'fontsize': blob_font_size,
                        'text': re.sub(r'\.{2,}', ' ', blob_text),
                        'page': i
                    })
                    blob_font_size = None
                    blob_text = ''
            else:
                if blob_font_size is not None and len(blob_text) >= 1:
                    processed_text.append({
                        'fontsize': blob_font_size,
                        'text': re.sub(r'\.{2,}', ' ', blob_text),
                        'page': i
                    })
                blob_font_size = word['size']
                blob_text = word['text']
            paper_text += processed_text

In [None]:
# 生成embeddings
embeddings_file_path = '{}.parquet'.format(file_name_prefix)
if os.path.exists(embeddings_file_path):
    df = pd.read_parquet(embeddings_file_path, engine='pyarrow')
else:
    filtered_pdf= []
    for row in paper_text:
        if len(row['text']) < 30:
            continue
        if len(row['text']) > 8000:
            row['text'] = row['text'][:8000]
        filtered_pdf.append(row)
    df = pd.DataFrame(filtered_pdf)
    df = df.drop_duplicates(subset=['text', 'page'], keep='first')
    df['length'] = df['text'].apply(lambda x: len(x))

    embedding_model = "text-embedding-ada-002"
    embeddings = []
    for text in tqdm(df.text.values, desc="Generating embeddings"):
            embeddings.append(get_embedding(text, engine=embedding_model))
    df["embeddings"] = embeddings
    # 保存留后续复用embeddings
    df.to_parquet(embeddings_file_path, engine='pyarrow')

In [None]:
# 输出文章概述
prompt_messages = []
prefix = '你是信息分析员'
i_say = f'对下面的文章片段用中文做概述，文章内容是 ```{pre_read_content}```'

system_content = {"role": "system", "content": prefix}
user_content_final = {"role": "user", "content": i_say}
prompt_messages.append(system_content)
prompt_messages.append(user_content_final)
r = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=prompt_messages)
res = json.loads(str(r))
overview = res['choices'][0]['message']['content']
full_report = "{}{}\n\n".format(full_report, overview)

In [None]:
# 提出问题
i_say = f'对这篇文章提出可能的五个问题'

user_content_final = {"role": "user", "content": i_say}
prompt_messages.append(user_content_final)
r = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=prompt_messages)
res = json.loads(str(r))
questions = res['choices'][0]['message']['content']
full_report = "{}可能的问题：{}\n\n".format(full_report, questions)

In [None]:
# 对问题逐一回答
qlist = questions.split('\n')
qnum = 0
for question in qlist:
    prompt_messages = []
#     if qnum > 0:
#         prompt_messages.pop()
    prefix = ""
    i_say = prefix + create_prompt(df, question)

    user_content_final = {"role": "user", "content": i_say}
    prompt_messages.append(user_content_final)
    r = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=prompt_messages)

    answer = r.choices[0]['message']
    res = json.loads(str(r))
    res_status = res['choices'][0]['finish_reason']
    res_content = res['choices'][0]['message']['content']
    answer = res_content
    response = {'answer': answer, 'sources': sources}
    full_report = "{}回答{}：\n{}\n\n".format(full_report, question, response['answer'])
    for source in response['sources']:
        for key, value in source.items():
            full_report = "{}来自原文{}:{}\n\n".format(full_report, key, value)
    qnum = qnum + 1

In [None]:
from datetime import datetime

current_date_time = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = "{}_{}.md".format(file_name_prefix, current_date_time)

with open(filename, "w", encoding="utf-8") as file:
    file.write(full_report)

In [None]:
# custom question answering
current_date_time = datetime.now().strftime("%Y%m%d_%H%M%S")
history_filename = "{}_chat_{}.md".format(file_name_prefix, current_date_time)

def ask_question(df, history_filename):
    question = input("Please enter your question: ")
    
    if question.lower() == "exit":
        return None
    
    prompt_messages = []
    history = ""

    prefix = ""
    i_say = prefix + create_prompt(df, question)

    user_content_final = {"role": "user", "content": i_say}
    prompt_messages.append(user_content_final)
    r = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=prompt_messages)

    answer = r.choices[0]['message']
    res = json.loads(str(r))
    res_status = res['choices'][0]['finish_reason']
    res_content = res['choices'][0]['message']['content']
    answer = res_content
    response = {'answer': answer, 'sources': sources}
    print("{}的回答：\n{}\n\n".format(question, response['answer']))
    history = "{}{}".format(history, "{}的回答：\n{}\n\n".format(question, response['answer']))
    for source in response['sources']:
        for key, value in source.items():
            print("来自原文{}:{}\n\n".format(key, value))
            history = "{}{}".format(history, "来自原文{}:{}\n\n".format(key, value))
    # Save conversation history
    with open(history_filename, "w", encoding="utf-8") as file:
        file.write(history)
    return history

In [None]:
while True:
    try:
        ask_question(df, history_filename)
    except KeyboardInterrupt:
        print("\nCtrl+C detected. Exiting the program.")
        break