# Retrieval-Augmented Generation

## Document Extractor

Raw document data is extracted into the system. These data are preprocessed, and functions from the LangChain open-source library can be used for this purpose. LangChain provides various document loaders to handle data in various forms from many different sources.

### Deal with all .pdf files

In [1]:
# All .pdf files in folder '/datasets'
import os
from langchain.document_loaders import PyMuPDFLoader

folder_path = "E:\桌面\jzyfinal_project\datasets"

file_names = os.listdir(folder_path)

pdf_files = [file_name for file_name in file_names if file_name.endswith(".pdf")]

all_pages = []
for pdf_file in pdf_files:
    pdf_file_path = os.path.join(folder_path, pdf_file)
    loader = PyMuPDFLoader(pdf_file_path)
    # loader = PyMuPDFLoader(pdf_file_path, extract_images=True)
    pages = loader.load()
    all_pages.extend(pages)
    
    print(f"Loaded {pdf_file} with {len(pages)} pages.")


Loaded M1-航空概论R1.pdf with 151 pages.
Loaded M2-航空器维修R1.pdf with 226 pages.
Loaded M5-航空涡轮发动机R1.pdf with 309 pages.
Loaded M3-飞机结构和系统R1.pdf with 934 pages.
Loaded M6-活塞发动机及其维修.pdf with 306 pages.
Loaded M7-航空器维修基本技能.pdf with 946 pages.
Loaded M8-航空器维修实践R1.pdf with 457 pages.
Loaded M4-直升机结构和系统.pdf with 571 pages.


In [78]:
print(len(all_pages))

3900


### Using `M1.pdf` file to evaluate result of different pdfLoader

In [21]:
# from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import PyMuPDFLoader

# loader = PyPDFLoader("/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M1-航空概论R1.pdf") # no images and the fastest, but has error messages
# loader = PyPDFLoader("/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M1-航空概论R1.pdf", extract_images=True) # extract images
loader = PyMuPDFLoader("E:\桌面\jzyfinal_project\datasets\M1-航空概论R1.pdf") # more accurate using PyMuPDF and no error messages
# loader = PyMuPDFLoader("/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M1-航空概论R1.pdf", extract_images=True) # more accurate using PyMuPDF but more time
pages = loader.load()

In [22]:
print(len(pages))

151


In [23]:
# page that has table (M1 Page 31)
print(pages[30])
# print(pages[30].page_content)

page_content='民用航空器维修基础系列教材第1 册\n空气动力学基础\n23\n表2-1\n国际标准大气\n高度\n（米）\n大气温度\n大气压强\n（百帕）\n大气密度\n（千克/立方米）\n声速\n（米/秒）\n（K）\n0\n288.150\n1013.25\n1.2250\n340.29\n1000\n281.651\n898.76\n1.1117\n336.43\n2000\n275.154\n795.01\n1.0066\n332.53\n3000\n268.659\n701.21\n0.9093\n328.58\n4000\n262.166\n616.60\n0.8194\n324.59\n5000\n255.676\n540.48\n0.7364\n320.55\n6000\n249.187\n472.17\n0.6601\n316.45\n7000\n242.700\n411.05\n0.5900\n312.31\n8000\n236.215\n356.51\n0.5258\n308.11\n9000\n229.733\n308.00\n0.4671\n303.83\n10000\n223.252\n264.99\n0.4135\n299.53\n11000\n216.774\n226.99\n0.3648\n295.15\n12000\n216.650\n193.39\n0.3119\n295.07\n13000\n216.650\n165.79\n0.2666\n295.07\n14000\n216.650\n141.70\n0.2279\n295.07\n15000\n216.650\n121.11\n0.1948\n295.07\n16000\n216.650\n103.52\n0.1665\n295.07\n17000\n216.650\n88.497\n0.1423\n295.07\n18000\n216.650\n75.652\n0.1217\n295.07\n19000\n216.650\n64.674\n0.1040\n295.07\n20000\n216.650\n55.293\n0.0889\n295.07\n21000\n217.581\n47.289\n0.0757\n295.70\n22000\n218.57

In [24]:
# page that has image
print(len(pages[8].page_content))
print(pages[8]) # M1 Page 9

524
page_content='民用航空器维修基础系列教材第1 册\n航空器的概念与分类\n1\n第1 章航空器的概念与分类\n1.1 航空器的定义与分类\n1.1.1 航空器的定义和分类\n任何由人工制造、能飞离地面、在空间进行由人来控制的飞行的物体称为飞行器。飞行\n器中，能够在大气层之外飞行的称为航天器，而在大气层中进行飞行的飞行器称为航空器。\n航空器根据获得升力方式的不同分为两大类：\n第一大类航空器总体的密度轻于空气，依靠空气的浮力而漂浮于空中，称为轻于空气的\n航空器，主要是气球和飞艇。气球和飞艇的主要区别在于，气球上没有安装动力，飞行方向\n不由本身控制；而飞艇上装有动力，可用本身的动力控制飞行的方向。\n第二大类航空器本身重于空气，依靠自身与空气之间的相对运动，产生空气动力克服重\n力而升空。这类航空器又分为非动力驱动的和动力驱动两类，非动力驱动的主要是滑翔机，\n动力驱动的分为飞机（或称固定翼航空器）\n、旋翼航空器和扑翼机三类。\n航空器的典型分类如图1-1 所示。\n图1-1 航空器分类\n自由气球\n非动力驱动：气球\n系留气球\n轻于空气的航空器\n动力驱动：飞艇\n航空器\n非动力驱动：滑翔机\n重于空气的航空器\n飞机（固定翼航空器）\n直升机\n动力驱动\n旋翼航空器\n自转旋翼机\n扑翼机' metadata={'source': '/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M1-航空概论R1.pdf', 'file_path': '/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M1-航空概论R1.pdf', 'page': 8, 'total_pages': 151, 'format': 'PDF 1.7', 'title': '', 'author': 'elancer', 'subject': '', 'keywords': '', 'creator': 'WPS 文字', 'producer': '', 'creationDate': "D:20201229093138+01'31'", 'modDate': "D:20201229093138+01'31'", 'trappe

## Document Preprocessing

After document loading, transformations are typically performed. One method of transformation is text segmentation, which breaks long texts  into smaller segments. This is crucial for fitting text into embedding models, such as e5-large-v2, where the maximum token length is 512. While text segmentation may sound straightforward, it can be a nuanced process that requires careful design of text segmentation functions. Of course, you can also use functions from the LangChain open-source library, but you will need to adapt them to the Chinese language context.

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Recursive的默认分隔符列表为：["\n\n", "\n", " ", ""]
# 要分割中文，除了定义新的separators之外，还需在split_text_with_regex中修改
splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,  # 文本块的最大长度
    chunk_overlap=20,   # 重叠
    length_function=len,
    separators=["\n\n", "\n", "。", "，", "；", "！", "？", " "], # 文本为中文的调整
    keep_separator=True,
    is_separator_regex=True # 在split_text_with_regex中修改
)

In [3]:
chunks = splitter.split_documents(all_pages)

In [4]:
print(len(chunks))
for i, chunk in enumerate(chunks):
    if i < 5:
        print(f"chunk #{i}, context:{chunk.page_content}")
        print("----------------------------------------------------------------")
    else:
        break

16072
chunk #0, context:民用航空器维修基础系列教材·共8 册
（第1 册）
航空概论
Introduction to Aviation
中国民用航空维修协会推荐
----------------------------------------------------------------
chunk #1, context:内容简介
本书为民用航空器维修基础系列教材之一。全书分为五章，包括：第一章航空器的概念
和分类，介绍航空器的概念、分类和基本特点；第二章空气动力学基础，介绍大气环境、空
气动力学基本原理、机翼外形特点和参数、作用在飞机上的空气动力以及高速飞行的基本特
点；第三章飞行原理，介绍飞机运动基础、飞机的稳定性和操纵性以及旋翼航空器的基本飞
----------------------------------------------------------------
chunk #2, context:行原理；第四章航空器动力装置，介绍航空活塞式发动机和燃气涡轮发动机的分类和基本工
作原理；第五章航空仪表和机载设备，介绍航空仪表以及其他机载电子、电气、机械系统的
基本功能和组成。
编者在编写中力求做到言简意赅，
深入浅出，
着重于清晰透彻的定性分析，
力求做到所有内容尽量与目前我国民航机务维修人员的实际工作紧密结合。
本书内容图文并茂通俗易懂，是民用航空器维修执照人员必须掌握的基本知识。通过学
----------------------------------------------------------------
chunk #3, context:习，机务维修人员不但易于掌握教材中的内容，而且能够起到提高其自身的素质和业务水平
的作用。本书适合在职飞机维修人员学习和相关院校专业做专业教材。
本教材的著作权归中国民用航空维修协会所有，任何单位和个人不得以营利为目的使用
本教材，侵权必究。
----------------------------------------------------------------
chunk #4, context:民用航空器维修基础系列教材编写委员会
主任委员：吴溪浚
副主任委员：杨卫东、刘英俊、杨国余、徐建星、蒋陵平、罗亮
生、刘韬然
编委：王会来、刘韬然、安辉、李珈、杨国余、何冠

## Embedding Generation

When extracting data, it must be converted into a format that the system can efficiently handle. Generating embeddings involves converting data  into high-dimensional vectors, representing text in numeric format. This functionality requires the involvement of embedding models. Please note that you should use models with strong Chinese capabilities to avoid suboptimal performance.

In [5]:
import os
from langchain.embeddings import HuggingFaceEmbeddings, HuggingFaceBgeEmbeddings
huggingface_embeddings_config = {
    "m3e-base": {
        "model_name": "moka-ai/m3e-base",
        "model_path": 'E:\桌面\jzyfinal_project\m3e-base',
        "encode_kwargs": {"normalize_embeddings": True},
        "max_len": 512,
    },}

def load_huggingface_embedding(name, device: str = "cuda") -> HuggingFaceEmbeddings:
    print("load_huggingface_embedding")
    model_path = huggingface_embeddings_config[name]["model_path"]
    model_name: str = huggingface_embeddings_config[name]["model_name"]
    encode_kwargs = huggingface_embeddings_config[name]["encode_kwargs"]

    model_name_or_path = model_path if model_path else model_name
    if model_name.startswith("bge"):
        embedding = HuggingFaceBgeEmbeddings(
            model_name=model_name_or_path,
            encode_kwargs=encode_kwargs,
            model_kwargs={"device": device},
        )
    else:
        embedding = HuggingFaceEmbeddings(
            model_name=model_name_or_path,
            encode_kwargs=encode_kwargs,
            model_kwargs={"device": device},
        )

    return embedding
embedding_used = 'm3e-base'
embedding_model = load_huggingface_embedding(name=embedding_used)

load_huggingface_embedding


  from tqdm.autonotebook import tqdm, trange


### m3e model

In [6]:
embeddings = embedding_model.embed_documents([chunk.page_content for chunk in chunks])

#Print the embeddings
for i, (sentence, embedding) in enumerate(zip(chunks, embeddings)):
    if i >= 5:
        break
    print("Sentence:", sentence)
    print("Embedding length:", len(embedding))
    print("Embedding:", embedding)
    print("")

Sentence: page_content='民用航空器维修基础系列教材·共8 册\n（第1 册）\n航空概论\nIntroduction to Aviation\n中国民用航空维修协会推荐' metadata={'source': '/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M1-航空概论R1.pdf', 'file_path': '/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M1-航空概论R1.pdf', 'page': 0, 'total_pages': 151, 'format': 'PDF 1.7', 'title': '', 'author': 'elancer', 'subject': '', 'keywords': '', 'creator': 'WPS 文字', 'producer': '', 'creationDate': "D:20201229093138+01'31'", 'modDate': "D:20201229093138+01'31'", 'trapped': ''}
Embedding length: 768
Embedding: [0.016590654850006104, -0.01424453966319561, 0.014683118090033531, 0.012106791138648987, -0.006413804367184639, -0.013891653157770634, 0.027410021051764488, 0.0017125473823398352, -0.013848017901182175, 0.037884749472141266, 0.03478787839412689, -0.015970800071954727, 0.0701654776930809, -0.019551223143935204, -0.022151121869683266, 0.0038251951336860657, 0.022853229194879532, 0.024012411013245583, -0.07546903192996979, -0.028793837

## Storing Embeddings in a Vector Database

Processed data and generated embeddings are stored in a dedicated database known as a vector database. These databases are optimized for handling vectorized data to enable fast search and retrieval operations.

In [7]:
from langchain_community.vectorstores import FAISS

db = FAISS.from_documents(chunks, embedding_model)

## LLM (Large Language Model)

LLMs are the foundational generation components of the RAG (Retrieval-Augmented Generation) process. These advanced general-purpose  language models are trained on vast datasets, enabling them to understand and generate human-like text. In the RAG environment, LLMs are used to generate fully formed responses based on the context information retrieved from the vector database during user queries and user query periods.

In [20]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import time
from openai import AzureOpenAI
GPT4_AZURE_OPENAI_KEY='126b11dac9994145a291b9f0a37e53d7'
GPT35_AZURE_OPENAI_KEY='fb149defd77c469ea86e54dc2e34b794'

class OpenAI_LLM:
    def __init__(self, model_name):
        self.model_name = model_name
        if model_name.endswith("_azure"):
            if "gpt-4" in model_name:
                self.client = AzureOpenAI(
                    azure_endpoint="https://zhishenggpt40.openai.azure.com/",
                    api_key=GPT4_AZURE_OPENAI_KEY,
                    api_version="2024-02-15-preview",
                )
                self.model = "GPT4"
                now = time.localtime()
                current_date = time.strftime("%Y-%m", now)
                self.system_prompt = f'You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2023-04\nCurrent date: {current_date}'
            elif "gpt-3.5" in model_name:
                self.client = AzureOpenAI(
                    azure_endpoint="https://zhishenggpt.openai.azure.com/",
                    api_key=GPT35_AZURE_OPENAI_KEY,
                    api_version="2024-02-15-preview",
                )
                self.model = "GPT-35"
                now = time.localtime()
                current_date = time.strftime("%Y-%m", now)
                self.system_prompt = f'You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2021-09\nCurrent date: {current_date}'
            else:
                raise ValueError(f"Unsupported model name: {model_name}")
        else:
            raise ValueError(f"Unsupported model name: {model_name}")

    def _call(
            self,
            messages,
            generation_config=None,
            temperature=0.7,
            max_tokens=4096,
            top_p=0.95,
            frequency_penalty=0,
            presence_penalty=0,
            stop=None,
            stream=False,
            add_system_prompt=False,
    ):
        if add_system_prompt:
            # 强制检查系统Prompt并且添加到messages的开头
            if self.model_name.endswith("_api2d"):
                if messages[0]["role"] != "system":
                    # 如果传入的messages不存在system_prompt，则添加system_prompt
                    messages = [{"role": "system", "content": self.system_prompt}] + messages  # 拼接system_prompt

            elif self.model_name.endswith("_azure"):
                if messages[0]["role"] != "system":
                    # 如果传入的messages不存在system_prompt，则添加system_prompt
                    messages = [{"role": "system", "content": self.system_prompt}] + messages  # 拼接system_prompt

            else:
                if messages[0]["role"] != "system":
                    # 如果传入的messages不存在system_prompt，则添加system_prompt
                    messages = [{"role": "system", "content": self.system_prompt}] + messages  # 拼接system_prompt

        completion = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens,
            top_p=top_p,
            frequency_penalty=frequency_penalty,
            presence_penalty=presence_penalty,
            stop=stop,
            stream=stream  # 流式返回
        )

        return completion

In [21]:
# create prompt template
template = """\
请参考以下知识，结合你已有的知识回答问题。

{knowledge}

问题: {query}

请针对以上问题，结合参考知识给出详细的、自然的、对用户有帮助的回答。\
"""

In [22]:
llm_35 = OpenAI_LLM('gpt-3.5-turbo-1106_azure')

## Querying

When a user submits a query, the RAG system performs efficient searches using indexed data and vectors. The system identifies relevant information by comparing the query vector with the vectors stored in the vector database. Then, LLMs formulate appropriate responses using the retrieved data.

In [23]:
query = "航空液压油主要有哪几类?"
result_simi = db.similarity_search(query, k=3)
knowledge = "\n".join([x.page_content for x in result_simi])

In [24]:
print(len(result_simi))

3


In [32]:
print(result_simi[0])
print(result_simi[1])
print(result_simi[2])

page_content='5.3.2 航空液压油的分类及识别......................................................................................126' metadata={'source': '/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M7-航空器维修基本技能.pdf', 'file_path': '/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M7-航空器维修基本技能.pdf', 'page': 6, 'total_pages': 946, 'format': 'PDF 1.7', 'title': '', 'author': 'leou', 'subject': '', 'keywords': '', 'creator': 'WPS 文字', 'producer': '', 'creationDate': "D:20200503113632+03'30'", 'modDate': "D:20200525092737+08'00'", 'trapped': ''}
page_content='具有较好的低温工作特性和\n低腐蚀性，广泛用于现代大型飞机的液压系统。  \n在维护过程中，应使用飞机维护手册或附件说明书所规定牌号的液压油。 \n表3-2 航空液压油的分类识别和运用 \n类型 \n名称 \n颜色 \n用途 \n航空植物基液压油 \nMIL-H-7644 \n蓝色 \n用于早期老式民航客机 \n航空矿物基液压油 \nMIL-H-5606 \nMIL-H-6803 \nBMS3-32 Ⅱ \n红色' metadata={'source': '/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M3-飞机结构和系统R1.pdf', 'file_path': '/home/jxy/program/course/CS224n/jzyfinal_project/datasets/M3-飞机结构和系统R1.pdf', 'page': 127, 'total_pages': 9

In [25]:
llm_35 = OpenAI_LLM('gpt-3.5-turbo-1106_azure')
response = llm_35._call([{"role": "system", "content": template.format(knowledge = knowledge, query = query)}]).choices[0].message.content
print(response)

航空液压油主要分为两类：航空植物基液压油和航空矿物基液压油。

航空植物基液压油通常符合MIL-H-7644标准，呈蓝色，主要用于早期老式民航客机。它由蓖麻油和酒精组成，具有刺鼻的酒精味，并且染成蓝色。这种油液适用于天然橡胶密封件，但是易燃。

航空矿物基液压油则符合MIL-H-5606、MIL-H-6803或BMS3-32 II标准，呈红色，主要用于现代大型飞机的液压系统。它是从石油中提炼出来的，具有刺激性的气味。航空矿物基液压油具有较好的低温工作特性和低腐蚀性，适用于现代大型飞机的液压系统。

在维护过程中，应使用飞机维护手册或附件说明书所规定牌号的液压油，以确保液压系统的正常运行和安全性。


In [26]:
llm_4 = OpenAI_LLM('gpt-4-1106-preview_azure')
response = llm_4._call([{"role": "system", "content":  template.format(knowledge = knowledge, query = query)}]).choices[0].message.content
print(response)

航空液压油主要分为两类：植物基液压油和矿物基液压油。

1. 植物基液压油：这类液压油主要是由蓖麻油和酒精组成的，通常染成蓝色，具有刺鼻的酒精味。植物基液压油用在较老式的飞机上，其兼容性较好，适用于天然橡胶密封件，但是这种类型的油液是易燃的。典型代表为MIL-H-7644。

2. 矿物基液压油：这类液压油是从石油中提炼出来的，它具有刺激性的气味并呈现红色。矿物基液压油具有较好的低温工作特性和低腐蚀性，因此广泛用于现代大型飞机的液压系统。常见的矿物基液压油牌号有MIL-H-5606、MIL-H-6803和BMS3-32 Ⅱ。

在飞机维护过程中，应严格按照飞机维护手册或附件说明书规定的牌号使用相应的液压油，以确保液压系统的正常运行和飞机的安全。
