# 合同关键信息提取助手

#### 合同关键信息提取助手可以从输入的合同文件中提取相关的关键信息，并制成一个可下载的csv表格

## 合同关键信息流程

### 1.安装依赖库

In [1]:
! pip install streamlit
! pip install pdf2image
! brew install tesseract
! pip install pytesseract
! pip install pillow
! pip install pandas
! pip install requests
! pip install openai

[4;31mError[0m: Another `brew update` process is already running.
Please wait for it to finish or terminate it to continue.
[34m==>[0m [1mDownloading https://formulae.brew.sh/api/formula.jws.json[0m
[34m==>[0m [1mDownloading https://formulae.brew.sh/api/formula_tap_migrations.jws.json[0m
tesseract 5.5.0_1 is already installed but outdated (so it will be upgraded).
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/tesseract/manifests/5.5.1[0m
######################################################################### 100.0%
[32m==>[0m [1mFetching dependencies for tesseract: [32mlibpng[39m, [32mgettext[39m, [32mglib[39m, [32mpixman[39m, [32mharfbuzz[39m and [32mlibarchive[39m[0m
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/libpng/manifests/1.6.49[0m
######################################################################### 100.0%
[32m==>[0m [1mFetching [32mlibpng[39m[0m
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/co

### 2.导入依赖库

In [2]:
import requests
import streamlit as st
from pdf2image import convert_from_bytes
import pytesseract
from PIL import Image
import pandas as pd

### 3.配置Azure oepnai 的 API
需要两个参数，分别是`API_KEY`和`ENDPOINT`，需要在`Azure OpenAI`的`Keys and EndPoint`中查看

In [3]:
API_KEY = 'f03dc7146838448a88e59f32363ab9b7'
ENDPOINT = "https://jcyopenai2.openai.azure.com/openai/deployments/jcy4o/chat/completions?api-version=2024-02-15-preview"

headers = {
    "Content-Type": "application/json",
    "api-key": API_KEY,
}

### 4.提示词
根据不同的合同类型定义了不用的提示词及对应字段

In [4]:
sales_contract_prompts = {
    "最终用户的公司名称": "Please provide only the final user's company name from this contract text.",
    "合同金额": "Please provide only the contract amount in yuan from this contract text.",
    "服务期限": "Please provide only the service term or contract expiration time from this contract text.",
    "买方的公司地址": "Please provide only the buyer's company address from this contract text.",
    "纳税人识别号": "Please provide only the taxpayer identification number from this contract text.",
    "银行信息": "Please provide only the bank account information from this contract text."
}

msp_agreement_prompts = {
    "服务时长": "Please provide only the duration of the service from this service agreement.",
    "服务期限": "Please provide only the validity period of the service from this service agreement."
}

#### 初始化输出 DataFrame 的结构
根据不同的合同类型设定了不同的输出

In [5]:
columns_for_sales = [
    "最终用户的公司名称",
    "合同金额",
    "服务期限",
    "买方的公司地址",
    "纳税人识别号",
    "银行信息"
]
columns_for_msp = [
    "服务时长",
    "服务期限"
]

extracted_info_sales = {col: "/" for col in columns_for_sales}
extracted_info_msp = {col: "/" for col in columns_for_msp}

### 5.提取中需要的函数

#### 与 ChatGPT 进行交互的函数
通过GPT4o获取表格中每列的内容

In [6]:
def extract_content_with_azure(text, prompt):
    payload = {
        "messages": [
            {
                "role": "system",
                "content": "你是一个北京信诺时代科技发展有限公司的合同关键信息提取助手，用户将询问有关签约客户的合同关键信息，你需要在与不同客户签署的买卖合同和MSP服务协议中，提取出如下的关键信息。"
            },
            {
                "role": "user",
                "content": f"{prompt}\n\n{text}"
            }
        ],
        "temperature": 0.5,
        "top_p": 0.95,
        "max_tokens": 150
    }
    
    response = requests.post(ENDPOINT, headers=headers, json=payload)
    response.raise_for_status()
            
    response_json = response.json()
    if 'choices' in response_json and len(response_json['choices']) > 0:
        return clean_extracted_content(response_json['choices'][0]['message']['content'])

#### 清理提取的内容的函数

In [7]:
def clean_extracted_content(content):
    # 删除所有前导或尾随空格、换行符或不必要的文本
    cleaned_content = content.strip()
    return cleaned_content

#### 将文本拆分成较小段的函数
避免段落过长gpt难以识别

In [8]:
def split_text(text, max_length=3000):
    """Splits text into chunks that are within the token limit."""
    paragraphs = text.split('\n')
    chunks = []
    current_chunk = []

    for paragraph in paragraphs:
        if sum(len(p) for p in current_chunk) + len(paragraph) <= max_length:
            current_chunk.append(paragraph)
        else:
            chunks.append("\n".join(current_chunk))
            current_chunk = [paragraph]

    if current_chunk:
        chunks.append("\n".join(current_chunk))

    return chunks

#### 分段提取的函数

In [9]:
def extract_from_segments(text, prompts):
    """Extracts information from text divided into segments."""
    extracted_info = {key: "/" for key in prompts.keys()}
    
    # 将文本拆分成更小的块
    text_segments = split_text(text)
    
    for segment in text_segments:
        for key, prompt in prompts.items():
            # 仅在尚未填充时尝试提取
            if extracted_info[key] == "/":
                result = extract_content_with_azure(segment, prompt)
                if result.strip() != "/":
                    extracted_info[key] = result

    return extracted_info

### 6.布置 streamlit

In [10]:
st.title("合同关键信息提取助手")
uploaded_file = st.file_uploader("请选择一个PDF文件", type="pdf")

2024-08-16 09:42:39.748 
  command:

    streamlit run /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipykernel_launcher.py [ARGUMENTS]


### 7.关键信息提取
使用分段提取和交互函数，根据识别出的不同种类的合同，进行关键信息提取，并将信息放入pdf中

In [11]:
if uploaded_file is not None:
    # 将上传的 PDF 转换为图像列表
    images = convert_from_bytes(uploaded_file.read())

    # 从每张图片（页面）中提取文本
    extracted_text = []
    for page_number, img in enumerate(images):
        text = pytesseract.image_to_string(img, lang='chi_sim')  # 假设该文件的语言为中文
        extracted_text.append(text)

    # 将所有文本合并为一个字符串
    full_text = "\n".join(extracted_text)

    # 确定合同类型并提取相关信息
    if "买卖合同" in full_text or "服务合同" in full_text:
        st.write("检测到买卖合同或服务合同")
        extracted_info_sales = extract_from_segments(full_text, sales_contract_prompts)
        df = pd.DataFrame([extracted_info_sales])

    elif "MSP服务协议" in full_text:
        st.write("检测到MSP服务协议")
        extracted_info_msp = extract_from_segments(full_text, msp_agreement_prompts)
        df = pd.DataFrame([extracted_info_msp])

    else:
        st.warning("未能识别合同类型，请确认合同是否包含明确的标题。")

### 8.显示 DataFrame

In [2]:
st.write(df)

NameError: name 'st' is not defined

### DataFrame 保存和下载

#### 将 DataFrame 保存进csv

In [None]:
csv_filename = "合同关键信息.csv"
    df.to_csv(csv_filename, index=False)

#### 提供csv下载

In [None]:
with open(csv_filename, 'rb') as f:
        st.download_button(
            label="下载CSV文件",
            data=f,
            file_name=csv_filename,
            mime='text/csv'
        )
else:
    st.warning("请上传一个PDF文件。")