<a href="https://colab.research.google.com/gist/NeoIntelligence/e50745c167fe40329b7c5f95b75a6846/-01_semi_structured_data-ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Prepare Environment

Let's install the necessary Python packages.

In [None]:
!pip install langchain unstructured[all-docs] pydantic lxml openai chromadb tiktoken -q -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m809.1/809.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.0/8.0 MB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.4/262.4 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m260.9/260.9 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

Download the PDF file and name it as `statement_of_changes.pdf`.

In [None]:
!wget -O statement_of_changes.pdf https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdf

--2024-03-20 14:38:50--  https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/381953f9-934e-4cc8-b099-144910676bad.pdf
Resolving d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)... 13.35.37.166, 13.35.37.47, 13.35.37.63, ...
Connecting to d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)|13.35.37.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 119037 (116K) [application/pdf]
Saving to: ‘statement_of_changes.pdf’


2024-03-20 14:38:52 (294 KB/s) - ‘statement_of_changes.pdf’ saved [119037/119037]



Install required platform packages:

- poppler-utils
  
  A collection of command-line utilities built on Poppler's library API, to manage PDF and extract contents

- tesseract-ocr

  Optical character recognition engine

In [None]:
!apt-get install poppler-utils tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
poppler-utils is already the newest version (22.02.0-2ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


In [None]:
!pip install openai




In [None]:
import os
os.environ["OPENAI_API_BASE"] = "https://api.openai-proxy.org/v1"

os.environ["OPENAI_API_KEY"] = "your AIP KEY"

In [None]:
!pip install numpy==1.24.4




### Coding

1. Use `unstructured` library to partition the PDF document into different type of elements.

In [None]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename="数据驱动的电力系统运行方式分析_侯庆春.pdf",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=".",
)



2. Categorize the elements

In [None]:
category_counts = {}

for element in raw_pdf_elements:
    category = str(type(element))
    if category in category_counts:
        category_counts[category] += 1
    else:
        category_counts[category] = 1

unique_categories = set(category_counts.keys())
category_counts

{"<class 'unstructured.documents.elements.CompositeElement'>": 11,
 "<class 'unstructured.documents.elements.Table'>": 1}

In [None]:
class Element(BaseModel):
    type: str
    text: Any

table_elements = []
text_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        table_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        text_elements.append(Element(type="text", text=str(element)))

In [None]:
print(len(table_elements))
print(len(text_elements))

1
11


In [None]:
table_elements[0]

Element(type='table', text='火电装机/MW 低比例情形 中比例情形 高比例情形 18260 18260 18260 水电装机/MW 9258 9258 9258 风电装机/MW 19 12772 14901 光伏装机/MW 可再生能源电量渗透率 7 0% 6502 20% 11002 30%')

In [None]:
table_elements[2]

Element(type='table', text='1. Title of Derivate |2. 3. Trans. 3A. Deemed |4. Trans. Code |5. Number of 6. Date Exercisable 7. Title and Amount of 8. Price of ]9. Number of | 10. Security Conversion | Date Execution |(Instr, 8) Derivative Securities |and Expiration Date _| Securities Underlying Derivative }derivative | Ownership] (Instr. 3) or Exercise Date, if any Acquired (A) or Derivative Security Security |Securities |Formof Price of Disposed of (D) (Instr. 3 and 4) (Instr. 5) |Beneficially | Derivative | Derivative (Instr. 3, 4 and 5) Owned Security: | Security Following —_| Direct (D) - Reported _| or Indirect Date Expiration] 1... | Amount or Number of Transaction(s)] (1) (Instr. coe |v | (a) (D) | Exercisable|Date Shares (instr. 4) 4)')

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser

3. Build up summarization chain with LangChain framework

In [None]:
prompt_text = """
  You are responsible for concisely summarizing table or text chunk:

  {element}
"""
prompt = ChatPromptTemplate.from_template(prompt_text)
summarize_chain = {"element": lambda x: x} | prompt | ChatOpenAI(temperature=0, model="gpt-3.5-turbo") | StrOutputParser()

4. Summarize each text and table element

In [None]:
tables = [i.text for i in table_elements]
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

texts = [i.text for i in text_elements]
text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})

In [None]:
print(text_summaries)

['This text is a summary of a research paper titled "Data-driven Power System Operation Mode Analysis" published in the Proceedings of the CSEE. The paper discusses the impact of increasing renewable energy penetration and power electronic devices integration on power system operation modes. It proposes a data-driven method to analyze power system operation modes and their variation based on high dimensional simulated chronological power system operation data. The method involves preprocessing the data, identifying representative operation mode patterns using clustering algorithm, and extracting key features for visualization. The paper also introduces indices to evaluate power system operation modes space dispersion, seasonal consistency, and time variation. A case study on Gansu provincial power system in China is presented to validate the proposed data-driven method, showing the impacts of high renewable energy penetration on power system operation modes.', 'This text discusses the 

5. Use LangChain MultiVectorRetriever to associate summaries of tables and texts with original text chunks in parent-child relationship.

In [None]:
import uuid

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.document import Document
from langchain.storage import InMemoryStore
from langchain.vectorstores import Chroma

id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings()),
    docstore=InMemoryStore(),
    id_key=id_key,
)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in texts]
summary_texts = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries)
]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, texts)))

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content=s, metadata={id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

  warn_deprecated(


In [None]:
print(summary_texts)

[Document(page_content='This text is a summary of a research paper titled "Data-driven Power System Operation Mode Analysis" published in the Proceedings of the CSEE. The paper discusses the impact of increasing renewable energy penetration and power electronic devices integration on power system operation modes. It proposes a data-driven method to analyze power system operation modes and their variation based on high dimensional simulated chronological power system operation data. The method involves preprocessing the data, identifying representative operation mode patterns using clustering algorithm, and extracting key features for visualization. The paper also introduces indices to evaluate power system operation modes space dispersion, seasonal consistency, and time variation. A case study on Gansu provincial power system in China is presented to validate the proposed data-driven method, showing the impacts of high renewable energy penetration on power system operation modes.', met

In [None]:
from langchain.schema.runnable import RunnablePassthrough

template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(temperature=0, model="gpt-4")
    | StrOutputParser()
)

In [None]:
print(chain)

first={
  context: MultiVectorRetriever(vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7f8a4805f640>, docstore=<langchain.storage.in_memory.InMemoryBaseStore object at 0x7f8a47edeb90>),
  question: RunnablePassthrough()
} middle=[ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template='Answer the question based only on the following context, which can include text and tables:\n{context}\nQuestion: {question}\n'))]), ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7f8a46ff2f50>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7f8a46f28be0>, model_name='gpt-4', temperature=0.0, openai_api_key='sk-XolnBTUuJmp8731EGGVJJj3Nx0Nrs6fvW43owpbKuVY5SYGb', openai_api_base='https://api.openai-proxy.org/v1', openai_proxy='')] last=StrOutputParser()


In [None]:
chain.invoke("甘肃电网算例中，各种情境下，风电装机最高是多少?对应的可再生能源电量渗透率有多少？")

'在甘肃电网的算例中，风电装机在各种情境下的最高值为14901MW，对应的可再生能源电量渗透率为30%。'

In [None]:
print(chain.first)

NameError: name 'chain' is not defined

6. Experiment with GPT-3.5

Looks it doesn't perform as well as GPT-4.

In [None]:
# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
    | StrOutputParser()
)
chain.invoke("甘肃电网算例中，各种情境下，风电装机最高是多少?对应的可再生能源电量渗透率有多少？")

'Based on the provided context, the highest wind power installed capacity in the Gansu power grid scenario is 14901 MW, and it corresponds to a renewable energy penetration rate of 30%.'