Reference
https://python.langchain.com/docs/tutorials/rag/    

#### Configuration

1. VS Code > Kernel > Install Python + Jupyter   

2. Create a virtual environment for Python 

3. Install tools    

In [2]:
pip install --upgrade pip

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 7.1 MB/s eta 0:00:01
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.2.4
    Uninstalling pip-21.2.4:
      Successfully uninstalled pip-21.2.4
Successfully installed pip-25.1.1
Note: you may need to restart the kernel to use updated packages.


#### Install LangChain & Tools

In [None]:
pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph

Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install -qU "langchain[openai]"

Note: you may need to restart the kernel to use updated packages.


##### Store an open key - env
https://abc-notes.data.tech.gov.sg/notes/topic-6-ai-agents-with-tools/2.-a-more-secure-way-to-store-credentials.html

In [None]:
pip install python-dotenv

Collecting python-dotenv
  Using cached python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Using cached python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.0
Note: you may need to restart the kernel to use updated packages.


In [14]:
pip install -qU langchain-openai

Note: you may need to restart the kernel to use updated packages.


##### Vector DB - Elasticsearch    

* Tutorial : https://python.langchain.com/docs/integrations/vectorstores/elasticsearch/    

* Blog : https://www.elastic.co/search-labs/blog/dataset-translation-langchain-python-elastic#:~:text=Loading%20the%20translated%20articles%20into%20a%20vector%20database%20and%20searching

##### Elastic Cloud

* Web : https://cloud.elastic.co

* Free-trial : https://cloud.elastic.co/registration?utm_source=langchain&utm_content=documentation    

* Get started with Elasticsearch Serverless : 
https://www.elastic.co/docs/solutions/search/serverless-elasticsearch-get-started

In [15]:
pip install langchain-elasticsearch

Note: you may need to restart the kernel to use updated packages.


In [1]:
%pip install pdf2image -q
%pip install pdfminer -q
%pip install pdfminer.six -q
%pip install openai -q
%pip install scikit-learn -q
%pip install rich -q
%pip install tqdm -q
%pip install pandas -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


# Install poppler in the terminal
brew install poppler

#### code

In [None]:
# Imports
from pdf2image import convert_from_path
from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

from pdfminer.high_level import extract_text
import base64
import io
import os
from openai import OpenAI
import re 
import numpy as np
from rich import print

def convert_doc_to_images(path):
    images = convert_from_path(path)
    return images

def extract_text_from_doc(path):
    text = extract_text(path)
    return text

In [9]:
%pip install pillow -q

Note: you may need to restart the kernel to use updated packages.


#### Convert pdf to image
https://cookbook.openai.com/examples/parse_pdf_docs_for_rag


In [None]:
file_path = os.path.abspath("data.pdf")
imgs = convert_doc_to_images(file_path)
for img in imgs:
    display(img)

In [None]:
import os  
from dotenv import load_dotenv
from langchain.chat_models import init_chat_model
from langchain_openai import OpenAIEmbeddings
from langchain_elasticsearch import ElasticsearchStore
from langchain.schema import Document
from langchain.schema.output_parser import StrOutputParser

In [182]:
load_dotenv()

True

In [None]:
# Converting images to base64 encoded images
def get_img_uri(img):
    png_buffer = io.BytesIO()
    img.save(png_buffer, format="PNG")
    png_buffer.seek(0)

    base64_png = base64.b64encode(png_buffer.read()).decode('utf-8')

    data_uri = f"data:image/png;base64,{base64_png}"
    return data_uri

#### Extract text from image

In [138]:
# PDF - 설명글 / 테이블 / 다이어그램 / 코드 / 참고문헌

extract_prompt = '''
You will be provided with images containing text, representing multiple pages of a document.

Your task is to extract all readable text from the images **as accurately and completely as possible, and organize it by logical sections.

- **Do Not** add any explanations, summaries, or interpretations.
- **Do Not** mark or mention page numbers in the output.
- Remove any isolated date stamps, page numbers, headers, footer, or other non-content elements.
- Exclude any repeated headers titled "Prompt Engineering"
- If a subheading is clearly present (e.g., bolded, underlined, capitalized, or formatted distinctly), start a new section from that point.
- Maintain the original order, flow of the content, and line breaks.

- **Diagrams**: Explain each component and how they interact. For example, "The process begins with X, which then leads to Y and results in Z."
- **Tables**: Break down the information logically. In the table, bolded characters represent the type or name of the data (such as variable names or categories), while non-bolded characters represent the actual data values. For example, "Product A costs 100 dollars, while Product B is priced at 200 dollars."

Output should be structured as a sequence of content sections divided by meaningful subheadings or topic shifts.

------

If there is an identifiable title, present the output in the following format:

{TITLE}
{Content description}
'''

client = OpenAI()

def analyze_image(data_uri):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": extract_prompt},
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"{data_uri}"}
                    }
                ]
            },
        ],
        max_tokens=500,
        temperature=0,
        top_p=0.1
    )
    return response.choices[0].message.content

In [74]:
# 텍스트 정리
def clean_text(text):
    content = text.replace(' \n', '').replace('\n\n', '\n').replace('\n\n\n', '\n').strip()
    content = re.sub(r"\*{1,2}", "", content)
    return content

In [139]:
# 텍스트 추출
documents = []
for i, img in enumerate(imgs):
    data_uri = get_img_uri(img)  # img 객체를 data_uri로 변환
    text = analyze_image(data_uri)
    text = clean_text(text)
    doc = Document(page_content=text, metadata={"image_name": f"page_{i+1}"})
    documents.append(doc)

In [None]:
for doc in documents:
    print(doc)

#### Data cleaning

In [None]:
doc = documents[11]
print(doc)

new_content = doc.page_content.replace('-', '')
new = Document(metadata=doc.metadata, page_content=new_content)
print(new)

In [None]:
documents[11] = new
print(documents[11])

In [149]:
doc = documents[12]
print(doc)

new_content = re.sub(r'\s{2,}', ' ', doc.page_content.replace('-', ''))
new = Document(metadata=doc.metadata, page_content=new_content)
print(new)

In [None]:
documents[12] = new
print(documents[12])

In [None]:
doc = documents[14]
print(doc)

new_content = re.sub(r'\s{2,}', ' ', doc.page_content.replace('```', ''))
new = Document(metadata=doc.metadata, page_content=new_content)

print(new)

In [None]:
documents[14] = new
print(documents[14])

In [156]:
doc = documents[18]
print(doc)

new_content = re.sub(r'\s{2,}', ' ', doc.page_content.replace('-', ''))
new = Document(metadata=doc.metadata, page_content=new_content)

print(new)

In [None]:
documents[18] = new
print(documents[18])

In [159]:
doc = documents[27]
print(doc)

new_content = re.sub(r'\s{2,}', ' ', doc.page_content.replace('---', ''))
new = Document(metadata=doc.metadata, page_content=new_content)

print(new)

In [None]:
documents[27] = new
print(documents[27])

In [166]:
doc = documents[57]
print(doc)

new_content = re.sub(r'[-|｜]', ' ', doc.page_content)
new_content = re.sub(r'\s{2,}', ' ', new_content)
new = Document(metadata=doc.metadata, page_content=new_content)

print(new)

In [None]:
documents[57] = new
print(documents[57])

In [169]:
doc = documents[58]
print(doc)

new_content = re.sub(r'\s{2,}', ' ', doc.page_content.replace('-', ''))
new = Document(metadata=doc.metadata, page_content=new_content)

print(new)

In [None]:
documents[58] = new
print(documents[58])

#### Elasticsearch

In [183]:
# elasticsearch 연결 (~05/28 Free-Trial)
from elasticsearch import Elasticsearch

es = Elasticsearch(
    hosts=os.getenv("ELASTICSEARCH_URL"),
    api_key=os.getenv("ELASTIC_API_KEY")
)

In [None]:
print(es.info())

#### Vector Embedding

In [62]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [174]:
# document + 임베딩 저장
es_store = ElasticsearchStore.from_documents(
    documents=documents,
    es_connection=es,
    index_name="llm-elastic",
    embedding=embeddings
)

In [185]:
# 검색 테스트
results = es_store.similarity_search("What is top k?", k=1)
for r in results:
    print(r.page_content)

#### Question & Answer

In [None]:
llm = init_chat_model("gpt-4o", model_provider="openai")

In [106]:
from langchain.prompts import ChatPromptTemplate

In [None]:
system_prompt = '''
You will be provided with an input prompt and content as context that can be used to reply to the prompt.
    
You will do 2 things:
    
1. First, you will internally assess whether the content provided is relevant to reply to the input prompt.     
2a. If that is the case, answer directly using this content. If the content is relevant, use elements found in the content to craft a reply to the input prompt.
2b. If the content is not relevant, use your own knowledge to reply or say that you don't know how to respond if your knowledge is not sufficient to answer.
    
Stay concise with your answer, replying specifically to the input prompt without mentioning additional information provided in the context content.
'''

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("user", "Content:\n{content}\n\nQuestion: {question}")
])

# 벡터 검색 + GPT 답변
def answer_question(question):
    retriever = es_store.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={"score_threshold": 0.5, "k": 3})
    
    docs = retriever.get_relevant_documents(question)
    content = "\n\n".join([doc.page_content for doc in docs])
    
    messages = {
        "content" : content,
        "question" : question
    }
    
    chain = prompt | llm | StrOutputParser()

    response = chain.invoke(messages)
    answer = f"Answer:\n{response}\n\n------\nRelated content:\n{content}"
    return answer

In [186]:
user_question = "explain about contextual prompting"
answer = answer_question(user_question)
print(answer)