<a target="_blank" href="https://colab.research.google.com/github/UpstageAI/cookbook/blob/main/Solar-Fullstack-LLM-101/06_PDF_CAG.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 6. PDF CAG(Credibility-Aware Generation)

## Overview  
In this exercise, we will explore how to apply Credibility-Aware Generation (CAG) techniques to process PDF documents using the Solar framework. This involves extracting text from PDFs, assessing the credibility of the content, and generating reliable outputs based on the extracted information. This notebook will guide you through the steps needed to integrate CAG with PDF handling.
 
## Purpose of the Exercise
The purpose of this exercise is to demonstrate the integration of Credibility-Aware Generation with PDF document processing. By the end of this tutorial, users will be able to extract text from PDFs, evaluate its credibility, and generate credible outputs using the Solar framework, enhancing the accuracy and trustworthiness of the information derived from PDF sources.


In [5]:
! pip3 install -qU langchain-upstage pypdf python-dotenv

In [6]:
# @title set API key
import os
import getpass
from pprint import pprint
import warnings

warnings.filterwarnings("ignore")

from IPython import get_ipython

if "google.colab" in str(get_ipython()):
    # Running in Google Colab. Please set the UPSTAGE_API_KEY in the Colab Secrets
    from google.colab import userdata
    os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")
else:
    # Running locally. Please set the UPSTAGE_API_KEY in the .env file
    from dotenv import load_dotenv

    load_dotenv()

if "UPSTAGE_API_KEY" not in os.environ:
    os.environ["UPSTAGE_API_KEY"] = getpass.getpass("Enter your Upstage API key: ")


![SolarSample](figures/solar_sample.png)

In [27]:
from pdf2image import convert_from_path
import pytesseract

# PDF 페이지를 이미지로 변환
images = convert_from_path("C:\\Users\\yes10\\Downloads\\example_ir.pdf")

# OCR을 사용해 첫 페이지 텍스트 추출
page_text = pytesseract.image_to_string(images[0], lang="eng")
print(page_text[:1000])  # 첫 페이지의 텍스트 일부 출력

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

In [20]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("C:\\Users\\yes10\\Downloads\\example_ir.pdf")
docs = loader.load()  # or layzer.lazy_load()
print(docs[0].page_content[:1000])

24. 11. 2. 오후  12:02 영문IR
https://www.jointips.or.kr/resources/ir_en/ 1/1


In [19]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [22]:
chain.invoke({"question": "Explain Business Model?", "Context": docs})

'The information is not present in the context.'

In [12]:
chain.invoke({"question": "What is MMLU scores of SOLAR 10.7B?", "Context": docs})

'The MMLU score of SOLAR 10.7B is 66.04.'

In [13]:
chain.invoke({"question": "What is ARC of Mistral?", "Context": docs})

'The information is not present in the context.'

# Excercise 

How can we easily get information from complicated tables for LLMs?