# Use AI to process a PDF
> Convert PDF to text and process the text through an LLM

## Ollama CLI demo

Using Bash in the notebook without Python yet.

In [None]:
!ollama run llama3.2 "What is a PDF in 3 bullet points?"

## Read PDF

Load the PDF as text using Python. Pass a URL or a local path to a PDF.

Here we read a Psychology PDF from [openstax.org](https://openstax.org/) as source of free PDFs.

In [None]:
from langchain_community.document_loaders import PyPDFLoader

In [None]:
PDF_INPUT = "https://assets.openstax.org/oscms-prodcms/media/documents/Psychology2e_WEB.pdf"

In [None]:
loader = PyPDFLoader(PDF_INPUT)
docs = loader.load_and_split()

content = [d.page_content for d in docs]

In [None]:
# We're splitting by page but this might not work out exactly, perhaps because of images.
print(f"chunks: {len(docs)}")


In [None]:
# First line for each chunk.
for chunk in content:
    print(chunk.splitlines()[0])
    print('---')

In [None]:
# For this demo we're using the Taste section at this chunk - at pg 177 of PDF, pg 164 of the book
docs[194].metadata

In [None]:
print(content[194])

## Translate the PDF text

Here converting from English to Dutch but you can use another prompt if you prefer like the sample below commented out.

In [None]:
import os

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

In [None]:
OPENAI_MODEL = os.getenv("OPENAI_MODEL", "llama3.2")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "dummy")
OPENAI_API_URL = os.getenv("OPENAI_API_URL", "http://localhost:11434/v1")

SYSTEM_PROMPT = "You are a helpful assistant."

In [None]:
model = ChatOpenAI(
    base_url=OPENAI_API_URL,
    openai_api_key=OPENAI_API_KEY,
    model=OPENAI_MODEL,
)

In [None]:
user_prompt = """
You are expert at translating info from English to Dutch. Keep the original structure and with Markdown format.
Give only the translated answer and nothing else, no explanation or preamble or conclusion."

Context: '''{context}'''
""".strip()

# user_prompt = """
# Explain like I'm 5.

# Explain the following in basic terms in bullet points. Give only the summarized content, with no preamble or conclusion.

# Context: '''{context}'''
# """.strip()


template = ChatPromptTemplate(
    [("system", SYSTEM_PROMPT), ("human", user_prompt)]
)
chain = template | model

In [None]:
# Use the target chunk and a few more.
sliced_content = content[194:194+3]

for i, chunk in enumerate(sliced_content, start=1):
    result = chain.invoke({"context": chunk})

    print(i)
    print("ORIGINAL")
    print(chunk)
    print()
    print("TRANSLATED")
    print(result.content)
    print()
    print('='*80)
    print()