**Document Analysis using LLMs with Python**

In [1]:
! pip install pdfplumber



In [21]:
import re

In [2]:
import pdfplumber
pdf_path = "/content/google_cloud_ai_agent_trends_2026_report.pdf"
output_text_file= "text_extract.txt"
with pdfplumber.open(pdf_path) as pdf:
  extracted_text = ""
  for page in pdf.pages:
    extracted_text+=page.extract_text()
with open(output_text_file, "w") as text_file:
  text_file.write(extracted_text)
print(f"Text extracted and saved to {output_text_file}")

Text extracted and saved to text_extract.txt


In [3]:
#Reading the extracted content by taking preview
with open(output_text_file, "r") as file:
  document_text=file.read()
print(document_text[:500])

AI agent
trends
2026
Five shifts that will
redefine roles, workflows,
and business value in 2026.
1About this
report
This report provides key insights for business
leaders to shape their AI agent strategy for
2026 and beyond. Within each trend, you will
find real-life examples, technical resources,
and customer stories to share with your
teams for deeper learning.
These trends were identified using a blend
of qualitative and quantitative data, including
internal Google Cloud and Google DeepMind



In [4]:
#Summarize the document with pretrained summarization model named t5-small from transformers
from transformers import pipeline
#load the summarization pipeline
summarizer = pipeline("summarization", model="t5-small")
summary = summarizer(document_text, max_length=150, min_length=30, do_sample=False) #do_sample=False to get deterministic output
print("Summary:", summary[0]['summary_text'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (7345 > 512). Running this sequence through the model will result in indexing errors
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summary: AI is transforming the security landscape in both offense and defense . it is a fundamental shift in the role of the employee as strategic . the key to achieving this gap is to build a faster, smarter workforce . if you are a business owner, you should be able to work with a specialized agent .


In [5]:
# Split the document into sentences and passages
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(document_text)  # List of sentences

passages = []
current_passage = ""

for sentence in sentences:
    # word count check
    if len(current_passage.split()) + len(sentence.split()) < 20:
        current_passage += " " + sentence
    else:
        passages.append(current_passage.strip())
        current_passage = sentence

# add last passage
if current_passage:
    passages.append(current_passage.strip())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Implementing the Cluster → Select → Generate strategy to reduce computational cost
1. Split the document into passages
2. Select the most informative passages based on heuristic importance
3. Generate questions only from the selected passages using an LLM


In [6]:
def deduplicate_questions(questions):
    seen = set()
    unique = []
    for q in questions:
        nq = q.lower().strip()
        if nq not in seen:
            seen.add(nq)
            unique.append(q)
    return unique


Duplicate Question Handling:
1. Since LLM-based question generation often produces repeated or near-duplicate questions, a normalization-based deduplication strategy is applied.
2. Questions are lowercased and stripped of whitespace, and only unique questions are retained after each generation step to improve diversity and reduce redundancy.


In [23]:
#Generate Questions From The Passage Using LLms
qg_pipeline = pipeline(
    "text2text-generation",
    model="valhalla/t5-base-qg-hl"
)

def generate_questions(passage, min_questions=5):
    results = qg_pipeline(f"generate diverse questions: {passage}")

    questions = results[0]["generated_text"].split("<sep>")
    questions = [q.strip() for q in questions if q.strip()]
    questions = deduplicate_questions(questions)

    if len(questions) < min_questions:
        sentences = passage.split(". ")
        for i in range(len(sentences)):
            if len(questions) >= min_questions:
                break

            extra = " ".join(sentences[i:i+2])
            r = qg_pipeline(f"generate questions: {extra}")
            more_qs = r[0]["generated_text"].split("<sep>")
            questions.extend([q.strip() for q in more_qs if q.strip()])
            questions = deduplicate_questions(questions)

    return questions[:min_questions]


Device set to use cuda:0


We implemented a hybrid QA system where extractive models are used for factual questions and a context-grounded LLM is invoked only for reasoning-based questions using a retrieval-augmented setup. This significantly reduced hallucinations and optimized compute cost.

In [22]:
def truncate_text(text, max_words=300):
    return " ".join(text.split()[:max_words])


def clean_context(text):
    text = re.sub(r"\b\d{4}\b", "", text)
    text = re.sub(r"\[[^\]]*\]", "", text)
    text = re.sub(r"\s{2,}", " ", text)
    return text.strip()


def clean_repetition(text):
    words = text.split()
    cleaned = []
    for w in words:
        if len(cleaned) < 3 or cleaned[-3:] != [w] * 3:
            cleaned.append(w)
    return " ".join(cleaned)


def clean_extractive_answer(ans):
    ans = re.sub(r"\b[A-Z][a-z]+ [A-Z][a-z]+.*?(Google|Forbes|Cloud).*", "", ans)
    ans = re.sub(r"\d+", "", ans)
    ans = ans.strip("”\"- ,.")
    return ans


def normalize_sentence(text):
    sents = sent_tokenize(text)
    if not sents:
        return ""
    sent = sents[0]
    if not sent.endswith("."):
        sent += "."
    return sent


def is_trivial_answer(ans):
    if len(ans.split()) < 6:
        return True
    if sum(c.isdigit() for c in ans) / max(len(ans), 1) > 0.2:
        return True
    return False


def confidence_score(mode, extractive_score=None):
    if mode == "extractive":
        return round(extractive_score, 2)
    if mode == "abstractive":
        return 0.75
    if mode == "extractive-fallback":
        return 0.60
    return 0.0

In [24]:
# QA Models
extractive_qa = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2"
)

abstractive_llm = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    device=0
)

ABSTRACTIVE_HINTS = [
    "why", "how", "role", "impact", "benefits",
    "future", "expected", "significance",
    "types", "how are", "what could"
]

def needs_abstraction(question):
    q = question.lower()
    return any(h in q for h in ABSTRACTIVE_HINTS)


Device set to use cuda:0
Device set to use cuda:0


In [25]:
# Context Builder
def build_context(passage, all_passages, max_words=300):
    context = [passage]
    for p in all_passages:
        if p != passage:
            context.append(p)
        if sum(len(c.split()) for c in context) >= max_words:
            break

    return truncate_text(clean_context(" ".join(context)), max_words)


def abstractive_answer(question, context):
    prompt = f"""
Answer the question using ONLY the context.
Write a complete, clear sentence.
Avoid names, citations, numbers, and lists.

Context:
{context}

Question:
{question}

Answer:
"""

    output = abstractive_llm(
        prompt,
        max_new_tokens=120,
        repetition_penalty=1.3,
        do_sample=False
    )[0]["generated_text"]

    output = clean_repetition(output.strip())
    return normalize_sentence(output)


In [28]:
def answer_question(question, passage, all_passages):
    passage = truncate_text(passage)

    #  Extractive
    ext = extractive_qa(question=question, context=passage)
    cleaned_ext = clean_extractive_answer(ext["answer"])

    if ext["score"] >= 0.35 and not is_trivial_answer(cleaned_ext):
        final = normalize_sentence(cleaned_ext)
        return final, "extractive", confidence_score("extractive", ext["score"])

    #  Abstractive
    context = build_context(passage, all_passages)
    abs_ans = abstractive_answer(question, context)

    if not is_trivial_answer(abs_ans):
        return abs_ans, "abstractive", confidence_score("abstractive")

    #  Safe fallback (still meaningful)
    fallback = normalize_sentence(truncate_text(passage, 35))
    return fallback, "extractive-fallback", confidence_score("extractive-fallback")

In [27]:
important_passages = sorted(
    passages,
    key=lambda x: len(x.split()),
    reverse=True
)[:10]

seen_questions = set()

for idx, passage in enumerate(important_passages):
    print(f"\n================ Passage {idx+1} ================\n")

    questions = generate_questions(passage)

    questions = [q for q in questions if q.lower() not in seen_questions]
    for q in questions:
        seen_questions.add(q.lower())

    if not questions:
        continue

    print("Generated Questions:")
    for q in questions:
        print("-", q)

    print("\nAnswers:\n")

    for q in questions:
        ans, mode, conf = answer_question(q, passage, important_passages)

        print(f"Q: {q}")
        print(f"A: {ans}")
        print(f"Mode: {mode}")
        print(f"Confidence: {conf}")
        print("-" * 50)




Generated Questions:
- What is the focus of the Cloud Learning Services Market Pulse?
- What is the name of the market Pulse that is fielded Sept.-Nov 2024?
- What is the name of the country in the U.S. in 2024?

Answers:

Q: What is the focus of the Cloud Learning Services Market Pulse?
A: Cloud Learning Services Market Pulse will be the year when every employee can go from guessing to knowing—but only if their organizations invest in the skills to make it possible.” Andrew Milo Global Director, Customer Training, Cloud Learning Services, Google Cloud 7 Forbes, AI Puts The Squeeze On The Shrinking Half-life Of Skills, 391 2 3 4 5 Agents for scale What executives are saying:8 of decision-makers agree that 82% technical learning resources help their organization stay ahead in AI of organizations surveyed realize 71% an increase in revenue since engaging.
Mode: abstractive
Confidence: 0.75
--------------------------------------------------
Q: What is the name of the market Pulse that i