# Document Analysis using LLMs with Python

This notebook demonstrates how to analyze documents using Large Language Models (LLMs). We'll extract text from a PDF document, summarize it, generate questions from the content, and answer those questions using pre-trained models.

## Overview

Document analysis refers to extracting, interpreting, and understanding the information contained within a document. With the rise of Large Language Models (LLMs) like GPT and BERT, we can now comprehend context, generate summaries, answer questions, and identify key insights efficiently.

## Step 1: Install Required Libraries

First, let's install all the necessary libraries for our document analysis task.

In [None]:
# Install required packages
!pip install pdfplumber transformers nltk pandas torch

## Step 2: Import Libraries

Now, let's import all the necessary libraries for our document analysis.

In [None]:
import pdfplumber
import pandas as pd
import nltk
from transformers import pipeline
import torch
import os

# Download necessary NLTK data
nltk.download('punkt')

## Step 3: Extract Text from PDF

For this analysis, we'll use Google's Terms of Service document. First, we need to extract the text from the PDF file.

In [None]:
# Function to extract text from PDF
def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text

# Path to your PDF file
# Note: You need to download the Google Terms of Service PDF
# You can replace this with the path to your downloaded PDF
pdf_path = "google_terms_of_service.pdf"

# Check if file exists
if os.path.exists(pdf_path):
    # Extract text from PDF
    extracted_text = extract_text_from_pdf(pdf_path)
    
    # Save extracted text to a file
    with open("extracted_text.txt", "w", encoding="utf-8") as f:
        f.write(extracted_text)
    
    print("Text extracted and saved to extracted_text.txt")
else:
    print(f"File {pdf_path} not found. Please download the Google Terms of Service PDF.")
    # For demonstration purposes, we'll create a sample text
    extracted_text = """GOOGLE TERMS OF SERVICE
Effective May 22, 2024 | Archived versions
What's covered in these terms
We know it's tempting to skip these Terms of
Service, but it's important to establish what you
can expect from us as you use Google services,
and what we expect from you.
These Terms of Service reflect the way Google's business works, the laws that apply to
our company, and certain things we've always believed to be true. As a result, these Terms
of Service help define Google's relationship with you as you interact with our services. For
example, these terms include the following topic headings:
What you can expect from us, which describes how we provide and develop our
services
What we expect from you, which establishes certain rules for using our services
Content in Google services, which describes the intellectual property rights to the
content you find in our services — whether that content belongs to you, Google, or
others
In case of problems or disagreements, which describes other legal rights you have,
and what to expect in case someone violates these terms
Understanding these terms is important because, by accessing or using our services,
you're agreeing to these terms."""
    print("Using sample text for demonstration purposes.")

## Step 4: Preview the Extracted Text

Let's preview the first few hundred characters of the extracted text to ensure everything is correctly captured.

In [None]:
# Preview the first 1000 characters of the extracted text
print(extracted_text[:1000])

## Step 5: Summarize the Document

Now, let's use a pre-trained summarization model to get a high-level overview of the document.

In [None]:
# Initialize the summarization pipeline with t5-small model
summarizer = pipeline("summarization", model="t5-small")

# Summarize the first 1000 characters of the document
# Note: T5 has a limit on input length, so we're only summarizing a portion
document_summary = summarizer(extracted_text[:1000], max_length=150, min_length=30, do_sample=False)

# Print the summary
print("Summary:", document_summary[0]['summary_text'])

## Step 6: Split the Document into Sentences and Passages

For more detailed analysis, we need to split the document into smaller chunks.

In [None]:
from nltk.tokenize import sent_tokenize

# Tokenize the document into sentences
sentences = sent_tokenize(extracted_text)

# Combine sentences into passages (chunks of text)
passages = []
current_passage = ""
word_limit = 200  # Limit each passage to approximately 200 words

for sentence in sentences:
    # Add the sentence to the current passage
    temp_passage = current_passage + " " + sentence if current_passage else sentence
    
    # Check if adding this sentence exceeds the word limit
    if len(temp_passage.split()) > word_limit and current_passage:
        # If it does, add the current passage to the list and start a new one
        passages.append(current_passage)
        current_passage = sentence
    else:
        # If not, update the current passage
        current_passage = temp_passage

# Add the last passage if it's not empty
if current_passage:
    passages.append(current_passage)

# Print the number of passages created
print(f"Document split into {len(passages)} passages.")

# Preview the first passage
print("\nPassage 1:\n", passages[0])

## Step 7: Generate Questions from the Passages

Now, let's generate questions based on the document's content using a question generation model.

In [None]:
# Initialize the question generation pipeline
question_generator = pipeline("text2text-generation", model="valhalla/t5-base-qg-hl")

def generate_questions_pipeline(text, min_questions=3):
    # Generate questions from the text
    questions = question_generator(f"generate questions: {text}", max_length=512, num_return_sequences=min_questions)
    return [q['generated_text'] for q in questions]

# Generate questions for each passage
all_questions = []
for i, passage in enumerate(passages[:5]):  # Limit to first 5 passages for demonstration
    print(f"Passage {i+1}:\n{passage[:500]}...\n")  # Print first 500 chars of passage
    
    # Generate at least 3 questions for each passage
    questions = generate_questions_pipeline(passage)
    
    # If we couldn't generate enough questions, try with smaller chunks
    if len(questions) < 3 and len(passage.split()) > 100:
        # Split the passage into smaller chunks
        sentences = sent_tokenize(passage)
        mid = len(sentences) // 2
        chunk1 = " ".join(sentences[:mid])
        chunk2 = " ".join(sentences[mid:])
        
        # Generate questions from each chunk
        questions1 = generate_questions_pipeline(chunk1)
        questions2 = generate_questions_pipeline(chunk2)
        questions = questions1 + questions2
    
    print("Generated Questions:")
    for q in questions:
        print(f"- {q}")
        all_questions.append((q, passage))  # Store question with its context
    
    print("\n" + "-"*50 + "\n")

## Step 8: Answer the Generated Questions

Finally, let's use a question-answering model to find answers to the generated questions within the text.

In [None]:
# Initialize the question-answering pipeline
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

# Function to answer unique questions
def answer_unique_questions(questions_with_context):
    answered_questions = set()  # To track which questions we've already answered
    
    for question, context in questions_with_context:
        # Skip if we've already answered this question
        if question in answered_questions:
            continue
        
        # Add to the set of answered questions
        answered_questions.add(question)
        
        # Get the answer
        answer = qa_pipeline(question=question, context=context)
        
        # Print the question and answer
        print(f"Q: {question}")
        print(f"A: {answer['answer']}\n")

# Answer the generated questions
answer_unique_questions(all_questions)

## Conclusion

In this notebook, we've demonstrated how to use Large Language Models (LLMs) for document analysis. We've:

1. Extracted text from a PDF document
2. Summarized the document using a pre-trained model
3. Split the document into manageable passages
4. Generated questions from the passages
5. Answered those questions using a question-answering model

This approach can be applied to various types of documents for information extraction, summarization, and question answering tasks.