## Multi-hop Question Answering Generation

This notebook facilitates the generation of multi-hop question-answering pairs from financial documents (PDFs) using a large language model. It extracts text from uploaded PDFs, formulates questions that require combining information from different parts of one or more documents, and exports the results for analysis.

The process involves:
1. Setting up the environment and API access.
2. Uploading and processing PDF documents.
3. Generating multi-hop QA pairs using a language model.
4. Exporting the generated QA data to JSON and Excel formats.

This notebook is built to run on Google colab, if you are running it in another environment, you will have to modify it.

I have included a set of sample PDFs along with generated questions and evaluations in the data folder.

## Set up Claude

In [None]:
!pip install pdfplumber anthropic google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client requests ipywidgets

In [None]:
from google.colab import auth, userdata, drive, files
from google.auth import default
from googleapiclient.discovery import build
import requests
import time
import json
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
import pandas as pd
import os
import anthropic
import pdfplumber
from typing import List, Dict
import csv
from datetime import datetime
import tempfile
from io import StringIO, BytesIO
from openpyxl import Workbook, load_workbook

# Mount Google Drive
drive.mount('/content/drive')

# Authenticate for Google APIs
auth.authenticate_user()
creds, _ = default()

# Access secrets
try:
    CLAUDE_API_KEY = userdata.get('CLAUDE_API_KEY')

    if not CLAUDE_API_KEY:
        raise ValueError("Missing CLAUDE_API_KEY. Please add it in Tools > Settings > Secrets")

except Exception as e:
    print(f"⚠️ Error with API configuration: {e}")

# Initialize global variables
generated_questions = []
uploaded_files = []

## Upload PDFs and Extract Text

This section allows you to upload the PDF documents you want to use for generating multi-hop questions. The code will process these PDFs, extract their text content, and prepare them for the question generation process. You can upload multiple files.



In [None]:
# PDF upload and management - this will show the upload widget immediately
print("📂 Upload PDF files below:")
from google.colab import files
uploaded = files.upload()

# Process uploaded files
uploaded_files = []
for filename, content in uploaded.items():
    if not filename.lower().endswith('.pdf'):
        print(f"⚠️ Ignoring non-PDF file: {filename}")
        continue

    # Create a temporary file
    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.pdf')
    temp_file.write(content)
    temp_file.close()

    # Store file information
    file_info = {
        'name': filename,
        'path': temp_file.name,
        'size': len(content)
    }

    uploaded_files.append(file_info)
    print(f"✅ Uploaded: {filename} ({len(content)/1024:.1f} KB)")

if len(uploaded_files) < 2:
    print("ℹ️ Note: For best results, upload at least 2 PDF files for multi-hop questions.")
else:
    print(f"✅ Total files uploaded: {len(uploaded_files)}")

# Helper function to list uploaded PDFs
def list_uploaded_pdfs():
    """List all currently uploaded PDFs"""
    global uploaded_files

    if not uploaded_files:
        print("No PDFs have been uploaded yet.")
        return

    print(f"📑 Currently uploaded PDFs ({len(uploaded_files)} total):")
    for i, file in enumerate(uploaded_files, 1):
        print(f"  {i}. {file['name']} ({file['size']/1024:.1f} KB)")

list_uploaded_pdfs()

## Multi-hop QA Generation Functions

This section contains the core logic for generating multi-hop question-answering pairs.

It includes functions for:
- Extracting text from individual PDF files.
- Loading and processing multiple PDFs from a specified folder.
- Generating the prompt template for the language model, considering document preference and previously generated questions.
- Interacting with the language model (Anthropic Claude) to generate a single multi-hop QA pair.
- Orchestrating the generation of multiple QA pairs, handling progress saving and API calls.
- Saving the generated QA pairs to a JSON file for later use.
- Exporting the generated QA pairs to a CSV file.

The main function `main_generate_qa` ties these steps together.

In [None]:
# Simplified Multi-hop QA Generator
# Focus on generating JSON data, export to spreadsheet at the end

import json
import anthropic
import pdfplumber
import tempfile
import os
import glob
from datetime import datetime
from typing import List, Dict
import time

# Global variables
generated_questions = []
uploaded_files = []

# Initialize Anthropic client (you'll need to set your API key)
# CLAUDE_API_KEY = "your_api_key_here"
# client = anthropic.Client(api_key=CLAUDE_API_KEY)

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text content from a PDF file using pdfplumber."""
    text = ''
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page_num, page in enumerate(pdf.pages, start=1):
                page_text = page.extract_text()
                if page_text:
                    text += f"Page {page_num}: {page_text}\n"
    except Exception as e:
        print(f"⚠️ Error extracting text from PDF: {e}")
        raise

    if len(text) < 100:
        print("⚠️ Extracted text is too short. Please check the PDF file.")
        raise ValueError("Extracted text is too short")

    return text

def load_pdfs_from_folder(folder_path: str, max_files: int = None) -> List[Dict]:
    """Load all PDFs from a folder and extract their text."""
    pdf_files = glob.glob(os.path.join(folder_path, "*.pdf"))

    if max_files:
        pdf_files = pdf_files[:max_files]

    print(f"Found {len(pdf_files)} PDF files in {folder_path}")

    documents = []
    for pdf_path in pdf_files:
        filename = os.path.basename(pdf_path)
        file_size = os.path.getsize(pdf_path)

        print(f"Processing {filename} ({file_size/1024:.1f} KB)...")

        try:
            text = extract_text_from_pdf(pdf_path)
            documents.append({
                'name': filename,
                'path': pdf_path,
                'size': file_size,
                'text': text
            })
            print(f"✅ Successfully processed {filename}")
        except Exception as e:
            print(f"⚠️ Error processing {filename}: {e}")
            continue

    return documents

def get_prompt_template(doc_preference: float, previous_questions: List[str] = None) -> str:
    """Generate prompt template based on document preference."""

    if previous_questions is None:
        previous_questions = []

    base_instruction = """
    Generate 1 practical financial analysis question that requires two distinct information retrievals in sequence.
    Focus on questions that a financial analyst or executive would realistically ask when analyzing company performance.
    The first retrieval MUST provide specific information needed to know what to look for in the second retrieval.

    REQUIREMENTS:
    - Questions should explicitly mention the company name and time period
    - Questions should follow natural business logic and analysis patterns
    - Use clear, objective metrics rather than vague references
    - Focus on meaningful business insights that require multi-step reasoning

    EXCELLENT EXAMPLES of realistic two-step retrievals:
    1. "For [Company]'s Q3 FY2024, what was the year-over-year revenue change in the segment that management identified as the biggest driver of growth, and what risks did management highlight for this segment?"

    2. "In [Company]'s Q3 FY2024, for the product line with the highest profit margin, what were the key investments made this quarter and associated risk factors?"

    3. "Among [Company]'s regions with double-digit revenue growth in Q3 2024, which one had the highest customer acquisition cost, and what operational risks were identified?"

    TOPIC DIVERSITY:
    - Each question should explore a different aspect of the company's financial story
    - Avoid repeating topics from these previously asked questions: {previous_questions}
    - Consider various aspects like segments, regions, strategic initiatives, risk factors, capital allocation, etc.
    """

    if doc_preference < 0.3:
        doc_guidance = """
        Focus on finding high-quality multi-hop questions within individual documents.
        """
    elif doc_preference < 0.7:
        doc_guidance = """
        Look for high-quality multi-hop questions either within individual documents or across documents.
        """
    else:
        doc_guidance = """
        Prioritize questions that connect information across multiple documents.
        """

    # Format previous questions
    prev_questions = "\n".join([f"- {q}" for q in previous_questions[-5:]])
    formatted_base = base_instruction.format(
        previous_questions=prev_questions if previous_questions else "None yet"
    )

    return f"""{formatted_base}\n\n{doc_guidance}\n\n
    Return ONLY a JSON object in this exact format:
    {{
        "question": "question text",
        "answer": "detailed answer that naturally references source documents and includes relevant numerical data and risk factors",
        "steps": [
            {{
                "step": 1,
                "description": "what needs to be found first",
                "document": "filename of the document used",
                "evidence": "EXACT 10-15 word snippet from the document that contains the key information - copy the text verbatim"
            }},
            {{
                "step": 2,
                "description": "what needs to be found second",
                "depends_on": "specific output from step 1 that was needed",
                "document": "filename of the document used",
                "evidence": "EXACT 10-15 word snippet from the document that contains the key information - copy the text verbatim"
            }}
        ],
        "multi_hop_reasoning": "explanation of why finding the information in step 1 was necessary to know what to look for in step 2"
    }}

    CRITICAL: For the "evidence" field in each step, you MUST copy an exact 10-15 word snippet directly from the source document text. This snippet should contain the specific data point or information mentioned in that step. Do not paraphrase or summarize - copy the exact words as they appear in the document."""

def generate_qa_pair(documents: List[Dict], doc_preference: float = 0.7,
                    previous_questions: List[str] = None, client=None) -> Dict:
    """Generate a single QA pair from the documents."""

    if client is None:
        raise ValueError("Anthropic client not provided")

    # Prepare document content for the prompt (limit size for API)
    docs_content = "\n\n".join([
        f"Document: {doc['name']}\n{doc['text'][:50000]}"
        for doc in documents
    ])

    # Get prompt template
    prompt_template = get_prompt_template(doc_preference, previous_questions)

    try:
        response = client.messages.create(
            model="claude-sonnet-4-0",
            max_tokens=4096,
            messages=[{
                "role": "user",
                "content": f"Available documents:\n{docs_content}\n\n{prompt_template}"
            }]
        )

        response_text = response.content[0].text.strip()

        # Extract and parse JSON
        json_start = response_text.find('{')
        json_end = response_text.rfind('}') + 1

        if json_start == -1 or json_end == 0:
            raise ValueError("No JSON object found in response")

        json_str = response_text[json_start:json_end]
        qa_pair = json.loads(json_str)

        # Add metadata
        qa_pair['generated_at'] = datetime.now().isoformat()
        qa_pair['doc_preference'] = doc_preference
        qa_pair['source_documents'] = [doc['name'] for doc in documents]

        return qa_pair

    except Exception as e:
        print(f"⚠️ Error generating QA pair: {e}")
        raise

def generate_multiple_qa_pairs(documents: List[Dict], num_questions: int = 5,
                              doc_preference: float = 0.7, client=None) -> List[Dict]:
    """Generate multiple QA pairs from documents."""

    print(f"🚀 Generating {num_questions} QA pairs...")
    print(f"📄 Using {len(documents)} documents: {[doc['name'] for doc in documents]}")
    print(f"🔧 Document preference: {doc_preference} (higher = more cross-document connections)")

    qa_pairs = []
    generated_questions = []

    for i in range(num_questions):
        print(f"\n⏳ Generating question {i+1}/{num_questions}...")

        try:
            qa_pair = generate_qa_pair(
                documents=documents,
                doc_preference=doc_preference,
                previous_questions=generated_questions,
                client=client
            )

            qa_pairs.append(qa_pair)
            generated_questions.append(qa_pair['question'])

            print(f"✅ Generated: {qa_pair['question'][:100]}...")

             # Save progress after each question
            with open('qa_backup.json', 'w') as f:
                json.dump(qa_pairs, f, indent=2)


            # Brief pause between requests
            time.sleep(1)

        except Exception as e:
            print(f"⚠️ Failed to generate question {i+1}: {e}")
            continue

    print(f"\n🎉 Successfully generated {len(qa_pairs)} QA pairs!")
    return qa_pairs

def save_qa_pairs_to_json(qa_pairs: List[Dict], output_path: str = None) -> str:
    """Save QA pairs to a JSON file."""

    if output_path is None:
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        output_path = f"qa_pairs_{timestamp}.json"

    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(qa_pairs, f, indent=2, ensure_ascii=False)

    print(f"💾 Saved {len(qa_pairs)} QA pairs to {output_path}")
    return output_path

def export_to_csv(qa_pairs: List[Dict], output_path: str = None) -> str:
    """Export QA pairs to CSV format for easy spreadsheet import."""
    import csv

    if output_path is None:
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        output_path = f"qa_pairs_{timestamp}.csv"

    with open(output_path, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)

        # Write header
        writer.writerow([
            'Question', 'Answer', 'Multi-hop Reasoning', 'Evidence Steps (JSON)',
            'Generated At', 'Source Documents'
        ])

        # Write data
        for qa in qa_pairs:
            writer.writerow([
                qa.get('question', ''),
                qa.get('answer', ''),
                qa.get('multi_hop_reasoning', ''),
                json.dumps(qa.get('steps', [])),
                qa.get('generated_at', ''),
                ', '.join(qa.get('source_documents', []))
            ])

    print(f"📊 Exported {len(qa_pairs)} QA pairs to {output_path}")
    return output_path

def main_generate_qa(folder_path: str, num_questions: int = 5,
                    doc_preference: float = 0.7, max_files: int = None,
                    claude_api_key: str = None):
    """Main function to generate QA pairs from a folder of PDFs."""

    # Initialize client
    if claude_api_key is None:
        raise ValueError("Claude API key is required")
    client = anthropic.Client(api_key=claude_api_key)

    # Load documents
    print("📂 Loading PDFs...")
    documents = load_pdfs_from_folder(folder_path, max_files=max_files)

    if not documents:
        print("❌ No documents loaded successfully")
        return None

    # Generate QA pairs
    qa_pairs = generate_multiple_qa_pairs(
        documents=documents,
        num_questions=num_questions,
        doc_preference=doc_preference,
        client=client
    )

    if not qa_pairs:
        print("❌ No QA pairs generated successfully")
        return None

    # Save results
    json_path = save_qa_pairs_to_json(qa_pairs)
    csv_path = export_to_csv(qa_pairs)

    print(f"\n✅ Complete! Generated {len(qa_pairs)} QA pairs")
    print(f"📄 JSON output: {json_path}")
    print(f"📊 CSV output: {csv_path}")

    return qa_pairs, json_path, csv_path

## Run QA Generation

Execute this cell to start the multi-hop question generation process using the PDFs uploaded earlier and the specified parameters (number of questions, document preference, maximum files). This process may take some time depending on the number and size of the documents and the number of questions requested.


In [None]:

CLAUDE_API_KEY = userdata.get('CLAUDE_API_KEY')
qa_pairs, json_file, csv_file = main_generate_qa(
    folder_path="/content/",
    num_questions=100,
    doc_preference=0.5,  # 0.0 = single doc focus, 1.0 = cross-doc focus
    max_files=7,         # Limit number of PDFs to process
    claude_api_key=CLAUDE_API_KEY
)

## Export the JSON to a Spreadsheet

In [None]:
# JSON to Spreadsheet Exporter
# Handles exporting the generated QA pairs to Excel

import json
import pandas as pd
from google.oauth2.credentials import Credentials
from google.auth import default
from googleapiclient.discovery import build
import openpyxl
from datetime import datetime

def load_qa_pairs_from_json(json_path: str) -> list:
    """Load QA pairs from JSON file."""
    with open(json_path, 'r', encoding='utf-8') as f:
        qa_pairs = json.load(f)
    print(f"📖 Loaded {len(qa_pairs)} QA pairs from {json_path}")
    return qa_pairs

def qa_pairs_to_dataframe(qa_pairs: list, multi_evidence_rows: bool = False) -> pd.DataFrame:
    """Convert QA pairs to a pandas DataFrame."""

    if multi_evidence_rows:
        # Each evidence step gets its own row
        rows = []
        for qa_id, qa in enumerate(qa_pairs, 1):
            steps = qa.get('steps', [])
            for step_idx, step in enumerate(steps):
                row = {
                    'QA_ID': qa_id,
                    'Question': qa['question'] if step_idx == 0 else '',  # Only on first row
                    'Answer': qa['answer'] if step_idx == 0 else '',
                    'Multi_hop_Reasoning': qa.get('multi_hop_reasoning', '') if step_idx == 0 else '',
                    'Step_Number': step.get('step', step_idx + 1),
                    'Step_Description': step.get('description', ''),
                    'Source_Document': step.get('document', ''),
                    'Evidence': step.get('evidence', ''),
                    'Depends_On': step.get('depends_on', ''),
                    'Generated_At': qa.get('generated_at', ''),
                    'Source_Documents': ', '.join(qa.get('source_documents', []))
                }
                rows.append(row)

    else:
        # Single row per QA pair (evidence as JSON)
        rows = []
        for qa_id, qa in enumerate(qa_pairs, 1):
            row = {
                'QA_ID': qa_id,
                'Question': qa['question'],
                'Answer': qa['answer'],
                'Multi_hop_Reasoning': qa.get('multi_hop_reasoning', ''),
                'Evidence_Steps': json.dumps(qa.get('steps', []), indent=2),
                'Generated_At': qa.get('generated_at', ''),
                'Source_Documents': ', '.join(qa.get('source_documents', []))
            }
            rows.append(row)

    return pd.DataFrame(rows)

def export_to_excel(qa_pairs: list, output_path: str = None,
                   multi_evidence_rows: bool = False) -> str:
    """Export QA pairs to Excel file."""

    if output_path is None:
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        layout = "multi_row" if multi_evidence_rows else "single_row"
        output_path = f"qa_pairs_{layout}_{timestamp}.xlsx"

    # Convert to DataFrame
    df = qa_pairs_to_dataframe(qa_pairs, multi_evidence_rows)

    # Write to Excel with formatting
    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        df.to_excel(writer, sheet_name='QA_Pairs', index=False)

        # Get the workbook and worksheet
        workbook = writer.book
        worksheet = writer.sheets['QA_Pairs']

        # Auto-adjust column widths
        for column in worksheet.columns:
            max_length = 0
            column_letter = column[0].column_letter

            for cell in column:
                try:
                    if len(str(cell.value)) > max_length:
                        max_length = len(str(cell.value))
                except:
                    pass

            adjusted_width = min(max_length + 2, 50)  # Cap at 50 characters
            worksheet.column_dimensions[column_letter].width = adjusted_width

        # Style the header row
        header_font = openpyxl.styles.Font(bold=True)
        header_fill = openpyxl.styles.PatternFill(start_color="CCCCCC", end_color="CCCCCC", fill_type="solid")

        for cell in worksheet[1]:
            cell.font = header_font
            cell.fill = header_fill

    print(f"📊 Exported to Excel: {output_path}")
    return output_path

In [None]:
%ls -al

In [None]:
# Load the generated QA pairs from the JSON file.
# Replace "qa_pairs_20250528_211321.json" with the actual filename if it differs.
qa_pairs = load_qa_pairs_from_json("qa_pairs_20250528_211321.json")

# Export the loaded QA pairs to an Excel file using the multi-evidence row format.
export_to_excel(qa_pairs, multi_evidence_rows=True)