**HECTOR** ___________Playing around **DSPy**


**What is DSPy?**


DSPy is a framework designed to make working with language models (like GPT-4) easier and more modular. Instead of manually writing complex prompts for each task, DSPy allows you to write Python code to interact with your models. It focuses on declarative programming, meaning you describe what you want, and the system figures out how to achieve it.


**Install Required Libraries**

In [1]:
!pip install openai PyPDF2 dspy


Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting dspy
  Downloading dspy-2.6.11-py3-none-any.whl.metadata (7.3 kB)
Collecting backoff (from dspy)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting ujson (from dspy)
  Downloading ujson-5.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Collecting datasets<3.0.0,>=2.14.6 (from dspy)
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting optuna (from dspy)
  Downloading optuna-4.2.1-py3-none-any.whl.metadata (17 kB)
Collecting magicattr~=0.1.6 (from dspy)
  Downloading magicattr-0.1.6-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting litellm<2.0.0,>=1.59.8 (from dspy)
  Downloading litellm-1.63.7-py3-none-any.whl.metadata (36 kB)
Collecting diskcache (from dspy)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting json-repair (from dspy)
  Downloading json_repair-0.39.1-py3-none-any.whl.metadata (11 kB

**Upload Your PDF File**

In [2]:
from google.colab import files

# Upload your proposal PDF
uploaded = files.upload()

# Get the filename (assumes a single file is uploaded)
pdf_filename = list(uploaded.keys())[0]
print("Uploaded file:", pdf_filename)


Saving sample-research-proposal.pdf to sample-research-proposal.pdf
Uploaded file: sample-research-proposal.pdf


**Extract Text from the PDF**

In [3]:
import PyPDF2

def extract_text_from_pdf(pdf_path):
    """Extract text from a PDF file using PyPDF2."""
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        text = ""
        for page in reader.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"
    return text

# Extract text from the uploaded PDF
pdf_text = extract_text_from_pdf(pdf_filename)
print("Extracted text length:", len(pdf_text))


Extracted text length: 22449


**Configure DSPy and OpenAI API**

In [4]:
import openai
import dspy

# Set your OpenAI API key here
openai.api_key = "YOUR_OPENAI_API_KEY"

# Configure DSPy with your language model (e.g., GPT‑4)
lm = dspy.LM("openai/gpt-4", api_key=openai.api_key)
dspy.configure(lm=lm)


In [5]:
# Define a custom signature for summarization
class SummarizeSignature(dspy.Signature):
    """
    Given a passage (e.g. a proposal document), generate a concise summary.
    """
    passage = dspy.InputField(desc="The document text to be summarized.")
    summary: str = dspy.OutputField(desc="A clear and concise summary of the document.")

# Build a summarization module using DSPy's ChainOfThought
summarizer = dspy.ChainOfThought(SummarizeSignature)

# Define a custom prompt to guide the summarization for proposal PDFs
custom_prompt = (
    "You are an expert summarizer for business proposals. "
    "Please summarize the following document, highlighting the key objectives, "
    "methodology, deliverables, budget, and timelines in a clear and concise manner."
)



**Split PDF Text into Chunks**


In [6]:
def split_text(text, max_chunk_size=2000):
    """
    Split text into chunks of at most max_chunk_size characters,
    preserving whole words.
    """
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0
    for word in words:
        # +1 accounts for space between words
        if current_length + len(word) + 1 <= max_chunk_size:
            current_chunk.append(word)
            current_length += len(word) + 1
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [word]
            current_length = len(word) + 1
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

# Split the extracted PDF text into chunks
chunks = split_text(pdf_text, max_chunk_size=2000)
print("Number of chunks:", len(chunks))


Number of chunks: 11



**Summarize Each Chunk and Combine Summaries**

In [7]:
# Summarize each chunk using our summarization module
chunk_summaries = []
for i, chunk in enumerate(chunks):
    result = summarizer(passage=chunk, prompt=custom_prompt)
    summary_text = result.summary
    chunk_summaries.append(summary_text)
    print(f"Chunk {i+1} summary:")
    print(summary_text)
    print("-" * 50)

# Combine the chunk summaries into one text
combined_summary_text = "\n".join(chunk_summaries)



Chunk 1 summary:
The document introduces two research proposals from the Department of Social Policy and Criminology. The first proposal explores the experiences of fathers post-divorce or separation, focusing on their employment and caring responsibilities. It emphasizes the need for more research on the experiences of working-class fathers. The second proposal, not detailed in the passage, is about police governance. Both proposals serve as examples for postgraduate research.
--------------------------------------------------
Chunk 2 summary:
The proposed research aims to explore the identity and roles of fathers post-divorce or separation, focusing on their experiences, perceptions, and how they negotiate employment and caring responsibilities. The study contributes to the growing field of 'family practice' sociology and could challenge traditional gender roles in earning and caring. The research is politically and sociologically significant, as it may provide insights into the real