<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/dspy_rag_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSPy-based "naive" RAG

* Directly follows the RAG tutorial in DSPy
* Uses our news data as the document collection
* OpenAI-based embeddings

In [1]:
!pip3 install dspy

Collecting dspy
  Downloading dspy-2.6.8-py3-none-any.whl.metadata (7.3 kB)
Collecting backoff (from dspy)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting ujson (from dspy)
  Downloading ujson-5.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Collecting datasets<3.0.0,>=2.14.6 (from dspy)
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting optuna (from dspy)
  Downloading optuna-4.2.1-py3-none-any.whl.metadata (17 kB)
Collecting magicattr~=0.1.6 (from dspy)
  Downloading magicattr-0.1.6-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting litellm<2.0.0,>=1.59.8 (from dspy)
  Downloading litellm-1.61.20-py3-none-any.whl.metadata (37 kB)
Collecting diskcache (from dspy)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting json-repair (from dspy)
  Downloading json_repair-0.39.1-py3-none-any.whl.metadata (11 kB)
Collecting asyncer==0.0.8 (from dspy)
  Downloading asyncer-0.0.8-py3-none-any

In [3]:
import dspy
import json


#api_keys.py has a GPT4o_API_KEY variable
from api_keys import *

lm = dspy.LM('openai/gpt-4o-mini', api_key=GPT4o_API_KEY)
dspy.configure(lm=lm)

In [4]:
!wget http://dl.turkunlp.org/TKO_8964_2023/news-en-2021.jsonl

--2025-03-02 20:18:13--  http://dl.turkunlp.org/TKO_8964_2023/news-en-2021.jsonl
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3385882 (3.2M) [application/octet-stream]
Saving to: ‘news-en-2021.jsonl’


2025-03-02 20:18:14 (11.0 MB/s) - ‘news-en-2021.jsonl’ saved [3385882/3385882]



In [5]:
news=[]
with open("news-en-2021.jsonl") as f:
    for line in f:
        d=json.loads(line)
        news.append(d)
news_dspy=[dspy.Example(document=s["text"]).with_inputs("document") for s in news]
print(news_dspy[0])

Example({'document': 'Finland\'s government is pushing ahead with plans to introduce a Covid pass, following a meeting of ministers at the House of the Estates in Helsinki on Thursday afternoon. \n "There are still many open questions that need to be answered. At this point, it is impossible to promise that the pass will come or when it will come," Prime Minister  Sanna Marin  (SDP) told the media following the conclusion of the meeting. \n "The government has given the green light to the Covid pass and preparations will continue," Marin added. \n Minister of Economic Affairs  Mika Lintilä  (Cen) told reporters immediately after the meeting that there was broad agreement between the coalition parties over the need for the certificate. \n "It [the pass] is an important tool so that we will not need restrictions any more," Lintilä said. \n The government also decided at Thursday afternoon\'s meeting to offer coronavirus vaccines to all 12- to 15-year-olds, starting as early as next week.

# Embed

* This embeds the documents for later retrieval
* To keep things lean and simple, we embed whole documents (no chunking, no sliding window, etc.)
* First 4,000 chars of each document embedded, rest ignored

In [7]:
embedder = dspy.Embedder('openai/text-embedding-3-small', dimensions=512, api_key=GPT4o_API_KEY)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=[d.document[:4000] for d in news_dspy], k=5)

In [8]:
class RAG(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought('context, question -> response')

    def forward(self, question):
        context = search(question).passages
        return self.respond(context=context, question=question)

rag=RAG()

In [9]:
rag(question="What is the average rent in Helsinki?")

Prediction(
    reasoning='The context provides specific figures for rental prices in Helsinki. It states that the median rent for a studio apartment in central Helsinki is 809 euros per month. Additionally, it mentions that larger homes, such as three-room apartments, have a median rent of 1,634 euros in downtown Helsinki. Therefore, the average rent can be inferred from these figures, particularly focusing on the studio apartment as a common rental type.',
    response='The average rent for a studio apartment in central Helsinki is 809 euros per month.'
)

In [10]:
lm.inspect_history(1)





[34m[2025-03-02T20:20:42.436532][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str)
2. `question` (str)

Your output fields are:
1. `reasoning` (str)
2. `response` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `context`, `question`, produce the fields `response`.


[31mUser message:[0m

[[ ## context ## ]]
[1] «««
    Rental fees for non-subsidised apartments rose across most of Finland during April to June, compared to the same period a year ago, according to data from Statistics Finland. 
     On average, rents rose by 0.9 percent during that period across the country. 
     Timo Metsola , board chair of rental agency Vuokraturva, attributed the increase to growing deman