# RAG using DeepSeek-R1


<img src="./resources/rag_architecture.png" width=800px>

In [1]:
import nest_asyncio
from dotenv import load_dotenv
from IPython.display import Markdown, display

from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.core import PromptTemplate
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, ServiceContext, SimpleDirectoryReader

In [2]:
# allows nested access to the event loop
nest_asyncio.apply()

In [3]:
# add your documents in this directory, you can drag & drop
input_dir_path = '/teamspace/studios/this_studio/test-dir'

In [4]:

# setup llm & embedding model
llm=Ollama(model="deepseek-r1:7b", request_timeout=120.0)
# embed_model = HuggingFaceEmbedding( model_name="Snowflake/snowflake-arctic-embed-m", trust_remote_code=True)
embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-large-en-v1.5", trust_remote_code=True)

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [7]:
# load data
loader = SimpleDirectoryReader(
            input_dir = input_dir_path,
            required_exts=[".pdf"],
            recursive=True
        )
docs = loader.load_data()

# Creating an index over loaded data
Settings.embed_model = embed_model
index = VectorStoreIndex.from_documents(docs, show_progress=True)

# Create the query engine, where we use a cohere reranker on the fetched nodes
Settings.llm = llm
query_engine = index.as_query_engine()

# ====== Customise prompt template ======
qa_prompt_tmpl_str = (
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information above I want you to think step by step to answer the query in a crisp manner, incase case you don't know the answer say 'I don't know!'.\n"
"Query: {query_str}\n"
"Answer: "
)
qa_prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str)

query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": qa_prompt_tmpl}
)

# Generate the response
response = query_engine.query("What exactly is DSPy?",)

Parsing nodes:   0%|          | 0/17 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/26 [00:00<?, ?it/s]

In [10]:
display(Markdown(str(response)))

<think>
Okay, so I need to figure out what exactly DSPy is based on the context provided. Let me start by reading through the information given.

From page_label:3, there's a section titled "DSPy Assertions: Computational Constraints for Self-Reﬁning Language Model Pipelines". It talks about something called DSPy being a framework that uses modules to solve tasks with language and retrieval models. The key point is that it replaces brittle prompt engineering by using composable modules and automatic optimizers.

Looking further, in section 2.1, it describes the DSPy Programming model. It says that instead of using free-form string prompts, programmers define signatures to specify what an LM needs to do declaratively. For example, a module might take a question and return an answer. The framework uses Predict modules which consume inputs and construct prompts from supplied signatures.

Then there's section 2.2 with a motivating example where a developer named Aiden builds an LM pipeline for multi-hop Q&A using DSPy's ChainOfThought module. This suggests that DSPy supports creating such pipelines efficiently.

The challenges mentioned are about lacking constructs to specify computational constraints and the need for these constraints to refine LM outputs or teach them to respect specifications at compile time. To address this, they integrated LM Assertions as first-class primitives in DSPy, inspired by Pythonic assertions. These allow setting strict rules, guidelines, debugging statements, etc., which can help improve the quality of LM outputs.

In section B.2.3, there are metrics and constraints for quiz choice generation, including valid formatting, correct answer inclusion, and choices' plausibility. They use DSPy modules to enforce these via Suggest statements and a Predict module with assessments.

Putting this together, it seems that DSPy is a framework designed to help build language model pipelines by allowing modular composition of modules, each handling specific tasks or steps in the pipeline. It uses assertions (called LM Assertions) to add constraints during the training or inference phases of the model, ensuring that outputs meet certain criteria like correct format, inclusion of the right answer, and validity of distractors.

So, DSPy not only provides a way to structure the pipeline but also integrates mechanisms to enforce these structured constraints on the language models, making it self-reﬁning by improving the quality of generated answers through these checks.
</think>

DSPy is a framework designed for creating efficient language model pipelines. It allows developers to compose modules that handle specific tasks declaratively, enabling self-reﬁning capabilities by enforcing computational constraints. These constraints ensure output quality through assertions and optimizers, facilitating better structured problem-solving in tasks like multi-hop question answering and quiz generation.

**Answer:**  
DSPy is a framework for building efficient language model pipelines by composing modules with declarative signatures, allowing self-reﬁnement through computational constraints and assertions to ensure high-quality outputs.

### ❗️❗️ Make sure you clear GPU memory by clicking on Restart button above, if you want to use Streamlit from here

In [7]:
# check GPU usage

!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Wed Sep 25 19:39:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:35:00.0 Off |                    0 |
| N/A   44C    P0              31W /  72W |  19223MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [6]:
llm = Ollama(model="deepseek-r1:7b", request_timeout=120.0)

In [7]:
resp = llm.complete("What is capital of France?")
resp

ConnectError: [Errno 111] Connection refused

In [8]:
print(resp.text)

<think>

</think>

The capital of France is Paris.
