# DSPY + Snowflake: Towards Secure and Future-Proof LLM Pipelines

DSPy is an open source framework for declaritively building LLM pipelines and automatically optimizing prompts. Snowflake Cortex is a managed LLM service that allows users to leverage LLM's without taking data out of Snowflake. The combination of the two empowers users to write secure, future proof, and cost efficient pipelines.

This notebook is based on the [DSPY tutorials](https://dspy-docs.vercel.app/docs/category/tutorials) and will walk you through how to use DSPy to do: 
* Basic RAG - Build a simple RAG program using the declaritive programming paradigm and Snowflake Cortex LLMs 
* End to End RAG in Snowflake - RAG example using a knowledge base and embeddings stored in Snowflake using the DSPy Snowflake Retriever
* Multi Hop RAG - Build an architecture that can break down complex questions and ask follow ups
* Pipeline Optimization - Automatically optimize Snowflake Cortex prompts to eliminate the need for manual prompt engineering

# DSPY Setup

If you don't already have dspy and the snowpark dependencies installed on your machine, you can install them with pip. The DSPy Snowflake integration will available starting with DSPy version 2.5.0, so in the meantime you can install from the latest source on github as follows:

In [None]:
!pip install git+https://github.com/stanfordnlp/dspy.git
!pip install snowflake-snowpark-python

The fundamental elements of a RAG architecture include:
* Knowledge Base - A database with passages and embeddings of content that will be required to generate the desired response.
* A Language Model (LM) - Given context from a particular knowledge base, a LM generates the response to the user's prompt.
* Retriever - A mechanism for retrieving the relevant context required to generate a response to the user's prompt.



To start, we will import the requirements for our program and load our Snowflake credentials.

In [3]:
import dspy
from dspy.evaluate.evaluate import Evaluate
from dspy.retrieve.snowflake_rm import SnowflakeRM
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

from snowflake.snowpark import Session
import os

from dotenv import load_dotenv
load_dotenv()

connection_parameters = {
    
    "account": os.getenv('SNOWFLAKE_ACCOUNT'),
    "user": os.getenv('SNOWFLAKE_USER'),
    "password": os.getenv('SNOWFLAKE_PASSWORD'),
    "role": os.getenv('SNOWFLAKE_ROLE'),
    "warehouse": os.getenv('SNOWFLAKE_WAREHOUSE'),
    "database": os.getenv('SNOWFLAKE_DATABASE'),
    "schema": os.getenv('SNOWFLAKE_SCHEMA')}  


Below we configure the basic program requirements:
* Knowledge Base: Containing the Snowflake Annual Reports available from the investor relations page [here](https://investors.snowflake.com/financials/sec-filings/default.aspx)
* LM: A Snowflake Cortex hosted Mixtral 8x7B model
* Retriever: DSPy Snowflake Retriever to leverage embeddings in Snowflake table

### Prepare the Embeddings

If our embeddings are not yet stored in Snowflake, but we have the raw data in a local directory or a Snowflake stage, we can easily generate and load the embeddings into a new Snowflake table using the SnowVecDB utility, which you can find [here](https://github.com/Snowflake-Labs/sf-samples/tree/main/samples/snowfake-cortex/cortexRAG).

Below we create a new `SEC_EMBEDDINGS` table using the annual reports that we've downloaded to a local directory. If your data is already in a Snowflake Stage, instead of using a local directory, you can point the helper module to generate the embeddings using your staged files, by using the `stage` argument in SVDB.

In [None]:
from snowvecdb import SnowVectorDB
snowpark =Session.builder.configs(connection_parameters).create()
SVDB = SnowVectorDB(snowflake_session=snowpark,chunk_size=500,chunk_overlap=75)
SVDB(vector_table_name="SEC_EMBEDDINGS",data_source_directory="your_local_directory/snowflake_annual_reports/")

The SVDB utility creates the embeddings table with generic names for the columns containing the passage (CHUNK) and the related embeddings (CHUNK_VEC). By default, the snowflake retriever (`SnowflakeRM`) assume your embeddings table has these headers, but you can easily ovverride them with the `embeddings_field` and `embeddings_field_text` arguments.

### Configure Language + Retriever Models

Once we have the knowledge base + embeddings ready, we can configure the Snowflake LM and RM in DSPy as follows:

In [None]:
# Snowflake Cortex Language Model Definition
turbo = dspy.Snowflake(model="mixtral-8x7b",credentials=connection_parameters)

# Snowflake Retriever Model Definition
snowflake_retriever = SnowflakeRM(
  snowflake_table_name="SEC_EMBEDDINGS",
  snowflake_credentials=connection_parameters
)

# Configure which LM and RM to use in DSPy
dspy.settings.configure(lm=turbo,rm=snowflake_retriever)

With the above configuration, we ensure that future DSPy pipeline calls will use the Snowflake Cortex Mixtral model and the `SEC_EMBEDDINGS` table for context. 

# DSPy + Snowflake RAG Pipeline Definition

Given a user's query, the most simple RAG architecture will 
1) Retrieve the K most relevant passages from our knowledge base for the user's query
2) Generate a response to the query utilizing the the relevant passages retrieved in step 1.

The building blocks of DSPy include:
- DSPy Signatures to define the expected inputs and outputs of the program
- DSPy Modules to define the core flow of your program

In [21]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 10 words")

class RAG(dspy.Module):
    def __init__(self, num_passages=5):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)


Notice above that the implementation of our RAG pipeline is decoupled from our underlying data, from our language model, and from our prompt. `dspy.Retrieve` and `dspy.ChainOfThought` will use the user-configured retriever (`snowflake_retriever`) and the user-configured language model (`turbo`) when the RAG pipeline is called.

#### Test the DSPY + Snowflake Pipeline

Above we update our DSPy settings to use the snowflake retriever so future calls to the same RAG pipeline will use the `SEC_EMBEDDINGS` table under the hood. 

In [24]:
rag = RAG()
rag("In waht fiscal year did Snowflake IPO?")

Prediction(
    context=['to the 2020 ESPP.\nWe intend to continue to make significant investments in research and development as we enhance our platform. We also intend to invest in our sales and marketing\norganization to drive future revenue growth. As a result of the closing of our IPO, we have incurred and expect to continue to incur additional expenses as a result of operating as a\npublic company, including costs to comply with the rules and regulations applicable to companies listed on a national securities exchange, costs related to compliance and reporting\nobligations, and increased expenses for insurance, investor relations, and professional services.\nKey Business Metrics\nThree Months Ended\nJanuary 31,\n2021 October  31, 2020July 31,\n2020April 30,\n2020January 31,\n2020October  31,\n2019July 31,\n2019April 30,\n2019\nProduct revenue (in millions) $ 178.3 $ 148.5 $ 125.2 $ 101.8 $ 82.4 $ 69.2 $ 57.8 $ 42.8 \nJanuary 31,\n2021 October  31, 2020July 31,\n2020April 30,\n202

We can see that the pipeline returns a dspy.Predicition object that contains the relevant context that was retrieved from our knowledge base and the final answer generated by the language model using the Chain of Thought technique.

# LLM Performance Evaluation

There a various ways to evaluate the performance of RAG systems. Fortunately with DSPy, evaluation metrics are easy to define. 

Below we'll demonstrate two different approaches:

* Exact Answer Match: Using DSPy utils to evaluate whether the generated response is an exact match of the ground truth answer
* Semantic Match: Using an LLM as a Judge to determine whether the answer is correct


**Note: because these LLMs exhibit non-deterministic behavior, as you rerun the cells below your results may vary.**


### Training Data Ingestion

For evaluation purposes, we'll use a standard industry data set with labels, HotPotQA, so will switch to using an open source Wikipedia knowledge base and a ColbertV2 retriever. For more detail on this visit the docs [here](https://hotpotqa.github.io/wiki-readme.html)

In [28]:
from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=50, eval_seed=1000, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(rm=colbertv2_wiki17_abstracts)

### Exact Answer Evaluation

The most stringent evaluation metric requires that our output exactly match the ground truth answer.

In [26]:
def validate_exact_answer(example, pred, trace=None):
    if not dspy.evaluate.answer_exact_match(example, pred): return False
   
    return True

evaluate = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=0)
evaluate(RAG(),validate_exact_answer)

  0%|          | 0/50 [00:00<?, ?it/s]

Average Metric: 12 / 50  (24.0): 100%|██████████| 50/50 [02:07<00:00,  2.56s/it]


24.0

### Semantic Evaluation

For some use cases, an exact match metric may be too restrictive. If we don't care about the exact language and simply want to evalute whether the generated response is factually correct, we can use another LLM in order to do so (sometimes known as the LLM as Judge approach).

DSPy allows us to use arbitrary user defined methods for our evaluation metrics, so we're able to use another DSPy program to evaluate the peformance of our primary agent's predictions. Below is an example of how we can do this by defining a semantic similarity metric that uses a separate DSPy program for evaluation. We want an independent judge when evaluating performance, so we'll use the DSPy context manager to use a Reka Flash model for the assessment of our Mixtral 8x7B pipeline.

In [30]:
class Judge(dspy.Signature):
    """Judge if the predicted answer contains the ground truth answer."""

    ground_truth = dspy.InputField(desc="ground truth")
    prediction = dspy.InputField(desc="predicted answer")
    assessment_answer: bool = dspy.OutputField(desc="only True or False without any rationale")

reka = dspy.Snowflake(model="reka-flash",credentials=connection_parameters)
judge = dspy.ChainOfThought(Judge)

def semantic_similarity(example, pred, trace=None):

    with dspy.settings.context(lm=reka):
    
        equivalent = judge(ground_truth=example.answer, prediction=pred.answer)
        
    return True if "true" in equivalent.assessment_answer.lower() else False

## Performance Comparison: Mixtral 8x7B vs. LLama3-70B

Let's evaluate the performance of our zero shot pipeline using 2 different models -  Mixtral 8x7B vs. Llama3-70B

### Mixtral 8x7b Performance

In [31]:
evaluate(RAG(),semantic_similarity)

Average Metric: 40 / 50  (80.0): 100%|██████████| 50/50 [04:31<00:00,  5.43s/it]


80.0

Executing 5 evaluation runs returns an average accuracy of 75% for the baseline Mixtral 8x7B pipeline

### LLama3-70B Performance

To evaluate the performance of the Llama3-70B model, all we need to do is change the dspy context. We don't need to change anything about the RAG pipeline.

In [32]:
llama_turbo = dspy.Snowflake(model="llama3-70b",credentials=connection_parameters)
with dspy.settings.context(lm=llama_turbo):

    print(evaluate(RAG(),semantic_similarity))

Average Metric: 41 / 50  (82.0): 100%|██████████| 50/50 [04:21<00:00,  5.22s/it]
82.0


LLama3-70B outperforms our Mixtral 8x7B pipeline. Executing 5 evaluations runs yield an average score of 83%.

 &nbsp;

# Pipeline Optimization

DSPy's built-in optimizers let us tune our LLM pipelines, automatically adjusting our prompts and LM weights to improve performance. Below we utilize the `BootstrapFewShowWithRandomSearch` optimizer to maximize our `semantic_similarity` metric.

DSPy allows us to use a larger teacher model to train our pipeline. This allows us to take advantage of the performance of a larger model with the cost profile of a smaller model. Below we use a Mistral Large teacher to train our Mixtral 8x7B pipeline.

### Mixtral 8x7B - Optimized Peformance

In [34]:
mistral = dspy.Snowflake(model="mistral-large",credentials=connection_parameters)
optimizer = BootstrapFewShotWithRandomSearch(metric=semantic_similarity,teacher_settings=dict(lm=mistral))
optimized_pipeline = optimizer.compile(RAG(), teacher=RAG(), trainset=trainset)
evaluate(optimized_pipeline,semantic_similarity)

Average Metric: 10 / 34  (29.4):  68%|██████▊   | 34/50 [00:35<00:15,  1.05it/s][2m2024-06-02T19:36:55.137631Z[0m [[31m[1merror    [0m] [1mError for example in dev set: 		 1 values are expected.[0m [[0m[1m[34mdspy.evaluate.evaluate[0m][0m [36mfilename[0m=[35mevaluate.py[0m [36mlineno[0m=[35m180[0m
Average Metric: 17.0 / 50  (34.0): 100%|██████████| 50/50 [00:50<00:00,  1.01s/it]
Average Metric: 25 / 50  (50.0): 100%|██████████| 50/50 [00:54<00:00,  1.09s/it]
 12%|█▏        | 6/50 [00:56<06:55,  9.45s/it]
Average Metric: 19 / 36  (52.8):  72%|███████▏  | 36/50 [00:36<00:12,  1.13it/s][2m2024-06-02T19:39:39.132762Z[0m [[31m[1merror    [0m] [1mError for example in dev set: 		 1 values are expected.[0m [[0m[1m[34mdspy.evaluate.evaluate[0m][0m [36mfilename[0m=[35mevaluate.py[0m [36mlineno[0m=[35m180[0m
Average Metric: 29.0 / 50  (58.0): 100%|██████████| 50/50 [00:52<00:00,  1.05s/it]
 10%|█         | 5/50 [00:45<06:46,  9.03s/it]
Average Metric: 30 / 

Average Metric: 47 / 50  (94.0): 100%|██████████| 50/50 [04:51<00:00,  5.82s/it]


94.0

5 evaluation runs of the optimized Mixtral pipeline gives an average accuracy of 88%. This is an improvement of almost 20% over the baseline Mixtral performance and outperforms the Llama3-70B pipeline. Note - Mixtral 8x7B is 5x cheaper than Llama3-70B in Snowflake Cortex!

&nbsp;

### What's going on under the hood?

Our Optimized Pipeline includes question/answer examples from the training data and examples of Q&A responses that have been generated by our teacher 
program during the Boostrapping process

In [35]:
optimized_pipeline("What castle did David Gregory inherit?")
turbo.inspect_history()




Answer questions with short factoid answers.

---

Question: Who composed "Sunflower Slow Drag" with the King of Ragtime?
Answer: Scott Hayden

Question: The Organisation that allows a community to influence their operation or use and to enjoy the benefits arisingwas founded in what year?
Answer: 2010

Question: The Victorians - Their Story In Pictures is a documentary series written by an author born in what year?
Answer: 1950

Question: Chester Bennett was executed during an occupation that happened after how many days of fierce fighting ?
Answer: 18 days

Question: Which American actress who made their film debut in the 1995 teen drama "Kids" was the co-founder of Voto Latino?
Answer: Rosario Dawson

Question: What Brazilian professional racing driver who races for Rebellion Racing has a mother named Viviane Senna da Silva Lalli?
Answer: Bruno Senna Lalli

Question: Which is taller, the Empire State Building or the Bank of America Tower?
Answer: The Empire State Building

Questio

'\n\n\nAnswer questions with short factoid answers.\n\n---\n\nQuestion: Who composed "Sunflower Slow Drag" with the King of Ragtime?\nAnswer: Scott Hayden\n\nQuestion: The Organisation that allows a community to influence their operation or use and to enjoy the benefits arisingwas founded in what year?\nAnswer: 2010\n\nQuestion: The Victorians - Their Story In Pictures is a documentary series written by an author born in what year?\nAnswer: 1950\n\nQuestion: Chester Bennett was executed during an occupation that happened after how many days of fierce fighting ?\nAnswer: 18 days\n\nQuestion: Which American actress who made their film debut in the 1995 teen drama "Kids" was the co-founder of Voto Latino?\nAnswer: Rosario Dawson\n\nQuestion: What Brazilian professional racing driver who races for Rebellion Racing has a mother named Viviane Senna da Silva Lalli?\nAnswer: Bruno Senna Lalli\n\nQuestion: Which is taller, the Empire State Building or the Bank of America Tower?\nAnswer: The Emp