# Retrieval-Augmented Generation with Gemma usin Weaviate and DSPy

## Tech Stack 
- ```Gemma```
- ```HuggingFace```
- ```Weaviate```
- ```DSPy```

### What is Retrieval-Augmented Generation?¶
Retrieval-Augmented Generation (RAG) is technique to provide LLMs with additional contenxt to reduce hallcinations and increase accuracy, similarly to traditional fine-tuning.

- Retrieval: The user's query is used to retrieve additional context from an external knowledge source. The external knowledge source stores pieces of information and their vector embeddings. At query time, the user query is embedded into the same vector space and used to retrieve similar context by calculating the closest data points.
- Augmentation: The user query and retrieved additional context are then used to augment a prompt template.
- Generation: The augmented prompt is used to generate a more factually accurate answer than the user query alone.

### What is DSPy?
DSPy is a framework that helps developers build pipelines using language models (LMs) similar to LangChain or LlamaIndex.

DSPy introduces signatures and modules to enable developers to define LM-based programs similarly to neural network architectures in PyTorch. After you have defined your DSPy program, you can use some sample data, an optimizer (called teleprompter in DSPy), a metric, and a DSPy compiler to optimize your DSPy program, similarly to how you would train a neural network with training data, an optimizer, and a metric.

### 1. Install Prerequisites

In [41]:
%%capture
!pip install -U transformers dspy-ai weaviate-client

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


1. Request access to the [Gemma Model](https://huggingface.co/google/gemma-7b-it) family on Hugging Face to be able to use Gemma via DSPy.
2. Set your Hugging Face token and your OpenAI API key in the Kaggle secrets: You will need to have an Hugging Face API key (Gemma can be used for free without a Pro account). To obtain a Hugging Face API key, you will need to register with Hugging Face and then create a token under your profile via Settings > Access tokens. Use Kaggle secrets to use your Hugging Face API key in this Kaggle Notebook without sharing it with others. (Adds-ons > Secrets)

In [43]:
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login

user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("Hugging_Face_API")

login(hf_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### 2. Load Data

Add input : [dataset](https://www.kaggle.com/datasets/kaggle/meta-kaggle?select=Competitions.csv)

In [44]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

df = pd.read_csv("/kaggle/input/meta-kaggle/Competitions.csv")

df[:5]

Unnamed: 0,Id,Slug,Title,Subtitle,HostSegmentTitle,ForumId,OrganizationId,EnabledDate,DeadlineDate,ProhibitNewEntrantsDeadlineDate,...,CanQualifyTiers,TotalTeams,TotalCompetitors,TotalSubmissions,ValidationSetName,ValidationSetValue,EnableSubmissionModelHashes,EnableSubmissionModelAttachments,HostName,CompetitionTypeId
0,2408,Eurovision2010,Forecast Eurovision Voting,"This competition requires contestants to forecast the voting for this year's Eurovision Song Contest in Norway on May 25th, 27th and 29th.",Featured,2,,04/07/2010 07:57:43,05/25/2010 18:00:00,,...,False,22,25,22,,,False,False,,1
1,2435,hivprogression,Predict HIV Progression,"This contest requires competitors to predict the likelihood that an HIV patient's infection will become less severe, given a small dataset and limited clinical information.",Featured,1,,04/27/2010 21:29:09,08/02/2010 12:32:00,,...,True,107,116,855,,,False,False,,1
2,2438,worldcup2010,World Cup 2010 - Take on the Quants,Quants at Goldman Sachs and JP Morgan have modeled the likely outcomes of the 2010 World Cup. Can you do better?,Featured,3094129,,06/03/2010 08:08:08,06/11/2010 13:29:00,,...,False,0,0,0,,,False,False,,1
3,2439,informs2010,INFORMS Data Mining Contest 2010,The goal of this contest is to predict short term movements in stock prices. The winners of this contest will be honoured of the INFORMS Annual Meeting in Austin-Texas (November 7-10).,Featured,4,,06/21/2010 21:53:25,10/10/2010 02:28:00,,...,True,145,153,1483,,,False,False,,1
4,2442,worldcupconf,World Cup 2010 - Confidence Challenge,The Confidence Challenge requires competitors to assign a level of confidence to their World Cup predictions.,Featured,3,,06/03/2010 08:08:08,06/11/2010 13:28:00,,...,False,63,64,63,,,False,False,,1


In [45]:
df.shape

(5680, 42)

In [46]:
df = df[(df.HostSegmentTitle == "Featured") & (df.Subtitle.notna())][-100:]
df[['Title', 'Subtitle']].head()

Unnamed: 0,Title,Subtitle
1080,PetFinder.my Adoption Prediction,How cute is that doggy in the shelter?
1085,Traveling Santa 2018 - Prime Paths,"But does your code recall, the most efficient route of all?"
1087,Quora Insincere Questions Classification,Detect toxic content to improve online conversations
1271,Google Cloud & NCAA® ML Competition 2019-Men's,Apply Machine Learning to NCAA® March Madness®
1272,Google Cloud & NCAA® ML Competition 2019-Women's,Apply Machine Learning to NCAA® March Madness®


### 3. Populate Vector Database

For Retriever Model - ```Weaviate Vector Database```

- Weaviate embedded : local instance inside the Noteboook, no need for API (free to use)
- Define Schema of VectorDB : DSPy requires a field named "contect". 
- Vectorizer module : Embedding Model that generates vector embeddings of the data at import and query time. ```sentence-transformers/all-MiniLM-L6-v2```

#### Connect to Weaviate Client

In [49]:
import weaviate
from weaviate.embedded import EmbeddedOptions
import re

# Connect to Weaviate client in embedded mode
client = weaviate.Client(embedded_options=EmbeddedOptions(),
                             additional_headers={
                                "X-Huggingface-Api-Key": hf_token,
                             }
                         )

embedded weaviate is already listening on port 8079


#### Create Weaviate Schema

In [50]:
# Create Weaviate schema
schema = {
   "classes": [
       {
           "class": "RAG_Gemma",
           "vectorizer": "text2vec-huggingface",
            "moduleConfig": {
                "text2vec-huggingface": {
                    "model": "sentence-transformers/all-MiniLM-L6-v2",
                }
            },
           "properties": [
               {
                   "name": "content", # This is a required property name to be able to use Weaviate as RM with DSPy
                   "dataType": ["text"]
               }
           ]
       }      
   ]
}

In [51]:
# Delete existing data collection if it already exists from a previous run
if client.schema.exists("RAG_Gemma"):
    client.schema.delete_class("RAG_Gemma")
client.schema.create(schema)

{"level":"info","msg":"Created shard rag_gemma_Nz3ye2LnYTHo in 1.192387ms","time":"2024-06-17T09:51:34Z"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-06-17T09:51:34Z","took":107925}


#### Populate Vector database

In [52]:
# Populate vector database in batches
client.batch.configure(batch_size=100)  # Configure batch (for only 100 data points this is not really necessary...)

with client.batch as batch:  # Initialize a batch process
    for _, row in df.iterrows():
        properties = {
            "content": row['Title'] + ' Kaggle Competition: ' + row['Subtitle']
        }
        batch.add_data_object(
            data_object=properties,
            class_name="RAG_Gemma"
        )

{'error': [{'message': 'update vector: failed with status: 429 error: Rate limit reached. You reached free usage limit (reset hourly). Please subscribe to a plan at https://huggingface.co/pricing to use the API at this rate'}]}


In [53]:
client.query.aggregate("RAG_Gemma").with_meta_count().do()

{'data': {'Aggregate': {'RAG_Gemma': [{'meta': {'count': 99}}]}}}

#### Retrieve some example data points.

In [54]:
import json

response = (
    client.query
    .get("RAG_Gemma", ["content"])
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "RAG_Gemma": [
                {
                    "content": "G-Research Crypto Forecasting Kaggle Competition: Use your ML expertise to predict real crypto market data"
                },
                {
                    "content": "M5 Forecasting - Accuracy Kaggle Competition: Estimate the unit sales of Walmart retail goods"
                }
            ]
        }
    }
}


Run example Vector search/similarity query.   
query -> value for ```"concepts"``` key,   
```.with_near_text()``` method

In [55]:
response = (
    client.query
    .get("RAG_Gemma", ["content"])
    .with_near_text({"concepts": ["Medical"]})
    .with_limit(3)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "RAG_Gemma": null
        }
    },
    "errors": [
        {
            "locations": [
                {
                    "column": 6,
                    "line": 1
                }
            ],
            "message": "explorer: get class: vectorize params: vectorize params: vectorize params: vectorize keywords: remote client vectorize: failed with status: 429 error: Rate limit reached. You reached free usage limit (reset hourly). Please subscribe to a plan at https://huggingface.co/pricing to use the API at this rate",
            "path": [
                "Get",
                "RAG_Gemma"
            ]
        }
    ]
}


### 4. Configure DSPy Settings

Configure both RM (Retriever Model) and LM (Language Model)

- Retriever Model : ```Weaviate```
- Language Model : ```Google Gemma```

Gemma Versions:  
- ```gemma_2b_en``` (for mobile devices)
- ```gemma_7b_en``` (for desktop computers)

In [56]:
# Incase of error : AttributeError: module 'google._upb._message' has no attribute 'MessageMapContainer'

# !pip install proto-plus==1.24.0.dev1

In [59]:
import dspy

# Configure language model
llm = dspy.HFModel(model = 'google/gemma-2b')

  self.comm = Comm(**args)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [61]:
# invoke Gemma with sample query

example_query = "Which Kaggle competition should I look at to learn more about recommender systems in e-commerce?"

#"Which Kaggle competition should I look at to learn more about recommender systems in e-commerce?"
#"You might be interested in the H&M Personalized Fashion Recommendations or Elo Merchant Category Recommendation Kaggle Competition competition,"

response = llm(example_query)

print(response)

Inference : Gemma isnt able to give good answers for the example query without any additional context provided.   
Main reason to implement a RAG pipeline. 

### 5. Configure Settings of Retriever Model: Weaviate

In [None]:
from dspy.retrieve.weaviate_rm import WeaviateRM

# Configure retriever model
rm = WeaviateRM("RAG_Gemma", 
                weaviate_client = client)

### 6. Configure both LM and RM in the overall DSPy settings

In [None]:
# Configure DSPy to use the following language model and retrieval model by default
dspy.settings.configure(lm = llm, 
                        rm = rm)

### 7. Write DSPy Program 

We can write a simple DSPy program for our RAG pipeline.  


DSPy -> PyTorch : This is similar to defining a neural network architecture in PyTorch:

- In the __init__() method define your modules.   
```Retrieve module``` to retrieve additional context from the vector database.  
```ChainOfThought module``` to prompt Gemma with a chain of thought ("Let's think step by step") prompting technique.  
- In the forward() method, you will define the flow of information among the defined modules.

In [None]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

In [None]:
uncompiled_rag = RAG()

response = uncompiled_rag(example_query)

print(response.answer)

In [None]:
llm.inspect_history(n=1)

### 8. Compile DSPy Program

To Compile : 
- Training Data 
- Mteric to Optimize
- Optimizer (Called teleprompter in DSPy)
- DSPy Compiler

In [30]:
example_question1 = "Has there been Kaggle competition about cute animals and if yes, which one?"
example_answer1 = "You might be interested in the PetFinder.my Adoption Prediction competition."

example_question2 = "I'm interested in autonomous driving. What Kaggle competition can you recommend?"
example_answer2 ="You might be interested in the Lyft Motion Prediction for Autonomous Vehicles or Lyft 3D Object Detection for Autonomous Vehicles competition,"

example_question3 = "What Kaggle competitions should I look at to learn more about predicting the stock market?"
example_answer3 = "You might be interested in the JPX Tokyo Stock Exchange Prediction or Jane Street Market Prediction or Ubiquant Market Prediction competition."

In [31]:
# Small training set with question and answer pairs
trainset = [dspy.Example(question=example_question1, 
                         answer=example_answer1).with_inputs('question'),
            dspy.Example(question=example_question2, 
                         answer=example_answer2).with_inputs('question'),
           dspy.Example(question=example_question3, 
                         answer=example_answer3).with_inputs('question'),]

In [32]:
from dspy.teleprompt import BootstrapFewShot

# The teleprompter will bootstrap missing labels: reasoning chains and retrieval contexts
teleprompter = BootstrapFewShot(metric=dspy.evaluate.answer_exact_match)
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

100%|██████████| 3/3 [00:00<00:00, 3129.30it/s]


In [62]:
response = compiled_rag(example_query)

print(response.answer)

In [34]:
llm.inspect_history(n=1)




Which Kaggle competition should I look at to learn more about recommender systems in e-commerce?[32mWhich Kaggle competition should I look at to learn more about recommender systems in e-commerce?

I am a beginner in recommender systems and I am looking for a competition that I can learn from.

I am looking for a competition that is not too difficult, but also not too easy.

I am looking for a competition that is not too difficult, but also not too easy.

I am looking for a competition that is not too difficult, but also not too easy.

I am looking for a competition that is not too difficult, but also not too easy.

I am looking for a competition that is not too difficult, but also not too easy.

I am looking for a competition that is not too difficult, but also not too easy.

I am looking for a competition that is not too difficult, but[0m





'\n\n\nWhich Kaggle competition should I look at to learn more about recommender systems in e-commerce?\x1b[32mWhich Kaggle competition should I look at to learn more about recommender systems in e-commerce?\n\nI am a beginner in recommender systems and I am looking for a competition that I can learn from.\n\nI am looking for a competition that is not too difficult, but also not too easy.\n\nI am looking for a competition that is not too difficult, but also not too easy.\n\nI am looking for a competition that is not too difficult, but also not too easy.\n\nI am looking for a competition that is not too difficult, but also not too easy.\n\nI am looking for a competition that is not too difficult, but also not too easy.\n\nI am looking for a competition that is not too difficult, but also not too easy.\n\nI am looking for a competition that is not too difficult, but\x1b[0m\n\n\n'