<a href="https://colab.research.google.com/github/An-Aeonic-Ant/MYRAG/blob/main/MYRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval-Augmented Generation (RAG) for Personalized Prompts
This project implements a Retrieval-Augmented Generation (RAG) model to generate personalized titles based
on a user's past preferences. RAG combines retrieval systems with generative language models to create more
relevant outputs.

The system retrieves relevant examples based on a user's history and uses a pre-trained language model to generate new titles aligned with their style and preferences. The point is that the titles stored previously whether model generated or edited by the user will always be as per the users preference. Hence they can be used as soft prompts to guide the model into generating titles of similar flavour

In [None]:
from google.colab import userdata
hugging_face_token=userdata.get('hugging_face')
!huggingface-cli login --token {hugging_face_token}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
The token `gemma` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `gemma`


In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-2-2b-it")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Device set to use cpu


In [None]:
%load_ext sql

In [None]:
%%sql
sqlite:///rag.db


In [None]:
import sqlite3
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Database setup
db_path = "rag.db"
conn = sqlite3.connect(db_path)
cursor = conn.cursor()

cursor.execute('''
CREATE TABLE IF NOT EXISTS documents (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    document_data TEXT,
    title TEXT,
    embedding BLOB
)
''')
conn.commit()

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Helper functions
def combine_document_features(document):
    return f"Invoice #{document['invoice_number']} on {document['date']} with items {', '.join(document['items'])}"

def add_document(document_data, title):
    doc_text = combine_document_features(document_data)
    embedding = embedding_model.encode([doc_text]).tobytes()
    cursor.execute('''
    INSERT INTO documents (document_data, title, embedding)
    VALUES (?, ?, ?)
    ''', (str(document_data), title, embedding))
    conn.commit()

def retrieve_similar_documents(new_document, top_n=3):
    new_doc_text = combine_document_features(new_document)
    new_embedding = embedding_model.encode([new_doc_text])

    cursor.execute('SELECT id,document_data,title, embedding FROM documents')
    results = cursor.fetchall()

    if not results:
        return None, new_doc_text  # No documents exist

    embeddings = np.array([np.frombuffer(row[3], dtype=np.float32) for row in results])
    similarities = cosine_similarity(new_embedding, embeddings)[0]
    top_indices = np.argsort(-similarities)[:top_n]

    return [(results[i][1], results[i][2],similarities[i]) for i in top_indices], new_doc_text

def generate_title(soft_prompt, new_document):
    examples='''Input: AI is revolutionising the world
    "title": "AI revolution"
    Input: chocolate, ice cream, candy.
    "title": "chocolate, ice cream and candy"'''
    if soft_prompt:
        soft_prompt_examples = '\n'.join(
            f'Input: {key}\n"title": "{value}"' for key, value in soft_prompt.items()
        )
    else:
        soft_prompt_examples = examples

    input_prompt = f'''Generate a title based on items in the invoice. Do not generate anything else. Try to keep it in a limit of 30 characters.
    Look at the below examples to understand the task.
    {soft_prompt_examples}
    Now using the examples above, try giving the output for the following input enclosed within the | delimiter.
    | {new_document} |.
    Give the answer as instructed above.'''
    outputs=pipe(input_prompt)
    return outputs[0]["generated_text"]

def suggest_title(new_document, similarity_threshold=0.9):
    similar_docs, new_doc_text = retrieve_similar_documents(new_document)

    if similar_docs and similar_docs[0][2] >= similarity_threshold:
        # If similarity is above the threshold, suggest the top result
        return f"Suggested Title: {similar_docs[0][1]}"
    soft_prompt = {i[0]:i[1] for i in similar_docs} if similar_docs else None
    generated_title = generate_title(soft_prompt, new_doc_text)
    return f"Generated Title: {generated_title}"

def update_user_feedback_and_save(doc_id, user_title):
    # Update the user title for the existing document
    cursor.execute('''
    UPDATE documents
    SET title = ?
    WHERE id = ?
    ''', (user_title, doc_id))
    conn.commit()




In [None]:
# cursor.execute('DELETE FROM documents')
# conn.commit()


Predicting Title without any user personilization


In [None]:
new_document={"invoice_number": "12352", "date": "2024-12-04", "items": ["watch", "skirt","underwear","pant","jacket"]}
suggested_title = suggest_title(new_document)
print("Suggested Title After Adding More Documents:", suggested_title)

new_document={"invoice_number": "12353", "date": "2024-12-04", "items": ["burger", "chicken-wings","sausage"]}
suggested_title = suggest_title(new_document)
print("Suggested Title After Adding More Documents:", suggested_title)

The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


Suggested Title After Adding More Documents: Generated Title: Generate a title based on items in the invoice.DO not generate anything else. Try to keep it in a limit of 30 characters.
    Look at the below examples to understand the task.
    Input: AI is revolutionising the world
    "title": "AI revolution"
    Input: chocolate, ice cream, candy.
    "title": "chocolate, ice cream and candy"
    Now using the examples above, try giving the output for the following input enclosed within the | delimiter.
    | Invoice #12352 on 2024-12-04 with items watch, skirt, underwear, pant, jacket |.
    Give the answer as instructed above.
    "title": "watch, skirt, underwear, pant, jacket" 
    

Suggested Title After Adding More Documents: Generated Title: Generate a title based on items in the invoice.DO not generate anything else. Try to keep it in a limit of 30 characters.
    Look at the below examples to understand the task.
    Input: AI is revolutionising the world
    "title": "AI rev

Adding dummy entries to simulate user preferences


In [None]:
add_document(
    {
        "invoice_number": "12347",
        "date": "2024-12-03",
        "items": ["Milk", "Bread", "Eggs"]
    },
    "Grocery Payment"
);

add_document(
    {
        "invoice_number": "12348",
        "date": "2024-12-04",
        "items": ["Headphones", "Smartwatch", "Phone Charger"]
    },
    "Electronics shopping"
);

add_document(
    {
        "invoice_number": "12349",
        "date": "2024-12-05",
        "items": ["Pizza", "Pasta", "Salad"]
    },
    "Restaurant Bill"
);

add_document(
    {
        "invoice_number": "12350",
        "date": "2024-12-06",
        "items": ["Novel", "Textbook", "Cookbook"]
    },
    "Book Store"
);

add_document(
    {
        "invoice_number": "12351",
        "date": "2024-12-07",
        "items": ["Hammer", "Screwdriver", "Nails"]
    },
    "Hardware necessities"
);

Predicting titles based on users existing data

In [None]:
new_document={"invoice_number": "12352", "date": "2024-12-04", "items": ["watch", "skirt","underwear","pant","jacket"]}
suggested_title = suggest_title(new_document)
print("Suggested Title After Adding More Documents:", suggested_title)

new_document={"invoice_number": "12353", "date": "2024-12-04", "items": ["burger", "chicken-wings","sausage"]}
suggested_title = suggest_title(new_document)
print("Suggested Title After Adding More Documents:", suggested_title)

Suggested Title After Adding More Documents: Generated Title: Generate a title based on items in the invoice.DO not generate anything else. Try to keep it in a limit of 30 characters.
    Look at the below examples to understand the task.
    Input: {'invoice_number': '12348', 'date': '2024-12-04', 'items': ['Headphones', 'Smartwatch', 'Phone Charger']}
"title": "Electronics shopping"
Input: {'invoice_number': '12347', 'date': '2024-12-03', 'items': ['Milk', 'Bread', 'Eggs']}
"title": "Grocery Payment"
Input: {'invoice_number': '12349', 'date': '2024-12-05', 'items': ['Pizza', 'Pasta', 'Salad']}
"title": "Restaurant Bill"
    Now using the examples above, try giving the output for the following input enclosed within the | delimiter.
    | Invoice #12352 on 2024-12-04 with items watch, skirt, underwear, pant, jacket |.
    Give the answer as instructed above.
    "title": "Clothing Shopping"
    
```python
def generate_title(
Suggested Title After Adding More Documents: Generated Title: