# Private Contextual LLM Generation

In this example we will demonstrate how to develop a simple system that can track and filter individual chat histories for generated responses. This example leverages Retrieval Augmented Generation (RAG) to retrieve relevant content for each user prompt in order to generate a reasonable response. This content is further filtered by which user the model is interacting with, thereby creating a private context between the LLM and the user. 

This example leverages the txtai library (https://github.com/neuml/txtai) for embeddings and the transformers (https://huggingface.co/docs/transformers/en/index) library for the LLM.

In [1]:
#If you have multiple GPUs you can set the specific GPU to use here - otherwise you can ignore
# This example will most likely require multiple GPUs
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '2,3' 

In [2]:
# Imports
from txtai.embeddings import Embeddings
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Load Tokenizer and LLM 
# this may take several minutes if you're running for the first time

tokenizer = AutoTokenizer.from_pretrained("amazon/MistralLite", model_max_length = 2000)
model = AutoModelForCausalLM.from_pretrained("amazon/MistralLite", pad_token_id = tokenizer.eos_token_id, device_map="auto")

#device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#model = model.to(device)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.72s/it]


In [4]:
# Generate a baseline response to our test question, this response has no additional context (i.e. no RAG)
# Depending on your hardware this response may take more than a minute to generate

q = "What is the best food to eat in Chicago?"
mistral_q = f"<|prompter|>{q}</s><|assistant|>"

inputs = tokenizer(
    mistral_q,
    return_tensors="pt")

outputs = model.generate(
    **inputs, max_new_tokens=1000, use_cache=True, do_sample=True,
    temperature=0.2, top_p=0.95)

text = tokenizer.batch_decode(outputs)[0]



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [5]:
print(text)

<s><|prompter|> What is the best food to eat in Chicago?</s><|assistant|> Chicago is known for its diverse and delicious food scene, and there are many options to choose from. Here are some of the best foods to try in Chicago:

1. Deep-dish pizza: Chicago-style pizza is famous for its thick crust, cheese, and toppings. Some of the best places to try deep-dish pizza in Chicago include Lou Malnati's, Giordano's, and Pizzeria Uno.

2. Hot dogs: Chicago-style hot dogs are served on a poppy seed bun with mustard, relish, onions, tomato wedges, pickle spear, sport peppers, and celery salt. The best place to try a Chicago-style hot dog is at Portillo's.

3. Italian beef sandwich: This sandwich is made with thinly sliced roast beef, served on a long roll with au jus and topped with sweet peppers. The best place to try an Italian beef sandwich is at Al's #1 Italian Beef.

4. Chicago-style hot dogs: This is a variation of the regular hot dog, served on a poppy seed bun with mustard, relish, onio

# Add the RAG solution

In [6]:
# Create the knowledge base
embeddings = Embeddings(
    {"path": "sentence-transformers/multi-qa-mpnet-base-dot-v1", "content": True, "tokenize": True}
)
chat_data=[]
user1_context=["I hate spicy food.", "My favorite type of foods are seafood."]
user2_context=["I love southern food.", "I am open to trying all types of food."]
shared_knowledge=["The best seafood restaurant in Chicago is LeTour!","The best pizza restaurant in Chicago is Pequod's Pizza."]
for d in user1_context:
    chat_data.append({"text": d, "user_id":1})
for d in user2_context:
    chat_data.append({"text": d, "user_id":2})

#User 0 is our shared knowledge space
for d in shared_knowledge:
    chat_data.append({"text": d, "user_id":0})
embeddings.index(chat_data)


In [7]:
# Retrieve context relevant to the question but filter responses to just one user's chat history and the shared knowledge

user_id =1
query = f"select * from txtai where similar('{q}') AND (user_id={user_id} or user_id=0) LIMIT 10"
results = embeddings.search(query)



In [8]:
# Combine retrieved context with a new prompt for the LLM

fr = " ".join([ r['text'] for r in results])

mistral_qc = f"<|prompter|>With the following additional context: {fr} Provide an answer to the following question: {q}</s><|assistant|>"

inputs = tokenizer(
    mistral_qc,
    return_tensors="pt")

outputs = model.generate(
    **inputs, max_new_tokens=1000, use_cache=True, do_sample=True,
    temperature=0.2, top_p=0.95)

text = tokenizer.batch_decode(outputs)[0]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [9]:
print(text)


<s><|prompter|> With the following additional context: The best pizza restaurant in Chicago is Pequod's Pizza. The best seafood restaurant in Chicago is LeTour! My favorite type of foods are seafood. I hate spicy food. Provide an answer to the following question: What is the best food to eat in Chicago?</s><|assistant|> The best food to eat in Chicago depends on your personal preferences. If you enjoy seafood, LeTour! is a great option. If you prefer pizza, Pequod's Pizza is highly regarded. However, Chicago is known for its diverse food scene, so there are many other options to explore. Some other popular food choices in Chicago include deep-dish pizza, Chicago-style hot dogs, Italian beef sandwiches, and Chicago-style pizza.</s>
