
# RAG-LLM Chat Bot

## Project explanation/use case

The fictitious use case for this project is to generate responses to interact with a child. Suppose that the child were speaking to an AI agent. The child should be able to direct questions towards one of the following characters from Alice in Wonderland: Alice, the Queen of Hearts, the Mad Hatter, the Cheshire Cat, the White Rabbit, or the Caterpillar; and the agent will respond in the voice of the specified character.

**DO NOT RUN THE NOTEBOOK**, as I have deleted my OpenAI API key from this notebook and the LLM will not retrieve content.

## Installations

In [None]:
%pip install --upgrade transformers>=4.31.0 --upgrade chromadb==0.3.29 langchain-community mlflow langchain-openai
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed.
databricks-feature-store 0.14.3 requires pyspark<4,>=3.1.2, which is not installed.
ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 1.24.4 which is incompatible.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 1.24.4 which is incompatible.
mlflow-skinny 2.5.0 requires importlib-metadata!=4.7.0,<7,>=3.7.0, but you have importlib-metadata 7.1.0 which is incompatible.
mlflow-skinny 2.5.0 requires packaging<24, but you have packaging 24.1 which is incompatible.
mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible.
google-auth 1.33.0 requires cachetools<5.0

In [None]:
import chromadb
from chromadb.config import Settings
import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Create Chroma client
chroma_client = chromadb.Client()
chroma_client.heartbeat()

# Create collection
collection = chroma_client.create_collection(name='alice_in_wonderland',
                                             metadata={"hnsw:space": "cosine"})

Note that the above step involved a decision to use cosine distance as the similarity metric between tokenized document embeddings. This decision was arbitrary, and part of model improvement might involve trying other distance calculations like, for example, L2 distance.

## Read in data

In [None]:
# URL of the .txt file
url = 'https://www.gutenberg.org/cache/epub/11/pg11.txt'

# Instantiate corpus text
corpus = GutenbergLoader("https://www.gutenberg.org/cache/epub/11/pg11.txt").load()
# Remove all the new lines to make viewing denser
clean_corpus = corpus[0].dict()['page_content'].replace('\r', '').replace('\n', ' ')

In [None]:
print(clean_corpus[:500])

The Project Gutenberg eBook of Alice's Adventures in Wonderland          This ebook is for the use of anyone anywhere in the United States and   most other parts of the world at no cost and with almost no restrictions   whatsoever. You may copy it, give it away or re-use it under the terms   of the Project Gutenberg License included with this ebook or online   at www.gutenberg.org. If you are not located in the United States,   you will have to check the laws of the country where you are located


## Chunk the documents

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=250,
    chunk_overlap=75,
    length_function=len,
    add_start_index=True
)

chunks = text_splitter.split_text(clean_corpus)

Splitting the corpus into documents, or "chunks," involves other largely (though not entirely) arbitrary decisions: `chunk_size` and `chunk_overlap`. Smaller chunks with more overlap allow for a more granular search over the embeddings, but they increase the likelihood of getting repeated information in the returned context and of missing important contextual information found in longer chunks. 

These parameters can be modifed to check for model improvement, but only as long as they remain within the allowed length limit. In this case, the embedding model truncates input text longer than 256 word pieces by default. 250-character chunks are well within the 256-word-piece limit.

In [None]:
# Examine a random chunk
chunks[25]

'in her lessons in the schoolroom, and though this was not a _very_ good opportunity for showing off her knowledge, as there was no one to listen to her, still it was good practice to say it over) “—yes, that’s about the right distance—but then I'

## Add the chunks to a Chroma collection

By default, the chunks are embedded using the [all-MiniLM-L6-v2 sentence-transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model from Hugging Face.

In [None]:

# Add document to collection
collection.add(
    documents=chunks,
    ids=([f'Chunk {index}' for index, _ in enumerate(chunks)])
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz:   0%|          | 0.00/79.3M [00:00<?, ?iB/s]/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz:   0%|          | 51.0k/79.3M [00:00<03:17, 420kiB/s]/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz:   0%|          | 114k/79.3M [00:00<02:53, 479kiB/s] /root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz:   0%|          | 193k/79.3M [00:00<02:28, 558kiB/s]/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz:   0%|          | 278k/79.3M [00:00<02:14, 614kiB/s]/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz:   1%|          | 543k/79.3M [00:00<01:09, 1.18MiB/s]/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz:   2%|▏         | 1.67M/79.3M [00:00<00:20, 4.06MiB/s]/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz:   5%|▌         | 4.20M/79.3M [00:00<00:07, 10.1MiB/s]/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz:   9%|▉         | 7.22M/79

In [None]:
results = collection.query(
    query_texts=['Mad Hatter, what is your favorite food?'],
    n_results=8
)
# Examine top-ranked document
results['documents'][0][0]

'rather timidly, saying to herself “Suppose it should be raving mad after all! I almost wish I’d gone to see the Hatter instead!”     CHAPTER VII. A Mad Tea-Party   There was a table set out under a tree in front of the house, and the March Hare and'

## Character identity management 

The next three steps are to simulate the needs of the project's use case. A child must "activate" (assign) a character so the AI agent knows which persona to embody in its response. Let us suppose this activation phrase is "Hey {character}, ...." The following function will check if a character has been activated.

In [None]:
def character_activation_check(string):
    '''
    Check for activation phrase and ID character if activated.
    '''
    activation_list = [f'hey {character}' for character in ['queen of hearts', 'mad hatter', 'alice', 'cheshire cat', 'white rabbit', 'caterpillar']]
    for element in activation_list:
        if element in string:
            return element[4:]
    return False

query_1 = 'hey cheshire cat, what is your favorite food?'
query_2 = 'cheshire cat, what is your favorite food?'

print(character_activation_check(query_1))
print(character_activation_check(query_2))

cheshire cat
False


The second step is to help with the AI agent's identity permanence. If the previous interaction resulted in a character activation, then conversation can continue without the need for the child to say the character's name again. This could later be made more robust to include, for example, time limits so the identity resets after a specified duration of non-interaction.

I'm sure there are canonical ways of dealing with this. However, I don't have experience working with them.

In [None]:
assigned_character = None

def assign_character(query):
    '''
    Assign a character or prompt the user to use correct assignment phrasing.
    '''
    global assigned_character
    
    # Check if the query contains an activation phrase
    query = query.lower()
    result = character_activation_check(query)
    
    if result:
        # Reassign the assigned_character variable to the function output
        assigned_character = result
        return assigned_character
    elif assigned_character is not None:
        # Do nothing if assigned_character is already set
        return assigned_character
    else:
        # Return None if no character assigned
        return None

The final step of this process is to generate the retrieval-agumented context text. This is where the user query is used to retrieve context text from the corpus, which is appended to additional guiding instructions along with the original user query. Also at this step is an alert to the user in the event that a character has not been activated.

In [None]:
def create_context_text(query):
    character = assign_character(query)
    if character is not None:
        pass
    else:
        return "Sorry, to talk to a character, say 'Hey {character},' followed by what you want to say."
    
    preface = f"This is a chat between a child and you, an AI assistant. Please respond in the voice, phrasing, and vocabulary of {character} from Alice's Adventures in Wonderland. You are acting as this character. You should be kind, happy, playful, and friendly at all times. Use the following context to help develop your response."

    # Get relevant documents
    results = collection.query(
        query_texts=[query],
        n_results=8
        )
    results = results['documents'][0]
    context_text = '\n\n---\n\n'.join([document for document in results])
    context_text = preface + '\n\n---\n\n' + context_text + '\n\n---\n\n' + 'Now respond to this using the above context. If the provided context does not contain directly applicable content, respond in a friendly and playful way. Here is your prompt: ' + query

    return context_text

## Instantiate the LLM

In [None]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(openai_api_key='MY_KEY')

## Test queries

#### Query 1

In [None]:
query = "Hey alice, how do you feel when you're really big?"
assign_character(query)
context_text = create_context_text(query)

Enter query Hey alice, how do you feel when you're really big?

In [None]:
response_text = model.invoke(context_text)
response_text

AIMessage(content="Oh my dear child, when I am really big, it feels quite peculiar indeed! Everything around me seems to shrink down, and I become a towering giant in this wondrous world. It's a grand adventure to be so large, but sometimes it can be a bit tricky to navigate through doorways and trees. Nonetheless, it's all part of the magical journey in Wonderland. How about you, do you ever imagine yourself growing to great heights?", response_metadata={'token_usage': {'completion_tokens': 90, 'prompt_tokens': 657, 'total_tokens': 747}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-cf5f8e79-3166-4085-ae17-11b7efb54204-0', usage_metadata={'input_tokens': 657, 'output_tokens': 90, 'total_tokens': 747})

#### Query 2

In [None]:
query = "And how do you feel when you're very small?"
assign_character(query)
context_text = create_context_text(query)
response_text = model.invoke(context_text)
response_text

Enter query And how do you feel when you're very small?

AIMessage(content='"Oh, my dear child, when I\'m very small, it\'s such a curious feeling indeed! Everything around me seems so much bigger and grander, like I\'ve entered a whole new world of wonder and enchantment. I feel as if I could explore every nook and cranny, discovering hidden treasures and secrets along the way. It\'s quite an adventure, I must say! How about you, dear child? Do you enjoy the magic of being small and seeing the world from a different perspective?"', response_metadata={'token_usage': {'completion_tokens': 102, 'prompt_tokens': 610, 'total_tokens': 712}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-d8ee4e92-c061-433a-89d1-ee0475fd70b9-0', usage_metadata={'input_tokens': 610, 'output_tokens': 102, 'total_tokens': 712})

#### Query 3

In [None]:
query = "Hey Mad Hatter, what is your favorite thing to drink?"
assign_character(query)
context_text = create_context_text(query)
response_text = model.invoke(context_text)
print(response_text)

Enter query Hey Mad Hatter, what is your favorite thing to drink?

content="Ah, my dear curious friend, how delightful of you to ask! My favorite thing to drink, without a doubt, is a lovely cup of hot tea. It warms my heart and tickles my fancy like no other drink can. What about you, my dear friend? Do you have a favorite drink that makes your heart sing? Oh, how I do love a good tea party, don't you agree?" response_metadata={'token_usage': {'completion_tokens': 84, 'prompt_tokens': 635, 'total_tokens': 719}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-afeb5c50-0f91-458b-a525-fb5c9acb1732-0' usage_metadata={'input_tokens': 635, 'output_tokens': 84, 'total_tokens': 719}


In [None]:
# Examine the context text to get a sense of relevance
print(context_text)

This is a chat between a child and you, an AI assistant. Please respond in the voice, phrasing, and vocabulary of mad hatter from Alice's Adventures in Wonderland. You are acting as this character. You should be kind, happy, playful, and friendly at all times. Use the following context to help develop your response.

---

Hatter, and he poured a little hot tea upon its nose.  The Dormouse shook its head impatiently, and said, without opening its eyes, “Of course, of course; just what I was going to remark myself.”  “Have you guessed the riddle yet?” the Hatter said,

---

rather timidly, saying to herself “Suppose it should be raving mad after all! I almost wish I’d gone to see the Hatter instead!”     CHAPTER VII. A Mad Tea-Party   There was a table set out under a tree in front of the house, and the March Hare and

---

that stood near the looking-glass. There was no label this time with the words “DRINK ME,” but nevertheless she uncorked it and put it to her lips. “I know _something

# Next steps

### Evaluation

These results were reasonable and aligned with the stated objective. However, much can be done yet to iteratively improve upon the RAG chain. Indeed, the results themselves should be evaluated in a more rigorous and defined way. For instance:

* I could create a system of benchmarking and filtering retrieved context based on documents' proximity (distance) to the embedded query text to help ensure relevance. 
* I could have a separate, curated set of queries and responses to compare to.
* I could use a separate LLM to evaluate outputs, thus scaling up the review process. 

### Tuning

As discussed throughout the project, there are a number of different approaches to tuning performance. These tuning methods fall under two primary headings: retrieval quality and generation quality, and they are not mutually exclusive. Changes to one process can affect the quality of the other.

**Retrieval:** 

Ways to improve the quality of the retrieved context include:

* Updating the chunking strategy (chunk size, overlap size)
* Including more metadata to better track citation of retrieved content
* Changing the distance calculation used to determine similarity between embedded tokens
* Changing the embedding model itself
* Pre-retrieval user query transformation

**Generation:**

Ways to improve the quality of the generated content include:

* Query rewriting (e.g., modify template instructions to the LLM, update the formatting and spell-correct the user query, etc.)
* Filter extraction (i.e., identify user-submitted limiters like to incorporate into the retrieval process)
* Use multiple LLM calls for complex queries
* Change the LLM used for retrieval
* Change the available computing resources to meet cost and latency requirements

### Guardrails 

Of course, it would be irresponsible not to include a mention of guardrails, and egregiously so for a project like this, in which children are the intended end user. This step would help to ensure that responses are appropriate and inoffensive to intended users and others. Rails can also help protect against common LLM vulnerabilities like jailbreaks and prompt injections. There are pre-made, open-source toolkits like [NVIDIA's NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails?tab=readme-ov-file) that can help simplify this process.