**Homework 12** - **Part A: Code Chatbot**

**Building RAG Chatbots with LangChain**

### Part A Goal

Build a code understanding model using LangChain, OpenAI, and Pinecone vector DB, to build a chatbot capable of learning from the external world using **Retrieval Augmented Generation (RAG)**.

Uploading previous assignment notebook files to the model we will be able to ask questions based on the code file as context.

This example will have a functioning chatbot and RAG pipeline that can hold a conversation and provide informative responses based on a knowledge base.

### Prerequisites

Install the following Python libraries:

- **langchain**: This is a library for GenAI. We'll use it to chain together different language models and components for our chatbot.
- **openai**: This is the official OpenAI Python client. We'll use it to interact with the OpenAI API and generate responses for our chatbot.
- **datasets**: This library provides a vast array of datasets for machine learning. We'll use it to load our knowledge base for the chatbot.
- **pinecone-client**: This is the official Pinecone Python client. We'll use it to interact with the Pinecone API and store our chatbot's knowledge base in a vector database.

**NOTE**: *OpenAI dataloaders will not load locally for on-prem devices easily. To simplify the use of these loaders, it is recommended to use an online notebook such as CoLab.*

In [3]:
!pip install -qU \
    langchain==0.0.354 \
    openai==1.6.1 \
    datasets==2.10.1 \
    pinecone-client==3.1.0 \
    tiktoken==0.5.2

[0m

### BACKGROUND: Building a Chatbot (no RAG)

We will be using Langchain library for our chatbot. We initialize `ChatOpenAI` object using the *OPENAI_KEY*

In [4]:
import os
from langchain.chat_models import ChatOpenAI

chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

  warn_deprecated(


Chat models are usually structured in plain text in the below format

```
System: You are a helpful assistant.

User: Hi AI, how are you today?

Assistant: I'm great thank you. How can I help you?

User: I'd like to understand string theory.

Assistant:
```

The final `"Assistant:"` without a response is what would prompt the model to continue the conversation. In the official OpenAI `ChatCompletion` endpoint these would be passed to the model in a format like:

```python
[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hi AI, how are you today?"},
    {"role": "assistant", "content": "I'm great thank you. How can I help you?"}
    {"role": "user", "content": "I'd like to understand string theory."}
]
```

In LangChain there is a slightly different format. We use three _message_ objects like so:

In [9]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful expert data scientist python coder."),
    HumanMessage(content="Hi AI, how are you today?"),
    AIMessage(content="I'm great thank you. How can I help you?"),
    HumanMessage(content="I'd like to understand pytorch and Langchain in two paragraphs.")
]

The format is very similar, we're just swapping the role of `"user"` for `HumanMessage`, and the role of `"assistant"` for `AIMessage`.

We generate the next response from the AI by passing these messages to the `ChatOpenAI` object.

In [10]:
res = chat(messages)
res

AIMessage(content='PyTorch is an open-source machine learning library for Python that provides a flexible and dynamic computational graph system, which allows for easy and efficient building of deep learning models. It offers a wide range of tools and modules that make it easier to create and train neural networks, such as layers, optimizers, and loss functions. PyTorch is known for its simplicity and ease of use, making it a popular choice among researchers and practitioners in the deep learning community.\n\nLangchain, on the other hand, is a domain-specific language (DSL) designed for creating and training neural networks. It aims to simplify the process of developing deep learning models by providing a high-level abstraction that allows users to define their models using a more human-readable syntax. Langchain abstracts away the complexities of low-level programming and focuses on the essentials of building neural networks, making it easier for developers to experiment with differe

In response we get another AI message object. We can print it more clearly like so:

In [11]:
print(res.content)

PyTorch is an open-source machine learning library for Python that provides a flexible and dynamic computational graph system, which allows for easy and efficient building of deep learning models. It offers a wide range of tools and modules that make it easier to create and train neural networks, such as layers, optimizers, and loss functions. PyTorch is known for its simplicity and ease of use, making it a popular choice among researchers and practitioners in the deep learning community.

Langchain, on the other hand, is a domain-specific language (DSL) designed for creating and training neural networks. It aims to simplify the process of developing deep learning models by providing a high-level abstraction that allows users to define their models using a more human-readable syntax. Langchain abstracts away the complexities of low-level programming and focuses on the essentials of building neural networks, making it easier for developers to experiment with different architectures and 

### Stringing Messages for a Conversation
Because `res` is just another `AIMessage` object, we can append it to `messages`, add another `HumanMessage`, and generate the next response in the conversation.

In [12]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Why do data scientists believe it can produce general artificial intelligence?"
)
# add to messages
messages.append(prompt)

# send to chat-gpt
res = chat(messages)

print(res.content)

Data scientists believe that artificial intelligence (AI) has the potential to achieve general intelligence due to its ability to learn from large amounts of data and adapt to new tasks and environments. Machine learning algorithms, such as deep learning models, have shown impressive capabilities in performing complex tasks, such as image recognition, natural language processing, and game playing, which were previously thought to require human-level intelligence.

Moreover, the advancements in AI research, particularly in areas like reinforcement learning, transfer learning, and meta-learning, have shown promising results in enabling AI systems to generalize their knowledge and skills across different domains. By leveraging these techniques and combining them with large-scale data and computational resources, data scientists believe that it is possible to develop AI systems that exhibit human-like intelligence and can perform a wide range of cognitive tasks.

While achieving general ar

### Dealing with Hallucinations

We have our chatbot, but as mentioned — the knowledge of LLMs can be limited. The reason for this is that LLMs learn all they know during training. An LLM essentially compresses the "world" as seen in the training data into the internal parameters of the model. We call this knowledge the _parametric knowledge_ of the model.

By default, LLMs have no access to the external world.

The result of this is very clear when we ask LLMs about more recent information, like about the new (and very popular) Llama 3 LLM.

In [13]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="What is so special about Llama 3?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [14]:
print(res.content)

As of my last update, there is no specific information available about a technology or product called "Llama 3." It is possible that it may refer to a new or upcoming technology, software, device, or concept that has been developed since then. If you have any specific details or context about Llama 3 that you would like me to explore further, please provide more information so that I can assist you better.


Our chatbot can no longer help us, it doesn't contain the information we need to answer the question. It was very clear from this answer that the LLM doesn't know the informaiton, but sometimes an LLM may respond like it _does_ know the answer — and this can be very hard to detect.

OpenAI have since adjusted the behavior for this particular example as we can see below:

In [15]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Can you tell me more about the LLMChain in LangChain?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [16]:
print(res.content)

I apologize for the confusion earlier. As of my last update, there is no specific information available about a technology or concept called "LLMChain" in relation to LangChain or any other known system. It is possible that it may refer to a new or emerging concept in the field of blockchain or distributed ledger technology that has been developed since then. If you have any specific details or context about LLMChain in LangChain that you would like me to explore further, please provide more information so that I can assist you better.


### Feed the LLM More Data Manually [Not scalable]
There is another way of feeding knowledge into LLMs. It is called _source knowledge_ and it refers to any information fed into the LLM via the prompt. We can try that with the LLMChain question. We can take a description of this object from the LangChain documentation.

In [17]:
llmchain_information = [
    "A LLMChain is the most common type of chain. It consists of a PromptTemplate, a model (either an LLM or a ChatModel), and an optional output parser. This chain takes multiple input variables, uses the PromptTemplate to format them into a prompt. It then passes that to the model. Finally, it uses the OutputParser (if provided) to parse the output of the LLM into a final format.",
    "Chains is an incredibly generic concept which returns to a sequence of modular components (or other chains) combined in a particular way to accomplish a common use case.",
    "LangChain is a framework for developing applications powered by language models. We believe that the most powerful and differentiated applications will not only call out to a language model via an api, but will also: (1) Be data-aware: connect a language model to other sources of data, (2) Be agentic: Allow a language model to interact with its environment. As such, the LangChain framework is designed with the objective in mind to enable those types of applications."
]

source_knowledge = "\n".join(llmchain_information)

We can feed this additional knowledge into our prompt with some instructions telling the LLM how we'd like it to use this information alongside our original query.

In [18]:
query = "Can you tell me about the LLMChain in LangChain?"

augmented_prompt = f"""Using the contexts below, answer the query.

Contexts:
{source_knowledge}

Query: {query}"""

Now we feed this into our chatbot as we were before.

In [19]:
# create a new user prompt
prompt = HumanMessage(
    content=augmented_prompt
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [20]:
print(res.content)

In LangChain, the LLMChain is a common type of chain that plays a crucial role in leveraging language models within the framework. The LLMChain comprises a PromptTemplate, a model (either an LLM or a ChatModel), and an optional output parser. This chain is designed to handle multiple input variables, where the PromptTemplate is used to format these inputs into a prompt. The processed prompt is then passed to the model for generating an output. Additionally, the LLMChain can utilize an OutputParser, if provided, to further refine and parse the output generated by the language model into a final format. Overall, the LLMChain within LangChain facilitates the seamless integration of language models into applications, enhancing their functionality and capabilities in processing and interacting with data and environments.


The quality of this answer is phenomenal. This is made possible thanks to the idea of augmented our query with external knowledge (source knowledge). There's just one problem — how do we get this information in the first place?

We learned in the previous chapters about Pinecone and vector databases. Well, they can help us here too. But first, we'll need a dataset.

### Importing the Data

In [22]:
# !pip install pypdf

In [23]:
from langchain_community.document_loaders import PyPDFLoader

#load pdf files
loader = PyPDFLoader('AnbuValluvan_Devadasan_Assignment9_Transformer_German_to_English.pdf')
data = loader.load()
print(data)



In [24]:
data

[Document(page_content='AnbuV alluvan_Devadasan_Assignment9_T ransformer_German_to_English\nMay 8, 2024\nStep1. Run the demo and train a model on the original German-to-English training\nset.\n[1]:%matplotlib inline\n1 Data Sourcing and Processing\nW e will use Multi30k dataset from torchtext library that yields a pair of source-target raw sentences.\nT o access torchtext datasets, please install torchdata following instructions at https://github.\ncom/pytorch/data .\n[2]: from torchtext .data .utils importget_tokenizer\nfrom torchtext .vocab importbuild_vocab_from_iterator\nfrom torchtext .datasets importmulti30k, Multi30k\nfrom typing importIterable, List\nmulti30k .URL["train"]="https://raw.githubusercontent.com/neychev/\n↪small_DL_repo/master/datasets/Multi30k/training.tar.gz "\nmulti30k .URL["valid"]="https://raw.githubusercontent.com/neychev/\n↪small_DL_repo/master/datasets/Multi30k/validation.tar.gz "\nSRC_LANGUAGE =\'de\'\nTGT_LANGUAGE =\'en\'\n# Place-holders\ntoken_transform 

In [25]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split text data into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=25)
text_chunks = text_splitter.split_documents(data)
print(len(text_chunks))

28


In [26]:
# check the chunks
text_chunks[2]

Document(page_content='[7]:token_transform[SRC_LANGUAGE] =get_tokenizer( \'spacy\',␣\n↪language =\'de_core_news_sm \')\ntoken_transform[TGT_LANGUAGE] =get_tokenizer( \'spacy\',␣\n↪language =\'en_core_web_sm \')\n# helper function to yield list of tokens\ndefyield_tokens (data_iter: Iterable, language: str)->List[str]:\nlanguage_index ={SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}\nfordata_sample indata_iter:\nyieldtoken_transform[language](data_sample[language_index[language]])\n# Define special symbols and indices\nUNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX =0,1,2,3\n# Make sure the tokens are in order of their indices to properly insert them in ␣\n↪vocab\nspecial_symbols =[\'<unk>\',\'<pad>\',\'<bos>\',\'<eos>\']\n"""\nPurpose: Defines special tokens used in the vocabulary for machine learning ␣\n↪tasks with text data.\nSpecial Tokens:\n<unk>: "Unknown" token (represents words not in the vocabulary)\n<pad>: Padding token (to make sequences the same length)\n<bos>: "Beginning of Sequence"\n<eos>: "End o

In [27]:
# reformat chunks to improve vectorization; match 'jamescalam/llama-2-arxiv-papers-chunked' format sourced from Llama 2 ArXiv papers on huggingface
dataset = []

for i, chunk in enumerate(text_chunks):
    dataset.append({
        'doi': '',  # you can add a DOI here if available
        'chunk-id': str(i),
        'chunk': chunk,
        'id': '',  # you can add an ID here if available
        'title': '',  # you can add a title here if available
        'summary': '',  # you can add a summary here if available
        'source': '',  # you can add a source here if available
        'authors': [],  # you can add authors here if available
        'categories': [],  # you can add categories here if available
        'comment': '',  # you can add a comment here if available
        'journal_ref': None,  # you can add a journal reference here if available
        'primary_category': '',  # you can add a primary category here if available
        'published': '',  # you can add a published date here if available
        'updated': '',  # you can add an updated date here if available
        'references': []  # you can add references here if available
    })

print(dataset[3])

{'doi': '', 'chunk-id': '3', 'chunk': Document(page_content="# Training data Iterator\ntrain_iter =Multi30k(split ='train', language_pair =(SRC_LANGUAGE, ␣\n↪TGT_LANGUAGE))\n# Create torchtext's Vocab object\nvocab_transform[ln] =build_vocab_from_iterator(yield_tokens(train_iter, ␣\n↪ln),\nmin_freq =1,\nspecials =special_symbols,\nspecial_first =True)\n# Set ``UNK_IDX`` as the default index. This index is returned when the token ␣\n↪is not found.\n# If not set, it throws ``RuntimeError`` when the queried token is not found in ␣\n↪the Vocabulary.\nforln in[SRC_LANGUAGE, TGT_LANGUAGE]:\nvocab_transform[ln] .set_default_index(UNK_IDX)\n2", metadata={'source': 'AnbuValluvan_Devadasan_Assignment9_Transformer_German_to_English.pdf', 'page': 1}), 'id': '', 'title': '', 'summary': '', 'source': '', 'authors': [], 'categories': [], 'comment': '', 'journal_ref': None, 'primary_category': '', 'published': '', 'updated': '', 'references': []}


#### Dataset Overview

The dataset used is the Assignment on Trasnformers in DATA_255

Because most **Large Language Models** (**LLMs**) only contain knowledge of the world as it was during training, they cannot answer our questions about Jorge's code without example data.

### Task 4: Building the Knowledge Base

We now have a dataset that can serve as our chatbot knowledge base. Our next task is to transform that dataset into the knowledge base that our chatbot can use. To do this we must use an embedding model and vector database.

We begin by initializing our connection to Pinecone, this requires a [free API key](https://app.pinecone.io).

In [28]:
from pinecone import Pinecone

# Initialize connection to Pinecone using the environment variable
api_key = os.environ.get("PINECONE_API_KEY")
pc = Pinecone(api_key=api_key)


  from tqdm.autonotebook import tqdm


Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [29]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

Then we initialize the index. We will be using OpenAI's `text-embedding-ada-002` model for creating the embeddings, so we set the `dimension` to `1536`.

In [30]:
import time

index_name = 'llama-2-rag'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Our index is now ready but it's empty. It is a vector index, so it needs vectors. As mentioned, to create these vector embeddings we will OpenAI's `text-embedding-ada-002` model — we can access it via LangChain like so:

In [31]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

  warn_deprecated(


Using this model we can create embeddings like so:

In [32]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed_model.embed_documents(texts)
len(res), len(res[0])

(2, 1536)

From this we get two (aligning to our two chunks of text) 1536-dimensional embeddings.

We're now ready to embed and index all our our data! We do this by looping through our dataset and embedding and inserting everything in batches.

**NOTE**: *ensure that chunks are strings and ensure that they are correctly assigned to metadata (do this with the .page_content method)*

In [33]:
import pandas as pd
from tqdm.auto import tqdm  # for progress bar

data = pd.DataFrame(dataset) # this makes it easier to iterate over the dataset

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    # get text to embed
    texts = [str(x['chunk']) for _, x in batch.iterrows()]

    # embed text
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'].page_content,
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

100%|██████████| 1/1 [00:02<00:00,  2.02s/it]


We can check that the vector index has been populated using `describe_index_stats` like before:

In [34]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 28}},
 'total_vector_count': 28}

#### Retrieval Augmented Generation

We've built a fully-fledged knowledge base. Now it's time to connect that knowledge base to our chatbot. To do that we'll be diving back into LangChain and reusing our template prompt from earlier.

To use LangChain here we need to load the LangChain abstraction for a vector index, called a `vectorstore`. We pass in our vector `index` to initialize the object.

In [35]:
from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

  warn_deprecated(


Using this `vectorstore` we can already query the index and see if we have any relevant information given our question about the assignment.

In [36]:
query = "Did they use Pytorch or Tensorflow in the Assignment?"

vectorstore.similarity_search(query, k=3)

[Document(page_content='AnbuV alluvan_Devadasan_Assignment9_T ransformer_German_to_English\nMay 8, 2024\nStep1. Run the demo and train a model on the original German-to-English training\nset.\n[1]:%matplotlib inline\n1 Data Sourcing and Processing\nW e will use Multi30k dataset from torchtext library that yields a pair of source-target raw sentences.\nT o access torchtext datasets, please install torchdata following instructions at https://github.\ncom/pytorch/data .\n[2]: from torchtext .data .utils importget_tokenizer\nfrom torchtext .vocab importbuild_vocab_from_iterator\nfrom torchtext .datasets importmulti30k, Multi30k\nfrom typing importIterable, List\nmulti30k .URL["train"]="https://raw.githubusercontent.com/neychev/\n↪small_DL_repo/master/datasets/Multi30k/training.tar.gz "\nmulti30k .URL["valid"]="https://raw.githubusercontent.com/neychev/\n↪small_DL_repo/master/datasets/Multi30k/validation.tar.gz "\nSRC_LANGUAGE =\'de\'\nTGT_LANGUAGE =\'en\'\n# Place-holders\ntoken_transform 

We return a lot of text here and it's not that clear what we need or what is relevant. Fortunately, our LLM will be able to parse this information much faster than us. All we need is to connect the output from our `vectorstore` to our `chat` chatbot. To do that we can use the same logic as we used earlier.

In [37]:
def augment_prompt(query: str):
    # get top 3 results from knowledge base
    results = vectorstore.similarity_search(query, k=3)
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

Using this we produce an augmented prompt:

In [38]:
print(augment_prompt(query))

Using the contexts below, answer the query.

    Contexts:
    AnbuV alluvan_Devadasan_Assignment9_T ransformer_German_to_English
May 8, 2024
Step1. Run the demo and train a model on the original German-to-English training
set.
[1]:%matplotlib inline
1 Data Sourcing and Processing
W e will use Multi30k dataset from torchtext library that yields a pair of source-target raw sentences.
T o access torchtext datasets, please install torchdata following instructions at https://github.
com/pytorch/data .
[2]: from torchtext .data .utils importget_tokenizer
from torchtext .vocab importbuild_vocab_from_iterator
from torchtext .datasets importmulti30k, Multi30k
from typing importIterable, List
multi30k .URL["train"]="https://raw.githubusercontent.com/neychev/
↪small_DL_repo/master/datasets/Multi30k/training.tar.gz "
multi30k .URL["valid"]="https://raw.githubusercontent.com/neychev/
↪small_DL_repo/master/datasets/Multi30k/validation.tar.gz "
SRC_LANGUAGE ='de'
TGT_LANGUAGE ='en'
# Place-holders
t

In [39]:
# create a new user prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)
# add to messages
messages.append(prompt)

res = chat(messages)

print(res.content)

They used PyTorch in the assignment. This can be inferred from the code snippets provided in the context, where they imported and utilized PyTorch modules and classes such as `torch`, `torch.nn`, and `nn`. The specific lines of code indicate the use of PyTorch for building and training the model for German-to-English translation.


We can continue with more questions about the Assignment. Let's try _without_ RAG first:

In [40]:
prompt = HumanMessage(
    content="Which dataset did they use in the Assignment?"
)

res = chat(messages + [prompt])
print(res.content)

In the Assignment described, they used the Multi30k dataset from the torchtext library. The Multi30k dataset provides a pair of source-target raw sentences for training machine translation models from German to English. The dataset was accessed using torchtext datasets and was used to train a model for German-to-English translation.


The chatbot is able to respond about the assignment thanks to it's conversational history stored in `messages`.

In [41]:
prompt = HumanMessage(
    content=augment_prompt(
        "Which deep learning model did they use in the assignment?"
    )
)

res = chat(messages + [prompt])
print(res.content)

In the assignment "Transformer_German_to_English" mentioned in the context, they used a deep learning model called "Seq2SeqTransformer." This model was trained and utilized for translating text from German to English.


In [42]:
prompt = HumanMessage(
    content=augment_prompt(
        "Show the code that built the Seq2SeqTransformer"
    )
)

res = chat(messages + [prompt])
print(res.content)

Here is the code snippet that defines the `Seq2SeqTransformer` class in the context provided:

```python
# Seq2Seq Network
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
      

In [43]:
prompt = HumanMessage(
    content=augment_prompt(
        "Can you talk about the transformer architecture in the Assignment?"
    )
)

res = chat(messages + [prompt])
print(res.content)

The transformer architecture used in the assignment is a Seq2SeqTransformer model. This model consists of several key components, including the Transformer module, positional encoding, token embedding, and a linear generator layer. The Transformer module is responsible for processing the input source and target embeddings using multi-head self-attention mechanisms and feedforward neural networks. It consists of encoder and decoder layers with specified parameters such as the number of layers, feedforward dimension, and dropout rate.

The positional encoding is added to the token embeddings to introduce a notion of word order in the model. It helps the model capture the sequential information of the input sentences. The token embedding layers for both the source and target languages are used to convert token indices into dense vectors suitable for processing by the Transformer module.

Additionally, the linear generator layer is responsible for mapping the output embeddings from the Tra

In [44]:
prompt = HumanMessage(
    content=augment_prompt(
        "Did they use any Transfer learning in the assignment? If yes, where?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Based on the provided context, the assignment involved training a model on the original German-to-English training set using PyTorch. The assignment focused on using the Multi30k dataset from the torchtext library, implementing a Seq2SeqTransformer model for the translation task, and addressing techniques to enhance model performance such as training on a larger and more diverse dataset.

However, there is no specific mention of transfer learning being used in the assignment within the provided context. Transfer learning typically involves utilizing pre-trained models on a related task to improve performance on a new task, but without explicit information in the context regarding the use of transfer learning, it is not possible to confirm whether it was employed in this particular assignment.


In [45]:
prompt = HumanMessage(
    content=augment_prompt(
        "Show the input and the prediction"
    )
)

res = chat(messages + [prompt])
print(res.content)

Based on the provided context, the input and prediction for the translation task using the Seq2SeqTransformer model in the German-to-English translation assignment are as follows:

Input: "Eine Gruppe von Menschen steht vor einem Iglu."
Prediction: "A group of people standing in front an igloo."

Input: "Das Auto steht vor dem Haus."
Prediction: "The car is standing in front of the house."

Input: "Die Katze liegt auf dem Sofa."
Prediction: "The cat is laying on the couch."

These are examples of input sentences in German and their corresponding translations predicted by the Seq2SeqTransformer model trained on the German-to-English training set.


We get a much more informed response that includes several items missing in the previous non-RAG response, such as "red-teaming", "iterative evaluations", and the intention of the researchers to share this research to help "improve their safety, promoting responsible development in the field".

**Observations and Limitations:**
- While the RAG model offered more comprehensive insights derived from textual content, the LLM struggled to generate code examples based on embeddings.
- PDF conversion of notebooks often includes special characters, whose removal is likely to enhance response quality.
- Chunking format ensures proper data loading and ingestion.
- Adding prompts and responses to messages expands content, enabling the chatbot to engage in conversation.
- Adjusting system parameters influences the results of the LLM.

Delete the index to save resources and not be charged for non-use:

In [46]:
pc.delete_index(index_name)

Reference:

1. DATA_255 RAG Chatbot with LangChain demo.

2. [OpenAI API key](https://platform.openai.com/account/api-keys) and [Pinecone API key](https://app.pinecone.io)

3. [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings/use-cases)

4. [RAGs with OpenAI](https://cookbook.openai.com/examples/parse_pdf_docs_for_rag)

5. https://github.com/jrgosalvez/data255_DL

---