**Reference Link:** [RAG Systems Essentials (Analytics Vidhya)](https://courses.analyticsvidhya.com/courses/take/rag-systems-essentials/lessons/60148017-hands-on-deep-dive-into-rag-evaluation-metrics-generator-metrics-i)

# Build a Simple RAG System

## Install OpenAI, and LangChain dependencies

In [2]:
%run setup.ipynb

In [6]:
from pprint import pprint

### Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient `text-embedding-3-small` model, and a larger and more powerful `text-embedding-3-large` model.

In [41]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates

openai_embed_model = OpenAIEmbeddings(model="text-embedding-3-small")

text = "Hello, world!"

embedding = openai_embed_model.embed_documents(text)

print(embedding)

[[0.027492856606841087, 0.001843890524469316, 0.015607842244207859, 0.05215075612068176, 0.012032992206513882, 0.01713435724377632, -0.01124636922031641, 0.00863727368414402, -0.009159092791378498, -0.06486133486032486, -0.0009935006964951754, 0.009540721774101257, -0.08261875808238983, -0.008979961276054382, 0.009922350756824017, 0.016480136662721634, -0.015397557057440281, 0.008481507189571857, 0.005564772058278322, 0.05878642201423645, -0.0074300807900726795, 0.015459863469004631, 0.03950248286128044, 0.026028648018836975, -0.003847442101687193, 5.3210213081911206e-05, -0.009229187853634357, -0.00957187544554472, 0.05000117048621178, -0.0231002289801836, 0.050281550735235214, -0.02685421146452427, 0.031885482370853424, -0.00673691788688302, -0.011277522891759872, 0.029159560799598694, -0.02300676889717579, 0.0509357713162899, -0.00034414746914990246, 0.037352900952100754, -0.00795968808233738, 0.011285311542451382, -0.012702790088951588, -0.015311885625123978, 0.049751944839954376, 

In [None]:
len(embedding)


13

### Why is the length 13?
- The string "Hello, world!" contains exactly 13 characters, including the comma, space, and exclamation mark:
- H-e-l-l-o-,-[space]-w-o-r-l-d-!
- Therefore, you get a length of 13—one embedding vector is created for each character.

### The Correct Way
Here are the proper ways to use OpenAI embeddings:

For a single document:


In [None]:
# Option 1: Use embed_query for single text
embedding = openai_embed_model.embed_query("Hello, world!")
print(f"Length: {len(embedding)}")  # This will be 1536 (for text-embedding-3-small)

# Option 2: Use embed_documents with a list
embedding = openai_embed_model.embed_documents(["Hello, world!"])
print(f"Length: {len(embedding)}")  # This will be 1 (one embedding vector)
print(f"Vector dimension: {len(embedding[0])}")  # This will be 1536

Length: 1536
Length: 1
Vector dimension: 1536


For multiple documents:


In [None]:
texts = ["Hello, world!", "How are you?", "This is a test."]
embeddings_list = openai_embed_model.embed_documents(texts)
print(f"Number of embeddings: {len(embeddings_list)}")  # 3
print(f"Each embedding dimension: {len(embeddings_list[0])}")  # 1536

Number of embeddings: 3
Each embedding dimension: 1536


What You're Actually Seeing
The actual embedding vector dimensions for OpenAI's `text-embedding-3-small` model should be 1536, not 13. The 13 you're seeing is the number of character-level embeddings created by the incorrect usage.
Try using embeddings.embed_query("Hello, world!") instead, and you'll see the proper embedding vector with 1536 dimensions!

### Load and Process JSON Documents

In [19]:
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(file_path='../../docs/wikidata_rag_demo.jsonl',
                    jq_schema='.',
                    text_content=False,
                    json_lines=True)

wiki_docs = loader.load()

print(wiki_docs[0].page_content)

{"id": "84801", "title": "Chinese New Year", "paragraphs": ["Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.", "The Chinese New Year is of the most important holidays for Chinese people all over the world. Its 7th day used to be used instead of birthdays to count people's ages in China. The holiday is still used to tell people which \"animal\" of the Chinese zodiac they are part of. The holiday is a time for gifts to children and for family gatherings with large meals, just like Christmas in Europe and in other Christian areas. Unlike Christmas, the children usually get gifts of c

In [20]:
len(wiki_docs)

1801

In [23]:
wiki_docs[3].page_content

'{"id": "71548", "title": "Chi-square distribution", "paragraphs": ["In probability theory and statistics, the chi-square distribution (also chi-squared or formula_1\\u00a0 distribution) is one of the most widely used theoretical probability distributions. Chi-square distribution with formula_2 degrees of freedom is written as formula_3. It is a special case of gamma distribution.", "Chi-square distribution is primarily used in statistical significance tests and confidence intervals. It is useful, because it is relatively easy to show that certain probability distributions come close to it, under certain conditions. One of these conditions is that the null hypothesis must be true. Another one is that the different random variables (or observations) must be independent of each other."]}'

In [28]:
import json
from langchain.docstore.document import Document
wiki_docs_processed = []

for doc in wiki_docs:
    doc = json.loads(doc.page_content)
    metadata = {
        "title": doc['title'],
        "id": doc['id'],
        "source": "Wikipedia"
    }
    data = ' '.join(doc['paragraphs'])
    wiki_docs_processed.append(Document(page_content=data, metadata=metadata))

In [30]:
print(wiki_docs_processed[3])

page_content='In probability theory and statistics, the chi-square distribution (also chi-squared or formula_1  distribution) is one of the most widely used theoretical probability distributions. Chi-square distribution with formula_2 degrees of freedom is written as formula_3. It is a special case of gamma distribution. Chi-square distribution is primarily used in statistical significance tests and confidence intervals. It is useful, because it is relatively easy to show that certain probability distributions come close to it, under certain conditions. One of these conditions is that the null hypothesis must be true. Another one is that the different random variables (or observations) must be independent of each other.' metadata={'title': 'Chi-square distribution', 'id': '71548', 'source': 'Wikipedia'}


### Load and Process PDF documents

#### Create Simple Document Chunks for Standard Retrieval

Here we just use simple chunking where each chunk is a fixed size of <= 3500 characters and overlap of 200 characters for any small isolated chunks (you can and should experiment with various chunk sizes and overlaps)

In [31]:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_simple_chunks(file_path, chunk_size=3500, chunk_overlap=0):

    print('Loading pages:', file_path)
    loader = PyMuPDFLoader(file_path)
    doc_pages = loader.load()

    print('Chunking pages:', file_path)
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                              chunk_overlap=chunk_overlap)
    doc_chunks = splitter.split_documents(doc_pages)
    print('Finished processing:', file_path)
    print()
    return doc_chunks

In [None]:
from glob import glob

pdf_files = glob('../../docs/*.pdf')


['../../docs/layoutparser_paper.pdf', '../../docs/cnn_paper.pdf', '../../docs/vision_transformer.pdf', '../../docs/resnet_paper.pdf', '../../docs/WEB_How_and_Why_to_UseLLMs_for_Chunk_Based_Information_Retrieval_Carlo_Peron_Oct_2024_TowardsDataScience.pdf', '../../docs/attention_paper.pdf', '../../docs/Vision Transformers.pdf']


In [34]:
pdf_files

['../../docs/layoutparser_paper.pdf', '../../docs/cnn_paper.pdf', '../../docs/vision_transformer.pdf', '../../docs/resnet_paper.pdf', '../../docs/WEB_How_and_Why_to_UseLLMs_for_Chunk_Based_Information_Retrieval_Carlo_Peron_Oct_2024_TowardsDataScience.pdf', '../../docs/attention_paper.pdf', '../../docs/Vision Transformers.pdf']

In [35]:
paper_docs = []
for fp in pdf_files:
    paper_docs.extend(create_simple_chunks(file_path=fp,
                                           chunk_size=3500,
                                           chunk_overlap=200))

Loading pages: ../../docs/layoutparser_paper.pdf
Chunking pages: ../../docs/layoutparser_paper.pdf
Finished processing: ../../docs/layoutparser_paper.pdf

Loading pages: ../../docs/cnn_paper.pdf
Chunking pages: ../../docs/cnn_paper.pdf
Finished processing: ../../docs/cnn_paper.pdf

Loading pages: ../../docs/vision_transformer.pdf
Chunking pages: ../../docs/vision_transformer.pdf
Finished processing: ../../docs/vision_transformer.pdf

Loading pages: ../../docs/resnet_paper.pdf
Chunking pages: ../../docs/resnet_paper.pdf
Finished processing: ../../docs/resnet_paper.pdf

Loading pages: ../../docs/WEB_How_and_Why_to_UseLLMs_for_Chunk_Based_Information_Retrieval_Carlo_Peron_Oct_2024_TowardsDataScience.pdf
Chunking pages: ../../docs/WEB_How_and_Why_to_UseLLMs_for_Chunk_Based_Information_Retrieval_Carlo_Peron_Oct_2024_TowardsDataScience.pdf
Finished processing: ../../docs/WEB_How_and_Why_to_UseLLMs_for_Chunk_Based_Information_Retrieval_Carlo_Peron_Oct_2024_TowardsDataScience.pdf

Loading page

In [36]:
len(paper_docs)

134

### Combine all document chunks in one list

In [37]:
len(wiki_docs_processed)

1801

In [60]:
type(paper_docs)

<class 'list'>

In [59]:
type(wiki_docs_processed)

<class 'list'>

In [64]:
print(wiki_docs_processed[3])

page_content='In probability theory and statistics, the chi-square distribution (also chi-squared or formula_1  distribution) is one of the most widely used theoretical probability distributions. Chi-square distribution with formula_2 degrees of freedom is written as formula_3. It is a special case of gamma distribution. Chi-square distribution is primarily used in statistical significance tests and confidence intervals. It is useful, because it is relatively easy to show that certain probability distributions come close to it, under certain conditions. One of these conditions is that the null hypothesis must be true. Another one is that the different random variables (or observations) must be independent of each other.' metadata={'title': 'Chi-square distribution', 'id': '71548', 'source': 'Wikipedia'}


In [65]:
print(paper_docs[0])

page_content='LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ( ), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify deep learning (DL) model
development in disciplines like natural langu

In [38]:
total_docs = wiki_docs_processed + paper_docs
len(total_docs)

1935

## Index Document Chunks and Embeddings in Vector DB

Here we initialize a connection to a Chroma vector DB client, and also we want to save to disk, so we simply initialize the Chroma client and pass the directory where we want the data to be saved to.

In [None]:
from langchain_chroma import Chroma

# create vector DB of docs and embeddings - takes < 30s on Colab
chroma_db = Chroma.from_documents(documents=total_docs,
                                  collection_name='my_db',
                                  embedding=openai_embed_model,
                                  # need to set the distance function to cosine else it uses euclidean by default
                                  # check https://docs.trychroma.com/guides#changing-the-distance-function
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./my_db")

### Load Vector DB from disk

This is just to show once you have a vector database on disk you can just load and create a connection to it anytime

In [43]:
# load from disk
chroma_db = Chroma(persist_directory="./my_db",
                   collection_name='my_db',
                   embedding_function=openai_embed_model)

In [44]:
chroma_db

<langchain_chroma.vectorstores.Chroma object at 0x123bbadd0>

### Semantic Similarity based Retrieval

We use simple cosine similarity here and retrieve the top 5 similar documents based on the user input query

In [45]:
similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 5})

In [46]:
from IPython.display import display, Markdown

def display_docs(docs):
    for doc in docs:
        print('Metadata:', doc.metadata)
        print('Content Brief:')
        display(Markdown(doc.page_content[:1000]))
        print()

In [47]:
query = "what is machine learning?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '359370', 'source': 'Wikipedia', 'title': 'Supervised learning'}
Content Brief:


In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic


Metadata: {'id': '6360', 'source': 'Wikipedia', 'title': 'Artificial intelligence'}
Content Brief:


Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955. In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation. An ideal (perfect) intelligent machine is a flexible agent which perceives its environment and takes actions to maximize its chance of success at some goal or objective. As machines become increasingly capable, mental facu


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.




In [66]:
query = "what is the difference between transformers and vision transformers?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../../docs/Vision Transformers.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 7, 'producer': 'pdfTeX-1.40.21', 'source': '../../docs/Vision Transformers.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Published as a conference paper at ICLR 2021
4.4
SCALING STUDY
We perform a controlled scaling study of different models by evaluating transfer performance from
JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess
performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1,
R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained
for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus
L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pre-
trained for 7 epochs, plus R50+ViT-L/16 pre-trained for 14 epochs (for hybrids, the number at the
end of the model name stands not for the patch size, but for the total dowsampling ratio in the ResNet
backbone).
Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5
for details on computational costs). Detailed results per mode


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 7, 'producer': 'pdfTeX-1.40.21', 'source': '../../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Published as a conference paper at ICLR 2021
4.4
SCALING STUDY
We perform a controlled scaling study of different models by evaluating transfer performance from
JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess
performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1,
R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained
for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus
L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pre-
trained for 7 epochs, plus R50+ViT-L/16 pre-trained for 14 epochs (for hybrids, the number at the
end of the model name stands not for the patch size, but for the total dowsampling ratio in the ResNet
backbone).
Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5
for details on computational costs). Detailed results per mode


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 0, 'producer': 'pdfTeX-1.40.21', 'source': '../../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Published as a conference paper at ICLR 2021
AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗,
Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†
∗equal technical contribution, †equal advising
Google Research, Brain Team
{adosovitskiy, neilhoulsby}@google.com
ABSTRACT
While the Transformer architecture has become the de-facto standard for natural
language processing tasks, its applications to computer vision remain limited. In
vision, attention is either applied in conjunction with convolutional networks, or
used to replace certain components of convolutional networks while keeping their
overall structure in place. We show that this reliance on CNNs is not necessary
and a pure transformer applied directly to sequences of image patches can perform
very well on image classiﬁcation tasks. When pre-traine


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../../docs/Vision Transformers.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 0, 'producer': 'pdfTeX-1.40.21', 'source': '../../docs/Vision Transformers.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Published as a conference paper at ICLR 2021
AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗,
Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†
∗equal technical contribution, †equal advising
Google Research, Brain Team
{adosovitskiy, neilhoulsby}@google.com
ABSTRACT
While the Transformer architecture has become the de-facto standard for natural
language processing tasks, its applications to computer vision remain limited. In
vision, attention is either applied in conjunction with convolutional networks, or
used to replace certain components of convolutional networks while keeping their
overall structure in place. We show that this reliance on CNNs is not necessary
and a pure transformer applied directly to sequences of image patches can perform
very well on image classiﬁcation tasks. When pre-traine


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 1, 'producer': 'pdfTeX-1.40.21', 'source': '../../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Published as a conference paper at ICLR 2021
inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well
when trained on insufﬁcient amounts of data.
However, the picture changes if the models are trained on larger datasets (14M-300M images). We
ﬁnd that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent
results when pre-trained at sufﬁcient scale and transferred to tasks with fewer datapoints. When
pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches
or beats state of the art on multiple image recognition benchmarks. In particular, the best model
reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100,
and 77.63% on the VTAB suite of 19 tasks.
2
RELATED WORK
Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since be-
come the state of the art method in many NLP tasks. Large Transformer-based mo




## Build the RAG Pipeline

In [67]:
from langchain_core.prompts import ChatPromptTemplate

rag_prompt = """You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.

                Question:
                {question}

                Context:
                {context}

                Answer:
            """

rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)

In [68]:
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

qa_rag_chain = (
    {
        "context": (similarity_retriever
                      |
                    format_docs),
        "question": RunnablePassthrough()
    }
      |
    rag_prompt_template
      |
    chatgpt
)

In [69]:
from IPython.display import display, Markdown

query = "What is machine learning?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Machine learning is a subfield of computer science that provides computers with the ability to learn without being explicitly programmed. The concept was introduced by Arthur Samuel in 1959 and is rooted in artificial intelligence. Machine learning focuses on the study and construction of algorithms that can learn from data and make predictions or decisions based on that data. These algorithms follow programmed instructions but can also adapt and improve their performance by building models from sample inputs.

Machine learning is particularly useful in scenarios where designing and programming explicit algorithms is impractical. Some common applications of machine learning include:

- Spam filtering
- Detection of network intruders or malicious insiders
- Optical character recognition (OCR)
- Search engines
- Computer vision

Within machine learning, there are different approaches, such as supervised learning, where a function is inferred from labeled training data. In this case, the system learns to produce correct results based on known outcomes, typically using vectors for training data and results.

Additionally, deep learning is a specialized area of machine learning that utilizes neural networks, particularly those with multiple layers (known as multi-layer neural networks). Deep learning is effective for complex tasks like speech recognition, image understanding, and handwriting recognition, which are challenging for computers but relatively easy for humans. 

Overall, machine learning represents a significant advancement in the ability of computers to process information and make decisions autonomously, evolving from traditional programming methods.

In [70]:
query = "What is a CNN?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

A CNN, or Convolutional Neural Network, is a type of artificial neural network primarily used for image-driven pattern recognition tasks. CNNs are designed to process data with a grid-like topology, such as images, and they consist of three main types of layers: convolutional layers, pooling layers, and fully-connected layers.

### Key Features of CNNs:

1. **Architecture**:
   - CNNs are structured to handle the spatial dimensionality of input data, which includes height, width, and depth (for color channels in images).
   - The architecture typically involves stacking multiple convolutional layers followed by pooling layers, which helps in reducing the dimensionality of the data while retaining important features.

2. **Convolutional Layers**:
   - These layers apply a convolution operation to the input, which involves sliding a filter (or kernel) over the input image to produce feature maps. Each neuron in a convolutional layer is connected to a local region of the input, allowing the network to learn spatial hierarchies of features.

3. **Pooling Layers**:
   - Pooling layers are used to down-sample the feature maps, reducing their dimensionality and helping to make the representation more manageable. This also aids in making the model invariant to small translations in the input.

4. **Fully-Connected Layers**:
   - After several convolutional and pooling layers, the high-level reasoning in the neural network is done through fully-connected layers, where every neuron is connected to every neuron in the previous layer.

5. **Activation Functions**:
   - CNNs often use activation functions like the Rectified Linear Unit (ReLU) to introduce non-linearity into the model, which helps in learning complex patterns.

6. **Efficiency**:
   - CNNs are designed to be computationally efficient, especially when dealing with large images, by reducing the number of parameters through shared weights in convolutional layers.

7. **Applications**:
   - CNNs are widely used in various applications, including image classification, object detection, and image segmentation, due to their ability to automatically learn and extract features from images.

In summary, CNNs are a powerful class of neural networks that excel in tasks involving image data, leveraging their unique architecture to efficiently learn from and process visual information.

In [71]:
query = "What is NLP and its relation to linguistics?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Natural Language Processing (NLP) is a field within Artificial Intelligence that focuses on enabling computers to automatically understand and generate human languages. The term "Natural Language" specifically refers to human languages, distinguishing them from programming languages. The overarching goal of NLP is to facilitate seamless interaction between humans and machines through language.

NLP is closely related to linguistics, as it draws upon linguistic principles to enhance the understanding and processing of human language. Linguistics, the scientific study of language, provides the foundational theories and frameworks that inform NLP techniques and algorithms. By leveraging insights from linguistics, NLP aims to improve the accuracy and effectiveness of language comprehension and generation by computers.

In [72]:
query = "How is self-attention important in transformers?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Self-attention is crucial in transformers for several reasons:

1. **Global Information Integration**: Self-attention allows the model to integrate information across the entire input, such as an image, even in the lowest layers of the network. This capability enables the model to attend to relevant regions of the input that are semantically important for tasks like classification.

2. **Attention Distance**: The concept of "attention distance" is introduced, which measures how far the model can integrate information based on the attention weights. It has been observed that some attention heads can attend to most of the image from the very first layers, indicating that the model effectively utilizes global information early in its processing.

3. **Localized vs. Global Attention**: While some attention heads exhibit highly localized attention in the lower layers, this is less pronounced in hybrid models that incorporate convolutional layers before the transformer. This suggests that self-attention can serve a similar purpose to early convolutional layers in CNNs, allowing for both localized and global attention mechanisms.

4. **Layer Depth and Attention**: The attention distance tends to increase with the depth of the network, meaning that as the model goes deeper, it can attend to larger areas of the input. This characteristic is beneficial for capturing complex relationships and dependencies within the data.

5. **Architecture Design**: The transformer architecture employs a multi-head self-attention mechanism within its encoder and decoder stacks. This design allows the model to learn different representations and focus on various parts of the input simultaneously, enhancing its ability to process and generate sequences effectively.

In summary, self-attention is fundamental to the transformer's architecture as it enables the model to capture both local and global dependencies in the data, facilitating improved performance on tasks such as image classification and natural language processing.

In [73]:
query = "What is an Agentic AI System?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The context provided does not contain specific information about an "Agentic AI System." Therefore, I don't know what an Agentic AI System is.

In [74]:
query = "What is LangChain?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

I don't know.

In [75]:
query = "What is LangGraph?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

LangGraph is not mentioned in the provided context. Therefore, I don't know what LangGraph is.

In [76]:
query = "What is the difference between transformers and vision transformers?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The difference between transformers and vision transformers (ViTs) primarily lies in their application and the way they process input data.

1. **Architecture and Input Data**:
   - **Transformers**: Originally designed for natural language processing (NLP), transformers operate on sequences of tokens (words) and utilize self-attention mechanisms to capture relationships between these tokens. They are typically pre-trained on large text corpora and fine-tuned on specific tasks.
   - **Vision Transformers (ViTs)**: ViTs adapt the transformer architecture for image data by treating image patches as sequences of tokens. An image is divided into smaller patches, which are then flattened and linearly embedded into a lower-dimensional space. This sequence of embeddings is fed into the transformer, allowing it to process image data similarly to how it processes text.

2. **Processing Mechanism**:
   - In ViTs, the first layer projects the flattened patches into a lower-dimensional space, and position embeddings are added to encode spatial information. This allows the model to learn the structure of the image, such as the distance between patches and their relative positions.
   - ViTs leverage self-attention to integrate information across the entire image, enabling the model to capture global context even in the early layers. This is different from traditional convolutional neural networks (CNNs), which typically focus on local features through convolutional layers.

3. **Performance and Efficiency**:
   - ViTs have been shown to outperform traditional CNNs (like ResNets) in terms of performance per compute cost, using approximately 2-4 times less compute to achieve similar performance on various datasets. This efficiency is particularly notable when pre-trained on large datasets and transferred to smaller tasks.
   - While hybrids that combine CNNs with ViTs can perform well, ViTs have demonstrated that they can achieve competitive results without relying on convolutional structures.

In summary, the key differences between transformers and vision transformers lie in their input data handling (text vs. image patches), the way they process this data (sequential tokens vs. image patches), and their performance characteristics in various tasks.

In [77]:
query = "How is a resnet better than a CNN?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

A ResNet (Residual Network) is considered better than a traditional CNN (Convolutional Neural Network) for several reasons, primarily related to its architecture and training efficiency:

1. **Overcoming Optimization Difficulties**: Traditional deep CNNs often face challenges such as increased training error and optimization difficulties as the depth of the network increases. In contrast, ResNets utilize shortcut connections that allow gradients to flow more easily during backpropagation, which helps mitigate these issues. This results in lower training errors even as the network depth increases.

2. **Increased Depth with Improved Performance**: ResNets can be significantly deeper than traditional CNNs without suffering from the degradation problem, where adding more layers leads to worse performance. For instance, the 34-layer ResNet outperforms the 18-layer ResNet by 2.8% in top-1 error, demonstrating that deeper networks can achieve better accuracy due to their architecture.

3. **No Extra Parameters**: ResNets can achieve better performance without increasing the number of parameters compared to their plain counterparts. This is because the shortcut connections do not add additional parameters, allowing for deeper networks that maintain computational efficiency.

4. **Generalization**: ResNets have shown to generalize better across various datasets and tasks. For example, the 152-layer ResNet achieved a top-5 validation error of 4.49%, outperforming previous models and demonstrating significant accuracy gains from increased depth.

5. **Performance Metrics**: In practical applications, ResNets have been shown to achieve lower error rates in tasks such as image classification and object detection. For instance, using ResNet-101 in object detection led to a 6.0% increase in COCO’s standard metric (mAP@[.5, .95]), which is a 28% relative improvement over traditional models.

In summary, ResNets improve upon traditional CNNs by allowing for deeper architectures that are easier to train, do not suffer from degradation, and achieve better performance without increasing the number of parameters. This makes them a powerful choice for various deep learning tasks.