# Build a Contextual Retrieval based RAG System

learn how to build a Contextual Retrieval-Augmented Generation (RAG) system using LangChain, OpenAI, and Chroma vector database. The workflow covers the entire RAG pipeline, including:

- Loading and processing structured (JSON) and unstructured (PDF) documents.
- Chunking documents and generating contextual summaries for each chunk to improve retrieval quality.
- Creating embeddings for all document chunks using OpenAI's embedding models.
- Indexing the embeddings in a persistent Chroma vector database for efficient semantic search.
- Implementing a semantic retriever to fetch relevant document chunks based on user queries.
- Building a RAG pipeline that uses retrieved context to answer questions with a language model.
- Extending the pipeline to provide source citations for generated answers, including quoted evidence from the original documents.
- Displaying results with highlighted citations and context for transparency and traceability.

It is intended as a practical guide for building advanced RAG systems with source attribution, suitable for research, knowledge management, and production AI applications.

In [66]:
# import necessary libraries
import os
import json
import uuid
import time
from glob import glob

from dotenv import load_dotenv

from langchain.docstore.document import Document
from langchain_openai import OpenAIEmbeddings
from langchain.document_loaders import JSONLoader
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

from IPython.display import display, Markdown

# load environment variables
load_dotenv()

# set OpenAI API key
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

In [None]:
# initialize OpenAI embeddings model
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

## Loading and Processing the Data

Follow notebook in lession 3 for more detail explaination on basic RAG.

In [None]:
# load wiki data JSONL file
loader = JSONLoader(file_path='../../rag_docs/wikidata_rag_demo.jsonl',
                    jq_schema='.',
                    text_content=False,
                    json_lines=True)
wiki_docs = loader.load()

In [None]:
# process wiki documents
wiki_docs_processed = []

for doc in wiki_docs:
    doc = json.loads(doc.page_content)
    metadata = {
        "title": doc['title'],
        "id": doc['id'],
        "source": "Wikipedia",
        "page": 1
    }
    data = ' '.join(doc['paragraphs'])
    wiki_docs_processed.append(Document(page_content=data, metadata=metadata))

### Load and Process PDF documents

#### Create Chunk Contexts for Contextual Retrieval

![CR](../../images/contextual_rag.png)

In [None]:
# initialize ChatOpenAI model
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [None]:
# create chunk context generation chain
def generate_chunk_context(document, chunk):

    chunk_process_prompt = """You are an AI assistant specializing in research paper analysis.
                            Your task is to provide brief, relevant context for a chunk of text
                            based on the following research paper.

                            Here is the research paper:
                            <paper>
                            {paper}
                            </paper>

                            Here is the chunk we want to situate within the whole document:
                            <chunk>
                            {chunk}
                            </chunk>

                            Provide a concise context (3-4 sentences max) for this chunk,
                            considering the following guidelines:

                            - Give a short succinct context to situate this chunk within the overall document
                            for the purposes of improving search retrieval of the chunk.
                            - Answer only with the succinct context and nothing else.
                            - Context should be mentioned like 'Focuses on ....'
                            do not mention 'this chunk or section focuses on...'

                            Context:
                        """

    prompt_template = ChatPromptTemplate.from_template(chunk_process_prompt)

    agentic_chunk_chain = (prompt_template
                                |
                            chatgpt
                                |
                            StrOutputParser())

    context = agentic_chunk_chain.invoke({'paper': document, 'chunk': chunk})

    return context

In [None]:

# create contextual chunks from PDF document
def create_contextual_chunks(file_path, chunk_size=3500, chunk_overlap=0):

    print('Loading pages:', file_path)
    loader = PyMuPDFLoader(file_path)
    doc_pages = loader.load()

    print('Chunking pages:', file_path)
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                              chunk_overlap=chunk_overlap)
    doc_chunks = splitter.split_documents(doc_pages)

    print('Generating contextual chunks:', file_path)
    original_doc = '\n'.join([doc.page_content for doc in doc_chunks])
    contextual_chunks = []
    for chunk in doc_chunks:
        chunk_content = chunk.page_content
        chunk_metadata = chunk.metadata
        chunk_metadata_upd = {
            'id': str(uuid.uuid4()),
            'page': chunk_metadata['page'],
            'source': chunk_metadata['source'],
            'title': chunk_metadata['source'].split('/')[-1]
        }
        context = generate_chunk_context(original_doc, chunk_content)
        contextual_chunks.append(Document(page_content=context+'\n'+chunk_content,
                                          metadata=chunk_metadata_upd))
        time.sleep(10)  # to avoid rate limits
    print('Finished processing:', file_path)
    print()
    return contextual_chunks

In [None]:
# get list of PDF files
pdf_files = glob('../../rag_docs/*.pdf')
pdf_files

['../../rag_docs/cnn_paper.pdf',
 '../../rag_docs/vision_transformer.pdf',
 '../../rag_docs/resnet_paper.pdf',
 '../../rag_docs/attention_paper.pdf']

In [None]:
# process PDF documents to create contextual chunks
paper_docs = []
for fp in pdf_files:
    paper_docs.extend(create_contextual_chunks(file_path=fp, chunk_size=3500))

Loading pages: ../../rag_docs/cnn_paper.pdf
Chunking pages: ../../rag_docs/cnn_paper.pdf
Generating contextual chunks: ../../rag_docs/cnn_paper.pdf
Finished processing: ../../rag_docs/cnn_paper.pdf

Loading pages: ../../rag_docs/vision_transformer.pdf
Chunking pages: ../../rag_docs/vision_transformer.pdf
Generating contextual chunks: ../../rag_docs/vision_transformer.pdf
Finished processing: ../../rag_docs/vision_transformer.pdf

Loading pages: ../../rag_docs/resnet_paper.pdf
Chunking pages: ../../rag_docs/resnet_paper.pdf
Generating contextual chunks: ../../rag_docs/resnet_paper.pdf
Finished processing: ../../rag_docs/resnet_paper.pdf

Loading pages: ../../rag_docs/attention_paper.pdf
Chunking pages: ../../rag_docs/attention_paper.pdf
Generating contextual chunks: ../../rag_docs/attention_paper.pdf
Finished processing: ../../rag_docs/attention_paper.pdf



In [None]:
# display first contextual chunk
print(paper_docs[0].page_content)

Focuses on the introduction of Convolutional Neural Networks (CNNs) within the broader field of machine learning, highlighting their significance as advanced architectures of Artificial Neural Networks (ANNs) specifically designed for image-driven pattern recognition tasks. It outlines the foundational concepts of ANNs and sets the stage for a deeper exploration of CNNs and their applications in image analysis.
An Introduction to Convolutional Neural Networks
Keiron O’Shea1 and Ryan Nash2
1 Department of Computer Science, Aberystwyth University, Ceredigion, SY23 3DB
keo7@aber.ac.uk
2 School of Computing and Communications, Lancaster University, Lancashire, LA1
4YW
nashrd@live.lancs.ac.uk
Abstract. The ﬁeld of machine learning has taken a dramatic twist in re-
cent times, with the rise of the Artiﬁcial Neural Network (ANN). These
biologically inspired computational models are able to far exceed the per-
formance of previous forms of artiﬁcial intelligence in common machine
learning task

### Combine all document chunks in one list

In [28]:
total_docs = wiki_docs_processed + paper_docs
len(total_docs)

1880

## Index Document Chunks and Embeddings in Vector DB

In [None]:
# Index Document Chunks and Embeddings in Vector DB
chroma_db = Chroma.from_documents(documents=total_docs,
                                  collection_name='context_db',
                                  embedding=openai_embed_model,
                                  # check https://docs.trychroma.com/guides#changing-the-distance-function
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./context_db")

### Load Vector DB from disk

This is just to show once you have a vector database on disk you can just load and create a connection to it anytime

In [30]:
# load from disk
chroma_db = Chroma(persist_directory="./context_db",
                   collection_name='context_db',
                   embedding_function=openai_embed_model)

### Semantic Similarity based Retrieval

In [None]:
# Semantic Similarity based Retrieval
retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 5})

In [None]:

# function to display documents
def display_docs(docs):
    for doc in docs:
        print('Metadata:', doc.metadata)
        print('Content Brief:')
        display(Markdown(doc.page_content[:1000]))
        print()

### Test retriever with basic questions

In [34]:
query = "what is machine learning?"
top_docs = retriever.invoke(query)
display_docs(top_docs)

Metadata: {'title': 'Machine learning', 'id': '564928', 'source': 'Wikipedia', 'page': 1}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'source': 'Wikipedia', 'page': 1, 'id': '359370', 'title': 'Supervised learning'}
Content Brief:


In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Metadata: {'id': '663523', 'source': 'Wikipedia', 'page': 1, 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic


Metadata: {'source': 'Wikipedia', 'title': 'Artificial intelligence', 'page': 1, 'id': '6360'}
Content Brief:


Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955. In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation. An ideal (perfect) intelligent machine is a flexible agent which perceives its environment and takes actions to maximize its chance of success at some goal or objective. As machines become increasingly capable, mental facu


Metadata: {'id': '44742', 'title': 'Artificial neural network', 'page': 1, 'source': 'Wikipedia'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.




In [35]:
query = "what is the difference between transformers and vision transformers?"
top_docs = retriever.invoke(query)
display_docs(top_docs)

Metadata: {'source': '../../rag_docs/vision_transformer.pdf', 'id': '0ea6821d-e794-4ca0-9dee-0f9ee8a499cc', 'title': 'vision_transformer.pdf', 'page': 7}
Content Brief:


Focuses on a controlled scaling study of various models, including Vision Transformers and ResNets, evaluating their transfer performance from the JFT-300M dataset. It highlights the performance versus pre-training cost, revealing that Vision Transformers generally outperform ResNets in terms of efficiency and scalability. Additionally, it discusses the implications of hybrid models and the potential for further scaling of Vision Transformers.
Published as a conference paper at ICLR 2021
4.4
SCALING STUDY
We perform a controlled scaling study of different models by evaluating transfer performance from
JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess
performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1,
R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained
for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus
L/16 and H/14


Metadata: {'source': '../../rag_docs/vision_transformer.pdf', 'title': 'vision_transformer.pdf', 'id': '67372578-19cd-48d4-9767-196e5c15e471', 'page': 0}
Content Brief:


Focuses on the introduction of the Vision Transformer (ViT) model, which applies a standard Transformer architecture directly to image classification tasks by treating image patches as tokens. It highlights the limitations of traditional convolutional neural networks (CNNs) in computer vision and presents evidence that a pure Transformer can achieve competitive performance on various image recognition benchmarks when pre-trained on large datasets.
Published as a conference paper at ICLR 2021
AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗,
Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†
∗equal technical contribution, †equal advising
Google Research, Brain Team
{adosovitskiy, neilhoulsby}@google.com
ABSTRACT
While the Transformer architecture has become the de-facto standard for natural
language pr


Metadata: {'source': '../../rag_docs/vision_transformer.pdf', 'page': 2, 'id': '17fc9c5f-3529-4522-89df-abf28bfb5590', 'title': 'vision_transformer.pdf'}
Content Brief:


Focuses on the architecture and methodology of the Vision Transformer (ViT), detailing how images are processed by splitting them into patches, embedding them, and utilizing a standard Transformer encoder for image classification tasks. It describes the model's design principles, including the use of position embeddings and the integration of a classification token, while referencing foundational work in Transformer architecture.
Published as a conference paper at ICLR 2021
Transformer Encoder
MLP 
Head
Vision Transformer (ViT)
*
Linear Projection of Flattened Patches
* Extra learnable
     [ cl ass]  embedding
1
2
3
4
5
6
7
8
9
0
Patch + Position 
Embedding
Class
Bird
Ball
Car
...
Embedded 
Patches
Multi-Head 
Attention
Norm
MLP
Norm
+
L x
+
Transformer Encoder
Figure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them,
add position embeddings, and feed the resulting sequence of vectors to a standard Transformer
encoder. In order to perform classiﬁ


Metadata: {'source': '../../rag_docs/vision_transformer.pdf', 'id': '6d2d3b62-4961-479f-96fe-1ed226d74dda', 'page': 1, 'title': 'vision_transformer.pdf'}
Content Brief:


Focuses on the performance of the Vision Transformer (ViT) when trained on large datasets, highlighting its ability to achieve state-of-the-art results in image recognition tasks despite lacking some inductive biases inherent to convolutional neural networks (CNNs). It also discusses related work in the field, particularly the application of self-attention mechanisms in image processing and comparisons with existing models.
Published as a conference paper at ICLR 2021
inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well
when trained on insufﬁcient amounts of data.
However, the picture changes if the models are trained on larger datasets (14M-300M images). We
ﬁnd that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent
results when pre-trained at sufﬁcient scale and transferred to tasks with fewer datapoints. When
pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT ap


Metadata: {'page': 7, 'title': 'vision_transformer.pdf', 'id': 'def024b1-1080-4288-bfaf-48e07c7981da', 'source': '../../rag_docs/vision_transformer.pdf'}
Content Brief:


Focuses on the behavior of attention mechanisms in Vision Transformers, highlighting how attention distances vary across layers and the implications of localized attention in hybrid models that incorporate CNNs. It also discusses the relationship between attention distance and network depth, indicating that deeper layers attend to semantically relevant regions for classification.
have consistently small attention distances in the low layers. This highly localized attention is
less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right),
suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the
attention distance increases with network depth. Globally, we ﬁnd that the model attends to image
regions that are semantically relevant for classiﬁcation (Figure 6).
4.6
SELF-SUPERVISION
Transformers show impressive performance on NLP tasks. However, much of their success stems
not only from their excellent scalability bu




## Build the RAG Pipeline

In [None]:

# RAG Prompt Template
rag_prompt = """You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.

                Question:
                {question}

                Context:
                {context}

                Answer:
            """

rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)

In [None]:
# initialize ChatOpenAI model
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

# function to format documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# RAG Chain
qa_rag_chain = (
    {
        "context": (retriever
                      |
                    format_docs),
        "question": RunnablePassthrough()
    }
      |
    rag_prompt_template
      |
    chatgpt
)

In [None]:
# Test RAG Chain with basic questions
query = "What is machine learning?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Machine learning is a subfield of computer science that provides computers with the ability to learn without being explicitly programmed. The concept was introduced by Arthur Samuel in 1959 and is rooted in artificial intelligence. Machine learning focuses on the study and construction of algorithms that can learn from data and make predictions or decisions based on that data. These algorithms follow programmed instructions but can also adapt and improve their performance by building models from sample inputs.

Machine learning is particularly useful in scenarios where designing and programming explicit algorithms is impractical. Some common applications of machine learning include:

- Spam filtering
- Detection of network intruders or malicious insiders
- Optical character recognition (OCR)
- Search engines
- Computer vision

Within machine learning, there are different types of learning approaches, such as supervised learning, where a function is inferred from labeled training data. In supervised learning, the system learns to produce correct results based on known outcomes, typically using vectors for training data and results.

Additionally, deep learning is a specialized area of machine learning that utilizes neural networks, particularly those with multiple layers (known as multi-layer neural networks). Deep learning is effective for complex tasks like speech recognition, image understanding, and handwriting recognition, which are challenging for computers but relatively easy for humans. 

Overall, machine learning represents a significant advancement in the ability of computers to process information and make decisions autonomously, drawing inspiration from biological systems while employing distinct methodologies.

In [40]:
query = "What is a CNN?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

A CNN, or Convolutional Neural Network, is a specialized type of Artificial Neural Network (ANN) that is particularly effective for image-driven pattern recognition tasks. CNNs are designed to process data with a grid-like topology, such as images, and they utilize a unique architecture that distinguishes them from traditional ANNs.

### Key Features of CNNs:

1. **Three-Dimensional Neuron Organization**: 
   - The neurons in CNNs are organized into three dimensions: height, width, and depth. This structure allows CNNs to capture spatial hierarchies in images.

2. **Layer Types**:
   - CNNs consist of several types of layers:
     - **Convolutional Layers**: These layers apply convolution operations to the input, allowing the network to learn spatial hierarchies of features. Each neuron in a convolutional layer is connected to a small region of the input, which helps in detecting local patterns.
     - **Pooling Layers**: These layers reduce the spatial dimensions of the input, helping to decrease the computational load and control overfitting.
     - **Fully-Connected Layers**: These layers connect every neuron in one layer to every neuron in the next layer, typically used at the end of the network to produce the final output.

3. **Architecture**:
   - A typical CNN architecture may involve stacking multiple convolutional layers followed by pooling layers. This stacking allows the network to learn increasingly complex features of the input data.

4. **Learning Paradigms**:
   - CNNs primarily utilize supervised learning, where the model is trained on labeled data to minimize classification errors. This is crucial for tasks such as image classification.

5. **Applications**:
   - CNNs are widely used in various applications, including image classification, object detection, and image segmentation, due to their ability to effectively encode image-specific features.

In summary, CNNs are a powerful class of neural networks that excel in image analysis and pattern recognition, leveraging their unique architecture to process and learn from visual data efficiently.

In [41]:
query = "How is a resnet better than a CNN?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

A ResNet (Residual Network) is considered better than a traditional CNN (Convolutional Neural Network) for several reasons, primarily due to its architectural innovations that address optimization challenges associated with deeper networks. Here are the key advantages of ResNets over standard CNNs:

1. **Residual Learning Framework**:
   - ResNets introduce the concept of residual learning, where the network learns to fit a residual mapping instead of the original unreferenced mapping. This is formalized as \( F(x) = H(x) - x \), where \( H(x) \) is the desired mapping. The output is then expressed as \( F(x) + x \). This approach simplifies the optimization process, making it easier for the network to learn.

2. **Shortcut Connections**:
   - ResNets utilize shortcut connections that skip one or more layers, allowing the gradient to flow more easily during backpropagation. This helps mitigate the vanishing gradient problem, which is common in very deep networks. The identity mappings introduced by these shortcuts do not add extra parameters or computational complexity, yet they significantly enhance the training dynamics.

3. **Performance with Increased Depth**:
   - Traditional CNNs often suffer from increased training error as the depth of the network increases, a phenomenon known as the degradation problem. In contrast, ResNets can maintain or even improve performance as they become deeper. For instance, a 34-layer ResNet outperforms an 18-layer ResNet by a notable margin, demonstrating that deeper networks can be effectively trained without degradation in performance.

4. **Generalization and Accuracy**:
   - ResNets have shown superior generalization performance across various tasks, including image classification and object detection. They have achieved state-of-the-art results on benchmarks like ImageNet, with deeper architectures (e.g., 101-layer and 152-layer ResNets) yielding significant accuracy gains compared to shallower models. For example, a 152-layer ResNet achieved a top-5 validation error of 4.49%, outperforming previous models.

5. **Empirical Evidence**:
   - Extensive experiments on datasets like ImageNet and CIFAR-10 have demonstrated that ResNets not only converge faster but also achieve lower training and validation errors compared to their plain counterparts. This is evidenced by the fact that ResNets can be trained with over 100 layers while maintaining lower complexity than traditional networks like VGG.

In summary, ResNets improve upon traditional CNNs by addressing optimization difficulties through residual learning and shortcut connections, allowing for deeper architectures that are easier to train and achieve better performance.

In [42]:
query = "What is NLP and its relation to linguistics?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Natural Language Processing (NLP) is a field within Artificial Intelligence that focuses on enabling computers to automatically understand and generate human languages. The term "Natural Language" specifically refers to human languages, distinguishing them from programming languages. The overarching goal of NLP is to facilitate seamless interaction between humans and machines through language.

NLP is closely related to linguistics, which is the scientific study of language and its structure. Linguistics provides the foundational theories and frameworks that inform the development of NLP technologies. By leveraging insights from linguistics, NLP aims to enhance the ability of computers to process and analyze human language in a way that is meaningful and contextually appropriate. This relationship underscores the importance of understanding language structure, semantics, and syntax in the design of effective NLP systems.

In [43]:
query = "What is the difference between AI, ML and DL?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The difference between AI, ML, and DL can be summarized as follows:

### Artificial Intelligence (AI)
- **Definition**: AI refers to the ability of a computer program or machine to think and learn, mimicking human cognition. It encompasses a broad range of technologies and applications aimed at making machines "smart."
- **Functionality**: AI systems can interpret external data, learn from it, and use those learnings to achieve specific goals through flexible adaptation. The term has evolved, and tasks once considered AI, like optical character recognition, are now seen as routine technologies.

### Machine Learning (ML)
- **Definition**: ML is a subfield of AI that focuses on the study and construction of algorithms that allow computers to learn from data without being explicitly programmed. It emerged from the broader field of artificial intelligence.
- **Functionality**: ML algorithms build models from sample inputs and can make predictions or decisions based on data. It is particularly useful in scenarios where designing explicit algorithms is impractical, such as spam filtering and computer vision.

### Deep Learning (DL)
- **Definition**: DL is a specialized subset of machine learning that primarily uses neural networks with multiple layers (multi-layer neural networks) to process data. It is often referred to as deep structured learning or hierarchical learning.
- **Functionality**: In deep learning, the information processed becomes more abstract with each added layer, making it particularly effective for complex tasks like speech and image recognition. DL models are inspired by the information processing patterns of biological nervous systems.

In summary, AI is the overarching field that includes both ML and DL. ML is a specific approach within AI that focuses on learning from data, while DL is a further specialization of ML that utilizes deep neural networks to handle complex data representations.

In [44]:
query = "What is the difference between transformers and vision transformers?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The difference between transformers and vision transformers primarily lies in their application and the way they process input data.

1. **Transformers**:
   - Originally designed for natural language processing (NLP) tasks, transformers utilize self-attention mechanisms to process sequences of tokens (words).
   - In NLP, the input is typically a sequence of words, and the transformer architecture processes these tokens to understand context and relationships between them.
   - Transformers have become the standard model for various NLP tasks due to their efficiency and scalability, allowing for the training of very large models.

2. **Vision Transformers (ViT)**:
   - Vision Transformers adapt the transformer architecture for image classification tasks by treating image patches as tokens.
   - Instead of processing sequences of words, ViTs split an image into fixed-size patches, flatten these patches, and then embed them into a sequence that is fed into a standard transformer encoder.
   - ViTs incorporate position embeddings to retain spatial information about the patches, which is crucial for understanding the structure of images.
   - The architecture allows for global integration of information across the entire image, leveraging the self-attention mechanism to capture relationships between different parts of the image.

In summary, while both transformers and vision transformers utilize the same underlying architecture, their applications differ significantly: transformers are used for sequential data (like text), whereas vision transformers are specifically designed to handle image data by treating image patches as sequences of tokens.

In [45]:
query = "How is self-attention important in transformers?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Self-attention is crucial in transformers for several reasons, particularly in the context of Vision Transformers (ViTs) and their application in image processing. Here are the key points regarding the importance of self-attention in transformers:

1. **Localized Attention**: In Vision Transformers, self-attention mechanisms exhibit highly localized attention in the lower layers. This means that these layers focus on small, specific regions of the input image, similar to the early convolutional layers in Convolutional Neural Networks (CNNs). This localized attention is essential for capturing fine-grained details in images.

2. **Attention Distance and Network Depth**: The attention distance, which refers to how far the model can attend to different parts of the input, increases with the depth of the network. In deeper layers, the model is able to attend to semantically relevant regions of the image that are important for classification tasks. This hierarchical attention allows the model to build a more comprehensive understanding of the image as it processes it through multiple layers.

3. **Scalability and Efficiency**: Traditional self-attention mechanisms can be computationally expensive, especially when applied to images where each pixel could potentially attend to every other pixel. However, transformers have adapted self-attention to work efficiently with images by using approximations, such as applying attention only within local neighborhoods or using sparse attention mechanisms. This adaptation allows transformers to scale effectively to larger input sizes without losing the benefits of self-attention.

4. **Performance in Image Recognition**: The application of self-attention in transformers has led to impressive performance in image recognition tasks. For instance, the Vision Transformer (ViT) has achieved state-of-the-art results on various benchmarks when trained on large datasets, demonstrating that self-attention can effectively replace some of the inductive biases inherent to CNNs.

5. **Flexibility in Architecture**: Self-attention mechanisms allow for flexibility in how features are processed. They can be combined with CNNs or used independently, enabling a variety of architectures that can be tailored to specific tasks in computer vision.

In summary, self-attention is vital in transformers as it enables localized attention, scales efficiently, enhances performance in image recognition, and provides architectural flexibility, all of which contribute to the model's ability to understand and classify images effectively.

In [46]:
query = "How does a resnet work?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

A ResNet, or Residual Network, operates on the principle of residual learning to address the challenges associated with training deep neural networks. Here’s a detailed explanation of how it works:

### Key Concepts of ResNet

1. **Residual Learning Framework**:
   - Instead of directly learning the desired underlying mapping \( H(x) \), ResNets learn a residual mapping \( F(x) = H(x) - x \). This means that the network is trained to predict the difference between the output and the input, which is often easier than learning the output directly.

2. **Shortcut Connections**:
   - ResNets utilize shortcut connections that skip one or more layers. These connections perform identity mapping, allowing the input \( x \) to be added directly to the output of the stacked layers. This can be mathematically represented as:
     \[
     H(x) = F(x) + x
     \]
   - The addition of these shortcuts does not introduce extra parameters or computational complexity, making it efficient to train.

3. **Optimization Benefits**:
   - The architecture allows for easier optimization of deeper networks. Traditional deep networks (plain networks) often suffer from increased training errors as depth increases, a phenomenon known as the degradation problem. In contrast, ResNets can maintain or improve accuracy as they become deeper, effectively overcoming this issue.

4. **Empirical Evidence**:
   - Experiments have shown that ResNets can achieve significantly lower training errors compared to their plain counterparts, even as the number of layers increases. For instance, a 34-layer ResNet outperforms an 18-layer ResNet, demonstrating that deeper networks can be more effective when using residual learning.

5. **Architectural Efficiency**:
   - ResNets can be constructed with various depths (e.g., 18, 34, 50, 101, and 152 layers) while maintaining lower complexity compared to other architectures like VGG. This efficiency allows for the construction of very deep networks without the typical drawbacks associated with depth.

### Conclusion
In summary, ResNets leverage the concept of residual learning through shortcut connections to facilitate the training of very deep networks. This approach not only simplifies the optimization process but also leads to improved performance in various image recognition tasks, as evidenced by their success in competitions like ILSVRC 2015.

In [47]:
query = "What is LangGraph?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

I don't know.

In [48]:
query = "What is an Agentic AI System?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The context provided does not contain specific information about an "Agentic AI System." Therefore, I don't know what an Agentic AI System is.

In [49]:
query = "What is LangChain?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

I don't know.

# Build a RAG System with Source Citations Agentic Pipeline

In [50]:
from langchain_core.prompts import ChatPromptTemplate

rag_prompt = """You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.

                Question:
                {question}

                Context:
                {context}

                Answer:
            """

rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)
rag_prompt_template.pretty_print()


You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.

                Question:
                [33;1m[1;3m{question}[0m

                Context:
                [33;1m[1;3m{context}[0m

                Answer:
            


In [51]:
citations_prompt = """You are an assistant who is an expert in analyzing answers to questions
                      and finding out referenced citations from context articles.

                      Given the following question, context and generated answer,
                      analyze the generated answer and quote citations from context articles
                      that can be used to justify the generated answer.

                      Question:
                      {question}

                      Context Articles:
                      {context}

                      Answer:
                      {answer}
                  """

cite_prompt_template = ChatPromptTemplate.from_template(citations_prompt)
cite_prompt_template.pretty_print()


You are an assistant who is an expert in analyzing answers to questions
                      and finding out referenced citations from context articles.

                      Given the following question, context and generated answer,
                      analyze the generated answer and quote citations from context articles
                      that can be used to justify the generated answer.

                      Question:
                      [33;1m[1;3m{question}[0m

                      Context Articles:
                      [33;1m[1;3m{context}[0m

                      Answer:
                      [33;1m[1;3m{answer}[0m
                  


In [None]:
from pydantic import BaseModel, Field
from typing import List

# define Citation and QuotedCitations models
class Citation(BaseModel):
    id: str = Field(description="""The string ID of a SPECIFIC context article
                                   which justifies the answer.""")
    source: str = Field(description="""The source of the SPECIFIC context article
                                       which justifies the answer.""")
    title: str = Field(description="""The title of the SPECIFIC context article
                                      which justifies the answer.""")
    page: int = Field(description="""The page number of the SPECIFIC context article
                                     which justifies the answer.""")
    quotes: str = Field(description="""The VERBATIM sentences from the SPECIFIC context article
                                      that are used to generate the answer.
                                      Should be exact sentences from context article without missing words.""")

# define QuotedCitations model
class QuotedCitations(BaseModel):
    """Quote citations from given context articles
       that can be used to justify the generated answer. Can be multiple articles."""
    citations: List[Citation] = Field(description="""Citations (can be multiple) from the given
                                                     context articles that justify the answer.""")

In [None]:
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableLambda
from operator import itemgetter

# initialize ChatOpenAI model with structured output
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
structured_chatgpt = chatgpt.with_structured_output(QuotedCitations)

# function to format documents with metadata
def format_docs_with_metadata(docs: List[Document]) -> str:
    formatted_docs = [
        f"""Context Article ID: {doc.metadata['id']}
            Context Article Source: {doc.metadata['source']}
            Context Article Title: {doc.metadata['title']}
            Context Article Page: {doc.metadata['page']}
            Context Article Details: {doc.page_content}
         """
            for i, doc in enumerate(docs)
    ]
    return "\n\n" + "\n\n".join(formatted_docs)

# RAG with Citations Chains
rag_response_chain = (
    {
        "context": (itemgetter('context')
                        |
                    RunnableLambda(format_docs_with_metadata)),
        "question": itemgetter("question")
    }
        |
    rag_prompt_template
        |
    chatgpt
        |
    StrOutputParser()
)

cite_response_chain = (
    {
        "context": itemgetter('context'),
        "question": itemgetter("question"),
        "answer": itemgetter("answer")
    }
        |
    cite_prompt_template
        |
    structured_chatgpt
)

rag_chain_w_citations = (
    {
        "context": retriever,
        "question": RunnablePassthrough()
    }
        |
    RunnablePassthrough.assign(answer=rag_response_chain)
        |
    RunnablePassthrough.assign(citations=cite_response_chain)
)

In [55]:
query = "What is machine learning"
result = rag_chain_w_citations.invoke(query)
result

{'context': [Document(id='feb73cb4-4973-4b3f-a6ec-fd392c120650', metadata={'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning', 'page': 1}, page_content='Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.'),
  Document(id='119b306a-cefd-4bd8-9425-2c5b89a30bd4', metadata={'page': 1, 'id': '35

In [67]:
result['citations'].model_dump()['citations']

[{'id': '0ea6821d-e794-4ca0-9dee-0f9ee8a499cc',
  'source': '../../rag_docs/vision_transformer.pdf',
  'title': 'vision_transformer.pdf',
  'page': 7,
  'quotes': 'Vision Transformers adapt the standard transformer architecture for image classification tasks by treating image patches as tokens.'},
 {'id': '67372578-19cd-48d4-9767-196e5c15e471',
  'source': '../../rag_docs/vision_transformer.pdf',
  'title': 'vision_transformer.pdf',
  'page': 0,
  'quotes': 'We experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer.'},
 {'id': '17fc9c5f-3529-4522-89df-abf28bfb5590',
  'source': '../../rag_docs/vision_transformer.pdf',
  'title': 'vision_transformer.pdf',
  'page': 2,
  'quotes': 'The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image into 

In [69]:
import re
# used mostly for nice display formatting, ignore if not needed
def get_cited_context(result_obj):
    # Dictionary to hold separate citation information for each unique source and title combination
    source_with_citations = {}

    def highlight_text(context, quote):
        # Normalize whitespace and remove unnecessary punctuation
        quote = re.sub(r'\s+', ' ', quote).strip()
        context = re.sub(r'\s+', ' ', context).strip()

        # Split quote into phrases, being careful with punctuation
        phrases = [phrase.strip() for phrase in re.split(r'[.!?]', quote) if phrase.strip()]

        highlighted_context = context

        for phrase in phrases: # for each quoted phrase

            # Create regex pattern to match cited phrases
            # Escape special regex characters, but preserve word boundaries
            escaped_phrase = re.escape(phrase)
            # Create regex pattern that allows for slight variations
            pattern = re.compile(r'\b' + escaped_phrase + r'\b', re.IGNORECASE)

            # Replace all matched phrases with bolded version
            highlighted_context = pattern.sub(lambda m: f"**{m.group(0)}**", highlighted_context)

        return highlighted_context

    # Process the citation data
    for cite in result_obj['citations'].model_dump()['citations']:
        cite_id = cite['id']
        title = cite['title']
        source = cite['source']
        page = cite['page']
        quote = cite['quotes']

        # Check if the (source, title) key exists, and initialize if it doesn't
        if (source, title) not in source_with_citations:
            source_with_citations[(source, title)] = {
                'title': title,
                'source': source,
                'citations': []
            }

        # Find or create the citation entry for this unique (id, page) combination
        citation_entry = next(
            (c for c in source_with_citations[(source, title)]['citations'] if c['id'] == cite_id and c['page'] == page),
            None
        )
        if citation_entry is None:
            citation_entry = {'id': cite_id, 'page': page, 'quote': [quote], 'context': None}
            source_with_citations[(source, title)]['citations'].append(citation_entry)
        else:
            citation_entry['quote'].append(quote)

    # Process context data
    for context in result_obj['context']:
        context_id = context.metadata['id']
        context_page = context.metadata['page']
        source = context.metadata['source']
        title = context.metadata['title']
        page_content = context.page_content

        # Match the context to the correct citation entry by source, title, id, and page
        if (source, title) in source_with_citations:
            for citation in source_with_citations[(source, title)]['citations']:
                if citation['id'] == context_id and citation['page'] == context_page:
                    # Apply highlighting for each quote in the citation's quote list
                    highlighted_content = page_content
                    for quote in citation['quote']:
                        highlighted_content = highlight_text(highlighted_content, quote)
                    citation['context'] = highlighted_content

    # Convert the dictionary to a list of dictionaries for separate entries
    final_result_list = [
        {
            'title': details['title'],
            'source': details['source'],
            'citations': details['citations']
        }
        for details in source_with_citations.values()
    ]

    return final_result_list


In [70]:
get_cited_context(result)

[{'title': 'vision_transformer.pdf',
  'source': '../../rag_docs/vision_transformer.pdf',
  'citations': [{'id': '0ea6821d-e794-4ca0-9dee-0f9ee8a499cc',
    'page': 7,
    'quote': ['Vision Transformers adapt the standard transformer architecture for image classification tasks by treating image patches as tokens.'],
    'context': 'Focuses on a controlled scaling study of various models, including Vision Transformers and ResNets, evaluating their transfer performance from the JFT-300M dataset. It highlights the performance versus pre-training cost, revealing that Vision Transformers generally outperform ResNets in terms of efficiency and scalability. Additionally, it discusses the implications of hybrid models and the potential for further scaling of Vision Transformers. Published as a conference paper at ICLR 2021 4.4 SCALING STUDY We perform a controlled scaling study of different models by evaluating transfer performance from JFT-300M. In this setting data size does not bottleneck t

In [71]:
from IPython.display import display, Markdown

def display_results(result_obj):
    print('Query:')
    display(Markdown(result_obj['question']))
    print()
    print('Response:')
    display(Markdown(result_obj['answer']))
    print('='*50)
    print('Sources:')
    cited_context = get_cited_context(result_obj)
    for source in cited_context:
        print('Title:', source['title'], ' ', 'Source:', source['source'])
        print('Citations:')
        for citation in source['citations']:
            print('ID:', citation['id'], ' ', 'Page:', citation['page'])
            print('Cited Quotes:')
            display(Markdown('*'+' '.join(citation['quote'])+'*'))
            print('Cited Context:')
            display(Markdown(citation['context']))
            print()


In [72]:
display_results(result)

Query:


What is the difference between transformers and vision transformers?


Response:


The difference between transformers and vision transformers primarily lies in their application and input processing methods.

1. **Transformers**:
   - Originally designed for natural language processing (NLP) tasks, transformers utilize self-attention mechanisms to process sequences of tokens (words).
   - They operate on 1D sequences, where each token is treated independently, allowing the model to capture relationships between tokens regardless of their position in the sequence.
   - The architecture is highly scalable and efficient, making it suitable for large datasets and complex tasks in NLP.

2. **Vision Transformers (ViT)**:
   - Vision Transformers adapt the standard transformer architecture for image classification tasks by treating image patches as tokens.
   - Instead of processing a sequence of words, ViTs split an image into fixed-size patches, flatten these patches, and then embed them into a sequence of vectors. This sequence is then fed into a transformer encoder.
   - ViTs incorporate position embeddings to retain spatial information, as the 2D structure of images is crucial for understanding visual data.
   - They have shown competitive performance in image recognition tasks, especially when pre-trained on large datasets, outperforming traditional convolutional neural networks (CNNs) in terms of efficiency and scalability.

In summary, while both transformers and vision transformers utilize self-attention mechanisms, the key difference lies in their input format and application domain: transformers are designed for sequential data (like text), whereas vision transformers adapt this architecture for 2D image data by treating image patches as tokens.

Sources:
Title: vision_transformer.pdf   Source: ../../rag_docs/vision_transformer.pdf
Citations:
ID: 0ea6821d-e794-4ca0-9dee-0f9ee8a499cc   Page: 7
Cited Quotes:


*Vision Transformers adapt the standard transformer architecture for image classification tasks by treating image patches as tokens.*

Cited Context:


Focuses on a controlled scaling study of various models, including Vision Transformers and ResNets, evaluating their transfer performance from the JFT-300M dataset. It highlights the performance versus pre-training cost, revealing that Vision Transformers generally outperform ResNets in terms of efficiency and scalability. Additionally, it discusses the implications of hybrid models and the potential for further scaling of Vision Transformers. Published as a conference paper at ICLR 2021 4.4 SCALING STUDY We perform a controlled scaling study of different models by evaluating transfer performance from JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1, R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pre- trained for 7 epochs, plus R50+ViT-L/16 pre-trained for 14 epochs (for hybrids, the number at the end of the model name stands not for the patch size, but for the total dowsampling ratio in the ResNet backbone). Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5 for details on computational costs). Detailed results per model are provided in Table 6 in the Ap- pendix. A few patterns can be observed. First, Vision Transformers dominate ResNets on the performance/compute trade-off. ViT uses approximately 2 −4× less compute to attain the same performance (average over 5 datasets). Second, hybrids slightly outperform ViT at small compu- tational budgets, but the difference vanishes for larger models. This result is somewhat surprising, since one might expect convolutional local feature processing to assist ViT at any size. Third, Vision Transformers appear not to saturate within the range tried, motivating future scaling efforts. 4.5 INSPECTING VISION TRANSFORMER Input Attention Figure 6: Representative ex- amples of attention from the output token to the input space. See Appendix D.7 for details. To begin to understand how the Vision Transformer processes im- age data, we analyze its internal representations. The ﬁrst layer of the Vision Transformer linearly projects the ﬂattened patches into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the top prin- cipal components of the the learned embedding ﬁlters. The com- ponents resemble plausible basis functions for a low-dimensional representation of the ﬁne structure within each patch. After the projection, a learned position embedding is added to the patch representations. Figure 7 (center) shows that the model learns to encode distance within the image in the similarity of position em- beddings, i.e. closer patches tend to have more similar position em- beddings. Further, the row-column structure appears; patches in the same row/column have similar embeddings. Finally, a sinusoidal structure is sometimes apparent for larger grids (Appendix D). That the position embeddings learn to represent 2D image topology ex- plains why hand-crafted 2D-aware embedding variants do not yield improvements (Appendix D.4). Self-attention allows ViT to integrate information across the entire image even in the lowest layers. We investigate to what degree the network makes use of this capability. Speciﬁcally, we compute the average distance in image space across which information is integrated, based on the attention weights (Figure 7, right). This “attention distance” is analogous to receptive ﬁeld size in CNNs. We ﬁnd that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model. Other attention heads


ID: 67372578-19cd-48d4-9767-196e5c15e471   Page: 0
Cited Quotes:


*We experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer.*

Cited Context:


Focuses on the introduction of the Vision Transformer (ViT) model, which applies a standard Transformer architecture directly to image classification tasks by treating image patches as tokens. It highlights the limitations of traditional convolutional neural networks (CNNs) in computer vision and presents evidence that a pure Transformer can achieve competitive performance on various image recognition benchmarks when pre-trained on large datasets. Published as a conference paper at ICLR 2021 AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗, Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,† ∗equal technical contribution, †equal advising Google Research, Brain Team {adosovitskiy, neilhoulsby}@google.com ABSTRACT While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classiﬁcation tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring sub- stantially fewer computational resources to train.1 1 INTRODUCTION Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then ﬁne-tune on a smaller task-speciﬁc dataset (Devlin et al., 2019). Thanks to Transformers’ computational efﬁciency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the models and datasets growing, there is still no sign of saturating performance. In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The latter models, while theoretically efﬁcient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNet- like architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al., 2020). Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modiﬁcations. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Trans- former. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classiﬁcation in supervised fashion. When trained on mid-sized datasets such as ImageNet without strong regularization, these mod- els yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases 1Fine-tuning code and pre-trained models are available at https://github.com/ google-research/vision_transformer 1 arXiv:2010.11929v2 [cs.CV] 3 Jun 2021


ID: 17fc9c5f-3529-4522-89df-abf28bfb5590   Page: 2
Cited Quotes:


*The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image into a sequence of flattened 2D patches.*

Cited Context:


Focuses on the architecture and methodology of the Vision Transformer (ViT), detailing how images are processed by splitting them into patches, embedding them, and utilizing a standard Transformer encoder for image classification tasks. It describes the model's design principles, including the use of position embeddings and the integration of a classification token, while referencing foundational work in Transformer architecture. Published as a conference paper at ICLR 2021 Transformer Encoder MLP Head Vision Transformer (ViT) * Linear Projection of Flattened Patches * Extra learnable [ cl ass] embedding 1 2 3 4 5 6 7 8 9 0 Patch + Position Embedding Class Bird Ball Car ... Embedded Patches Multi-Head Attention Norm MLP Norm + L x + Transformer Encoder Figure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classiﬁcation, we use the standard approach of adding an extra learnable “classiﬁcation token” to the sequence. The illustration of the Transformer encoder was inspired by Vaswani et al. (2017). 3 METHOD In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible. An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and their efﬁcient implementations – can be used almost out of the box. 3.1 VISION TRANSFORMER (VIT) An overview of the model is depicted in Figure 1. **The standard Transformer receives as input a 1D sequence of token embeddings**. To handle 2D images, we reshape the image x ∈RH×W ×C into a sequence of ﬂattened 2D patches xp ∈RN×(P 2·C), where (H, W) is the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, and N = HW/P 2 is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size D through all of its layers, so we ﬂatten the patches and map to D dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings. Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embed- ded patches (z0 0 = xclass), whose state at the output of the Transformer encoder (z0 L) serves as the image representation y (Eq. 4). Both during pre-training and ﬁne-tuning, a classiﬁcation head is at- tached to z0 L. The classiﬁcation head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at ﬁne-tuning time. Position embeddings are added to the patch embeddings to retain positional information. We use standard learnable 1D position embeddings, since we have not observed signiﬁcant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors serves as input to the encoder. The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self- attention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every block, and residual connections after every block (Wang et al., 2019; Baevski & Auli, 2019). 3


ID: 6d2d3b62-4961-479f-96fe-1ed226d74dda   Page: 1
Cited Quotes:


*ViTs incorporate position embeddings to retain spatial information, as the 2D structure of images is crucial for understanding visual data.*

Cited Context:


Focuses on the performance of the Vision Transformer (ViT) when trained on large datasets, highlighting its ability to achieve state-of-the-art results in image recognition tasks despite lacking some inductive biases inherent to convolutional neural networks (CNNs). It also discusses related work in the field, particularly the application of self-attention mechanisms in image processing and comparisons with existing models. Published as a conference paper at ICLR 2021 inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufﬁcient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We ﬁnd that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent results when pre-trained at sufﬁcient scale and transferred to tasks with fewer datapoints. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In particular, the best model reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on the VTAB suite of 19 tasks. 2 RELATED WORK Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since be- come the state of the art method in many NLP tasks. Large Transformer-based models are often pre-trained on large corpora and then ﬁne-tuned for the task at hand: BERT (Devlin et al., 2019) uses a denoising self-supervised pre-training task, while the GPT line of work uses language mod- eling as its pre-training task (Radford et al., 2018; 2019; Brown et al., 2020). Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. Thus, to apply Transformers in the context of image processing, several approximations have been tried in the past. Parmar et al. (2018) applied the self-attention only in local neighborhoods for each query pixel instead of globally. Such local multi-head dot-product self attention blocks can completely replace convolutions (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020). In a different line of work, Sparse Transformers (Child et al., 2019) employ scalable approximations to global self- attention in order to be applicable to images. An alternative way to scale attention is to apply it in blocks of varying sizes (Weissenborn et al., 2019), in the extreme case only along individual axes (Ho et al., 2019; Wang et al., 2020a). Many of these specialized attention architectures demonstrate promising results on computer vision tasks, but require complex engineering to be implemented efﬁciently on hardware accelerators. Most related to ours is the model of Cordonnier et al. (2020), which extracts patches of size 2 × 2 from the input image and applies full self-attention on top. This model is very similar to ViT, but our work goes further to demonstrate that large scale pre-training makes vanilla transformers competitive with (or even better than) state-of-the-art CNNs. Moreover, Cordonnier et al. (2020) use a small patch size of 2 × 2 pixels, which makes the model applicable only to small-resolution images, while we handle medium-resolution images as well. There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms of self-attention, e.g. by augmenting feature maps for image classiﬁcation (Bello et al., 2019) or by further processing the output of a CNN using self-attention, e.g. for object detection (Hu et al., 2018; Carion et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classiﬁcation (Wu et al., 2020), unsupervised object discovery (Locatello et al., 2020), or uniﬁed text-vision tasks (Chen




In [73]:
query = "What is AI, ML and DL?"
result = rag_chain_w_citations.invoke(query)
display_results(result)

Query:


What is AI, ML and DL?


Response:


**Artificial Intelligence (AI)**: AI is defined as the ability of a computer program or machine to think and learn. It is also a field of study aimed at making computers "smart," allowing them to operate independently without being explicitly programmed with commands. The term was coined by John McCarthy in 1955. AI encompasses systems that can interpret external data, learn from it, and adapt to achieve specific goals. As technology advances, tasks once considered to require intelligence, such as optical character recognition, are no longer classified as AI but rather as routine technologies.

**Machine Learning (ML)**: ML is a subfield of computer science that provides computers the ability to learn from data without being explicitly programmed. The concept originated from AI research. Machine learning involves the study and construction of algorithms that can learn from data and make predictions or decisions based on that data. These algorithms build models from sample inputs and are particularly useful in scenarios where traditional programming is impractical. Applications of ML include spam filtering, network intrusion detection, optical character recognition, search engines, and computer vision.

**Deep Learning (DL)**: DL is a specialized form of machine learning that primarily utilizes neural networks, particularly those with multiple layers (known as multi-layer neural networks). It can involve unsupervised, semi-supervised, or supervised learning sessions. Deep learning is particularly effective for complex tasks such as speech recognition, image understanding, and handwriting recognition, which are challenging for computers. The architecture of deep learning models is inspired by the information processing patterns of biological nervous systems, although they differ significantly from the structural and functional properties of human brains.

Sources:
Title: Artificial intelligence   Source: Wikipedia
Citations:
ID: 6360   Page: 1
Cited Quotes:


*Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955. In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation.*

Cited Context:


**Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn**. It is also a field of study which tries to make computers "smart". **They work on their own without being encoded with commands**. **John McCarthy came up with the name "Artificial Intelligence" in 1955**. **In general use, the term "artificial intelligence" means a programme which mimics human cognition**. **At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do**. **Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation**. An ideal (perfect) intelligent machine is a flexible agent which perceives its environment and takes actions to maximize its chance of success at some goal or objective. As machines become increasingly capable, mental faculties once thought to require intelligence are removed from the definition. For example, optical character recognition is no longer perceived as an example of "artificial intelligence": it is just a routine technology.


Title: Machine learning   Source: Wikipedia
Citations:
ID: 564928   Page: 1
Cited Quotes:


*Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data.*

Cited Context:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). **It is a subfield of computer science**. **The idea came from work in artificial intelligence**. **Machine learning explores the study and construction of algorithms which can learn and make predictions on data**. **Such algorithms follow programmed instructions, but can also make predictions or decisions based on data**. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Title: Deep learning   Source: Wikipedia
Citations:
ID: 663523   Page: 1
Cited Quotes:


*Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer.*

Cited Context:


**Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks**. **As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised**. **In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer**. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, which make them incompatible with neuroscience evidences.




In [74]:
query = "How is Machine learning related to supervised learning and clustering?"
result = rag_chain_w_citations.invoke(query)
display_results(result)

Query:


How is Machine learning related to supervised learning and clustering?


Response:


Machine learning is a broad field that encompasses various techniques and methodologies for enabling computers to learn from data. Two important concepts within machine learning are supervised learning and clustering.

### Supervised Learning
- **Definition**: Supervised learning is a specific type of machine learning where the model is trained on labeled data. This means that the training dataset includes both the input data and the corresponding correct outputs (labels).
- **Process**: The system infers a function from this labeled training data, learning how to map inputs to the correct outputs. The goal is to create a classifier that can predict outcomes for new, unseen data based on the patterns learned from the training data.
- **Inductive Reasoning**: Supervised learning typically employs inductive reasoning to generalize from the training data to make predictions.

### Clustering
- **Definition**: Clustering, or cluster analysis, is a type of data analysis that involves grouping a set of objects in such a way that objects in the same group (or cluster) are more similar to each other than to those in other groups.
- **Application**: Clustering is commonly used in data mining and is a form of unsupervised learning, where the model does not rely on labeled outputs. Instead, it identifies patterns and structures in the data based solely on the input features.

### Relationship Between Machine Learning, Supervised Learning, and Clustering
- **Machine Learning as an Umbrella**: Machine learning serves as the overarching field that includes various learning paradigms, including supervised learning and clustering.
- **Supervised Learning vs. Clustering**: While supervised learning requires labeled data and focuses on predicting outcomes, clustering operates without labels and focuses on discovering inherent groupings within the data.
- **Different Goals**: The primary goal of supervised learning is to predict outcomes based on learned relationships, whereas clustering aims to identify and group similar data points.

In summary, machine learning encompasses both supervised learning and clustering, with each serving distinct purposes and methodologies within the field.

Sources:
Title: Supervised learning   Source: Wikipedia
Citations:
ID: 359370   Page: 1
Cited Quotes:


*In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly.*

Cited Context:


**In machine learning, supervised learning is the task of inferring a function from labelled training data**. **The results of the training are known beforehand, the system simply learns how to get to these results correctly**. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Title: Machine learning   Source: Wikipedia
Citations:
ID: 564928   Page: 1
Cited Quotes:


*Machine learning gives computers the ability to learn without being explicitly programmed. It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data.*

Cited Context:


**Machine learning gives computers the ability to learn without being explicitly programmed** (Arthur Samuel, 1959). **It is a subfield of computer science**. **The idea came from work in artificial intelligence**. **Machine learning explores the study and construction of algorithms which can learn and make predictions on data**. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Title: Cluster analysis   Source: Wikipedia
Citations:
ID: 593732   Page: 1
Cited Quotes:


*Clustering or cluster analysis is a type of data analysis. The analyst groups objects so that objects in the same group (called a cluster) are more similar to each other than to objects in other groups (clusters) in some way.*

Cited Context:


**Clustering or cluster analysis is a type of data analysis**. **The analyst groups objects so that objects in the same group (called a cluster) are more similar to each other than to objects in other groups (clusters) in some way**. This is a common task in data mining.




In [None]:
query = "What is the difference between transformers and vision transformers?"
result = rag_chain_w_citations.invoke(query)
display_results(result)