**Reference Link:** [RAG Systems Essentials (Analytics Vidhya)](https://courses.analyticsvidhya.com/courses/take/rag-systems-essentials/lessons/60148017-hands-on-deep-dive-into-rag-evaluation-metrics-generator-metrics-i)

# Project: Build a Document Retriever Search Engine on Wikipedia Data

## Install OpenAI, and LangChain dependencies

In [2]:
!pip install -qq langchain
!pip install -qq langchain-openai
!pip install -qq langchain-community
!pip install -qq langchain-huggingface
!pip install -qq jq
!pip install -qq pymupdf

## Install Chroma Vector DB and LangChain wrapper

In [3]:
!pip install -qq langchain-chroma==0.1.4

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
transformers 4.53.1 requires tokenizers<0.22,>=0.21, but you have tokenizers 0.20.3 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0m

In [4]:
import os
from dotenv import load_dotenv

load_dotenv()

True

### Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient `text-embedding-3-small` model, and a larger and more powerful `text-embedding-3-large` model.

In [5]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

## Loading and Processing the Data

### Load JSON Documents from Wikipedia Dump

In [10]:
from langchain.document_loaders import JSONLoader

loader = JSONLoader(file_path='../docs/wikidata_rag_demo.jsonl',
                    jq_schema='.',
                    text_content=False,
                    json_lines=True)
wiki_docs = loader.load()

In [11]:
len(wiki_docs)

1801

In [13]:
print(wiki_docs[0].page_content)

{"id": "84801", "title": "Chinese New Year", "paragraphs": ["Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.", "The Chinese New Year is of the most important holidays for Chinese people all over the world. Its 7th day used to be used instead of birthdays to count people's ages in China. The holiday is still used to tell people which \"animal\" of the Chinese zodiac they are part of. The holiday is a time for gifts to children and for family gatherings with large meals, just like Christmas in Europe and in other Christian areas. Unlike Christmas, the children usually get gifts of c

In [18]:
import json
from langchain.docstore.document import Document

# Initialize an empty list to store processed Document objects
wiki_docs_processed = []

# Iterate through each document loaded from the JSONLoader
for doc in wiki_docs:
    # Parse the JSON content from the page_content attribute
    doc = json.loads(doc.page_content)
    
    # Extract relevant metadata fields for each document
    metadata = {
        "title": doc['title'],      # The title of the Wikipedia article
        "id": doc['id'],            # The unique identifier for the article
        "source": "Wikipedia"       # The source of the document
    }
    
    # Concatenate all paragraphs into a single string for the document content
    data = ' '.join(doc['paragraphs'])
    
    # Create a LangChain Document object with the combined content and metadata, and add it to the list
    wiki_docs_processed.append(Document(page_content=data, metadata=metadata))

In [19]:
wiki_docs_processed

[Document(metadata={'title': 'Chinese New Year', 'id': '84801', 'source': 'Wikipedia'}, page_content='Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20. The Chinese New Year is of the most important holidays for Chinese people all over the world. Its 7th day used to be used instead of birthdays to count people\'s ages in China. The holiday is still used to tell people which "animal" of the Chinese zodiac they are part of. The holiday is a time for gifts to children and for family gatherings with large meals, just like Christmas in Europe and in other Christian areas. Unlike Christmas

In [20]:
print(wiki_docs_processed[1500])

page_content='Jan Persson ("Janne Lucas"), born 3 October 1947 in Gothenburg's Gamlestad Parish in Gothenburg, Sweden is a Swedish pianist and singer, scoring several chart successes in Sweden during the 1970s and 1980s. Janne Lucas participated at Melodifestivalen 1980 with the song "Växeln hallå", winning the contest. The upcoming year he participated with the song "Rocky Mountain" ending up third. For many years, Janne Lucas also acted as pianist for "Vi i femman" Janne also accompanied the vocal group "Noviserna" for a while, where Anna-Lisa Cederquist participated.' metadata={'title': 'Janne Persson', 'id': '460169', 'source': 'Wikipedia'}


### Create function to generate contextual summaries for chunks

Here we borrow inspiration from Anthropic's [contextual retrieval](https://www.anthropic.com/news/contextual-retrieval) strategy which involves create a contextual summary for each chunk and adding it to the chunk before storing in the vector database.

![](https://i.imgur.com/cjnB831.png)

#### Create Chunk Contexts for Contextual Retrieval

![](https://i.imgur.com/LRhKHzk.png)

In [21]:
# load PDF files with langchain
from langchain.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("../docs/attention_paper.pdf")
doc_pages = loader.load()

In [22]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=3500,
                                          chunk_overlap=0)
doc_chunks = splitter.split_documents(doc_pages)

In [23]:
len(doc_chunks)

16

In [24]:
# the actual research paper
big_doc = '\n'.join([doc.page_content for doc in doc_chunks])

In [25]:
len(big_doc.split(' '))

5050

In [26]:
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [None]:
# create a chat prompt
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser


def generate_chunk_context(document, chunk):

    chunk_process_prompt = """You are an AI assistant specializing in research paper analysis.
                            Your task is to provide brief, relevant context for a chunk of text
                            based on the following research paper.

                            Here is the research paper:
                            <paper>
                            {paper}
                            </paper>

                            Here is the chunk we want to situate within the whole document:
                            <chunk>
                            {chunk}
                            </chunk>

                            Provide a concise context (3-4 sentences max) for this chunk,
                            considering the following guidelines:

                            - Give a short succinct context to situate this chunk within the overall document
                            for the purposes of improving search retrieval of the chunk.
                            - Answer only with the succinct context and nothing else.
                            - Context should be mentioned like 'Focuses on ....'
                            do not mention 'this chunk or section focuses on...'

                            Context:
                        """

    prompt_template = ChatPromptTemplate.from_template(chunk_process_prompt)

    agentic_chunk_chain = (prompt_template
                                |
                            chatgpt
                                |
                            StrOutputParser())

    context = agentic_chunk_chain.invoke({'paper': document, 'chunk': chunk})

    return context

In [28]:
print(doc_chunks[5].page_content)

output values. These are concatenated and once again projected, resulting in the final values, as
depicted in Figure 2.
Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions. With a single attention head, averaging inhibits this.
MultiHead(Q, K, V ) = Concat(head1, ..., headh)W O
where headi = Attention(QW Q
i , KW K
i , V W V
i )
Where the projections are parameter matrices W Q
i
∈Rdmodel×dk, W K
i
∈Rdmodel×dk, W V
i
∈Rdmodel×dv
and W O ∈Rhdv×dmodel.
In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.
3.2.3
Applications of Attention in our Model
The Transformer uses multi-head attention in three different ways:
• In "encoder-decoder attention" layers, the queries come from the previous decoder layer,
and

In [29]:
generate_chunk_context(big_doc, doc_chunks[5].page_content)

'Focuses on the implementation and functionality of multi-head attention within the Transformer architecture, detailing how it allows the model to process information from different representation subspaces simultaneously. It describes the mathematical formulation of multi-head attention and its applications in both encoder-decoder and self-attention layers, as well as the integration of position-wise feed-forward networks.'

### Load and Process PDF Documents

In [30]:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_contextual_chunks(file_path):

    print('Loading pages:', file_path)
    loader = PyMuPDFLoader(file_path)
    doc_pages = loader.load()

    print('Chunking pages:', file_path)
    splitter = RecursiveCharacterTextSplitter(chunk_size=3500,
                                              chunk_overlap=0)
    doc_chunks = splitter.split_documents(doc_pages)

    print('Generating contextual chunks:', file_path)
    original_doc = '\n'.join([doc.page_content for doc in doc_chunks])
    contextual_chunks = []
    for chunk in doc_chunks:
        context = generate_chunk_context(original_doc, chunk.page_content)
        contextual_chunks.append(Document(page_content=context+'\n'+chunk.page_content,
                                          metadata=chunk.metadata))
    print('Finished processing:', file_path)
    print()
    return contextual_chunks

In [33]:
from glob import glob

pdf_files = glob('../docs/*.pdf')
pdf_files

['../docs/layoutparser_paper.pdf',
 '../docs/cnn_paper.pdf',
 '../docs/vision_transformer.pdf',
 '../docs/resnet_paper.pdf',
 '../docs/WEB_How_and_Why_to_UseLLMs_for_Chunk_Based_Information_Retrieval_Carlo_Peron_Oct_2024_TowardsDataScience.pdf',
 '../docs/attention_paper.pdf',
 '../docs/Vision Transformers.pdf']

In [36]:
paper_docs = []
for fp in pdf_files:
    if ('attention' in fp) or ('transformer' in fp) or ('vision' in fp):
        paper_docs.extend(create_contextual_chunks(fp))

Loading pages: ../docs/vision_transformer.pdf
Chunking pages: ../docs/vision_transformer.pdf
Generating contextual chunks: ../docs/vision_transformer.pdf
Finished processing: ../docs/vision_transformer.pdf

Loading pages: ../docs/attention_paper.pdf
Chunking pages: ../docs/attention_paper.pdf
Generating contextual chunks: ../docs/attention_paper.pdf
Finished processing: ../docs/attention_paper.pdf



In [37]:
len(paper_docs)

44

In [38]:
paper_docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-04T00:19:58+00:00', 'source': '../docs/vision_transformer.pdf', 'file_path': '../docs/vision_transformer.pdf', 'total_pages': 22, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2021-06-04T00:19:58+00:00', 'trapped': '', 'modDate': 'D:20210604001958Z', 'creationDate': 'D:20210604001958Z', 'page': 0}, page_content='Focuses on the introduction of the Vision Transformer (ViT) model, which applies a pure Transformer architecture to image classification tasks by treating image patches as tokens. It highlights the limitations of traditional convolutional neural networks (CNNs) in computer vision and presents evidence that ViT can achieve competitive performance on various benchmarks with fewer computational resources when pre-trained on large datasets.\nPublished as a conference paper at ICLR 2021\nAN IMAGE IS WORTH 16X16 WORDS:\nTRANSFORMERS 

In [39]:
len(wiki_docs_processed)

1801

In [40]:
total_docs = wiki_docs_processed + paper_docs
len(total_docs)

1845

## Vector Databases

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector database takes care of storing embedded data and performing vector search for you.

### Chroma Vector DB

[Chroma](https://docs.trychroma.com/getting-started) is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.

### Create a Vector DB and persist on disk

Here we initialize a connection to a Chroma vector DB client, and also we want to save to disk, so we simply initialize the Chroma client and pass the directory where we want the data to be saved to.

In [41]:
from langchain_chroma import Chroma

# create vector DB of docs and embeddings - takes < 30s on Colab
chroma_db = Chroma.from_documents(documents=total_docs,
                                  collection_name='my_db',
                                  embedding=openai_embed_model,
                                  # need to set the distance function to cosine else it uses euclidean by default
                                  # check https://docs.trychroma.com/guides#changing-the-distance-function
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./my_db")

### Load Vector DB from disk

This is just to show once you have a vector database on disk you can just load and create a connection to it anytime

In [42]:
# load from disk
chroma_db = Chroma(persist_directory="./my_db",
                   collection_name='my_db',
                   embedding_function=openai_embed_model)

In [43]:
chroma_db

<langchain_chroma.vectorstores.Chroma at 0x123aaa850>

## Experiment with Vector Database Retrievers

Here we will explore the following retrieval strategies on our Vector Database:

- Similarity or Ranking based Retrieval
- Multi Query Retrieval
- Contextual Compression Retrieval
- Chained Retrieval Pipeline

### Similarity or Ranking based Retrieval

We use cosine similarity here and retrieve the top 5 similar documents based on the user input query

In [44]:
similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 5})

In [45]:
from IPython.display import display, Markdown

def display_docs(docs):
    for doc in docs:
        print('Metadata:', doc.metadata)
        print('Content Brief:')
        display(Markdown(doc.page_content[:1000]))
        print()

In [46]:
query = "what is machine learning?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '359370', 'source': 'Wikipedia', 'title': 'Supervised learning'}
Content Brief:


In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic


Metadata: {'id': '6360', 'source': 'Wikipedia', 'title': 'Artificial intelligence'}
Content Brief:


Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955. In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation. An ideal (perfect) intelligent machine is a flexible agent which perceives its environment and takes actions to maximize its chance of success at some goal or objective. As machines become increasingly capable, mental facu


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.




In [47]:
query = "what is ML?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'id': '312307', 'source': 'Wikipedia', 'title': 'Standard ML'}
Content Brief:


Standard ML is a functional programming language which is a dialect of ML (programming language). It is sometimes used for writing compilers and in theorem provers. Here is an example of a factorial function written in a simple, non-tail recursive, style.


Metadata: {'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '15798', 'source': 'Wikipedia', 'title': 'Major League Baseball'}
Content Brief:


Major League Baseball (MLB) is a professional baseball league in North America. It is often considered to be the highest level of professional baseball in the world. There are two leagues that make up the MLB: the American League, also called AL, and National League, also called NL. There are currently 30 teams in the MLB, 29 from the United States and one from Canada, the Toronto Blue Jays. The official website of MLB is known as "MLB.com" (www.mlb.com). The 30 teams in MLB are divided into two leagues: American and National. Each league is divided into three divisions: East, Central, West. Since the 2013 season, each division has had five teams. The most recent change took place after the 2012 season, when the Houston Astros moved from the NL Central to the AL West.


Metadata: {'id': '196959', 'source': 'Wikipedia', 'title': 'Mathematical Reviews'}
Content Brief:


Mathematical Reviews is a journal and online database published by the American Mathematical Society that contains many articles in mathematics, statistics, and related topics.


Metadata: {'id': '757418', 'source': 'Wikipedia', 'title': 'VRML'}
Content Brief:


VRML (Virtual Reality Modeling Language, pronounced "vermal", or by its initials, known before 1995 as Virtual Reality Markup Language) is a standard 3-dimensional (3D) interactive vector graphics file format designed for the World Wide Web. It has been succeeded by X3D. VRML uses text files. The vertices, edges, surface colors, UV-mapped textures, shininess, transparency and more of a 3D polygon can be specified. Graphical components can be made to fetch web pages or other VRML files from the Internet from URLs when the user clicks on the graphical component. Animations, sounds, lighting, and other things about the virtual world can interact with the user or can happen when external events say so, such as timers. A special Script Node allows program code (such as program code in Java or ECMAScript) to be added to a VRML file. VRML files are commonly called "worlds" and have the .wrl extension (for example, a VRML file can be called island.wrl). VRML files are in plain text and usually




In [48]:
query = "what is the difference between transformers and vision transformers?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 0, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Focuses on the introduction of the Vision Transformer (ViT) model, which applies a pure Transformer architecture to image classification tasks by treating image patches as tokens. It highlights the limitations of traditional convolutional neural networks (CNNs) in computer vision and presents evidence that ViT can achieve competitive performance on various benchmarks with fewer computational resources when pre-trained on large datasets.
Published as a conference paper at ICLR 2021
AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗,
Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†
∗equal technical contribution, †equal advising
Google Research, Brain Team
{adosovitskiy, neilhoulsby}@google.com
ABSTRACT
While the Transformer architecture has become the de-facto standard for natural
language processing ta


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 7, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Focuses on the scaling study of various models, including Vision Transformers and ResNets, evaluating their transfer performance from the JFT-300M dataset. It highlights the performance versus pre-training cost, revealing that Vision Transformers outperform ResNets in terms of compute efficiency and suggesting potential for further scaling. Additionally, it discusses the performance of hybrid models in comparison to pure Vision Transformers.
Published as a conference paper at ICLR 2021
4.4
SCALING STUDY
We perform a controlled scaling study of different models by evaluating transfer performance from
JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess
performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1,
R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained
for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus
L/16 and H/14 p


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 2, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Focuses on the architecture and methodology of the Vision Transformer (ViT), detailing how images are processed by splitting them into patches, embedding them, and utilizing a standard Transformer encoder for image classification tasks. It describes the model's design principles, including the use of position embeddings and the classification head, while referencing foundational work in Transformer architecture.
Published as a conference paper at ICLR 2021
Transformer Encoder
MLP 
Head
Vision Transformer (ViT)
*
Linear Projection of Flattened Patches
* Extra learnable
     [ cl ass]  embedding
1
2
3
4
5
6
7
8
9
0
Patch + Position 
Embedding
Class
Bird
Ball
Car
...
Embedded 
Patches
Multi-Head 
Attention
Norm
MLP
Norm
+
L x
+
Transformer Encoder
Figure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them,
add position embeddings, and feed the resulting sequence of vectors to a standard Transformer
encoder. In order to perform classiﬁcation, we use the


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 7, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Focuses on the behavior of attention mechanisms in Vision Transformers, highlighting how attention distances vary across layers and the implications for model performance. It discusses the relationship between attention distance and network depth, as well as the role of hybrid models that incorporate convolutional layers. Additionally, it sets the stage for a transition to discussing self-supervised learning methods in Transformers.
have consistently small attention distances in the low layers. This highly localized attention is
less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right),
suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the
attention distance increases with network depth. Globally, we ﬁnd that the model attends to image
regions that are semantically relevant for classiﬁcation (Figure 6).
4.6
SELF-SUPERVISION
Transformers show impressive performance on NLP tasks. However, much of their succ


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 9, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Focuses on the references cited in the research paper, which include key works related to Transformers, image recognition, and self-attention mechanisms. These references provide foundational and contemporary insights that support the development and evaluation of the Vision Transformer (ViT) model presented in the paper.
Published as a conference paper at ICLR 2021
Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In
ICLR, 2019.
I. Bello, B. Zoph, Q. Le, A. Vaswani, and J. Shlens. Attention augmented convolutional networks.
In ICCV, 2019.
Lucas Beyer, Olivier J. H´enaff, Alexander Kolesnikov, Xiaohua Zhai, and A¨aron van den Oord. Are
we done with imagenet? arXiv, 2020.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. arXiv, 2020.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usun




In [49]:
query = "what is a cnn?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'id': '3615', 'source': 'Wikipedia', 'title': 'CNN'}
Content Brief:


The Cable News Network (CNN) is an American cable news television channel. It was founded in 1980 by Ted Turner. The Cable News Network first aired on television on June 1, 1980. The Cable News Network's first newscast was anchored (hosted) by David Walker and his wife Lois Hart. In its first year CNN hired many political analysts, including Rowland Evans and Robert Novak. On January 1, 1982 CNN launched a 24-hour sister newscast channel with no talk shows or commentary shows called CNN2. CNN broadcasts programs from its headquarters at the CNN Center in Atlanta, or from the Time Warner Center in New York City, or from studios in Washington, D.C., and Los Angeles. CNN is owned by Time Warner, and the U.S. news channel is a part of the Turner Broadcasting System. The hosts of its opinion shows are Don Lemon, Chris Cuomo, Fredricka Whitfield, Erin Burnett, Brianna Keiler and Brooke Baldwin. CNN has been criticized by the right-wing Media Research Center for having a left-wing bias. Accor


Metadata: {'id': '407048', 'source': 'Wikipedia', 'title': 'Piers Morgan Live'}
Content Brief:


Piers Morgan Live (previously known as Piers Morgan Tonight) is a television talk show on CNN. It is hosted by Piers Morgan. It started on January 17, 2011. It took over the timeslot that "Larry King Live" was in before Larry King retired. On the show, Morgan interviews guests such as politicians, celebrities, and members of the public. His first guest was Oprah Winfrey. The show was cancelled on February 23, 2014 and the final episode was aired on March 28, 2014.


Metadata: {'id': '246273', 'source': 'Wikipedia', 'title': 'NBA TV'}
Content Brief:


NBA TV is a television specialty channel that is dedicated to showcasing the sport of basketball in the United States. The network is financially backed by the National Basketball Association (NBA), which also uses NBA TV as a way of advertising their out of market package NBA League Pass, and partner channel TNT. Started in 1999 as nba.com TV, the channel, which had its studios at NBA Entertainment in Secaucus, New Jersey, began a multi-year deal with American television companies Cox Communications, Cablevision, and Time Warner on June 28, 2003, allowing the network to expand to 45 million American homes, and 30 different countries. NBA TV replaced Time Warner's CNN/SI on many cable systems after that network shut down a year earlier. NBA TV offers basketball news every day, as well as programming showcasing basketball players' individual lifestyles, life as a basketball team during an NBA season, famous games of the past, and live games typically four days a week during the NBA seas


Metadata: {'id': '14059', 'source': 'Wikipedia', 'title': 'News'}
Content Brief:


News is when people talk about current events (things that are happening right now). News Media is a portrayal of current affairs, perspectives and social influence. News can be given in newspapers, television, magazines, or radio. There are several news channels on cable television that give news all day long, such as Fox News and CNN. There are several news magazines, such as "Time", "The Economist", and "Newsweek". A newsman is a person who helps out with the news. For example, Brian Gotter is a newsman. News Media can be viewed in many forms, such as newspaper, television and radio.


Metadata: {'id': '365266', 'source': 'Wikipedia', 'title': 'New Canaan, Connecticut'}
Content Brief:


New Canaan is a affluent town in Fairfield County, Connecticut, United States. There were 19,738 people according to the 2010 census. The town is one of the richest communities in the nation. In 2011, New Canaan was 8th on CNN Money's list of the top-earning towns in the United States.




In [50]:
query = "what is deep learning?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic


Metadata: {'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.


Metadata: {'id': '359370', 'source': 'Wikipedia', 'title': 'Supervised learning'}
Content Brief:


In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 7, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Focuses on the behavior of attention mechanisms in Vision Transformers, highlighting how attention distances vary across layers and the implications for model performance. It discusses the relationship between attention distance and network depth, as well as the role of hybrid models that incorporate convolutional layers. Additionally, it sets the stage for a transition to discussing self-supervised learning methods in Transformers.
have consistently small attention distances in the low layers. This highly localized attention is
less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right),
suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the
attention distance increases with network depth. Globally, we ﬁnd that the model attends to image
regions that are semantically relevant for classiﬁcation (Figure 6).
4.6
SELF-SUPERVISION
Transformers show impressive performance on NLP tasks. However, much of their succ




In [51]:
query = "what is nlp?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'id': '335464', 'source': 'Wikipedia', 'title': 'Neurolinguistic programming'}
Content Brief:


Neurolinguistic programming is a way of communicating, created in the 1970s. It is often shortened to "NLP". The discipline assumes there is a link between neurological processes, language and behavior. According to NLP, it is possible to achieve certain goals in life by changing one's behaviour. Certain neuroscientists psychologists and linguists, believe that NLP is unsupported by current scientific evidence and that it uses incorrect and misleading terms and concepts. NLP was invented by Richard Bandler and John Grinder. According to these people, NLP can help solve problems such as phobias, depression, habit disorder, psychosomatic illnesses, and learning disorders.


Metadata: {'id': '40613', 'source': 'Wikipedia', 'title': 'Natural language processing'}
Content Brief:


Natural Language Processing (NLP) is a field in Artificial Intelligence, and is also related to linguistics. On a high level, the goal of NLP is to program computers to automatically understand human languages, and also to automatically write/speak in human languages. We say "Natural Language" to mean human language, and to indicate that we are not talking about computer (programming) languages.


Metadata: {'id': '669662', 'source': 'Wikipedia', 'title': 'Loop AI Labs'}
Content Brief:


Loop AI Labs is an AI and cognitive computing company that focuses on language understanding technology. The company was founded in San Francisco in 2012 by Italian entrepreneur Gianmauro Calafiore, who sold his company Gsmbox to in 2004 and then relocated from Italy to San Francisco. Wanting to start an artificial intelligence company, he recruited two veterans of the project, the largest government-funded AI project in history, who had worked on the project at and Stanford University's . The original company name, "Soshoma", was changed to Loop AI Labs in 2015 after the company decided to change its focus from consumer-oriented to enterprise. Loop AI Labs is headquartered in San Francisco, California, with offices in New York, Milan, and Singapore. The company is privately funded. On May 4, 2017, Loop AI Labs entered into a deal with , a leading European provider of mobile messaging and solutions, to bring their cognitive computing technology to LINK's business clients, which cover 2


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.


Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic




### Multi Query Retrieval

Retrieval may produce different results with subtle changes in query wording, or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.

The [`MultiQueryRetriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents.

In [52]:
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [53]:
from langchain.retrievers.multi_query import MultiQueryRetriever
# Set logging for the queries
import logging

similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 5})

mq_retriever = MultiQueryRetriever.from_llm(
    retriever=similarity_retriever, llm=chatgpt
)

logging.basicConfig()
# so we can see what queries are generated by the LLM
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [54]:
query = "what is a cnn?"
top_docs = mq_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does CNN stand for and what are its main functions?  ', 'Can you explain the concept and applications of convolutional neural networks?  ', 'What are the key features and uses of CNNs in machine learning?']


Metadata: {'id': '3615', 'source': 'Wikipedia', 'title': 'CNN'}
Content Brief:


The Cable News Network (CNN) is an American cable news television channel. It was founded in 1980 by Ted Turner. The Cable News Network first aired on television on June 1, 1980. The Cable News Network's first newscast was anchored (hosted) by David Walker and his wife Lois Hart. In its first year CNN hired many political analysts, including Rowland Evans and Robert Novak. On January 1, 1982 CNN launched a 24-hour sister newscast channel with no talk shows or commentary shows called CNN2. CNN broadcasts programs from its headquarters at the CNN Center in Atlanta, or from the Time Warner Center in New York City, or from studios in Washington, D.C., and Los Angeles. CNN is owned by Time Warner, and the U.S. news channel is a part of the Turner Broadcasting System. The hosts of its opinion shows are Don Lemon, Chris Cuomo, Fredricka Whitfield, Erin Burnett, Brianna Keiler and Brooke Baldwin. CNN has been criticized by the right-wing Media Research Center for having a left-wing bias. Accor


Metadata: {'id': '14059', 'source': 'Wikipedia', 'title': 'News'}
Content Brief:


News is when people talk about current events (things that are happening right now). News Media is a portrayal of current affairs, perspectives and social influence. News can be given in newspapers, television, magazines, or radio. There are several news channels on cable television that give news all day long, such as Fox News and CNN. There are several news magazines, such as "Time", "The Economist", and "Newsweek". A newsman is a person who helps out with the news. For example, Brian Gotter is a newsman. News Media can be viewed in many forms, such as newspaper, television and radio.


Metadata: {'id': '779837', 'source': 'Wikipedia', 'title': 'Euronews'}
Content Brief:


Euronews is a pan-European TV news channel, with headquarters in Lyon, France. Right now, it broadcasts in 13 languages. 88% of the channel is owned by Media Globe Networks, a company which belongs to Egyptian billionaire Naguib Sawiris. The other 12% belongs to a group of EBU (European Broadcasting Union) members. CNN helped make 24-hour TV news important after the Gulf War. The European Broadcasting Union answered this with the creation of their own news channel. Euronews started broadcasting on 1 January 1993, from Lyon, in five languages (English, French, German, Spanish and Italian). The channel was special in the news market because it had no presenters or studios, just videos showing the news. The channel's most important segment is "No Comment", which shows videos with no voiceover. Later, the channel started broadcasting in Portuguese in 1999, in Russian in 2001, in Arabic in 2008, in Turkish and in Persian in 2010, in Ukrainian in 2011, in Greek in 2012, and in Hungarian in 2


Metadata: {'id': '246273', 'source': 'Wikipedia', 'title': 'NBA TV'}
Content Brief:


NBA TV is a television specialty channel that is dedicated to showcasing the sport of basketball in the United States. The network is financially backed by the National Basketball Association (NBA), which also uses NBA TV as a way of advertising their out of market package NBA League Pass, and partner channel TNT. Started in 1999 as nba.com TV, the channel, which had its studios at NBA Entertainment in Secaucus, New Jersey, began a multi-year deal with American television companies Cox Communications, Cablevision, and Time Warner on June 28, 2003, allowing the network to expand to 45 million American homes, and 30 different countries. NBA TV replaced Time Warner's CNN/SI on many cable systems after that network shut down a year earlier. NBA TV offers basketball news every day, as well as programming showcasing basketball players' individual lifestyles, life as a basketball team during an NBA season, famous games of the past, and live games typically four days a week during the NBA seas


Metadata: {'id': '429703', 'source': 'Wikipedia', 'title': 'JTBC'}
Content Brief:


JTBC is a South Korean nationwide general cable television network and broadcasting company. It was established on March 21, 2011 and launched on December 1, 2011. JTBC is supported by "JoongAng Ilbo", which is one of the three biggest newspapers published in Seoul. JTBC's largest shareholder is JoongAng Media Network with 25% of shares. JTBC's corporate identity (CI) is motivated by the rainbow, representing creativity and variety. JTBC is in alliance with other overseas broadcasting companies, including CNN, FOX, HBO, BBC, SMG, TV Asahi, K-channel, Al Masry, Al Youm and Standard Group. JTBC insists Tongyang Broadcasting Corporation (TBC) is ground for the return of "JoongAng Ilbo" to television in JTBC. TBC, which was a part of the Samsung Group, was launched in 1966 and ran the network for 16 years. In 1980, however, TBC was combined with the state-run Korean Broadcasting System (KBS) under the Chun Doo-hwan military regime. JTBC opened on December 1, 2011 to honor TBC, which was as


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 9, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Focuses on the references cited in the research paper, which include key works related to Transformers, image recognition, and self-attention mechanisms. These references provide foundational and contemporary insights that support the development and evaluation of the Vision Transformer (ViT) model presented in the paper.
Published as a conference paper at ICLR 2021
Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In
ICLR, 2019.
I. Bello, B. Zoph, Q. Le, A. Vaswani, and J. Shlens. Attention augmented convolutional networks.
In ICCV, 2019.
Lucas Beyer, Olivier J. H´enaff, Alexander Kolesnikov, Xiaohua Zhai, and A¨aron van den Oord. Are
we done with imagenet? arXiv, 2020.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. arXiv, 2020.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usun


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 0, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Focuses on the introduction of the Vision Transformer (ViT) model, which applies a pure Transformer architecture to image classification tasks by treating image patches as tokens. It highlights the limitations of traditional convolutional neural networks (CNNs) in computer vision and presents evidence that ViT can achieve competitive performance on various benchmarks with fewer computational resources when pre-trained on large datasets.
Published as a conference paper at ICLR 2021
AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗,
Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†
∗equal technical contribution, †equal advising
Google Research, Brain Team
{adosovitskiy, neilhoulsby}@google.com
ABSTRACT
While the Transformer architecture has become the de-facto standard for natural
language processing ta


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 1, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Focuses on the exploration of recent models that apply Transformers to image recognition, particularly highlighting the image GPT (iGPT) model and its unsupervised training approach. It discusses the significance of utilizing larger datasets, such as ImageNet-21k and JFT-300M, for achieving state-of-the-art results in image classification, contrasting with previous works that primarily employed CNNs. The text emphasizes the trend of leveraging additional data sources to enhance model performance in computer vision tasks.
et al., 2020c; Lu et al., 2019; Li et al., 2019).
Another recent related model is image GPT (iGPT) (Chen et al., 2020a), which applies Transformers
to image pixels after reducing image resolution and color space. The model is trained in an unsu-
pervised fashion as a generative model, and the resulting representation can then be ﬁne-tuned or
probed linearly for classiﬁcation performance, achieving a maximal accuracy of 72% on ImageNet.
Our work adds to the increasing c


Metadata: {'id': '779314', 'source': 'Wikipedia', 'title': 'Binary Neural Network'}
Content Brief:


Binary neural network is an artificial neural network, where commonly used floating-point weights are replaced with binary ones. It largely saving the storage and computation, serves as a technique for deploying deep models on resource-limited devices. Usage of binary values can bring up to 58 times speedup, while accuracy and information capacity of binary neural network can be manually controlled. Binary neural networks do not achieve the same accuracy as their full-precision counterparts, but improvements are being made to close this gap.


Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic


Metadata: {'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.




In [55]:
query = "what is nlp?"
top_docs = mq_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does NLP stand for and what are its main applications?  ', 'Can you explain the concept of natural language processing and its significance?  ', 'What are the key components and techniques involved in NLP?']


Metadata: {'id': '40613', 'source': 'Wikipedia', 'title': 'Natural language processing'}
Content Brief:


Natural Language Processing (NLP) is a field in Artificial Intelligence, and is also related to linguistics. On a high level, the goal of NLP is to program computers to automatically understand human languages, and also to automatically write/speak in human languages. We say "Natural Language" to mean human language, and to indicate that we are not talking about computer (programming) languages.


Metadata: {'id': '335464', 'source': 'Wikipedia', 'title': 'Neurolinguistic programming'}
Content Brief:


Neurolinguistic programming is a way of communicating, created in the 1970s. It is often shortened to "NLP". The discipline assumes there is a link between neurological processes, language and behavior. According to NLP, it is possible to achieve certain goals in life by changing one's behaviour. Certain neuroscientists psychologists and linguists, believe that NLP is unsupported by current scientific evidence and that it uses incorrect and misleading terms and concepts. NLP was invented by Richard Bandler and John Grinder. According to these people, NLP can help solve problems such as phobias, depression, habit disorder, psychosomatic illnesses, and learning disorders.


Metadata: {'id': '669662', 'source': 'Wikipedia', 'title': 'Loop AI Labs'}
Content Brief:


Loop AI Labs is an AI and cognitive computing company that focuses on language understanding technology. The company was founded in San Francisco in 2012 by Italian entrepreneur Gianmauro Calafiore, who sold his company Gsmbox to in 2004 and then relocated from Italy to San Francisco. Wanting to start an artificial intelligence company, he recruited two veterans of the project, the largest government-funded AI project in history, who had worked on the project at and Stanford University's . The original company name, "Soshoma", was changed to Loop AI Labs in 2015 after the company decided to change its focus from consumer-oriented to enterprise. Loop AI Labs is headquartered in San Francisco, California, with offices in New York, Milan, and Singapore. The company is privately funded. On May 4, 2017, Loop AI Labs entered into a deal with , a leading European provider of mobile messaging and solutions, to bring their cognitive computing technology to LINK's business clients, which cover 2


Metadata: {'author': '', 'creationDate': 'D:20230803000729Z', 'creationdate': '2023-08-03T00:07:29+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/attention_paper.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20230803000729Z', 'moddate': '2023-08-03T00:07:29+00:00', 'page': 11, 'producer': 'pdfTeX-1.40.25', 'source': '../docs/attention_paper.pdf', 'subject': '', 'title': '', 'total_pages': 15, 'trapped': ''}
Content Brief:


Focuses on the references cited in the research paper, which include foundational works in natural language processing, machine translation, and neural network architectures. These references support the development and evaluation of the Transformer model, highlighting its contributions to sequence transduction tasks and its performance compared to previous models.
[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated
corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
[26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference,
pages 152–159. ACL, June 2006.
[27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention
model. In Empirical Methods in Natural Language Processing, 2016.
[28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforce


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.


Metadata: {'id': '52820', 'source': 'Wikipedia', 'title': 'Signal processing'}
Content Brief:


Signal processing is the analysis, interpretation and manipulation of signals. Signals of interest include sound, images, biological signals such as ECG, radar signals, and many others. Processing of such signals includes storage and reconstruction, separation of information from noise (e.g., aircraft identification by radar), compression (e.g., image compression), and feature extraction (e.g., converting text to speech). For analog signals, signal processing may involve the amplification and filtering of audio signals for audio equipment or the modulation and demodulation of signals for telecommunication. For digital signals, signal processing may involve the compression, error checking and error detection of digital signals.


Metadata: {'author': '', 'creationDate': 'D:20230803000729Z', 'creationdate': '2023-08-03T00:07:29+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/attention_paper.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20230803000729Z', 'moddate': '2023-08-03T00:07:29+00:00', 'page': 10, 'producer': 'pdfTeX-1.40.25', 'source': '../docs/attention_paper.pdf', 'subject': '', 'title': '', 'total_pages': 15, 'trapped': ''}
Content Brief:


Focuses on the references cited in the research paper, which include foundational works on recurrent neural networks, attention mechanisms, and various neural architectures relevant to sequence modeling and machine translation. These references support the development and evaluation of the Transformer model, highlighting its innovations and comparisons to existing methodologies in the field.
[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical
machine translation. CoRR, abs/1406.1078, 2014.
[6] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv
preprint arXiv:1610.02357, 2016.
[7] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation
of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
[8] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. R


Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic




In [56]:
query = "what is ML?"
top_docs = mq_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does machine learning (ML) refer to?  ', 'Can you explain the concept of machine learning?  ', 'What are the key principles and applications of ML?']


Metadata: {'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '359370', 'source': 'Wikipedia', 'title': 'Supervised learning'}
Content Brief:


In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic


Metadata: {'id': '6360', 'source': 'Wikipedia', 'title': 'Artificial intelligence'}
Content Brief:


Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955. In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation. An ideal (perfect) intelligent machine is a flexible agent which perceives its environment and takes actions to maximize its chance of success at some goal or objective. As machines become increasingly capable, mental facu


Metadata: {'id': '312307', 'source': 'Wikipedia', 'title': 'Standard ML'}
Content Brief:


Standard ML is a functional programming language which is a dialect of ML (programming language). It is sometimes used for writing compilers and in theorem provers. Here is an example of a factorial function written in a simple, non-tail recursive, style.


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 10, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


Focuses on the references cited in the research paper, which include foundational works in deep learning, optimization methods, and notable contributions to image classification and representation learning. These references support the development and validation of the Vision Transformer (ViT) model and its performance in image recognition tasks.
Published as a conference paper at ICLR 2021
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. 2015.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly,
and Neil Houlsby. Big transfer (BiT): General visual representation learning. In ECCV, 2020.
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deep conv


Metadata: {'id': '501529', 'source': 'Wikipedia', 'title': 'Logistic Regression'}
Content Brief:


Logistic Regression, also known as "Logit Regression" or "Logit Model", is a mathematical model used in statistics to estimate (guess) the probability of an event occurring having been given some previous data. Logistic Regression works with binary data, where either the event happens (1) or the event does not happen (0). So given some feature x it tries to find out whether some event y happens or not. So y can either be 0 or 1. In the case where the event happens, y is given the value 1. If the event does not happen, then y is given the value of 0. For example, if y represents whether a sports team wins a match, then y will be 1 if they win the match or y will be 0 if they do not. This is known as "Binomial Logistic Regression". There is also another form of Logistic Regression which uses multiple values for the variable y. This form of Logistic Regression is known as "Multinomial Logistic Regression". Logistic Regression uses the logistic function to find a model that fits with the d




### Contextual Compression Retrieval

The information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned.

This compression can happen in the form of:

- Remove parts of the content of retrieved documents which are not relevant to the query. This is done by extracting only relevant parts of the document to the given query

- Filter out documents which are not relevant to the given query but do not remove content from the document

Here we wrap our multi-query retriever with a `ContextualCompressionRetriever`. Then we'll add an `LLMChainExtractor`, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.

In [57]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor


# extracts from each document only the content that is relevant to the query
compressor = LLMChainExtractor.from_llm(llm=chatgpt)

# retrieves the documents similar to query and then applies the compressor
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=mq_retriever
)

In [58]:
query = "what is ML?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does machine learning (ML) refer to?  ', 'Can you explain the concept of machine learning?  ', 'What are the key principles and applications of ML?']


Metadata: {'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '359370', 'source': 'Wikipedia', 'title': 'Supervised learning'}
Content Brief:


In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised.


Metadata: {'id': '312307', 'source': 'Wikipedia', 'title': 'Standard ML'}
Content Brief:


Standard ML is a functional programming language which is a dialect of ML (programming language).


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


"Neural networks are an example of machine learning, where a program can change as it learns to solve a problem."




In [59]:
query = "what is a cnn?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does CNN stand for and what are its main functions?  ', 'Can you explain the concept and applications of convolutional neural networks?  ', 'What are the key features and uses of CNNs in machine learning?']


Metadata: {'id': '3615', 'source': 'Wikipedia', 'title': 'CNN'}
Content Brief:


The Cable News Network (CNN) is an American cable news television channel. It was founded in 1980 by Ted Turner. The Cable News Network first aired on television on June 1, 1980. CNN broadcasts programs from its headquarters at the CNN Center in Atlanta, or from the Time Warner Center in New York City, or from studios in Washington, D.C., and Los Angeles. CNN is owned by Time Warner, and the U.S. news channel is a part of the Turner Broadcasting System.


Metadata: {'id': '14059', 'source': 'Wikipedia', 'title': 'News'}
Content Brief:


"There are several news channels on cable television that give news all day long, such as Fox News and CNN."


Metadata: {'id': '779837', 'source': 'Wikipedia', 'title': 'Euronews'}
Content Brief:


CNN helped make 24-hour TV news important after the Gulf War.


Metadata: {'id': '246273', 'source': 'Wikipedia', 'title': 'NBA TV'}
Content Brief:


NBA TV replaced Time Warner's CNN/SI on many cable systems after that network shut down a year earlier.


Metadata: {'id': '429703', 'source': 'Wikipedia', 'title': 'JTBC'}
Content Brief:


JTBC is in alliance with other overseas broadcasting companies, including CNN.


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 0, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


"highlights the limitations of traditional convolutional neural networks (CNNs) in computer vision"  
"In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016)."  
"some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a)."  
"Therefore, in large-scale image recognition, classic ResNet-like architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al., 2020)."


Metadata: {'author': '', 'creationDate': 'D:20210604001958Z', 'creationdate': '2021-06-04T00:19:58+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/vision_transformer.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20210604001958Z', 'moddate': '2021-06-04T00:19:58+00:00', 'page': 1, 'producer': 'pdfTeX-1.40.21', 'source': '../docs/vision_transformer.pdf', 'subject': '', 'title': '', 'total_pages': 22, 'trapped': ''}
Content Brief:


"contrasting with previous works that primarily employed CNNs."  
"Moreover, Sun et al. (2017) study how CNN performance scales with dataset size, and Kolesnikov et al. (2020); Djolonga et al. (2020) perform an empirical exploration of CNN transfer learning from large scale datasets such as ImageNet-21k and JFT-300M."




In [60]:
query = "what is nlp?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does NLP stand for and what are its main applications?  ', 'Can you explain the concept of natural language processing and its significance?  ', 'What are the key components and techniques involved in NLP?']


Metadata: {'id': '40613', 'source': 'Wikipedia', 'title': 'Natural language processing'}
Content Brief:


Natural Language Processing (NLP) is a field in Artificial Intelligence, and is also related to linguistics. On a high level, the goal of NLP is to program computers to automatically understand human languages, and also to automatically write/speak in human languages. We say "Natural Language" to mean human language, and to indicate that we are not talking about computer (programming) languages.


Metadata: {'id': '335464', 'source': 'Wikipedia', 'title': 'Neurolinguistic programming'}
Content Brief:


Neurolinguistic programming is a way of communicating, created in the 1970s. It is often shortened to "NLP". The discipline assumes there is a link between neurological processes, language and behavior. According to NLP, it is possible to achieve certain goals in life by changing one's behaviour. NLP was invented by Richard Bandler and John Grinder.




In [61]:
query = "what is clustering?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What are the key concepts and techniques involved in clustering?  ', 'Can you explain the different types of clustering methods and their applications?  ', 'How does clustering work, and what are its main purposes in data analysis?']


Metadata: {'id': '593732', 'source': 'Wikipedia', 'title': 'Cluster analysis'}
Content Brief:


Clustering or cluster analysis is a type of data analysis. The analyst groups objects so that objects in the same group (called a cluster) are more similar to each other than to objects in other groups (clusters) in some way. This is a common task in data mining.




In [62]:
query = "what is a neural network?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What are the key components and functions of a neural network?  ', 'Can you explain the concept and workings of neural networks in simple terms?  ', 'What are the different types of neural networks and their applications?']


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.


Metadata: {'id': '587280', 'source': 'Wikipedia', 'title': 'Generative adversarial networks'}
Content Brief:


Generative adversarial networks (GANs) are artificial neural networks that work together to give better answers. One neural network is the tricky network, and the other one is the useful network. The tricky network will try to give an input to the useful network that will cause the useful network to give a bad answer. The useful network will then learn not to give a bad answer, and the tricky network will try to trick the useful network again. As this continues, the useful network will get better and not become tricked as often, and the useful network will be able to be used to make good predictions.


Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems.




The `LLMChainFilter` is slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.

In [63]:
from langchain.retrievers.document_compressors import LLMChainFilter

#  decides which of the initially retrieved documents to filter out and which ones to return
_filter = LLMChainFilter.from_llm(llm=chatgpt)

# retrieves the documents similar to query and then applies the filter
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=mq_retriever
)

In [64]:
query = "what is ML?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does machine learning (ML) refer to?  ', 'Can you explain the concept of machine learning?  ', 'What are the key principles and applications of ML?']


Metadata: {'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '359370', 'source': 'Wikipedia', 'title': 'Supervised learning'}
Content Brief:


In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic


Metadata: {'id': '312307', 'source': 'Wikipedia', 'title': 'Standard ML'}
Content Brief:


Standard ML is a functional programming language which is a dialect of ML (programming language). It is sometimes used for writing compilers and in theorem provers. Here is an example of a factorial function written in a simple, non-tail recursive, style.


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.




In [65]:
query = "what is NLP?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does NLP stand for and what are its main applications?  ', 'Can you explain the concept of Natural Language Processing and its significance?  ', 'What are the key components and techniques involved in NLP?']


Metadata: {'id': '40613', 'source': 'Wikipedia', 'title': 'Natural language processing'}
Content Brief:


Natural Language Processing (NLP) is a field in Artificial Intelligence, and is also related to linguistics. On a high level, the goal of NLP is to program computers to automatically understand human languages, and also to automatically write/speak in human languages. We say "Natural Language" to mean human language, and to indicate that we are not talking about computer (programming) languages.


Metadata: {'author': '', 'creationDate': 'D:20230803000729Z', 'creationdate': '2023-08-03T00:07:29+00:00', 'creator': 'LaTeX with hyperref', 'file_path': '../docs/attention_paper.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20230803000729Z', 'moddate': '2023-08-03T00:07:29+00:00', 'page': 11, 'producer': 'pdfTeX-1.40.25', 'source': '../docs/attention_paper.pdf', 'subject': '', 'title': '', 'total_pages': 15, 'trapped': ''}
Content Brief:


Focuses on the references cited in the research paper, which include foundational works in natural language processing, machine translation, and neural network architectures. These references support the development and evaluation of the Transformer model, highlighting its contributions to sequence transduction tasks and its performance compared to previous models.
[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated
corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
[26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference,
pages 152–159. ACL, June 2006.
[27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention
model. In Empirical Methods in Natural Language Processing, 2016.
[28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforce




In [66]:
query = "what is a neural network?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What are the key components and functions of a neural network?  ', 'Can you explain the concept and workings of neural networks in simple terms?  ', 'What are the different types of neural networks and their applications?']


Metadata: {'id': '44742', 'source': 'Wikipedia', 'title': 'Artificial neural network'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.


Metadata: {'id': '779314', 'source': 'Wikipedia', 'title': 'Binary Neural Network'}
Content Brief:


Binary neural network is an artificial neural network, where commonly used floating-point weights are replaced with binary ones. It largely saving the storage and computation, serves as a technique for deploying deep models on resource-limited devices. Usage of binary values can bring up to 58 times speedup, while accuracy and information capacity of binary neural network can be manually controlled. Binary neural networks do not achieve the same accuracy as their full-precision counterparts, but improvements are being made to close this gap.


Metadata: {'id': '587280', 'source': 'Wikipedia', 'title': 'Generative adversarial networks'}
Content Brief:


Generative adversarial networks (GANs) are artificial neural networks that work together to give better answers. One neural network is the tricky network, and the other one is the useful network. The tricky network will try to give an input to the useful network that will cause the useful network to give a bad answer. The useful network will then learn not to give a bad answer, and the tricky network will try to trick the useful network again. As this continues, the useful network will get better and not become tricked as often, and the useful network will be able to be used to make good predictions.


Metadata: {'id': '663523', 'source': 'Wikipedia', 'title': 'Deep learning'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic




In [67]:
query = "what is a cnn?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What does CNN stand for and what are its main functions?  ', 'Can you explain the concept and applications of convolutional neural networks?  ', 'What are the key features and uses of CNNs in machine learning?']


Metadata: {'id': '3615', 'source': 'Wikipedia', 'title': 'CNN'}
Content Brief:


The Cable News Network (CNN) is an American cable news television channel. It was founded in 1980 by Ted Turner. The Cable News Network first aired on television on June 1, 1980. The Cable News Network's first newscast was anchored (hosted) by David Walker and his wife Lois Hart. In its first year CNN hired many political analysts, including Rowland Evans and Robert Novak. On January 1, 1982 CNN launched a 24-hour sister newscast channel with no talk shows or commentary shows called CNN2. CNN broadcasts programs from its headquarters at the CNN Center in Atlanta, or from the Time Warner Center in New York City, or from studios in Washington, D.C., and Los Angeles. CNN is owned by Time Warner, and the U.S. news channel is a part of the Turner Broadcasting System. The hosts of its opinion shows are Don Lemon, Chris Cuomo, Fredricka Whitfield, Erin Burnett, Brianna Keiler and Brooke Baldwin. CNN has been criticized by the right-wing Media Research Center for having a left-wing bias. Accor


Metadata: {'id': '14059', 'source': 'Wikipedia', 'title': 'News'}
Content Brief:


News is when people talk about current events (things that are happening right now). News Media is a portrayal of current affairs, perspectives and social influence. News can be given in newspapers, television, magazines, or radio. There are several news channels on cable television that give news all day long, such as Fox News and CNN. There are several news magazines, such as "Time", "The Economist", and "Newsweek". A newsman is a person who helps out with the news. For example, Brian Gotter is a newsman. News Media can be viewed in many forms, such as newspaper, television and radio.




### Chained Retrieval Pipeline

This strategy uses a chain of multiple retrievers sequentially to get to the most relevant documents. The following is the flow

Similarity Retrieval → Compression Filter → Reranker Model Retrieval

![](http://i.imgur.com/77pXxLu.gif)

In [69]:
!pip install -qq sentence-transformers

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chromadb 0.5.23 requires tokenizers<=0.20.3,>=0.13.2, but you have tokenizers 0.21.2 which is incompatible.[0m[31m
[0m

In [70]:
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers.document_compressors import LLMChainFilter
from langchain.retrievers import ContextualCompressionRetriever

# Retriever 1 - simple cosine distance based retriever
similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 5})

#  decides which of the initially retrieved documents to filter out and which ones to return
_filter = LLMChainFilter.from_llm(llm=chatgpt)
# Retriever 2 - retrieves the documents similar to query and then applies the filter
compressor_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=similarity_retriever
)

# download an open-source reranker model - BAAI/bge-reranker-v2-m3
reranker = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-large")
reranker_compressor = CrossEncoderReranker(model=reranker, top_n=3)
# Retriever 3 - Uses a Reranker model to rerank retrieval results from the previous retriever
final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker_compressor, base_retriever=compressor_retriever
)

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [71]:
query = "what is ML?"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.




In [None]:
query = "what is a neural network?"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [None]:
query = "what is a transformer model?"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [None]:
query = "what is nlp?"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [None]:
query = "what is clustering?"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [None]:
query = "what is a vision transformer"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [None]:
query = "what is statistics"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)