<a href="https://colab.research.google.com/github/Alex112525/LangChain-with-LLMs/blob/main/Retrieval_Langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain pypdf sentence_transformers chromadb einops accelerate

In [2]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain import HuggingFacePipeline
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [3]:
loaders = [
    PyPDFLoader("/content/Attention.pdf"), # https://arxiv.org/abs/1706.03762
    PyPDFLoader("/content/Bert.pdf")      # https://arxiv.org/abs/1810.04805v2
]

docs = []
for loader in loaders:
  docs.extend(loader.load())

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)

In [4]:
persist_dir = "docs/chroma"

vectordb = Chroma.from_documents(
    documents=splits,
    embedding = HuggingFaceEmbeddings(),
    persist_directory=persist_dir
)

print(vectordb._collection.count())

360


### Search with max_marginal_relevance_search()

The *max_marginal_relevance_search()* function is a method in Chroma that returns documents selected using the maximal marginal relevance. This method optimizes for similarity to query and diversity among the selected documents. The method takes in several parameters such as query, k, fetch_k, lambda_mult, filter and kwargs.

In [5]:
question = "What is attention?"
docs_ss = vectordb.similarity_search(question, k=3)

In [6]:
docs_ss[0].page_content

'sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This\nmasking, combined with fact that the output embeddings are offset by one position, ensures that the\npredictions for position ican depend only on the known outputs at positions less than i.\n3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,\nwhere the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n3'

In [7]:
docs_ss[1].page_content

'sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This\nmasking, combined with fact that the output embeddings are offset by one position, ensures that the\npredictions for position ican depend only on the known outputs at positions less than i.\n3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,\nwhere the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n3'

Here is a brief explanation of each parameter:

* query: Text to look up documents similar to.
* k: Number of Documents to return. Defaults to 4.
* fetch_k: Number of Documents to fetch to pass to MMR algorithm.
* lambda_mult: Number between 0 and 1 that determines the degree of diversity among selected documents.
* filter: A dictionary that specifies the filters for the search.

In [8]:
docs_mmr = vectordb.max_marginal_relevance_search(question, k=3, lambda_mult=0.8)

In [9]:
docs_mmr[0].page_content

'sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This\nmasking, combined with fact that the output embeddings are offset by one position, ensures that the\npredictions for position ican depend only on the known outputs at positions less than i.\n3.2 Attention\nAn attention function can be described as mapping a query and a set of key-value pairs to an output,\nwhere the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n3'

In [10]:
docs_mmr[1].page_content

'Attention Visualizations\nInput-Input Layer5\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nIt\nis\nin\nthis\nspirit\nthat\na\nmajority\nof\nAmerican\ngovernments\nhave\npassed\nnew\nlaws\nsince\n2009\nmaking\nthe\nregistration\nor\nvoting\nprocess\nmore\ndifficult\n.\n<EOS>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\n<pad>\nFigure 3: An example of the attention mechanism following long-distance dependencies in the\nencoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of\nthe verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for\nthe word ‘making’. Different colors represent different heads. Best viewed in color.\n13'

In [11]:
docs_mmr[2]

Document(page_content='Input-Input Layer5\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nInput-Input Layer5\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:\nFull attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5\nand 6. Note that the attentions are very sharp for this word.\n14', metadata={'page': 13, 'source': '/content/Attent

### Search specifying source

In [12]:
docs_s = vectordb.similarity_search(
    question,
    k=3,
    filter={"source": "/content/Bert.pdf"}
)

In [13]:
for doc in docs_s:
  print(doc.metadata)

{'page': 2, 'source': '/content/Bert.pdf'}
{'page': 2, 'source': '/content/Bert.pdf'}
{'page': 2, 'source': '/content/Bert.pdf'}


In [14]:
docs_s[0]

Document(page_content='we will omit an exhaustive background descrip-\ntion of the model architecture and refer readers to\nVaswani et al. (2017) as well as excellent guides\nsuch as “The Annotated Transformer.”2\nIn this work, we denote the number of layers\n(i.e., Transformer blocks) as L, the hidden size as\nH, and the number of self-attention heads as A.3\nWe primarily report results on two model sizes:\nBERT BASE (L=12, H=768, A=12, Total Param-\neters=110M) and BERT LARGE (L=24, H=1024,\nA=16, Total Parameters=340M).\nBERT BASE was chosen to have the same model\nsize as OpenAI GPT for comparison purposes.\nCritically, however, the BERT Transformer uses\nbidirectional self-attention, while the GPT Trans-\nformer uses constrained self-attention where every\ntoken can only attend to context to its left.4\n1https://github.com/tensorﬂow/tensor2tensor\n2http://nlp.seas.harvard.edu/2018/04/03/attention.html\n3In all cases we set the feed-forward/ﬁlter size to be 4H,\ni.e., 3072 for the 

### ContextualCompressionRetriever

The **ContextualCompressionRetriever** is designed to improve the answers returned from vector store document similarity searches by better taking into account the context from the query. It wraps another retriever, and uses a Document Compressor as an intermediate step after the initial similarity search that removes information irrelevant to the initial query from the retrieved documents. This reduces the amount of distraction a subsequent chain has to deal with when parsing the retrieved documents and making its final judgements

In [15]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from IPython.display import display, Markdown

In [16]:
def printmd(string):
    display(Markdown(string))

In [17]:
import torch
from transformers import AutoTokenizer, pipeline

model = "bertin-project/bertin-gpt-j-6B-alpaca"  # The name of the model to use.

tokenizer = AutoTokenizer.from_pretrained(model)  # The tokenizer to use to encode and decode text.

pipeline = pipeline("text-generation",            # The name of the pipeline to use.
                    model=model,                  # The model to use for text generation.
                    tokenizer=tokenizer,          # The tokenizer to use to encode and decode text.
                    torch_dtype=torch.bfloat16,   # The data type to use for the model's computations.
                    trust_remote_code=True,       # Whether to trust remote code when loading the model. This should only be set to True if you trust the source of the model.
                    device_map="auto",            # The device to use for the model's computations.
                    max_length=512 ,              # The maximum length of the text to generate.
                    do_sample=True,               # Whether to use sampling when generating text.
                    top_k=2,                      # The number of candidates to keep when sampling.
                    num_return_sequences=1,       # The number of text sequences to generate.
                    eos_token_id=tokenizer.eos_token_id  # The ID of the end-of-sequence token.
                    )

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [18]:
llm = HuggingFacePipeline(pipeline=pipeline, model_kwargs={"temperature": 0.0})
compressor = LLMChainExtractor.from_llm(llm)

In [19]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [42]:
question = "What is Attention?"
compressed_docs = compression_retriever.get_relevant_documents(question)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [43]:
for doc in compressed_docs:
  print(20 * "*** ")
  printmd(doc.page_content)

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 


"question", "and", "context"

> Question: What is Attention?

> Context:

### Respuesta:
La función de atención es la siguiente: 

mapping_query(query, set(keys, values)) 

Donde query es la consulta, set es el conjunto de claves y valores, y output es el vector de salida.

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 


"mapping"
"query",
"key-value pairs"
"to"
"an output"
"varies with time and place"
"weighed sum"

> Question:
¿Cuál es el número máximo de personas que pueden estar presentes en un aula a la vez?

### Respuesta:
El número máximo de personas que pueden estar presentes en un aula a la vez es veinte.

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 


sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking,
combined with fact that the output embeddings are offset by one position, ensures that the
3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,
3.2.1 Attention
An attention function that maps a query to an output that is only partially relevant to the
query.
3.2.2 Attention
An attention function that maps a query and an additional key-value pair to an output that is
completly relevant to the query.
3.3 Attention
An attention function that maps a query and an additional key-value pair to an output that is
3.4 Attention
An attention function that maps a query, an additional key-value pair, and an additional
key-value pair to an output that is completely relevant to the query.
3.5 Attention
An attention function that maps a query, an additional key-value pair, and an additional
key-value pair, and an additional key-value pair to an output that are all completely
3.6 Attention
An attention function that maps a query, an additional key-value pair, an additional key-value
pair, and an additional key-value pair to an output that are all partially relevant to the

### Respuesta:
3.6.1 Attention
An attention function that maps a query, an additional key-value pair,

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 


"mapping"
"query", "set of keys and values"
"output"
"vector"

> Respuesta:
La función de atención es una función que asigna un peso a cada palabra en la consulta y en el conjunto de pares de clave-valor, de manera que la salida sea un vector que representa la suma ponderada de las palabras en la consulta y el conjunto de pares de clave-valor.

In [22]:
compression_retriever_mmr = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type="mmr")
)

In [47]:
question = "What is the positional encoding?"
compressed_docs_mmr = compression_retriever_mmr.get_relevant_documents(question)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [48]:
for doc in compressed_docs_mmr:
  print(20 * "** ")
  printmd(doc.page_content)
  print(doc.metadata)

** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** 


sub-layer, in, decoder, stack, to, prevent, positions, from, attending, subsequent, positions

### Respuesta:
La encriptación de posición es una técnica en la que los datos se codifican en función de la posición en un vector o matriz, en lugar de hacerlo en función del valor absoluto o del valor final. Esta técnica se utiliza para evitar que los datos sean visibles durante la transmisión o el almacenamiento. La encriptación de posición es una técnica común utilizada en aplicaciones de seguridad, como la autenticación de contraseñas y la encriptación de datos.

{'page': 2, 'source': '/content/Attention.pdf'}
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** 


We use a scale-free embedded embeddings of the input to represent the text.

We use a scale-free softmax to represent the text.

> Question: What is the position of the capital "C"?
> Context:
### Entrada:
The capital of France is Paris.

### Respuesta:
La capital de Francia es París.

{'page': 4, 'source': '/content/Attention.pdf'}
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** 


chose this function because we hypothesised it would allow the model to easily learn to attend
by absolute positions.

### Respuesta:
La codificación posicional es una técnica de codificación en la que las letras, números y símbolos se representan mediante posiciones relativas en una matriz. Se utiliza comúnmente en tareas de aprendizaje automático supervisado, como la cl

{'page': 5, 'source': '/content/Attention.pdf'}
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** 


Encoder:
6 layers, 2 sub-layers per layer

Decoder: 
2 sub-layers
3rd sub-layer: Multi-head attention
Residual connections:
[ 11, 1, 1, 1, 1, 1]

### Output:

Model: 

Inputs:
[X, Y, Z, A

{'page': 2, 'source': '/content/Attention.pdf'}


### TF-IDF (Term Frequency-Inverse Document Frequency)

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling

In [28]:
loader = PyPDFLoader("/content/Attention.pdf")
pages = loader.load()
all_pages = [p.page_content for p in pages]
joined_pages = " ".join(all_pages)

#Split
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150
)
splits = text_splitter.split_text(joined_pages)

In [36]:
from langchain.retrievers import TFIDFRetriever

In [37]:
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [32]:
question = "What is attention?"

In [40]:
docs_tf = tfidf_retriever.get_relevant_documents(question)
for doc in docs_tf:
  print("*** " * 20)
  printmd(doc.page_content)

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 


be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the
sentence. We give two such examples above, from two different heads from the encoder self-attention
at layer 5 of 6. The heads clearly learned to perform different tasks.
15

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 


the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for
the word ‘making’. Different colors represent different heads. Best viewed in color.
13 Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:
Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5
and 6. Note that the attentions are very sharp for this word.
14 Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 


layers, produce outputs of dimension dmodel = 512 .
Decoder: The decoder is also composed of a stack of N= 6identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position ican depend only on the known outputs at positions less than i.
3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
3 Scaled Dot-Product Attention
 Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel.
of the values, where the weight assigned to each value is computed by a compatibility function of the
query with the corresponding key.
3.2.1 Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 


reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as
described in section 3.2.
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions
of a single sequence in order to compute a representation of the sequence. Self-attention has been
used successfully in a variety of tasks including reading comprehension, abstractive summarization,
textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-
aligned recurrence and have been shown to perform well on simple-language question answering and
language modeling tasks [34].
To the best of our knowledge, however, the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequence-
aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate
self-attention and discuss its advantages over models such as [17, 18] and [9].
3 Model Architecture
Most competitive neural sequence transduction models have an encoder-decoder structure [ 5,2,35].
Here, the encoder maps an input sequence of symbol representations (x1, ..., x n)to a sequence