In [None]:
%pip install langchain_anthropic gpt4all

In [1]:
import os
from dotenv import load_dotenv


In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import GPT4AllEmbeddings

In [3]:
load_dotenv()

True

In [4]:

TAVILY_API_KEY = os.getenv("TAVILY_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")


Get the article

In [5]:
article_path = os.path.join(os.getcwd(), "NIPS-2017-attention-is-all-you-need-Paper.pdf")

In [6]:


loader = PyPDFLoader(article_path)
docs = loader.load_and_split()

Split the article into chunks of characters 

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700)
documents = text_splitter.split_documents(docs)
documents

[Document(page_content='Attention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗†\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention', metadata={'source': '/home/varad/Work/Claude_llm/NIPS-2017-attention-is-all-you-need-Paper.pdf', 'page': 0}),
 Document(page_content='convolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. W

In [8]:
len(documents)

67

In [9]:
print(documents[30].page_content)

Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for different layer types. nis the sequence length, dis the representation dimension, kis the kernel
size of convolutions and rthe size of the neighborhood in restricted self-attention.
Layer Type Complexity per Layer Sequential Maximum Path Length
Operations
Self-Attention O(n2·d) O(1) O(1)
Recurrent O(n·d2) O(n) O(n)
Convolutional O(k·n·d2)O(1) O(logk(n))
Self-Attention (restricted) O(r·n·d)O(1) O(n/r)
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed. There are many choices of positional encodings,


In [10]:
len(documents[30].page_content)

700

Save the chunks to the vector database.

In [12]:
embeddings = GPT4AllEmbeddings()

Downloading: 100%|██████████| 45.9M/45.9M [00:07<00:00, 6.11MiB/s]
Verifying: 100%|██████████| 45.9M/45.9M [00:00<00:00, 470MiB/s]


In [13]:
embeddings.embed_documents(["hello how are you?"])

[[0.0028873372357338667,
  0.03898775577545166,
  0.08888968825340271,
  0.07305361330509186,
  -0.026213297620415688,
  -0.07340902835130692,
  0.05525078997015953,
  -0.010802335105836391,
  -0.08910194784402847,
  0.017528029158711433,
  -0.002462751464918256,
  -0.005296293646097183,
  -0.02521679922938347,
  -0.021081289276480675,
  0.07688184827566147,
  -0.03447548300027847,
  0.09120288491249084,
  -0.0904010534286499,
  -0.11484559625387192,
  0.04438171535730362,
  -0.06186788156628609,
  0.02854953519999981,
  0.03158390149474144,
  0.06960074603557587,
  -0.055351778864860535,
  -0.05987855792045593,
  0.030876602977514267,
  0.031193938106298447,
  0.03783831372857094,
  -0.08316593617200851,
  -0.056789711117744446,
  0.06846589595079422,
  -0.014690886251628399,
  -0.004066290333867073,
  -0.025891607627272606,
  0.05491568520665169,
  -0.02755533903837204,
  -0.11880224943161011,
  0.004968432709574699,
  -0.014188005588948727,
  0.01071904692798853,
  -0.05204920843243

In [14]:
vector = FAISS.from_documents(documents, embeddings)

In [15]:
retriever = vector.as_retriever()

In [16]:
retriever.invoke("What is the decoder?")

[Document(page_content='Decoder: The decoder is also composed of a stack of N= 6identical layers. In addition to the two\nsub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head\nattention over the output of the encoder stack. Similar to the encoder, we employ residual connections\naround each of the sub-layers, followed by layer normalization. We also modify the self-attention\nsub-layer in the decoder stack to prevent positions from attending to subsequent positions. This\nmasking, combined with fact that the output embeddings are offset by one position, ensures that the\npredictions for position ican depend only on the known outputs at positions less than i.\n3.2 Attention', metadata={'source': '/home/varad/Work/Claude_llm/NIPS-2017-attention-is-all-you-need-Paper.pdf', 'page': 2}),
 Document(page_content='Figure 1: The Transformer - model architecture.\nwise fully connected feed-forward network. We employ a residual connection [ 10] around

In [17]:
# create tool for retrieving documents
from langchain.tools.retriever import create_retriever_tool

In [18]:
retriever_tool = create_retriever_tool(
    retriever=retriever,
    name="transformer_search",
    description="Search for information about transformers in AI and attention mechanism. \
        For any questions about the transformer architecture and attention mechanism, you must use this tool!",
)

In [19]:
# tool for connecting to the web
search = TavilySearchResults()

In [20]:
search.dict()

{'name': 'tavily_search_results_json',
 'description': 'A search engine optimized for comprehensive, accurate, and trusted results. Useful for when you need to answer questions about current events. Input should be a search query.',
 'args_schema': langchain_community.tools.tavily_search.tool.TavilyInput,
 'return_direct': False,
 'verbose': False,
 'tags': None,
 'metadata': None,
 'handle_tool_error': False,
 'handle_validation_error': False,
 'api_wrapper': {'tavily_api_key': SecretStr('**********')},
 'max_results': 5}

In [21]:
# list of tools to be used by the agent
tools = [search, retriever_tool]

In [22]:
# create the llm
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(temperature=0, model_name="claude-3-opus-20240229")



Create the agent

In [23]:
from langchain.agents import create_react_agent, AgentExecutor
from langchain import hub

In [24]:
# get the prompt for execution
prompt = hub.pull("hwchase17/react")

In [25]:
print(prompt.template)

Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}


In [26]:
agent = create_react_agent(llm=llm, tools=tools, prompt=prompt)

In [27]:
agent_executer = AgentExecutor(agent=agent, tools=tools, verbose=True)

In [28]:
agent_executer.invoke({"input":"what is the weather in New Delhi"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mQuestion: what is the weather in New Delhi
Thought: To find the current weather in New Delhi, I should do a search on a weather website or service.
Action: tavily_search_results_json
Action Input: New Delhi weather[0m[36;1m[1;3m[{'url': 'https://www.weatherapi.com/', 'content': "{'location': {'name': 'New Delhi', 'region': 'Delhi', 'country': 'India', 'lat': 28.6, 'lon': 77.2, 'tz_id': 'Asia/Kolkata', 'localtime_epoch': 1713682111, 'localtime': '2024-04-21 12:18'}, 'current': {'last_updated_epoch': 1713681900, 'last_updated': '2024-04-21 12:15', 'temp_c': 34.0, 'temp_f': 93.2, 'is_day': 1, 'condition': {'text': 'Mist', 'icon': '//cdn.weatherapi.com/weather/64x64/day/143.png', 'code': 1030}, 'wind_mph': 10.5, 'wind_kph': 16.9, 'wind_degree': 290, 'wind_dir': 'WNW', 'pressure_mb': 1008.0, 'pressure_in': 29.77, 'precip_mm': 0.0, 'precip_in': 0.0, 'humidity': 23, 'cloud': 50, 'feelslike_c': 32.0, 'feelslike_f': 89.6, 'vis_km':

{'input': 'what is the weather in New Delhi',
 'output': 'It is currently 34°C (93°F) and partly cloudy in New Delhi, India. The forecast high is 36°C (97°F) with mostly cloudy skies, light winds, and very low chances of rain. The heat and strong sun are the main weather impacts today.'}

In [29]:
agent_executer.invoke({"input":"What is the scaled dot-product in attention mechanism?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mQuestion: What is the scaled dot-product in attention mechanism?

Thought: To answer this question about the attention mechanism in transformers, I should use the transformer_search tool to find relevant information.

Action: transformer_search
Action Input: scaled dot-product attention
[0m[33;1m[1;3mScaled Dot-Product Attention
 Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel.
query with all keys, divide each by√dk, and apply a softmax function to obtain the weights on the
values.
In practice, we compute the attention function on a set of queries simultaneously, packed together
into a matrix Q. The keys and values are also packed together into matrices KandV. We compute
the matrix of outputs as:
Attention(Q,K,V ) = softmax(QKT
√dk)V (1)
The two most commonly used attention functions are additive attention [ 2], and do

{'input': 'What is the scaled dot-product in attention mechanism?',
 'output': 'Scaled dot-product attention is an attention mechanism used in transformers that efficiently computes the attention weights between queries and keys by taking the scaled dot-product between them. It takes the dot product of the queries and keys, scales the result by 1/sqrt(key dimension), and applies a softmax to obtain the weights to blend the values for the output. This scaled dot-product formulation allows for faster, more memory-efficient computation compared to additive attention.'}

In [31]:
agent_executer.invoke({"input":"What is the cost of the gpu \
                       used in the models in transformer architecture?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI do not have enough information to answer the question about the cost of GPUs used in transformer models. To find this, I will search for information specifically about the transformer architecture and attention mechanism.

Action: transformer_search
Action Input: cost of gpu used in transformer models[0m[33;1m[1;3marchitectures from the literature. We estimate the number of ﬂoating point operations used to train a
model by multiplying the training time, the number of GPUs used, and an estimate of the sustained
single-precision ﬂoating-point capacity of each GPU5.
6.2 Model Variations
To evaluate the importance of different components of the Transformer, we varied our base model
in different ways, measuring the change in performance on English-to-German translation on the
development set, newstest2013. We used beam search as described in the previous section, but no
checkpoint averaging. We present these results in Table 

{'input': 'What is the cost of the gpu                        used in the models in transformer architecture?',
 'output': "Based on the information found, the NVIDIA Tesla P100 GPUs used to train the original Transformer model are very expensive, high-end components. While exact prices weren't given, the P100 PCIe version has a single-unit cost of $9,430 before discounts, with higher-end models being even pricier. Refurbished P100s are also listed at high prices. \n\nAs the Transformer used 8 of these premium GPUs for training, that would represent a very significant cost, likely totaling tens of thousands of dollars in GPU hardware. So the GPUs enabling the Transformer's breakthrough performance came at a steep price."}