# Project: Summarization App Using LangChain and OpenAI

In [2]:
%%capture 
!pip install -r requirements.txt

In [3]:
import os
import openai
import getpass

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OPENAI_API_KEY: ")
os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your PINECONE_API_KEY: ")

Enter your OPENAI_API_KEY:  ········
Enter your PINECONE_API_KEY:  ········


## Summarizing Using a Basic Prompt

In [6]:
from langchain_openai import ChatOpenAI
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

In [7]:
text= r"""
Mojo combines the usability of Python with the performance of C, unlocking unparalleled programmability \
of AI hardware and extensibility of AI models.
Mojo is a new programming language that bridges the gap between research and production \ 
by combining the best of Python syntax with systems programming and metaprogramming.
With Mojo, you can write portable code that’s faster than C and seamlessly inter-op with the Python ecosystem.
When we started Modular, we had no intention of building a new programming language. \
But as we were building our platform with the intent to unify the world’s ML/AI infrastructure, \
we realized that programming across the entire stack was too complicated. Plus, we were writing a \
lot of MLIR by hand and not having a good time.
And although accelerators are important, one of the most prevalent and sometimes overlooked "accelerators" \
is the host CPU. Nowadays, CPUs have lots of tensor-core-like accelerator blocks and other AI acceleration \
units, but they also serve as the “fallback” for operations that specialized accelerators don’t handle, \
such as data loading, pre- and post-processing, and integrations with foreign systems. \
"""

messages = [
    SystemMessage(content='You are an expert copywriter with expertize in summarizing documents'),
    HumanMessage(content=f'Please provide a short and concise summary of the following text:\n TEXT: {text}')
]

llm = ChatOpenAI(temperature=0, model_name='gpt-4o-mini')

In [8]:
llm.get_num_tokens(text)

235

In [9]:
summary_output = llm.invoke(messages)

In [11]:
summary_output.content

"Mojo is a new programming language that merges Python's usability with C's performance, enhancing AI hardware programmability and model extensibility. It aims to simplify the transition from research to production by offering a syntax similar to Python while incorporating systems programming and metaprogramming. Mojo enables the creation of portable, high-speed code that integrates well with the Python ecosystem. The development of Mojo arose from the challenges faced while building a unified ML/AI infrastructure, highlighting the importance of both specialized accelerators and the host CPU in AI operations."

## Summarizing using Prompt Templates

In [12]:
from langchain import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

In [13]:
llm = ChatOpenAI(temperature=0, model_name='gpt-4o-mini')
parser = StrOutputParser()

template = '''
Write a concise and short summary of the following text:
TEXT: `{text}`
Translate the summary to {language}.
'''
prompt = PromptTemplate(
    input_variables=['text', 'language'],
    template=template
)
chain = prompt | llm | parser

In [14]:
llm.get_num_tokens(prompt.format(text=text, language='English'))

256

In [15]:
summary = chain.invoke({'text': text, 'language':'English'})

In [16]:
summary

"Mojo is a new programming language that merges Python's usability with C's performance, enhancing AI hardware programmability and model extensibility. It aims to simplify the complexities of programming across the machine learning and AI infrastructure by offering a syntax similar to Python while enabling high-speed, portable code that integrates well with the Python ecosystem. The development of Mojo arose from the challenges faced while building a unified ML/AI platform, highlighting the importance of both specialized accelerators and the host CPU in AI operations."

In [17]:
summary = chain.invoke({'text': text, 'language':'Jamaican patois'})

In [18]:
summary

"**Summary:** Mojo is a new programming language that merges Python's ease of use with C's performance, making it ideal for AI hardware and model extensibility. It allows for fast, portable code that works well within the Python ecosystem. The creators of Mojo aimed to simplify programming across machine learning and AI infrastructure, recognizing the importance of both specialized accelerators and host CPUs in the process.\n\n**Jamaican Patois Translation:** Mojo a one new programming language weh mix up Python ease wid C performance, mek it perfect fi AI hardware an model extensibility. It allow yuh fi write fast, portable code weh work good inna di Python ecosystem. Di creators a Mojo did waan fi simplify di programming across machine learnin an AI infrastructure, knowin seh both specialized accelerators an host CPUs important fi di process."

In [20]:
from IPython.display import Markdown, display

In [21]:
display(Markdown(summary))

**Summary:** Mojo is a new programming language that merges Python's ease of use with C's performance, making it ideal for AI hardware and model extensibility. It allows for fast, portable code that works well within the Python ecosystem. The creators of Mojo aimed to simplify programming across machine learning and AI infrastructure, recognizing the importance of both specialized accelerators and host CPUs in the process.

**Jamaican Patois Translation:** Mojo a one new programming language weh mix up Python ease wid C performance, mek it perfect fi AI hardware an model extensibility. It allow yuh fi write fast, portable code weh work good inna di Python ecosystem. Di creators a Mojo did waan fi simplify di programming across machine learnin an AI infrastructure, knowin seh both specialized accelerators an host CPUs important fi di process.

## Summarizing using StuffDocumentChain

In [22]:
from langchain import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document

In [29]:
with open("files/sj.txt", encoding="utf-8") as f:
    text = f.read()

docs = [Document(page_content=text)]
llm = ChatOpenAI(temperature=0, model_name='gpt-4o-mini')

In [30]:
template = '''Write a concise and short summary of the following text.
TEXT: `{text}`
'''
prompt = PromptTemplate(
    input_variables=['text'],
    template=template
)
prompt

PromptTemplate(input_variables=['text'], input_types={}, partial_variables={}, template='Write a concise and short summary of the following text.\nTEXT: `{text}`\n')

In [31]:
docs

[Document(metadata={}, page_content='I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.\n\nThe first story is about connecting the dots.\n\nI dropped out of Reed College after the first 6 months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why did I drop out?\n\nIt started before I was born. My biological mother was a young, unwed college graduate student, and she decided to put me up for adoption. She felt very strongly that I should be adopted by college graduates, so everything was all set for me to be adopted at birth by a lawyer and his wife. Except that when I popped out they decided at the last minute that they really wanted a girl. So my parents, who were on a waiting list, got 

In [32]:
chain = load_summarize_chain(
    llm,
    chain_type='stuff',
    prompt=prompt,
    verbose=False
)
output_summary = chain.invoke(docs)

In [33]:
display(Markdown(output_summary["output_text"]))

In a commencement speech, the speaker shares three personal stories that highlight key life lessons. The first story emphasizes the importance of trusting one's intuition and the ability to connect past experiences to future success, illustrated by his decision to drop out of college and later apply his calligraphy skills to the Macintosh computer. The second story reflects on love and loss, detailing how being fired from Apple led to a creative resurgence, resulting in the founding of NeXT and Pixar. The third story addresses the inevitability of death, urging graduates to live authentically and follow their passions. He concludes with the advice to "Stay Hungry. Stay Foolish," encouraging a lifelong pursuit of curiosity and innovation.

## Summarizing Large Documents using map_reduce

In [34]:
from langchain import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [80]:
llm = ChatOpenAI(temperature=0, model_name='gpt-4o-mini')

with open('files/sj.txt', encoding='utf-8') as f:
    text = f.read()

In [81]:
llm.get_num_tokens(text)

2643

In [82]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=50)
chunks = text_splitter.create_documents([text])

In [83]:
len(chunks)

2

In [84]:
chunks[-1]

Document(metadata={}, page_content='No one wants to die. Even people who want to go to heaven don’t want to die to get there. And yet death is the destination we all share. No one has ever escaped it. And that is as it should be, because Death is very likely the single best invention of Life. It is Life’s change agent. It clears out the old to make way for the new. Right now the new is you, but someday not too long from now, you will gradually become the old and be cleared away. Sorry to be so dramatic, but it is quite true.\n\nYour time is limited, so don’t waste it living someone else’s life. Don’t be trapped by dogma — which is living with the results of other people’s thinking. Don’t let the noise of others’ opinions drown out your own inner voice. And most important, have the courage to follow your heart and intuition. They somehow already know what you truly want to become. Everything else is secondary.\n\nWhen I was young, there was an amazing publication called The Whole Earth 

In [85]:
chain = load_summarize_chain(
    llm,
    chain_type='map_reduce',
    verbose=False
)
output_summary = chain.invoke(chunks)

In [86]:
display(Markdown(output_summary["output_text"]))

In his commencement speech, Steve Jobs shares three personal stories that convey essential life lessons. He emphasizes trusting intuition and connecting past experiences, illustrated by his decision to drop out of college and take a calligraphy class, which later influenced Macintosh design. He reflects on love and loss through his firing from Apple, which led to new ventures and ultimately his return to the company. Lastly, he discusses the inevitability of death, prompted by his cancer diagnosis, which encouraged him to prioritize what truly matters. Jobs urges graduates to pursue their passions and live authentically, echoing the message of "Stay Hungry. Stay Foolish" to inspire curiosity and boldness in their journeys.

In [87]:
with open('files/state_of_the_union.txt', encoding='utf-8') as f:
    text = f.read()

llm.get_num_tokens(text)

8220

In [88]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=50)
chunks = text_splitter.create_documents([text])
len(chunks)

4

In [56]:
chain = load_summarize_chain(
    llm,
    chain_type='map_reduce',
    verbose=False
)
output_summary = chain.invoke(chunks)

In [57]:
display(Markdown(output_summary["output_text"]))

In a recent address, the President called for unity among Americans in response to global challenges, particularly the conflict in Ukraine, where he condemned Russian aggression and highlighted U.S. support for Ukraine. He discussed domestic issues, including economic recovery, job creation, and the need for infrastructure investment to compete globally, particularly against China. The Bipartisan Infrastructure Law aims to modernize infrastructure and promote domestic manufacturing. The President also outlined measures to combat inflation, improve healthcare, and support small businesses, while emphasizing the importance of voting rights and bipartisan cooperation on issues like police funding and gun control. Additionally, he honored retiring Supreme Court Justice Stephen Breyer and nominated Judge Ketanji Brown Jackson as his successor. A speaker advocating for immigration reform and women's rights emphasized a "Unity Agenda" focused on mental health, veterans' support, and cancer research, expressing optimism about America's resilience.

In [58]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [89]:
file = "files/attention_is_all_you_need.pdf"
loader = PyPDFLoader(file)
data = loader.load()
# data

In [62]:
len(data)

15

In [90]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=50)
pdf_chunks = text_splitter.split_documents(data)
len(pdf_chunks)

15

In [91]:
chain = load_summarize_chain(
    llm,
    chain_type='map_reduce',
    verbose=False
)
output_summary = chain.invoke(pdf_chunks)

In [92]:
display(Markdown(output_summary["output_text"]))

The paper "Attention Is All You Need" presents the Transformer, a groundbreaking neural network architecture that utilizes only attention mechanisms, eliminating the need for recurrent and convolutional layers. This design enhances parallelization and reduces training time, leading to superior performance in machine translation, achieving BLEU scores of 28.4 for English-to-German and 41.8 for English-to-French, surpassing previous models. The Transformer consists of an encoder and decoder with stacked self-attention and feed-forward layers, employing techniques like multi-head attention and positional encoding to effectively model dependencies and handle long-range sequences. The architecture demonstrates strong generalization across various tasks, including constituency parsing, and sets new benchmarks in translation efficiency. The study also explores variations in model configurations and their impact on performance, highlighting the potential for future research in applying attention mechanisms to other modalities.

In [69]:
display(Markdown(chain.llm_chain.prompt.template))

Write a concise summary of the following:


"{text}"


CONCISE SUMMARY:

In [70]:
display(Markdown(chain.combine_document_chain.llm_chain.prompt.template))

Write a concise summary of the following:


"{text}"


CONCISE SUMMARY:

### map_reduce with Custom Prompts

In [71]:
map_prompt = '''
Write a short and concise summary of the following:
Text: `{text}`
CONCISE SUMMARY:
'''
map_prompt_template = PromptTemplate(
    input_variables=['text'],
    template=map_prompt
)

In [72]:
combine_prompt = '''
Write a concise summary of the following text that covers the key points.
Add a title to the summary.
Start your summary with an INTRODUCTION PARAGRAPH that gives an overview of the topic FOLLOWED
by BULLET POINTS if possible AND end the summary with a CONCLUSION PHRASE.
Text: `{text}`
'''
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=['text'])

In [76]:
summary_chain = load_summarize_chain(
    llm=llm,
    chain_type='map_reduce',
    map_prompt=map_prompt_template,
    combine_prompt=combine_prompt_template,
    verbose=False
)
output = summary_chain.invoke(chunks)

In [77]:
display(Markdown(output["output_text"]))

### Summary of Recent Address on Unity and Progress

In a recent address, the speaker called for unity among Americans while addressing both international and domestic challenges. They condemned Russia's aggression in Ukraine, praised the resilience of the Ukrainian people, and reaffirmed U.S. support through sanctions and military aid. The speaker also discussed the impact of the COVID-19 pandemic and outlined a vision for economic revitalization focused on the middle class and infrastructure investment.

- **International Focus:**
  - Condemnation of Russia's actions under Putin.
  - Support for Ukraine through economic sanctions and military assistance.
  
- **Domestic Issues:**
  - Reflection on the COVID-19 pandemic and the American Rescue Plan.
  - Advocacy for a new economic vision prioritizing middle-class investment and infrastructure.
  
- **Infrastructure Investment:**
  - U.S. infrastructure ranked 13th globally; need for modernization.
  - Bipartisan Infrastructure Law aims to create jobs and address climate change.
  - Key initiatives include electric vehicle charging stations, lead pipe replacement, and high-speed internet expansion.
  
- **Economic Measures:**
  - Proposals to lower costs for prescription drugs, energy, and childcare.
  - Emphasis on fair taxation for corporations and the wealthy.
  - Significant deficit reduction, with over a trillion dollars cut in one year.
  
- **Social Issues:**
  - Plans to improve nursing home standards, raise the minimum wage, and support workers.
  - Call for immigration reform, women's rights, and LGBTQ+ rights.
  - Unity Agenda focusing on the opioid epidemic, mental health services, veterans' support, and cancer research.

In conclusion, the speaker's address emphasized the importance of unity in overcoming challenges, both at home and abroad, while advocating for policies that promote economic equity and social justice.

In [78]:
output = summary_chain.invoke(pdf_chunks)

In [79]:
display(Markdown(output["output_text"]))

### Summary of "Attention Is All You Need"

The paper "Attention Is All You Need" presents the Transformer, a groundbreaking neural network architecture that utilizes attention mechanisms exclusively, eliminating the need for recurrent and convolutional layers. This innovation leads to enhanced performance in machine translation tasks, showcasing significant improvements in efficiency and translation quality.

- **Introduction of the Transformer**: 
  - A novel architecture that relies solely on attention mechanisms.
  - Achieves state-of-the-art BLEU scores: 28.4 (English-to-German) and 41.8 (English-to-French).
  - More parallelizable and requires less training time than previous models.

- **Architecture Overview**:
  - Comprises an encoder and decoder, each with six identical layers.
  - Utilizes multi-head self-attention and position-wise feed-forward networks.
  - Incorporates residual connections and layer normalization for improved performance.

- **Attention Mechanisms**:
  - Scaled Dot-Product Attention and Multi-Head Attention are key components.
  - Multi-Head Attention allows simultaneous focus on various representation subspaces.

- **Efficiency and Performance**:
  - Self-attention layers are more efficient than recurrent layers, especially for shorter sequences.
  - The Transformer model outperforms traditional models in translation tasks with reduced training costs.

- **Model Variations and Generalization**:
  - Different configurations of the Transformer show competitive performance in various tasks, including constituency parsing.
  - Future research aims to extend attention-based models to other domains like images and audio.

- **Training and Optimization**:
  - Utilizes the Adam optimizer and various regularization techniques.
  - Highlights the importance of hyperparameter tuning for optimal results.

In conclusion, the Transformer architecture represents a significant advancement in neural network design, particularly for language processing tasks, demonstrating strong performance and efficiency that could influence future research and applications in various fields.

## Summarizing Using the refine CombineDocumentChain

In [93]:
from langchain_openai import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

In [96]:
file = "files/attention_is_all_you_need.pdf"
loader = PyPDFLoader(file)
pdf_data = loader.load()

In [123]:
display(Markdown(pdf_data[0].page_content))

Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser ∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-
to-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.8 after
training for 3.5 days on eight GPUs, a small fraction of the training costs of the
best models from the literature. We show that the Transformer generalizes well to
other tasks by applying it successfully to English constituency parsing both with
large and limited training data.
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, have been ﬁrmly established as state of the art approaches in sequence modeling and
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started
the effort to evaluate this idea. Ashish, with Illia, designed and implemented the ﬁrst Transformer models and
has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head
attention and the parameter-free position representation and became the other person involved in nearly every
detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and
tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and
efﬁcient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and
implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating
our research.
†Work performed while at Google Brain.
‡Work performed while at Google Research.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
arXiv:1706.03762v5  [cs.CL]  6 Dec 2017

In [98]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=100)
pdf_chunks = text_splitter.split_documents(pdf_data)
len(pdf_chunks)

15

In [124]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-3-small')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.00002:.6f}')

In [125]:
print_embedding_cost(pdf_chunks)

Total Tokens: 10235
Embedding Cost in USD: 0.000205


In [100]:
llm = ChatOpenAI(temperature=0, model_name='gpt-4o-mini')
chain = load_summarize_chain(
    llm=llm,
    chain_type='refine',
    verbose=True
)
output_summary = chain.invoke(pdf_chunks)



[1m> Entering new RefineDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser ∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirel

In [101]:
display(Markdown(output_summary["output_text"]))

The paper "Attention Is All You Need" introduces the Transformer, a novel neural network architecture that relies solely on attention mechanisms, eliminating the need for recurrent and convolutional layers. This approach enhances parallelization and reduces training time, allowing the model to achieve state-of-the-art results in machine translation, with a BLEU score of 28.4 for English-to-German and 41.8 for English-to-French. The Transformer demonstrates strong generalization capabilities across various tasks, including English constituency parsing.

The architecture consists of an encoder and a decoder, each composed of a stack of six identical layers. The encoder features a multi-head self-attention mechanism and a position-wise fully connected feed-forward network, with residual connections and layer normalization applied to each sub-layer. The decoder includes an additional multi-head attention sub-layer that attends to the encoder's output, with modifications to prevent attending to future positions, ensuring that predictions depend only on known outputs.

A key component of the Transformer is the Scaled Dot-Product Attention, which computes attention scores by taking the dot products of queries and keys, scaling them by the square root of their dimension, and applying a softmax function to obtain weights for the values. This method is efficient and effective, particularly when implemented with matrix multiplication. The model employs Multi-Head Attention, which allows it to project queries, keys, and values into multiple learned representations, performing attention in parallel to capture diverse information from different representation subspaces. Specifically, the Transformer uses eight parallel attention heads, each with reduced dimensionality, maintaining computational efficiency while enhancing the model's ability to jointly attend to information from different representation subspaces.

The Transformer utilizes multi-head attention in three distinct ways: in encoder-decoder attention layers, where the decoder queries attend to the encoder's output; in self-attention layers within the encoder, allowing each position to attend to all previous positions; and in self-attention layers within the decoder, which are masked to prevent leftward information flow, preserving the auto-regressive property.

In addition to attention sub-layers, each layer in the encoder and decoder contains a position-wise feed-forward network, consisting of two linear transformations with a ReLU activation in between. The model also incorporates learned embeddings to convert input and output tokens into vectors, and it uses positional encoding to inject information about the order of the sequence, compensating for the lack of recurrence and convolution. The positional encodings are derived from sine and cosine functions of different frequencies, allowing the model to learn relative positions effectively and potentially extrapolate to longer sequences than those encountered during training.

The authors emphasize the collaborative effort behind the development and implementation of the Transformer, which represents a significant advancement in the field of sequence transduction. The paper discusses the advantages of self-attention over recurrent and convolutional layers, highlighting its lower computational complexity, greater parallelization potential, and shorter path lengths for learning long-range dependencies, making it particularly suitable for tasks involving variable-length sequences. Additionally, the authors note that self-attention could yield more interpretable models, as individual attention heads learn to perform different tasks and exhibit behavior related to the syntactic and semantic structure of sentences.

The training of the Transformer models was conducted on large datasets, including the WMT 2014 English-German and English-French datasets, utilizing the Adam optimizer with a specific learning rate schedule and regularization techniques such as residual dropout. The training process was optimized for efficiency, leveraging powerful hardware to achieve significant performance improvements. Notably, the Transformer outperforms previous state-of-the-art models on the English-to-German and English-to-French translation tasks, achieving BLEU scores of 28.4 and 41.8, respectively, while requiring significantly less training cost in terms of floating point operations (FLOPs). The authors also employed label smoothing during training to improve accuracy and BLEU scores, further demonstrating the model's effectiveness.

Additionally, the paper presents variations on the Transformer architecture, showing that larger models generally yield better performance, while dropout is effective in preventing overfitting. The Transformer also generalizes well to English constituency parsing, achieving competitive results compared to previous models, indicating its versatility across different tasks. The findings suggest that while certain architectural choices, such as the size of attention keys, can impact model quality, the overall design of the Transformer remains robust and effective across various applications. The authors also highlight the model's performance in small-data regimes, noting that it outperforms traditional RNN sequence-to-sequence models even with limited training data, further underscoring its efficiency and effectiveness. Future research directions include extending the Transformer to other modalities beyond text and exploring local attention mechanisms for handling larger inputs and outputs.

The paper also includes attention visualizations, demonstrating how the attention mechanism can effectively capture long-distance dependencies. For instance, in the encoder's self-attention layer, certain attention heads focus on distant dependencies, such as the verb "making" in a sentence, illustrating the model's ability to maintain contextual relevance across longer sequences. This capability enhances the interpretability of the model, as different attention heads can be observed to learn distinct aspects of the input data. Notably, the paper highlights specific attention heads involved in tasks like anaphora resolution, showcasing the model's ability to focus sharply on relevant words, further emphasizing the interpretability and effectiveness of the attention mechanism in understanding complex linguistic structures. The attention visualizations also reveal that many attention heads exhibit behavior related to the structure of sentences, indicating that the model learns to perform different tasks effectively.

### refine with Custom Prompts

In [102]:
prompt_template = """Write a concise summary of the following extracting the key information:
Text: `{text}`
CONCISE SUMMARY:"""
initial_prompt = PromptTemplate(template=prompt_template, input_variables=['text'])

refine_template = '''
    Your job is to produce a final summary.
    I have provided an existing summary up to a certain point: {existing_answer}.
    Please refine the existing summary with some more context below.
    ------------
    {text}
    ------------
    Start the final summary with an INTRODUCTION PARAGRAPH that gives an overview of the topic FOLLOWED
    by BULLET POINTS if possible AND end the summary with a CONCLUSION PHRASE.
    
'''
refine_prompt = PromptTemplate(
    template=refine_template,
    input_variables=['existing_answer', 'text']
)

In [103]:
chain = load_summarize_chain(
    llm=llm,
    chain_type='refine',
    question_prompt=initial_prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=False
    
)
output_summary = chain.invoke(pdf_chunks)

In [104]:
display(Markdown(output_summary["output_text"]))

**Introduction**  
The paper "Attention Is All You Need" introduces the Transformer model, a groundbreaking architecture that has significantly advanced the field of natural language processing (NLP) by relying solely on attention mechanisms. This model addresses the shortcomings of traditional recurrent and convolutional networks, resulting in enhanced training efficiency, superior performance, and the ability to effectively manage long-range dependencies in data. The Transformer not only sets new standards for sequence transduction tasks but also opens avenues for future innovations in machine learning. The paper thoroughly explores the components of the Transformer and showcases its effectiveness across various applications, including machine translation and other NLP tasks.

**Key Points:**
- **Architecture**: The Transformer consists of an encoder-decoder structure, each with six identical layers utilizing multi-head self-attention and position-wise feed-forward networks.
- **Multi-Head Attention**: This feature enables the model to focus on different representation subspaces, capturing a diverse range of relationships within the data.
- **Attention Mechanism**: The attention function maps queries to key-value pairs, producing outputs as weighted sums based on the compatibility of queries with keys.
- **Scaled Dot-Product Attention**: This method computes dot products of queries and keys, scales them, and applies a softmax function to derive weights, effectively addressing gradient flow issues.
- **Residual Connections and Layer Normalization**: These techniques improve training stability and overall performance in both encoder and decoder layers.
- **Position-wise Feed-Forward Networks**: Each layer includes a fully connected feed-forward network, applied independently to each position.
- **Positional Encoding**: The model employs positional encodings to maintain sequence order, using sine and cosine functions to learn relative positions.
- **Performance**: The Transformer achieves state-of-the-art results in machine translation, significantly improving BLEU scores and outperforming previous models while reducing training time.
- **Efficiency**: The model can be trained in as little as twelve hours on eight P100 GPUs, demonstrating its computational efficiency.
- **Versatility**: The Transformer has been successfully applied to various tasks, including English constituency parsing, reading comprehension, and abstractive summarization.
- **Self-Attention**: This mechanism enhances the model's ability to capture long-range dependencies, improving performance across diverse tasks.
- **Comparative Analysis**: Self-attention layers exhibit lower computational complexity and greater parallelization compared to recurrent and convolutional layers.
- **Training Regime**: The models were trained on WMT 2014 datasets, utilizing byte-pair encoding and a shared vocabulary, with a focus on optimization techniques.
- **Results**: The Transformer outperforms previous models in BLEU scores for English-to-German and English-to-French translations, achieving these results with lower computational costs.
- **Model Variations**: The paper explores variations in the Transformer model, revealing optimal ranges for attention heads and the significance of dropout in preventing overfitting.
- **Generalization**: The architecture generalizes well to tasks beyond translation, achieving competitive F1 scores in English constituency parsing.
- **Small Data Regimes**: The Transformer shows promising results in small-data scenarios, outperforming traditional RNN models.
- **Attention Visualizations**: The paper includes visualizations of the attention mechanism, illustrating how the model captures long-distance dependencies. For example, in the encoder's self-attention layer, attention heads focus on the verb "making," maintaining contextual relevance across distances. Additionally, attention heads in layer 5 demonstrate their role in anaphora resolution, showcasing sharp attentions for the word "its."
- **Layer-Specific Behavior**: Attention heads in layer 5 exhibit behavior related to sentence structure, indicating that different heads have learned to perform distinct tasks, further enhancing the model's interpretability.

**Conclusion**  
The introduction of the Transformer architecture represents a significant leap forward in neural network design, providing a robust framework for addressing a wide range of sequence transduction challenges. Its innovative attention mechanisms not only enhance performance but also lay the groundwork for future developments in natural language processing and related fields. The ongoing exploration of self-attention and its implications for model interpretability and efficiency underscores the Transformer's potential to shape the future of machine learning.

## Summarizing Using LangChain Agents

In [112]:
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain.agents import create_react_agent, Tool, AgentExecutor
from langchain_community.utilities import WikipediaAPIWrapper

In [114]:
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your LANGCHAIN_API_KEY: ")

Enter your LANGCHAIN_API_KEY:  ········


In [115]:
llm = ChatOpenAI(temperature=0, model_name='gpt-4o-mini')
wikipedia = WikipediaAPIWrapper()
prompt = hub.pull("hwchase17/react")



In [108]:
tools = [
    Tool(
        name="Wikipedia", 
        func=wikipedia.run,
        description="Useful for when you need to get information from wikipedia about a single topic"
    )
]

In [116]:
agent = create_react_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools)

In [119]:
output = agent_executor.invoke({"input": "Can you please provide a summary of LLMs, RAG, and llm agents"})

In [122]:
display(Markdown(output["output"]))

**Large Language Models (LLMs)** are advanced computational models designed for natural language processing tasks, such as language generation. They learn statistical relationships from vast amounts of text through self-supervised and semi-supervised training. The most capable LLMs utilize a transformer-based architecture, allowing efficient processing and generation of text. They can be fine-tuned for specific tasks and are influenced by prompt engineering.

**Retrieval-Augmented Generation (RAG)** is a technique that enhances LLMs by integrating information retrieval capabilities. This allows LLMs to respond to queries using both their training data and specific documents, enabling them to provide more accurate and up-to-date information. RAG is particularly useful in applications like chatbots that need access to internal data or authoritative sources.

**LLM Agents** refer to multi-agent systems that leverage LLMs to enable sophisticated interactions among multiple intelligent agents. These systems can tackle complex problems that individual agents cannot solve alone. The integration of LLMs into multi-agent systems represents a new research area, enhancing the capabilities of agents in various applications, including online trading and disaster response.