# Phase 1: Extracting Text from PDFs and Text Files

Using `pymupdf` for PDFs

In [5]:
import fitz


doc = fitz.open("attention_paper.pdf")

text = "\n".join([page.get_text() for page in doc])
print(text)


Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Exper

In [7]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/devansh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/devansh/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/devansh/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /home/devansh/nltk_data...


True

# Phase 2: Chunking The Extracted Text

In [8]:
def gen_chunks(text, chunk_size=300, overlap=50):
    words = text.split()
    chunks = []
    current_index = 0

    while current_index < len(words):
        # Get the chunk
        end_index = current_index + chunk_size
        chunk = " ".join(words[current_index:end_index])

        # Store
        chunks.append(chunk)

        # Move index forward with overlap
        current_index += chunk_size - overlap

    return chunks

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")

def gen_chunks_token(text, max_tokens=512, stride=50):
    tokens = tokenizer(text, return_tensors="pt", truncation=False)["input_ids"][0]
    chunks = []
    start = 0

    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_ids = tokens[start:end]
        chunk = tokenizer.decode(chunk_ids, skip_special_tokens=True)
        chunks.append(chunk)
        start += max_tokens - stride

    return chunks

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def gen_chunks_sent(text, chunk_size=300, overlap=1):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        words = sentence.split()
        if current_length + len(words) <= chunk_size:
            current_chunk.append(sentence)
            current_length += len(words)
        else:
            chunks.append(" ".join(current_chunk))
            # Start new chunk with `overlap` previous sentences
            current_chunk = current_chunk[-overlap:] + [sentence]
            current_length = sum(len(s.split()) for s in current_chunk)

    # Append last chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks


chunks = gen_chunks_sent(text)

[nltk_data] Downloading package punkt to /home/devansh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
chunks

['Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works. Attention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗†\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence

# Phase 3: Embedding the Chunks

Gonna store them in dicts for now, will use vector databased for later

In [10]:
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")


In [11]:
chunk_embeddings = [{
    'chunk': chunk,
    'embedding': model.encode(chunk, normalize_embeddings=True),
    'meta': {
        'chapter': 1
    }
} for chunk in chunks]

# Phase 4: Semantic Search

Using cosine simmiliarity to compare query embeddings and chunk embeddings.

In [12]:
import numpy as np

def cosine_sim(a,b):
    return np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))

def return_top_k_chunks(query,k=3):
    query_embedding = model.encode(query)
    ranked_chunks = [(cosine_sim(query_embedding,embedding['embedding']),embedding) for embedding in chunk_embeddings]
    ranked_chunks.sort(reverse=True)

    return ranked_chunks[:min(k,len(chunk_embeddings))]

In [13]:
query = "What is the difference between an algorithm and a program?"
best_chunks = return_top_k_chunks(query)
print(f'Query: {query}\n')
print('Best Chunks')
for sim, chunk_info in best_chunks:
    print(f'Similarity: {sim}\nChunk: {chunk_info['chunk']}')
    print()

Query: What is the difference between an algorithm and a program?

Best Chunks
Similarity: 0.1249820739030838
Chunk: The fundamental
constraint of sequential computation, however, remains. Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms
are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs. 2
Background
The goal of reducing sequential computation also forms the founda

In [15]:
from google import genai
import os
from dotenv import load_dotenv

# Load environment variables from .env
load_dotenv()
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

In [16]:
query = input()
best_chunks = return_top_k_chunks(query)
print(f'Query: {query}\n')
print('Best Chunks')
for sim, chunk_info in best_chunks:
    print(f'Similarity: {sim}\nChunk: {chunk_info['chunk']}')
    print()

prompt = f"""{"\n".join(chunk['chunk'] for _, chunk in best_chunks)}

Based on the paragraphs above, answer the question below

{query}"""

response = client.models.generate_content(
    model='gemini-2.5-flash', contents = prompt
)

print(response.text)

Query: what is attention?

Best Chunks
Similarity: 0.4772815704345703
Chunk: Convolutional layers are generally more expensive than
recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity
considerably, to O(k · n · d + n · d2). Even with k = n, however, the complexity of a separable
convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer,
the approach we take in our model. As side benefit, self-attention could yield more interpretable models. We inspect attention distributions
from our models and present and discuss examples in the appendix. Not only do individual attention
heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic
and semantic structure of the sentences. 5
Training
This section describes the training regime for our models. 5.1
Training Data and Batching
We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 mi

# Phase 5: Porting Transformers for Offline Functionality

Using `all-MiniLM-L6-v2` for chunk embedding, `t5-base` for summarization, and `roberta-base-squad2` for QA Answering.

In [20]:
from transformers import pipeline
import torch

import re

def clean_text(text):
    text = re.sub(r'\b(EOS|EOS>)\b', '', text)
    text = re.sub(r'\bFigure \d+.*?\.', '', text)
    text = re.sub(r'\n+', ' ', text)
    return text.strip()

pipe = pipeline("text2text-generation", model="google/flan-t5-base", device=0, torch_dtype=torch.float16)

summary = ''
for chunk_embedding in chunk_embeddings:
    chunk_text = clean_text(chunk_embedding['chunk'])
    # Construct a better prompt for t5-base
    prompt = (
        "Summarize this research paper excerpt for a high school student. "
        "Explain what it proposes, how it works, and why it matters:\n\n"
        f"{chunk_text}"
    )

    result = pipe(prompt, do_sample=False)[0]['generated_text']
    summary += result.strip() + '\n'

# Join all chunk summaries and summarize again
pipe_cpu = pipeline("text2text-generation", model="google/flan-t5-base", device=0)
final_summary = pipe_cpu("Summarize:\n" + summary, max_length=200, do_sample=False)[0]['generated_text']

print(final_summary)


Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (546 > 512). Running this sequence through the model will result in indexing errors
Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (3395 > 512). Running this sequence through the model will result in indexing errors
Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. It achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for as little as twelve hours on eight GPUs, a small fraction of the training costs of the best models from the literature. Work performed while at Google Brain. Work performed while at Google Research. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. arXiv:1706.03762v7 [cs.CL] 2 Aug 2023 1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transducti

In [None]:
import gc, torch
del pipe
gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
#pipe = pipeline("text2text-generation", model="google/flan-t5-large", device=0, torch_dtype=torch.float16)

In [9]:
from transformers import pipeline

# Initialize the QA pipeline with a Squad2-trained model
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2", tokenizer="deepset/roberta-base-squad2")

Device set to use cuda:0


In [13]:
query = 'What is a transformer?'
top_k_chunks = return_top_k_chunks(query, k=1)
context = "\n".join(chunk[1]['chunk'] for chunk in top_k_chunks)
result = qa_pipeline({
    'question': query,
    'context': context
})
print(result['answer'])

the first transduction model


In [10]:
from transformers import pipeline
import torch
pipe = pipeline("text2text-generation", model="google/flan-t5-large", device=0, torch_dtype=torch.float16)

Device set to use cuda:0


In [12]:
query = input()
top_k_chunks = return_top_k_chunks(query, k=3)
context = "\n".join(chunk[1]['chunk'] for chunk in top_k_chunks)
prompt = f"""
You are an AI tutor helping high school students understand research papers.

Context: {context}

Question: {query}

Answer briefly and clearly in simple everyday language.
Avoid repeating the context or using technical or academic terms.
"""

response = pipe(prompt)[0]['generated_text']
print("Query:", query)
print("Answer:", response)
gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

Query: transformer
Answer: Work performed while at Google Brain. Work performed while at Google Research. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. arXiv:1706.03762v7 [cs.CL] 2 Aug 2023 1 Introduction Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht1 and the input for position t. This inherently sequential nature precludes paral

NameError: name 'gc' is not defined

In [None]:
import gc
query = input()
compressed_context = ""
top_k_chunks = return_top_k_chunks(query, k=3)
for chunk in top_k_chunks:
    summary_prompt = f"Summarize this for a high school student:\n\n{chunk}"
    compressed = pipe(summary_prompt, max_length=100, do_sample=False)[0]['generated_text']
    compressed_context += compressed + "\n"

# Now use compressed_context in your final QnA prompt
final_prompt = f"""
You are an AI tutor helping high school students understand research papers.

Context: {compressed_context}

Question: {query}

Answer in short, simple sentences using non-technical language.
"""

response = pipe(final_prompt, max_length=200, do_sample=False)[0]['generated_text']
print("Query:", query)
print("Answer:", response)
gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

In [32]:
import gc, torch
gc.collect()
del pipe
torch.cuda.empty_cache()
torch.cuda.ipc_collect()


NameError: name 'pipe' is not defined