In [22]:
# Importing Libraries
from langchain_community.llms import Ollama
from langchain_ollama import OllamaLLM

- from langchain_community.llms import Ollama:
Imports the community version of Ollama LLM wrapper for LangChain. May be used in older setups or custom workflows.
- from langchain_ollama import OllamaLLM:
Imports the official Ollama integration for LangChain. Preferred for current, stable use with local models like LLaMA, Mistral, etc.
Use one of them depending on your setup. For new projects, go with OllamaLLM.


In [2]:

# Initialize the model
llm = Ollama(model='llama3.1:8b')

  llm = Ollama(model='llama3.1:8b')


- llm = Ollama(model='llama3.1:8b'):
This initializes a local language model using Ollama.
It loads the model named 'llama3.1:8b', which likely refers to LLaMA 3.1 with 8 billion parameters.
‚úÖ After this, llm can be used to generate text or interact with the model in LangChain workflows


In [3]:
question = "Popular actor of india"
response = llm.invoke(question)
print(response)

Here are some of the most popular actors in India:

**Male Actors:**

1. **Shah Rukh Khan**: Known as the "King of Bollywood", he has acted in numerous hit films like Dilwale Dulhania Le Jayenge, Kuch Kuch Hota Hai, and Kabhi Khushi Kabhie Gham.
2. **Amitabh Bachchan**: A legendary actor who has been active in the industry for over 50 years, known for his iconic roles in films like Sholay, Deewar, and Black.
3. **Salman Khan**: One of the highest-paid actors in India, known for his blockbuster hits like Bajrangi Bhaijaan, Sultan, and Dabangg.
4. **Hrithik Roshan**: A versatile actor known for his energetic performances in films like Kaho Naa Pyaar Hai, Dhoom 2, and Kaabil.
5. **Ranveer Singh**: Known for his energetic and flamboyant performances in films like Bajirao Mastani, Padmaavat, and Gully Boy.

**Female Actors:**

1. **Priyanka Chopra**: A popular actress who has acted in numerous hit films like Kaminey, Barfi!, and Mary Kom.
2. **Deepika Padukone**: One of the highest-paid act

In [4]:
# Generate answers to a question
question = "Who is sunil shetty "
response = llm.invoke(question)
print(response)

Suniel Shetty is an Indian actor, film producer, and television personality who has been active in the Hindi film industry since the late 1980s. He was born on August 19, 1961, in Mulki, Karnataka, India.

Shetty began his acting career with a small role in the 1988 film "Balwan", but it was his breakthrough role as a villain in the 1992 film "Dil" that brought him to prominence. He then went on to play lead roles in several successful films, including "Hum Hain Khalnayak" (1994), "Gopi Kishan" (1994), and "Aaditya" (1995).

Shetty's most notable role was perhaps as the villainous Rakka in Rajiv Mehra's 1993 film "Dilwale Dulhania Le Jayenge", which is one of the highest-grossing films of all time in Indian cinema.

In addition to his acting career, Shetty has also ventured into production with his company, Sunshine Productions. He has produced several films, including "Gadar: Ek Prem Katha" (2001), "Krishna Cottage" (2006), and "Tera Mera Ki Rishta" (2010).

Shetty has also made appea


### Implementing RAG for custom data

- PyPDFLoader: Loads text content from a PDF file and converts it into LangChain Document objects.
- RecursiveCharacterTextSplitter: Splits large text into smaller chunks (e.g., 500‚Äì1000 characters) while preserving sentence structure as much as possible. Useful for feeding into LLMs that have token limits.


In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

- Loads the full PDF into a list of **Document** objects.
- Each **Document** contains text and metadata (like page number).
- Breaks the full text into smaller overlapping chunks.
- **chunk_size=1100**: Each chunk has ~1100 characters.
- **chunk_overlap=140**: Each chunk overlaps the previous by 140 characters to preserve context.
- Prints how many chunks were created.
- Displays the first 800 characters of the first chunk


# Notes -

### ‚úÖ How to Decide Chunk Size Based on PDF Pages (Correct Method)

- You decide parameters using 3 factors:
- 1Ô∏è‚É£ Step 1 ‚Äî Understand PDF Type
| PDF Type           | Examples                | Best Chunk Size |
| ------------------ | ----------------------- | --------------- |
| Research Paper     | Transformers, ML papers | **1500‚Äì2000**   |
| Textbooks          | ML, DS books            | **1800‚Äì2500**   |
| Code/Documentation | LangChain, APIs         | **800‚Äì1200**    |
| Stories/Novels     | Fiction, articles       | **1000‚Äì1500**   |
| Legal/Contracts    | Agreements, policies    | **1200‚Äì1800**   |

- 2Ô∏è‚É£ Step 2 ‚Äî Estimate Text Density (Important)

The number of pages alone is meaningless because:

A 50-page paper = ~20,000 words

A 50-page textbook = ~40,000‚Äì60,000 words

A 50-page presentation-style PDF = ~4,000 words

So instead, ask:

Does 1 page contain heavy text?

If yes, use larger chunks.
- Simple rule:
| Page Text Density | Signs                      | Chunk Size    |
| ----------------- | -------------------------- | ------------- |
| High Density      | Full paragraphs, equations | **1500‚Äì2000** |
| Medium            | Normal text                | **1200‚Äì1500** |
| Low               | Bullet slides              | **600‚Äì900**   |

- 3Ô∏è‚É£ Step 3 ‚Äî Simple Formula for Chunk Size Based on Pages

If you still want a formula, here is the best practical one:
Formula (ML research/general books):
chunk_size = 35000 / number_of_chunks_you_want


To get good RAG performance:

Best practice:
Aim for 50‚Äì80 chunks total.


So:

For a 52-page research paper ‚Üí Target 60‚Äì70 chunks

Use:

chunk_size = (total_characters / 60)


But you don't know characters ‚Üí so use this shortcut:

Shortcut Rule
chunk_size = 1500 + (pages / 10 * 50)


For 52 pages:

chunk_size = 1500 + (52/10 * 50)
            = 1500 + 260
            = 1760


Perfect for research papers.

üéØ Final Decision Shortcut (use this always)
üìò If PDF has dense text (research/math/books):

üëâ chunk_size = 1600‚Äì2000
üëâ chunk_overlap = 200‚Äì300

üìÑ If PDF has normal paragraphs:

üëâ chunk_size = 1200‚Äì1500
üëâ chunk_overlap = 150‚Äì200

üñ•Ô∏è If PDF has slides (PPT-like):

üëâ chunk_size = 600‚Äì800
üëâ chunk_overlap = 100‚Äì150


### ‚≠ê Super Simple Guide
| Pages   | Type           | Recommended Chunk Size |
| ------- | -------------- | ---------------------- |
| 20‚Äì50   | Research paper | **1500‚Äì1800**          |
| 50‚Äì150  | Textbook       | **1800‚Äì2200**          |
| 150‚Äì500 | Long books     | **2000‚Äì2500**          |
| 10‚Äì30   | Slides         | **600‚Äì900**            |


In [6]:

pdf_loader = PyPDFLoader("Attention Is All You Need.pdf")
documents = pdf_loader.load()

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1300,
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", " ", ""]
)
chunks = text_splitter.split_documents(documents)

print("Total Chunks:", len(chunks))
print(chunks[0].page_content[:15])




Total Chunks: 39
Provided proper


In [7]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# Create embeddings
embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")

# Build FAISS vector store
db = FAISS.from_documents(documents=chunks, embedding=embeddings)

# Create retriever
retriever = db.as_retriever()


  embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")


- **HuggingFaceEmbeddings**: Loads a sentence transformer model to convert text into numerical vectors (embeddings).
- **FAISS**: A fast vector similarity search library used to store and search embeddings efficiently


- Loads the **all-mpnet-base-v2 model** from Hugging Face.
- This model turns each text chunk into a dense vector that captures its meaning.
- Converts all **chunks** into vectors using the embedding model.
- Stores them in a FAISS index for fast similarity search.
- Converts the **FAISS index** into a retriever object.
- You can now use **retriever.get_relevant_documents(query)** to fetch chunks similar to a user query.


# Notes -
#### ‚úÖ Recommended Embedding Models for RAG
1Ô∏è‚É£ all-mpnet-base-v2 (your current choice)

Model: "sentence-transformers/all-mpnet-base-v2"

Pros:

Very strong semantic understanding

Excellent for short & long texts

High-quality embeddings for English research papers

Cons:

Slightly slower than smaller models

Verdict: ‚úÖ Excellent choice for research papers

### 2Ô∏è‚É£ Other HuggingFace Sentence Transformers
| Model                       | Strengths                             | Use Case                                                             |
| --------------------------- | ------------------------------------- | -------------------------------------------------------------------- |
| `all-MiniLM-L6-v2`          | Lightweight, fast, smaller embeddings | If you want **faster retrieval** and can sacrifice a little accuracy |
| `all-mpnet-base-v2`         | High accuracy                         | Best **balance for research papers**                                 |
| `multi-qa-MiniLM-L6-cos-v1` | Optimized for **question-answering**  | If RAG is mostly **QA-based**                                        |
| `all-mpnet-base-v1`         | Slightly older version                | Slightly less accurate, cheaper                                      |

3Ô∏è‚É£ For Large Docs / Dense Research Papers

You can use multi-qa-mpnet-base-dot-v1 or all-mpnet-base-v2

Why?

Handles longer contexts

Good for similarity search

Works with FAISS / Chroma / Milvus

4Ô∏è‚É£ Light / Fast Option
HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


Embedding size: 384 (smaller)

Fast, but slightly less semantic precision

üéØ Recommendation for 15‚Äì50 page research papers

Best Accuracy: all-mpnet-base-v2 ‚úÖ

Fast + Good Accuracy: all-MiniLM-L6-v2

Your current choice (all-mpnet-base-v2) is perfect for Attention Is All You Need.

In [8]:
llm = OllamaLLM(model="llama3.1:8b",gpu=False)

In [9]:
!pip install --upgrade langchain




In [10]:
import langchain
print(langchain.__version__)


1.1.0


In [13]:
!pip show langchain


Name: langchain
Version: 1.1.0
Summary: Building applications with LLMs through composability
Home-page: https://docs.langchain.com/
Author: 
Author-email: 
License: MIT
Location: C:\Users\weare\ansel\Lib\site-packages
Requires: langchain-core, langgraph, pydantic
Required-by: 


In [17]:
!pip install --upgrade langchain




In [11]:
from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["chat_history", "context", "question"],
    template="""
You are a helpful AI assistant.

Chat History:
{chat_history}

Context:
{context}

Question:
{question}

Answer:
"""
)




- This creates a custom prompt template for a Retrieval-Augmented Generation (RAG) chatbot using LangChain.


- PromptTemplate: A LangChain utility to define dynamic prompts with placeholders.
- input_variables: These are the dynamic fields (chat_history, context, question) that will be filled at runtime.
- template: The actual prompt structure. It guides the LLM to:
- Read the chat history (for continuity),
- Use the retrieved context (from vector store),
- Answer the latest user question.

‚úÖ Use Case
This prompt is ideal for chatbots with memory + retrieval, where:
- chat_history maintains conversation flow,
- context comes from relevant document chunks,
- question is the current user query.


Explanation (in short):
Takes a list of documents (docs).
Extracts the text (page_content) from each document.
Joins all the text together with two new lines (\n\n) between them.
Returns one clean combined string.
So it basically merges multiple document texts into one formatted text.


In [15]:
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

In [29]:
def ask_question(query):
    docs = db.similarity_search(query, k=5)       # 1. Find top 5 most relevant chunks from the vector DB
    context = format_docs(docs)                   # 2. Convert those chunks into a single text block

    history_text = "\n".join(chat_history)        # 3. Convert chat history list ‚Üí single string

    final_prompt = prompt.format(                 # 4. Fill your RAG prompt template with:
        chat_history=history_text,
        context=context,
        question=query
    )

    response = llm.invoke(final_prompt)           # 5. Ask the LLM using the combined prompt

    chat_history.append(f"User: {query}")         # 6. Add new user message to history
    chat_history.append(f"AI: {response}")        # 7. Add AI reply to history

    return response                               # 8. Return answer to user


In [18]:
query1 = "Explain self-attention mechanism"
answer1 = ask_question(query1)
print(answer1)

query2 = "How does positional encoding work?"
answer2 = ask_question(query2)
print(answer2)


The self-attention mechanism is a crucial component of the Transformer model, allowing it to attend to different parts of the input sequence and weigh their importance for generating the output. It's a way for the model to understand relationships between different tokens or positions in the input.

In simple terms, self-attention works by computing a weighted sum of the values from different positions in the input, where the weights are learned during training. The input is split into three components: queries (Q), keys (K), and values (V). The model then computes attention scores between each query and key, which represents how relevant each pair is to each other.

The attention mechanism consists of two main steps:

1. **Attention calculation**: The model computes the attention weights (Œ±) by taking the dot product of Q and K, and applying a softmax function to get the normalized weights.
2. **Weighted sum**: The model takes the weighted sum of the values V, using the attention wei

#### Sends the question ‚ÄúExplain self-attention mechanism‚Äù to your ask_question() function.

-  RAG system retrieves relevant chunks ‚Üí builds prompt ‚Üí gets LLM answer.

- Stores the answer in answer1.

- Prints the answer.

In [19]:
query3 = "Encoder and Decoder Stacks?"
answer3 = ask_question(query2)
print(answer2)


In this case, the question is already answered in detail. However, I will provide a concise summary and highlight the key points.

**Positional Encoding**

The positional encoding is used to inject information about the relative or absolute position of tokens in a sequence into the model. This is necessary because self-attention mechanisms do not have a natural notion of order or position.

**How it Works**

The positional encoding uses sine and cosine functions of different frequencies to represent each dimension of the input embeddings. The formula for positional encoding is:

P E(pos,2i) = sin(pos/100002i/dmodel )
P E(pos,2i+1) = cos(pos/100002i/dmodel )

Where `pos` is the position, and `i` is the dimension.

**Key Points**

* Positional encoding is used to provide information about token positions in a sequence.
* It uses sine and cosine functions of different frequencies to represent each dimension.
* The positional encodings have the same dimension as the embeddings (dmodel), so

In [26]:
query4 = "Conclusion"
answer4 = ask_question(query4)
print(answer4)
docs = retriever.invoke(question)
for d in docs:
    print(d.page_content[:400])


It seems we've reached the end of our conversation! To summarize, we discussed self-attention mechanisms and their application in natural language processing. We covered the following topics:

1. **Self-Attention Mechanism**: Self-attention is a crucial component of the Transformer model that allows it to attend to different parts of the input sequence and weigh their importance for generating the output.
2. **Positional Encoding**: Positional encoding is used to inject information about the relative or absolute position of tokens in a sequence into the model. This is necessary because self-attention mechanisms do not have a natural notion of order or position.
3. **Why Self-Attention**: Self-attention was used in this particular work because it addresses three key desiderata for mapping one variable-length sequence to another: parallelization, scalability, and interpretability.

We also reviewed some figures and papers related to self-attention mechanisms and neural machine translatio

In [27]:
query4 = "author of pdf "
answer4 = ask_question(query4)
print(answer4)
docs = retriever.invoke(question)
for d in docs:
    print(d.page_content[:400])


The authors of the PDF are not explicitly stated in the provided snippet, but based on the references and citations mentioned, it appears to be a collection of research papers and articles related to natural language processing and neural machine translation.

However, I can provide some information about the specific papers cited:

* The paper "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al., 2014) is attributed to Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
* The paper "Massive Exploration of Neural Machine Translation Architectures" (Britz et al., 2017) is attributed to Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le.

If you're looking for the authors of the original PDF, I'd be happy to help you investigate further!
4To illustrate why the dot products get large, assume that the components of q and k are independent random
variables with mean 0 and variance 1. Then their dot product, q ¬∑ k = Pdk
i=1 qiki, has mean 0 and va

In [28]:
query4 = "Scaled Dot-Product Attention"
answer4 = ask_question(query4)
print(answer4)
docs = retriever.invoke(question)
for d in docs:
    print(d.page_content[:400])


The Scaled Dot-Product Attention is a type of attention mechanism used in the Transformer model. It works by computing the dot products of the query with all keys, dividing each by ‚àödk, and applying a softmax function to obtain the weights on the values.

The formula for Scaled Dot-Product Attention is given by:

Attention(Q, K, V) = softmax(QKT
‚àödk
)V

Where Q is the matrix of queries, K is the matrix of keys, V is the matrix of values, and dk is the dimension of the keys. The dot products are scaled by ‚àödk to prevent the dot products from growing too large in magnitude.

The Scaled Dot-Product Attention has several advantages over other attention mechanisms, including additive attention, such as:

* It can be implemented using highly optimized matrix multiplication code.
* It is much faster and more space-efficient than additive attention.
* It outperforms additive attention for small values of dk.

However, for larger values of dk, the dot products can grow large in magnitude,