<div style="border-radius: 10px; border: 1px solid #000; padding: 10px; background: #fff; color: #000; text-align: center; box-shadow: 0px 2px 4px rgba(0, 0, 0, 0.2);">
    <h1 style="color: #000; font-weight: bold; margin-bottom: 10px; font-size: 18px;">
        I comments most of code beacuse I don't want to reveal my API keys, so it’s not possible to run or make it public on Kaggle. It took me 10 days to develop, and I hope it helps you become an expert in RAG.
    </h1>
</div>


# Overview of RAG 🧠✨

**Retrieval-augmented generation (RAG)** is an advanced approach in Natural Language Processing (NLP) that combines retrieval-based methods and generative models to provide more accurate and contextually relevant responses. 

### How RAG Works 🛠️

1. **Retrieval Step 🔍**: Given a query, a set of relevant documents or passages is retrieved from a knowledge base.
2. **Generation Step ✍️**: The retrieved documents are then used as additional context to generate a response.

### Advanced RAG 🚀

The advanced RAG model includes improvements in both the retrieval and generation processes to enhance the quality and relevance of the responses.

#### Advanced Retrieval Techniques 🌐

1. **Dense Passage Retrieval (DPR) 🏗️**: Uses dense vector representations of passages and queries to improve retrieval accuracy.
2. **Hybrid Retrieval 🔄**: Combines both dense and sparse retrieval methods to balance precision and recall.
3. **Context-Aware Retrieval 🧩**: Adjusts retrieval strategies based on the context of the conversation or the user's history.

#### Enhanced Generation ✨

1. **Fusion-in-Decoder (FiD) 🛠️**: Integrates multiple retrieved passages in the decoder to generate more informed responses.
2. **Cross-Attention Mechanisms 🎯**: Employs advanced attention mechanisms to focus on the most relevant parts of the retrieved documents.
3. **Knowledge-Aware Generation 🧠**: Incorporates structured knowledge (e.g., knowledge graphs) to ensure the generated content is factual and relevant.

### Modular RAG 🧩

The modular RAG approach breaks down the RAG architecture into distinct, interchangeable components. This modularity allows for flexibility and easy upgrades of individual components without redesigning the entire system.

#### Components of Modular RAG 🛠️

1. **Retrieval Module 🔍**: Responsible for fetching relevant documents.
   - **Indexer 📇**: Creates and maintains an index of the knowledge base.
   - **Retriever 🕵️**: Searches the index to find relevant documents.
2. **Generation Module ✍️**: Generates the final response using the retrieved documents.
   - **Encoder 🎛️**: Encodes the retrieved documents and the query.
   - **Decoder 💡**: Generates the response using the encoded information.
3. **Knowledge Base 📚**: The source of information for the retriever.
   - **Static Knowledge Base 📖**: Predefined set of documents.
   - **Dynamic Knowledge Base 🌐**: Continuously updated with new information.

### Diagrams 📊

#### Basic RAG Model

```plaintext
+------------+         +--------------+         +--------------+
|   Query    |         |   Retriever  |         |  Generator   |
+------------+   -->   +--------------+   -->   +--------------+
                        /           \            /            \
                       /             \          /              \
                  +----+----+    +----+----+  +----+----+   +----+----+
                  | Document |    | Document |  | Document |   | Document |
                  +----+----+    +----+----+  +----+----+   +----+----+
```

#### Advanced RAG Model 🚀

```plaintext
+------------+         +--------------+         +--------------+
|   Query    |         |   Retriever  |         |  Generator   |
+------------+   -->   +--------------+   -->   +--------------+
                        /           \            /            \
                       /             \          /              \
                  +----+----+    +----+----+  +----+----+   +----+----+
                  | Document |    | Document |  | Document |   | Document |
                  +----+----+    +----+----+  +----+----+   +----+----+
                     |              |             |              |
                 +---+----+      +---+----+     +---+----+     +---+----+
                 |   DPR  |      | Hybrid |     | Context|     | Fusion |
                 |Retrieval|    |Retrieval|    |Aware Ret.|    |  in Decoder |
                 +---------+     +---------+     +---------+     +---------+
```

#### Modular RAG Architecture 🧩

```plaintext
+------------+          +-----------------+          +--------------+
|   Query    |          |  Retrieval      |          |  Generation  |
+------------+   -->    |   Module        |    -->   |    Module    |
                        +-----------------+          +--------------+
                             /      \                     /   \
                            /        \                   /     \
                   +------+           +------+   +------+     +------+
                   |Index |           |Retriever|   | Encoder |   |Decoder|
                   +------+           +------+   +------+     +------+
```

### Updated Information for 2024 📅

1. **Scalability 📈**: RAG models have been scaled to handle larger knowledge bases, making them more suitable for real-world applications.
2. **Adaptability 🔄**: Advances in transfer learning allow RAG models to adapt quickly to new domains with minimal additional training.
3. **Multi-Language Support 🌍**: Improved multilingual capabilities enable RAG models to support a broader range of languages.
4. **Real-Time Updates ⏱️**: Integration with dynamic knowledge bases allows RAG models to incorporate the latest information in real-time.
5. **User Personalization 👤**: Personalized retrieval and generation techniques enhance user-specific responses, improving user experience.

By combining retrieval and generation in a sophisticated, modular framework, RAG models provide powerful tools for generating accurate and contextually relevant responses, making them invaluable for various NLP applications.

In [1]:
# ! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain cohere
# # !pip install 'protobuf<=3.20.1' --force-reinstall
# !python3 -m pip install pip --upgrade
# !pip install pyopenssl --upgrade
# # !pip install pymupdf
# !pip install langchain-cohere

#### RAG exampls from scratch to End

In [2]:
# import os
# import re
# import nltk
# import bs4
# from langchain import hub
# from langchain_core.output_parsers import StrOutputParser
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain_core.runnables import RunnablePassthrough
# from langchain_community.llms import Ollama
# from langchain_community.document_loaders import WebBaseLoader,ArxivLoader
# from langchain_openai import OpenAIEmbeddings
# from langchain_cohere import CohereEmbeddings
# from langchain_community.vectorstores import Chroma
# from langchain_core.documents import Document
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
# from langchain_community.vectorstores.utils import filter_complex_metadata
# from langchain.prompts import ChatPromptTemplate
# from langchain_openai import ChatOpenAI




# # # ! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain cohere
# # # pip install 'protobuf<=3.20.1' --force-reinstall
# # # !python3 -m pip install pip --upgrade
# # # !pip install pyopenssl --upgrade
# # !pip install pymupdf
# # !pip install langchain-cohere

In [3]:
# import os
# os.environ['LANGCHAIN_TRACING_V2'] = 'true'
# os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
# # os.environ['LANGCHAIN_API_KEY'] = ""

# os.environ['COHERE_API_KEY'] = ""
# # os.environ['OPENAI_API_KEY'] = ""

In [4]:
### Paid
# llm= ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# embeddings=OpenAIEmbeddings()

### Free
# from langchain.llms import GooglePalm

# api_key = 'AIzaSyAW8XJLn0DnCLGueBlbpmFu5HMrTlkA_9E'
# llm = GooglePalm(google_api_key=api_key, temperature=0.2)
# print(llm('what is GenAI'))
# embeddings=CohereEmbeddings()
# embeddings

In [5]:
# !pip install arxiv
# !pip install pymupdf

In [6]:
# from langchain_community.document_loaders import ArxivLoader

In [7]:
# # Post-processing
# def format_docs(docs):
#     return "\n\n".join(doc.page_content for doc in docs)


# # Step 1: Load the documents
# docs = ArxivLoader(query='2312.10997', load_max_docs=1, load_all_available_meta=True).load()
# # print(docs)
# load_docs = docs[0].page_content
# # print(load_docs)

# # Step 2: Filter complex metadata
# filtered_docs = filter_complex_metadata(docs)

# # Step 3: Split the documents
# text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=300, chunk_overlap=50)
# splits = text_splitter.split_documents(filtered_docs)

# # Step 4: Embed the documents
# vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)

# # Step 5: Create the retriever
# retriever = vectorstore.as_retriever()

# # Prompt
# prompt = hub.pull("rlm/rag-prompt")

# # Chain
# rag_chain = (
#     {"context": retriever | format_docs , "question": RunnablePassthrough()}
#     | prompt
#     | llm
#     | StrOutputParser()
# )

# # Question
# rag_chain.invoke("What is Modular RAG ?")

How to write prompts- templates

In [8]:
# # Prompt
# template = """You are a Q&A assistant, You will refer the context provided and answer the question.
# If you dont know the answer , reply that you dont know the answer:
# {context}
# Question: {question}
# """

# prompt = ChatPromptTemplate.from_template(template)
# prompt

### Tweaking RAG : Work on queries, Work on prompts , work with rerankers & then see what you have got !

#### Multiquery Perspective

In [9]:
# # Multi Query: Different Perspectives
# template = """You are an AI language model assistant.
# Your task is to generate five different versions of the given user question to retrieve relevant documents from a vector
# database.
# By generating multiple perspectives on the user question, your goal is to help
# the user overcome some of the limitations of the distance-based similarity search.
# Provide these alternative questions separated by newlines. Original question:
# {question}"""

# prompt_perspectives = ChatPromptTemplate.from_template(template)


# generate_queries = (
#     prompt_perspectives
#     | llm
#     | StrOutputParser()
#     | (lambda x: x.split("\n"))
# )

# from langchain.load import dumps, loads

# def get_unique_union(documents: list[list]):
#     """ Unique union of retrieved docs """
#     # Flatten list of lists, and convert each Document to string
#     flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
#     # Get unique documents
#     unique_docs = list(set(flattened_docs))
#     # Return
#     return [loads(doc) for doc in unique_docs]

# # Retrieve
# question = "What is Modular RAG ?"
# retrieval_chain = generate_queries | retriever.map() | get_unique_union
# docs = retrieval_chain.invoke({"question":question})
# print(docs[0].page_content)

In [10]:
# # RAG
# from operator import itemgetter

# template = """Answer the following question based on this context:
# {context}
# Question: {question}
# """

# prompt = ChatPromptTemplate.from_template(template)

# final_rag_chain = (
#     {"context": retrieval_chain,
#      "question": itemgetter("question")}
#     | prompt
#     | llm
#     | StrOutputParser()
# )

# final_rag_chain.invoke({"question":question})

 You can see our anwer depends on the answers which were coming out of the multiquery retriever chain

#### Decomposition

In [11]:
# from langchain.prompts import ChatPromptTemplate

# # Decomposition
# template = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
# The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n
# Generate multiple search queries related to: {question} \n
# Output (3 queries):"""
# prompt_decomposition = ChatPromptTemplate.from_template(template)
# # Chain
# generate_queries_decomposition = ( prompt_decomposition | llm | StrOutputParser() | (lambda x: x.split("\n")))

# # Run
# question = "I dont understand RAG ,Can you help me understand what are the components and one more thing I would like to know about whether is it same as Advanced RAG ?"
# #### I gave an ambigious query which talks about 3 questions 1. RAG understanding 2. Components of RAG 3. Difference between RAG & Advanced RAG
# questions = generate_queries_decomposition.invoke({"question":question})

# questions

In [12]:
# # Prompt
# template = """Here is the question you need to answer:

# \n --- \n {question} \n --- \n

# Here is any available background question + answer pairs:

# \n --- \n {q_a_pairs} \n --- \n

# Here is additional context relevant to the question:

# \n --- \n {context} \n --- \n

# Use the above context and any background question + answer pairs to answer the question: \n {question}
# """

# decomposition_prompt = ChatPromptTemplate.from_template(template)

# from operator import itemgetter
# from langchain_core.output_parsers import StrOutputParser

# def format_qa_pair(question, answer):
#     """Format Q and A pair"""

#     formatted_string = ""
#     formatted_string += f"Question: {question}\nAnswer: {answer}\n\n"
#     return formatted_string.strip()



# q_a_pairs = ""
# for q in questions:

#     rag_chain = (
#     {"context": itemgetter("question") | retriever,
#      "question": itemgetter("question"),
#      "q_a_pairs": itemgetter("q_a_pairs")}
#     | decomposition_prompt
#     | llm
#     | StrOutputParser())

#     answer = rag_chain.invoke({"question":q,"q_a_pairs":q_a_pairs})
#     q_a_pair = format_qa_pair(q,answer)
#     q_a_pairs = q_a_pairs + "\n---\n"+  q_a_pair

# print(q_a_pairs)

It answered all the questions one by one Now its time to look at the actual query and revert it in a single answer.Lets print it and see what happens !

In [13]:
# print(answer)

### HYDE

In [14]:
# from langchain.prompts import ChatPromptTemplate

# # HyDE document genration
# template = """Please write a scientific paper passage to answer the question
# Question: {question}
# Passage:"""
# prompt_hyde = ChatPromptTemplate.from_template(template)

# from langchain_core.output_parsers import StrOutputParser

# generate_docs_for_retrieval = (
#     prompt_hyde | llm | StrOutputParser()
# )

# # Run
# generate_docs_for_retrieval.invoke({"question":question})
# # Retrieve
# retrieval_chain = generate_docs_for_retrieval | retriever
# retireved_docs = retrieval_chain.invoke({"question":question})
# retireved_docs

# # RAG
# template = """Answer the following question based on this context:

# {context}

# Question: {question}
# """

# prompt = ChatPromptTemplate.from_template(template)

# final_rag_chain = (
#     prompt
#     | llm
#     | StrOutputParser()
# )

# final_rag_chain.invoke({"context":retireved_docs,"question":question})

### RAG Fusion

In [15]:
# from langchain.prompts import ChatPromptTemplate

# # RAG-Fusion
# template = """You are an assistant that generates multiple search queries based on a single input query. \n
# Generate multiple search queries related to: {question} \n
# Output (3 queries):
# """
# prompt_rag_fusion = ChatPromptTemplate.from_template(template)

# from langchain_core.output_parsers import StrOutputParser


# generate_queries = (
#     prompt_rag_fusion
#     | llm
#     | StrOutputParser()
#     | (lambda x: x.split("\n"))
# )

# from langchain.load import dumps, loads

# def reciprocal_rank_fusion(results: list[list], k=60):
#     """ Reciprocal_rank_fusion that takes multiple lists of ranked documents
#         and an optional parameter k used in the RRF formula """

#     # Initialize a dictionary to hold fused scores for each unique document
#     fused_scores = {}

#     # Iterate through each list of ranked documents
#     for docs in results:
#         # Iterate through each document in the list, with its rank (position in the list)
#         for rank, doc in enumerate(docs):
#             # Convert the document to a string format to use as a key (assumes documents can be serialized to JSON)
#             doc_str = dumps(doc)
#             # If the document is not yet in the fused_scores dictionary, add it with an initial score of 0
#             if doc_str not in fused_scores:
#                 fused_scores[doc_str] = 0
#             # Retrieve the current score of the document, if any
#             previous_score = fused_scores[doc_str]
#             # Update the score of the document using the RRF formula: 1 / (rank + k)
#             fused_scores[doc_str] += 1 / (rank + k)

#     # Sort the documents based on their fused scores in descending order to get the final reranked results
#     reranked_results = [
#         (loads(doc), score)
#         for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
#     ]

#     # Return the reranked results as a list of tuples, each containing the document and its fused score
#     return reranked_results

# question = "What is pattern in Modular RAG ?"
# retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion
# docs = retrieval_chain_rag_fusion.invoke({"question": question})

# print(docs)


In [16]:
# from operator import itemgetter
# from langchain_core.runnables import RunnablePassthrough

# # RAG
# template = """
# Answer the following question based on this context,
# If you dont find any answer then just revert with 'Answer not found'.
# context: {context}
# Question: {question}
# """

# prompt = ChatPromptTemplate.from_template(template)

# #llm = ChatOpenAI(temperature=0)

# final_rag_chain = (
#     {"context": retrieval_chain_rag_fusion,
#      "question": itemgetter("question")}
#     | prompt
#     | llm
#     | StrOutputParser()
# )

# final_rag_chain.invoke({"question":question})

##### Lets look at the Advanced RAG : Using CohereReranker

In [17]:
# from langchain_community.llms import Cohere
# from langchain.retrievers import  ContextualCompressionRetriever
# from langchain.retrievers.document_compressors import CohereRerank

# retriever = vectorstore.as_retriever(search_kwargs={"k": 25})

# # Chain
# normal_rag_chain = (
#     {"context": retriever | format_docs , "question": RunnablePassthrough()}
#     | prompt
#     | llm
#     | StrOutputParser()
# )
# # Question
# normal_rag_chain.invoke("What is pattern in Modular RAG ?")

In [18]:
# # Re-rank
# top_k=5
# compressor = CohereRerank(top_n=top_k)
# compression_retriever = ContextualCompressionRetriever(
#     base_compressor=compressor, base_retriever=retriever
# )
# question="What is pattern in Modular RAG ?"
# compressed_docs = compression_retriever.get_relevant_documents(question)

# #### After using reranker
# reranked_rag_chain = (
#     prompt
#     | llm
#     | StrOutputParser()
# )

# print("Answer after reranking comes out to be: ")
# print(reranked_rag_chain.invoke({"context":compressed_docs,"question":question}))


# # The retrieved source documents
# print("\nRetrieved Documents:")
# for i in range(top_k):
#     print(f"\nDocument {i+1}:")
#     print(compressed_docs[0].page_content)  # or doc.text depending on the document structure

#### You see how answer changes once you make use of additional retriever !

### Advanced RAG Cohere Reranker(): How reranking looks like

In [19]:
# ###### RAG & Applying Cohere Reranker for document extraction
# import fitz  # PyMuPDF
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# # from langchain.embeddings import OpenAIEmbeddings
# from langchain.vectorstores import FAISS
# # from langchain.llms import OpenAI
# from langchain.chains import RetrievalQA


# from langchain import hub

# # # Loads the latest version
# # prompt = hub.pull("rlm/rag-prompt", api_url="https://api.hub.langchain.com")

In [20]:
# # Path to the PDF file

# # Step 1: Load the documents
# docs = ArxivLoader(query='2312.10997', load_max_docs=1, load_all_available_meta=True).load()
# load_docs = docs[0].page_content

# # Step 2: Filter complex metadata
# data = filter_complex_metadata(docs)



# # Split
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
# all_splits = text_splitter.split_documents(data)

# # Store splits
# # from langchain.embeddings import OpenAIEmbeddings
# from langchain.vectorstores import Chroma

# # Create a vector store with Chroma
# vectorstore = Chroma.from_documents(documents=all_splits, embedding=embeddings)

# # RetrievalQA
# from langchain.chains import RetrievalQA
# from langchain.prompts import PromptTemplate
# # from langchain.llms import OpenAI


# # Set up the prompt template if needed
# prompt_template = """
# Answer the following question based on the provided context.

# {context}

# Question: {question}
# Answer:
# """

# prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

# # Create the retriever with a specified top_k value
# retriever = vectorstore.as_retriever(search_kwargs={"k": 25})  # Set top_k to 25

# # Create the QA chain with the retriever and prompt
# qa_chain = RetrievalQA.from_chain_type(
#     llm, retriever=retriever, chain_type_kwargs={"prompt": prompt}, return_source_documents=True
# )

# # Run a query and see the results along with the context documents
# query = "Explain different components of Modular RAG?"
# result = qa_chain(query)

# # The answer to the question
# print("Answer:", result['result'])

# # The retrieved source documents
# print("\nRetrieved Documents:")
# for i, doc in enumerate(result['source_documents']):
#     print(f"\nDocument {i+1}:")
#     print(doc.page_content)  # or doc.text depending on the document structure


In [21]:
# # Re-Rank them with cohere
# import cohere
# # Get your cohere API key on: www.cohere.com
# co = cohere.Client(f"{os.environ['COHERE_API_KEY']}")
# docs = [doc.page_content for doc in result['source_documents']]

# # Re-Rank them with cohere
# top_n=5
# rerank_hits = co.rerank(query=query, documents=docs, top_n=top_n, model='rerank-multilingual-v3.0')
# print(rerank_hits)
# #[doc[rerank_hits.results[i].index] for i in range(5)]

# for i, doc in enumerate(docs):
#     if i>top_n-1:
#         break
#     else:
#         print(f"\nDocument {i}:")
#         print(f"Relevance score on the basis of reranking is : {rerank_hits.results[i].relevance_score}")
#         print(docs[rerank_hits.results[i].index])


### Character Text Splitting

**Text:**
```
"The quick brown fox jumps over the lazy dog."
```

**Fixed Limit:** 10 characters

**Result:**
```
1. "The quick "
2. "brown fox "
3. "jumps over"
4. " the lazy "
5. "dog."
```

### Recursive Character Text Splitting

**Text:**
```
"The quick brown fox jumps over the lazy dog."
```

**Fixed Limit:** 10 characters

**Result:**
```
1. "The quick"
2. "brown fox"
3. "jumps"
4. "over"
5. "the lazy"
6. "dog"
```

### MarkdownTextSplitter

**Text:**
```
"## Introduction\n\nThis is a sample markdown text with different sections and paragraphs.\n\n### Section 1\n\nSome content in section 1.\n\n### Section 2\n\nContent in section 2."
```

**Splitting Criteria:** Section headers (`##`, `###`)

**Result:**
```
1. "## Introduction\n\nThis is a sample markdown text with different sections and paragraphs."
2. "### Section 1\n\nSome content in section 1."
3. "### Section 2\n\nContent in section 2."
```

### PythonCodeTextSplitter

**Text:**
```
"def calculate_sum(a, b):\n    return a + b\n\ndef calculate_product(a, b):\n    return a * b\n\nresult_sum = calculate_sum(3, 5)\nresult_product = calculate_product(3, 5)\nprint(result_sum)\nprint(result_product)"
```

**Splitting Criteria:** Function definitions (`def`)

**Result:**
```
1. "def calculate_sum(a, b):\n    return a + b\n"
2. "def calculate_product(a, b):\n    return a * b\n"
3. "result_sum = calculate_sum(3, 5)\nresult_product = calculate_product(3, 5)\nprint(result_sum)\nprint(result_product)"
```

### Semantic Chunking

Semantic chunking groups text based on meaning or semantic units rather than characters or specific syntax. Example scenarios could involve grouping sentences based on topics or intents, which is more complex to demonstrate in a simple example.


Here are the short definitions:

### Character Text Splitting
Breaks text based on a fixed number of characters.

### Recursive Character Text Splitting
Breaks text respecting natural language boundaries and recursively if needed.

### MarkdownTextSplitter
Splits Markdown text based on Markdown syntax like headers.

### PythonCodeTextSplitter
Divides Python code into segments based on syntactic units such as functions.

### Semantic Chunking
Groups text based on meaning or semantic units rather than specific syntax or characters.

###  Splitting Strategies

In [22]:
# # 1. Character Text Splitting
# print("#### Character Text Splitting ####")

# text = """In 2024, Pakistan will be at the epicenter of global attention, hosting two momentous events:
#  the highly anticipated general elections and the ICC Cricket World Cup. The general elections will witness millions
#  of eligible voters participating in a democratic exercise of unprecedented scale. Political parties are actively engaging their
#   supporters with campaigns focused on pressing issues such as economic development, social equality, and national security.
#   Concurrently, the nation will be swept up in cricket euphoria as teams from across the globe vie for supremacy in the Cricket World Cup.
#   Cricket stadiums will reverberate with the enthusiastic cheers of fans, and the cricket pitches will witness thrilling displays of talent and sportsmanship.
#   Amidst the fervor of political rallies and cricket matches, these parallel
#  events will highlight Pakistan's unique national spirit—rooted in a steadfast commitment to democracy and an unwavering passion for cricket."""
# # Manual Splitting
# chunks = []
# chunk_size = 35 # Characters
# for i in range(0, len(text), chunk_size):
#     chunk = text[i:i + chunk_size]
#     chunks.append(chunk)
# documents = [Document(page_content=chunk, metadata={"source": "local"}) for chunk in chunks]
# print(documents)

# # Automatic Text Splitting
# from langchain.text_splitter import CharacterTextSplitter
# text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='', strip_whitespace=False)
# documents = text_splitter.create_documents([text])
# print(documents)

In [23]:
# # 2. Recursive Character Text Splitting
# print("#### Recursive Character Text Splitting ####")

# from langchain.text_splitter import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0) # ["\n\n", "\n", " ", ""] 65,450
# print(text_splitter.create_documents([text]))


In [24]:
# # 3. Document Specific Splitting
# print("#### Document Specific Splitting ####")

# # Document Specific Splitting - Markdown
# from langchain.text_splitter import MarkdownTextSplitter
# splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)
# markdown_text = text
# print(splitter.create_documents([markdown_text]))

# # Document Specific Splitting - Python
# from langchain.text_splitter import PythonCodeTextSplitter
# python_text = """
# class Person:
#   def __init__(self, name, age):
#     self.name = name
#     self.age = age

# p1 = Person("John", 36)

# for i in range(10):
#     print (i)
# """
# python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
# print(python_splitter.create_documents([python_text]))

# # Document Specific Splitting - Javascript
# from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
# javascript_text = """
# // Function is called, the return value will end up in x
# let x = myFunction(4, 3);

# function myFunction(a, b) {
# // Function returns the product of a and b
#   return a * b;
# }
# """
# js_splitter = RecursiveCharacterTextSplitter.from_language(
#     language=Language.JS, chunk_size=65, chunk_overlap=0
# )
# print(js_splitter.create_documents([javascript_text]))

