<a href="https://colab.research.google.com/github/Rami-RK/Retrieval_Augmented_Generation_RAG/blob/main/RAG_Step_1_Document_Loaders_and_Document_Splitting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **RAG: Step 1 - Document Loaders and Document Splitting**

In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

### **Objectives**

At the end of the experiment you will be able to understand and implement:

1. different data loaders available in LangChain
2. document splitting

Above mentioned are the few initial steps before performing RAG.

### **Installing and importing packages**

In [None]:
!pip install openai
!pip install langchain
!pip install pypdf

In [None]:
import os
import openai

#### **Authentication for OpenAI API**

In [None]:
f = open('/content/openapi_key.txt')
api_key = f.read()
type(api_key)

str

In [None]:
os.environ['OPENAI_API_KEY'] = api_key
openai.api_key= os.getenv('OPENAI_API_KEY')

### **Loading the PDFs**

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/Doc 1.pdf")
pages = loader.load()

Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [None]:
len(pages)

1

In [None]:
page = pages[0]
page

Document(page_content="India, officially known as the Republic of India, is a diverse and vibrant country located in South\nAsia. With a rich history spanning thousands of years, India is known for its cultural heritage, \nreligious diversity, and vast landscapes. From the majestic Himalayas in the north to the serene\nbackwaters of Kerala in the south, India encompasses a wide range of geographical features, \nincluding deserts, plains, mountains, and coastlines, making it a land of incredible natural \nbeauty.\nIndia is the seventh-largest country by land area and the second-most populous country in the \nworld, with a population exceeding 1.3 billion people. It is a federal parliamentary democratic \nrepublic, with a president as the head of state and a prime minister as the head of government. \nThe country follows a multi-tiered administrative structure, with 28 states and 9 union territories,\neach having its own elected government.\nIndia has a rich cultural heritage that has ev

In [None]:
print(page.page_content[0:500])

India, officially known as the Republic of India, is a diverse and vibrant country located in South
Asia. With a rich history spanning thousands of years, India is known for its cultural heritage, 
religious diversity, and vast landscapes. From the majestic Himalayas in the north to the serene
backwaters of Kerala in the south, India encompasses a wide range of geographical features, 
including deserts, plains, mountains, and coastlines, making it a land of incredible natural 
beauty.
India is t


In [None]:
page.metadata

{'source': '/content/Doc 1.pdf', 'page': 0}

####  **Document loader for  Youtube & OpenAI Whisper**

In [None]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [None]:
! pip install yt_dlp
! pip install pydub

In [None]:
url="https://www.youtube.com/shorts/5xp0taGM3Kg"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/shorts/5xp0taGM3Kg
[youtube] 5xp0taGM3Kg: Downloading webpage
[youtube] 5xp0taGM3Kg: Downloading ios player API JSON
[youtube] 5xp0taGM3Kg: Downloading android player API JSON
[youtube] 5xp0taGM3Kg: Downloading m3u8 information
[info] 5xp0taGM3Kg: Downloading 1 format(s): 140
[download] Destination: docs/youtube//Andrew Ng's Secret to Mastering Machine Learning - Part 1 #shorts.m4a
[download] 100% of  757.83KiB in 00:00:00 at 2.71MiB/s   
[FixupM4a] Correcting container of "docs/youtube//Andrew Ng's Secret to Mastering Machine Learning - Part 1 #shorts.m4a"
[ExtractAudio] Not converting audio docs/youtube//Andrew Ng's Secret to Mastering Machine Learning - Part 1 #shorts.m4a; file is already in target format m4a
Transcribing part 1!


In [None]:
docs[0].page_content[0:500]

"What would you recommend? How do they go about day-to-day, sort of specific advice about learning in the world of deep learning, machine learning? Getting the habit of learning is key, and that means regularity. And for myself, I've picked up a habit of spending some time every Saturday and every Sunday reading or studying. And so I don't wake up on a Saturday and have to make a decision. Do I feel like reading or studying today or not? It's just what I do. And the fact it's a habit makes it eas"

### **Document Splitting**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [None]:
chunk_size =26
chunk_overlap = 4

In [None]:
r_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=chunk_overlap)

c_splitter = CharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=chunk_overlap)

**Recursive splitter**

Why doesn't this split the string below?

In [None]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [None]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [None]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [None]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [None]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [None]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

**Character splitter**

In [None]:
c_splitter.split_text(text3) # What is happening here? What is going on here?
#character text splitter, splits on single character by default it is new line \n

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [None]:
c_splitter = CharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=chunk_overlap,separator = ' ')
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

### **Recursive splitting details**

`RecursiveCharacterTextSplitter` is recommended for generic text.

In [None]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [None]:
len(some_text)

496

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

In [None]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [None]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

Let's reduce the chunk size a bit and add a period to our separators:

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/Doc 1.pdf")
pages = loader.load()

In [None]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [None]:
docs = text_splitter.split_documents(pages)

In [None]:
len(docs)

4

In [None]:
docs[0].page_content

'India, officially known as the Republic of India, is a diverse and vibrant country located in South\nAsia. With a rich history spanning thousands of years, India is known for its cultural heritage, \nreligious diversity, and vast landscapes. From the majestic Himalayas in the north to the serene\nbackwaters of Kerala in the south, India encompasses a wide range of geographical features, \nincluding deserts, plains, mountains, and coastlines, making it a land of incredible natural \nbeauty.\nIndia is the seventh-largest country by land area and the second-most populous country in the \nworld, with a population exceeding 1.3 billion people. It is a federal parliamentary democratic \nrepublic, with a president as the head of state and a prime minister as the head of government. \nThe country follows a multi-tiered administrative structure, with 28 states and 9 union territories,\neach having its own elected government.'

In [None]:
docs[1].page_content

"The country follows a multi-tiered administrative structure, with 28 states and 9 union territories,\neach having its own elected government.\nIndia has a rich cultural heritage that has evolved over thousands of years. It is home to various\nreligions, including Hinduism, Islam, Christianity, Sikhism, Buddhism, and Jainism, among \nothers. These religions coexist harmoniously, contributing to India's multicultural fabric. \nFestivals like Diwali, Eid, Christmas, and Holi are celebrated with great enthusiasm and bring \npeople from different communities together.\nThe history of India is characterized by ancient civilizations, invasions, and the establishment of\npowerful empires. The Indus Valley Civilization, one of the world's oldest urban civilizations, \nflourished in the northwestern part of the Indian subcontinent around 2500 BCE. Over the \ncenturies, India witnessed the rise and fall of several dynasties, including the Maurya, Gupta,"

In [None]:
docs[2].page_content

"centuries, India witnessed the rise and fall of several dynasties, including the Maurya, Gupta, \nand Mughal empires. The Mughal period, in particular, left a lasting impact on Indian culture, \nart, and architecture.\nIndia's struggle for independence from British colonial rule is a significant chapter in its history. \nLed by Mahatma Gandhi and other freedom fighters, the non-violent resistance movement \ngained momentum and eventually led to India's independence on August 15, 1947. This day is \ncelebrated annually as Independence Day.\nIndia's economy is one of the fastest-growing in the world. It has transitioned from an agrarian \neconomy to a service-oriented and industrialized economy. The country is known for its \nsoftware and information technology services, pharmaceuticals, textiles, agriculture, and \nmanufacturing sectors. Major cities like Mumbai, Delhi, Bangalore, and Chennai are hubs of \nbusiness and commerce, attracting investments and fostering innovation."

In [None]:
len(pages)

1

### **Token splitting**

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

In [None]:
pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.5.1


In [None]:
import tiktoken

In [None]:
from langchain.text_splitter import TokenTextSplitter

In [None]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [None]:
text1 = "foo bar bazzyfoo"

In [None]:
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [None]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [None]:
docs = text_splitter.split_documents(pages)

In [None]:
docs[3]

Document(page_content='\nlearning class. So what I wanna do today', metadata={'source': '/content/MachineLearning-Lecture01.pdf', 'page': 0})

In [None]:
pages[0].metadata

{'source': '/content/MachineLearning-Lecture01.pdf', 'page': 0}

### Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.