<a href="https://colab.research.google.com/github/Adhira-Deogade/RAG-chat-with-my-blog/blob/main/langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Loading

## Retrieval augmented generation

In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

In [None]:
print("Hello RAG")

Hello RAG


In [None]:
! pip install langchain
! pip install python-dotenv openai
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
! pip install pypdf
! pip install yt_dlp
! pip install pydub
!pip install -U langchain-community
!pip install tiktoken
! pip install chromadb




In [None]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv


In [None]:
_ = load_dotenv(find_dotenv('vars.env')) # read local .env file
print(_)
openai.api_key  = os.environ['OPENAI_API_KEY']

True


## URLs

In [None]:
#from langchain.document_loaders import WebBaseLoader
#url = "https://0ma.in/wipp/doc.pdf"
#oldurl = "https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md"
#loader = WebBaseLoader(url)

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("DBT.pdf")
pages = loader.load()



> Note: the URL sent to the WebBaseLoader differs from the one shonw in the video because for 2024 it was updated.

In [None]:
docs = pages

In [None]:

print(len(docs))
# print(docs.page_content)

213


In [None]:
# Document splitting

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [None]:
recursive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len,
    separators=["\n\n", "\n", "(><=\. )", " ", ""]
)


In [None]:
op = recursive_text_splitter.split_documents(docs)
print(len(op))

516


In [None]:
for i in range(5):
  print(f"i = {i}\n{op[i].page_content}")
  print('-----')

i = 0
Blended-Care DBT 
 Program Materials 
 All materials were taken directly from  DBT Skills Training Handouts 
 and Worksheets  by Marsha M. Linehan (2005), Second Edition 
 For individual client use only
-----
i = 1
Welcome!  We’re really glad to have you here! 
 How to use this packet: 
 This packet includes all the handouts and worksheets you will 
 need to participate in Lyra’s BC-DBT Program. Handouts contain 
 information that will correspond to the skills being taught each 
 week. Worksheets refer to your homework pages. 
 In addition, each week you will have Digital Lessons and Activities 
 assigned through your Lyra Portal to continue to support your 
 practice! 
 If you have any questions, please feel free to reach out to your group 
 leaders or your individual therapist!
-----
i = 2
Pretreatment and Orientation 
 Mindfulness Week 1 
 Homework Mindfulness Week 1 
 Mindfulness Week 2 
 Homework Mindfulness Week 2 
 Distress Tolerance Week 1 
 Homework Distress Tolerance We

In [None]:
# Let's try a Token text splitter
# This is useful because LLMs often have context window designated in tokens.
# Tokens are often ~4 characters long
# Split text on token count

In [None]:
from langchain.text_splitter import TokenTextSplitter

In [None]:
token_text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=10)
op_token = token_text_splitter.split_documents(docs)
for i in range(5):
  print(f"i = {i}\n{len(op_token[i].page_content)}")
  print('-----')

i = 0
202
-----
i = 1
405
-----
i = 2
230
-----
i = 3
360
-----
i = 4
368
-----


In [None]:
# Context aware splitting
# Split markdown file based on headers and add the header data to metadata field
# This metadata gets passed along to all the pages

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

  embeddings = OpenAIEmbeddings()


In [None]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'
!rm -rf ./docs/chroma  # remove old database files if any
vectordb = Chroma.from_documents(
    documents=op_token,
    embedding=embeddings,
)
    #persist_directory=persist_directory
#)
print(vectordb._collection.count())

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [None]:
print(type(vectordb.get()))
for i in vectordb.get():
  print(i)
  print(vectordb.get()[i])

In [None]:
question = "Painful moments"
docs = vectordb.similarity_search(question, k=3)
print(len(docs))
for i in range(len(docs)):
  print(f"i = {i}\n{docs[i].page_content}")
vectordb.persist()

In [None]:
# what are the edge cases?
# 1. Repeated text if repeat uploads
# 2. Doesn't understand the semantics - example tell me about the 5th chapter
# which will be in the 5th document

In [None]:
# Retrieval is important at query time
# When the query comes in, we want to get the most relevant results