## Part 1: Document Loading

Type of documents covered:
- PDFs
- Youtube Videos
- Website 

Import and stardardise such that we obtain:
- Content
- Meta data

In [None]:
import openai
import os

# Connecting to account
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("C:/Users/richi/OneDrive/Documents/OpenAI API practice/openai_api_key.env")) # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

### PDF

Each page is a Document. A Document contains text (page_content) and metadata.

In [None]:
# pip install pypdf 

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain for LLM applications/Deep_Learning_A4.pdf")
pages = loader.load()

In [None]:
page = pages[0]
page.page_content[0:500] # number of characters
page.metadata # 

### Youtube (Broken - to be fixed)
- Issue: yt_dlp can't find the ffmpeg files, even though they're properly installed on the local device. Didn't resolve why yet...
- Tutorial used: https://www.youtube.com/watch?v=IECI72XEox0

In [None]:
# ! pip install yt_dlp
# ! pip install pydub

In [None]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser # converts youtube audio to text format (langchain model)
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [None]:
# Youtube URL (video: Josh Angrist: What's the Difference Between Econometrics and Data Science? - 2 min)
url = "https://www.youtube.com/watch?v=2EhRT2mOXm8&t=2s"

# Directory where to save audio
save_dir = "C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain for LLM applications/"


In [None]:
# DOESN'T WORK FOR SOME REASON - installed fmpeg stuff...

# Note: may take a while & will give error if the content is already present/downloaded in file directory
loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser()
)

docs = loader.load()

In [None]:
doc = docs[0]
doc.page_content[0: 500]
doc.metadata

### URLs

Note: format is probably really poorly formatted, so we should post-process for readability.

In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://richie-lee.github.io/post/2021_uplift/")

In [None]:
docs = loader.load()

In [None]:
doc = docs[0]
doc.page_content[:1500]

---
## Part 2: Document splitting

*After* loading data and *before* feeding it into the vector store 

Fundamental concept: splitting on chunks with some size, with overlap. This overlap is helpful in ensuring no information is loss when splitting texts.

Types of splitting:
- **CharacterTextSplitter():** based on characters
- **MarkdownHeaderTextSplitter():** based on MD headers
- **TokenTextSplitter():** based on tokens
- **RecursiveCharacterTextSplitter():** recursively tries to split by different characters to see what works
- **Language():** for Python, Ruby, Markdown, ...
- **NLTKTextSplitter():** based on sentences and NLTK (natural language tool kit)
- **SpacyTextSplitter():** based on sentences and Spacy




In [None]:
import openai
import os

# Connecting to account
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("C:/Users/richi/OneDrive/Documents/OpenAI API practice/openai_api_key.env")) # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

Intuitive examples of (Recursive) character text splitters:

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [None]:
chunk_size =26
chunk_overlap = 4

In [None]:
# initialise two different text splitters
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [None]:
# n < 26 (chunk size)
text1 = 'abcdefghijklmnopqrstuvwxyz'
print(r_splitter.split_text(text1))

# n > 26 (chunk size)
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
print(r_splitter.split_text(text2))

character text splitting issue: it splits on a new characters, by default a newline char, but here there arent't any.

In [None]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
print(r_splitter.split_text(text3)) # recursive character text splitting
print(c_splitter.split_text(text3)) # character text splitting: issue it splits on a new characters, by default a newline char, but here there arent't any.

In [None]:
# Note - given processing C & R become equivalent
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

Recursive splitting details:

In [None]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [None]:
len(some_text)

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""] # default sepearators, but here for illustration explicitly displayed - it moves from left to right recursively
)

In [None]:
# Only splits on spaces
c_splitter.split_text(some_text)

In [None]:
# Splits on \n\n first, and then rest respectively for better quality due to importance hierarchy
r_splitter.split_text(some_text)

For periods, define regex with lookback for better results: "(?<=\.)"

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""] # separators=["\n\n", "\n", "\. ", " ", ""] not this due to REGEX under the hood
)
r_splitter.split_text(some_text)

Try with real example:

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain for LLM applications/Deep_Learning_A4.pdf")
pages = loader.load()

In [None]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [None]:
docs = text_splitter.split_documents(pages)

In [None]:
# To illustrate difference it may make
print(len(docs), len(pages))

**Token splitting:** LLMs often have context windows designated in tokens (approx 4 characters often).

In [None]:
from langchain.text_splitter import TokenTextSplitter

In [None]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [None]:
text1 = "foo bar bazzyfoo"

In [None]:
text_splitter.split_text(text1)

In [None]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [None]:
docs = text_splitter.split_documents(pages)

In [None]:
# Note metadata is same in chunk as in pages (which is good).
docs[0]
pages[0].metadata

**Context aware splitting:** adds meta data to the text chunks

- chunks aim to keep text with common context together
- text splitting often uses sentences or other delimiters to keep related text together, but some docs have explicit structures that can be used (e.g. markdown headers) - headers become metadata

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [None]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [None]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [None]:
print(len(md_header_splits)) # number chunks

print(md_header_splits[0]) # first chunk

---
## Part 3: Vectorstores & Embeddings
Retrieval augmented generation workflow:

Documents => smaller splits => embedding => store in vectorstore 

- Embedding: numerical representation of text (similar embeddings = similar texts)
- 



In [None]:
import openai
import os

# Connecting to account
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("C:/Users/richi/OneDrive/Documents/OpenAI API practice/openai_api_key.env")) # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf"),
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf"),
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A5.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [None]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [None]:
splits = text_splitter.split_documents(docs)

### Embeddings

Simple example to understand what's happening under the hood

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [None]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [None]:
# convert to vectors/embeddings
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [None]:
# Test similarity of embeddings
import numpy as np
print(f"[1, 2]: {np.dot(embedding1, embedding2)}")
print(f"[1, 3]: {np.dot(embedding1, embedding3)}")

### Vectorstores

- Vectorstore used: Chroma (lightweight & in-memory)
- Other vector stores can be hosted, which would be better for larger scale projects

In [None]:
# Installation not necessarily straight-forward: 
# - Go to https://visualstudio.microsoft.com/visual-cpp-build-tools/
# - Download & make sure to toggle C++

# !pip install chromadb

In [None]:
from langchain.vectorstores import Chroma

In [None]:
# Save at directory (for future usage) - check if there's not something already there, as it may fuck shit up
persist_directory = 'C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/'

In [None]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory = persist_directory # chroma-specific keyword
)

In [None]:
# Note: same as number of splits as before
print(vectordb._collection.count())

**Similarity search**:

In [None]:
question = "What's the advantage of a recurrent neural network"

In [None]:
docs = vectordb.similarity_search(question,k=3) # k = number of documents we want to return
len(docs)

In [None]:
# Looks at (embeddings of) chunks and sees which one matches the best 
docs[0].page_content

In [None]:
# Save to use later
vectordb.persist()

### Failure Modes
Edge cases that we should be aware of that cause problems with standard implementations
- Duplicates (for duplicate input)
- Sub-optimal chunks by not leveraging structured information over regular semantics

In [None]:
question = "what did they say about matlab?"
docs = vectordb.similarity_search(question, k = 5)

# Produces duplicates (due to duplicate input data) - no additional value, distint chunks would be more valuable
print(docs[0], "\n\n", docs[1])

In [None]:
question = "what did they say about reinforcement?"
docs = vectordb.similarity_search(question,k=5)

for doc in docs:
    print(doc.metadata)

# It doesn't capture/prioritise structured information over normal sentence semantics and may therefore not prioritise the most relevant info
print(docs[4].page_content)