# Dependencies

1. PyPDFLoader depends upon pypdf to process the pdfs
2. YoutubeAudioLoader depends upon yt_dlp, pydub and librosa
    - yt_dlp: To download the relevant audio transcripts of youtube videos
    - pydub: To split the audio to adhere to OpenAI Whisper's 25mb limit

(Here is the relevant list of all other document loaders)[https://python.langchain.com/docs/integrations/document_loaders]

In [1]:
from langchain.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader('docs/pdf/hyperion.pdf')
pdf_docs = pdf_loader.load()
pdf_docs = pdf_docs
print(f'Number of pages: {len(pdf_docs)}')

Number of pages: 570


# Possible Bugs

I've encountered the recent versions of openai (>=1.0.0) to be incompatible with the latest version of langchain (0.0.333), as a result I've had to make the following changes:

1. run `openai migrate <path-to-langchain>/document_loaders/parsers/audio.py` replace `<path-to-langchain>` with the correct path to langchain
2. change line 66 in `<path-to-langchain>/document_loaders/parsers/audio.py` to `transcript = client.audio.transcriptions.create(model="whisper-1", file=file_obj)`

In [None]:
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser

url = 'https://youtu.be/_PPWWRV6gbA?si=hQFeGBgt6yawfuPI'
youtube_loader = GenericLoader(YoutubeAudioLoader([url],'docs/youtube'), OpenAIWhisperParser())
youtube_docs = youtube_loader.load()

# Preprocessing & Splitting

### Preprocessing
1. The `PyPDFLoader` returns a list of `Document` objects, each of which has a `page_content` and `metadata` attribute
2. The `page_content` attribute is then preprocessed to add `#` before each chapter number
3. The `metadata` attribute is then updated to include the chapter number, story and character name


### Splitting
Langchain provides us with numerous splitting options, some of most common ones are:
1. `CharacterTextSplitter`: Splits the text into chunks of a fixed size, with a fixed overlap
2. `RecursiveCharacterTextSplitter`: Simillar to `CharacterTextSplitter` but recursively splits the text into smaller chunks
3. `MarkdownTextSplitter`: Splits the text into chunks based on markdown headers

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
import re

def preprocess_pdf(docs):
    sc_mapping = {'Priest/Lenar Hoyt': [35, 114], 'Soldier/Fedmahn Kassad': [144, 204], 'Poet/Martin Silenus': [210, 271], 'Scholar/Sol Weintraub': [285, 356], 'Detective/Brawne Lamia': [376, 470], 'Consul/Consul': [484, 541]}
    chapter = 0
    for page in docs:
        page_content = re.sub(r'^\s*(\d+)\s*$', r'#\1', page.page_content, flags=re.MULTILINE)
        if '#' in page_content: chapter = int(page_content.split('#')[1].split('\n')[0])
        for k,v in sc_mapping.items():
            if v[0] <= page.metadata['page']+1 <= v[1]:
                story = k.split('/')[0]
                character = k.split('/')[1]
                break
            story, character = 'Plot', 'None'
        page.page_content = page_content
        page.metadata['chapter'] = chapter
        page.metadata['story'] = story
        page.metadata['character'] = character
        page.metadata['page'] += 1
    return docs

pdf_docs = preprocess_pdf(pdf_docs)
pdf_docs_copy = pdf_docs.copy()
for _ in range(3):
    pdf_docs += pdf_docs_copy.copy()
print(f'Number of pages: {len(pdf_docs)}')
print(f'random page: {pdf_docs[320].page_content[:100]}')
print(f'random page metadata: {pdf_docs[320].metadata}')

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150,
    separators=['#', '\n', '"', ' ', '']
)
splits = text_splitter.split_documents(pdf_docs)
print(f'Number of splits: {len(splits)}')

Number of pages: 2280
random page: last contact, Arundez had aged but little—Sol guessed that
he was still in his late twenties. But th
random page metadata: {'source': 'docs/pdf/hyperion.pdf', 'page': 321, 'chapter': 4, 'story': 'Scholar', 'character': 'Sol Weintraub'}
Number of splits: 4340


# Embedding & Vectorstore

### Embedding
Langchain provides us with numerous vectorization options, some of most common ones are:
1. `HuggingFaceEmbeddings`: Uses the HuggingFace transformers library to generate embeddings
2. `OpenAIEmbeddings`: Uses the OpenAI GPT library to generate embeddings

We chose to use the `HuggingFaceEmbeddings` as OpenAI was rate limiting the number of requests we could make to their API

### Vectorstore
A vectorstore is a database of embeddings which corresponds to a set of documents. Langchain provides us with numerous vectorstore options, some of most common ones are:
1. `FAISS`: Uses the FAISS library to generate vectorstores

In [3]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
import os

def retrieve_vectorstore(documents):
    store_name, _ = os.path.splitext(os.path.basename(documents[0].metadata['source']))
    store_path = os.path.join(r'docs/vectorstores', store_name)
    if os.path.exists(store_path):
        return FAISS.load_local(store_path, HuggingFaceEmbeddings())
    else:
        vectorstore = FAISS.from_documents(documents=documents, embedding=HuggingFaceEmbeddings())
        vectorstore.save_local(store_path)
        return vectorstore

vectordb = retrieve_vectorstore(splits)
query = 'Who are the Outcasters?'
print(vectordb.similarity_search('Outcasters?', n=2))

[Document(page_content='out. Some are waiting forlhe farcaster to be built, but most\ndon’t believe it’ll happen in time. They’re afraid.”\n“Of the Ousters?”\n“Them too,” said Theo, “but mostly of the Shrike.”\nThe Consul turned his face from the coolness of the\ncanopy. “It’s come south of the Bridle Range then?”\nTheo laughed without humor. “It’s everywhere. Or they’r e\neverywhere. Most people are convinced that there are', metadata={'source': 'docs/pdf/hyperion.pdf', 'page': 128, 'chapter': 2, 'story': 'Plot', 'character': 'None'}), Document(page_content='out. Some are waiting forlhe farcaster to be built, but most\ndon’t believe it’ll happen in time. They’re afraid.”\n“Of the Ousters?”\n“Them too,” said Theo, “but mostly of the Shrike.”\nThe Consul turned his face from the coolness of the\ncanopy. “It’s come south of the Bridle Range then?”\nTheo laughed without humor. “It’s everywhere. Or they’r e\neverywhere. Most people are convinced that there are', metadata={'source': 'docs/p