# Dependencies

1. PyPDFLoader depends upon pypdf to process the pdfs
2. YoutubeAudioLoader depends upon yt_dlp, pydub and librosa
    - yt_dlp: To download the relevant audio transcripts of youtube videos
    - pydub: To split the audio to adhere to OpenAI Whisper's 25mb limit

(Here is the relevant list of all other document loaders)[https://python.langchain.com/docs/integrations/document_loaders]

In [1]:
from langchain.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader('docs/pdf/hyperion.pdf')
pdf_docs = pdf_loader.load()
pdf_docs = pdf_docs
print(f'Number of pages: {len(pdf_docs)}')

Number of pages: 570


# Possible Bugs

I've encountered the recent versions of openai (>=1.0.0) to be incompatible with the latest version of langchain (0.0.333), as a result I've had to make the following changes:

1. run `openai migrate <path-to-langchain>/document_loaders/parsers/audio.py` replace `<path-to-langchain>` with the correct path to langchain
2. change line 66 in `<path-to-langchain>/document_loaders/parsers/audio.py` to `transcript = client.audio.transcriptions.create(model="whisper-1", file=file_obj)`

In [None]:
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser

url = 'https://youtu.be/_PPWWRV6gbA?si=hQFeGBgt6yawfuPI'
youtube_loader = GenericLoader(YoutubeAudioLoader([url],'docs/youtube'), OpenAIWhisperParser())
youtube_docs = youtube_loader.load()

# Preprocessing & Splitting

### Preprocessing
1. The `PyPDFLoader` returns a list of `Document` objects, each of which has a `page_content` and `metadata` attribute
2. The `page_content` attribute is then preprocessed to add `#` before each chapter number
3. The `metadata` attribute is then updated to include the chapter number, story and character name


### Splitting
Langchain provides us with numerous splitting options, some of most common ones are:
1. `CharacterTextSplitter`: Splits the text into chunks of a fixed size, with a fixed overlap
2. `RecursiveCharacterTextSplitter`: Simillar to `CharacterTextSplitter` but recursively splits the text into smaller chunks
3. `MarkdownTextSplitter`: Splits the text into chunks based on markdown headers

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
import re

def preprocess_pdf(docs):
    sc_mapping = {'Priest/Lenar Hoyt': [35, 114], 'Soldier/Fedmahn Kassad': [144, 204], 'Poet/Martin Silenus': [210, 271], 'Scholar/Sol Weintraub': [285, 356], 'Detective/Brawne Lamia': [376, 470], 'Consul/Consul': [484, 541]}
    chapter = 0
    for page in docs:
        page_content = re.sub(r'^\s*(\d+)\s*$', r'#\1', page.page_content, flags=re.MULTILINE)
        if '#' in page_content: chapter = int(page_content.split('#')[1].split('\n')[0])
        for k,v in sc_mapping.items():
            if v[0] <= page.metadata['page']+1 <= v[1]:
                story = k.split('/')[0]
                character = k.split('/')[1]
                break
            story, character = 'Plot', 'None'
        page.page_content = page_content
        page.metadata['chapter'] = chapter
        page.metadata['story'] = story
        page.metadata['character'] = character
        page.metadata['page'] += 1
    return docs

pdf_docs = preprocess_pdf(pdf_docs)
pdf_docs_copy = pdf_docs.copy()
for _ in range(3):
    pdf_docs += pdf_docs_copy.copy()
print(f'Number of pages: {len(pdf_docs)}')
print(f'random page: {pdf_docs[320].page_content[:100]}')
print(f'random page metadata: {pdf_docs[320].metadata}')

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150,
    separators=['#', '\n', '"', ' ', '']
)
splits = text_splitter.split_documents(pdf_docs)
print(f'Number of splits: {len(splits)}')

Number of pages: 2280
random page: last contact, Arundez had aged but little—Sol guessed that
he was still in his late twenties. But th
random page metadata: {'source': 'docs/pdf/hyperion.pdf', 'page': 321, 'chapter': 4, 'story': 'Scholar', 'character': 'Sol Weintraub'}
Number of splits: 4340


# Embedding & Vectorstore

### Embedding
Langchain provides us with numerous vectorization options, some of most common ones are:
1. `HuggingFaceEmbeddings`: Uses the HuggingFace transformers library to generate embeddings
2. `OpenAIEmbeddings`: Uses the OpenAI GPT library to generate embeddings

We chose to use the `HuggingFaceEmbeddings` as OpenAI was rate limiting the number of requests we could make to their API

### Vectorstore
A vectorstore is a database of embeddings which corresponds to a set of documents. Langchain provides us with numerous vectorstore options, some of most common ones are:
1. `FAISS`: Stands for Facebook AI similarity search
2. `Chroma`: langchain's preferred vectorstore

In [3]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS, Chroma
import os

embedding = HuggingFaceEmbeddings()
save_dir = r'docs/vectorstores'

def retrieve_FAISS_vectorstore(documents):
    store_name, _ = os.path.splitext(os.path.basename(documents[0].metadata['source']))
    store_path = os.path.join(os.getcwd(), save_dir, 'faiss', store_name)
    if not os.path.exists(store_path):
        vectorstore = FAISS.from_documents(documents=documents, embedding=embedding)
        vectorstore.save_local(store_path)
        return vectorstore
    else:
        return FAISS.load_local(store_path, embedding)

def retrieve_Chroma_vectorstore(documents):
    store_name, _ = os.path.splitext(os.path.basename(documents[0].metadata['source']))
    store_path = os.path.join(os.getcwd(), save_dir, 'chroma', store_name)
    if not os.path.exists(store_path):
        os.makedirs(store_path)
        vectorstore = Chroma.from_documents(documents=documents, embedding=embedding, persist_directory=store_path)
    else:
        vectorstore = Chroma(persist_directory=store_path, embedding_function=embedding)
    return vectorstore


vectordb = retrieve_Chroma_vectorstore(splits)
query = 'Who are the Ousters?'
docs_ss = vectordb.similarity_search(query, k=2)
print(docs_ss)

[Document(page_content='the Ousters. The SDF forces have been running wild. Much of\nthe carnage could be their doing.”\n“With no bodies?” laughed Martin Silenus. “Wishful\nthinking. Our absent hosts downstairs dangle now on the\nShrike’s steel tree. Where, ere long, we too will be.”\n“Shut up,” Brawne Lamia said tiredly.', metadata={'chapter': 6, 'character': 'None', 'page': 476, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Plot'}), Document(page_content='the Ousters. The SDF forces have been running wild. Much of\nthe carnage could be their doing.”\n“With no bodies?” laughed Martin Silenus. “Wishful\nthinking. Our absent hosts downstairs dangle now on the\nShrike’s steel tree. Where, ere long, we too will be.”\n“Shut up,” Brawne Lamia said tiredly.', metadata={'chapter': 6, 'character': 'None', 'page': 476, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Plot'})]


# Max Marginal Relevance
To introduce diversity in the model responses

In [4]:
docs_mmr = vectordb.max_marginal_relevance_search(query, k=2, fetch_k=5)
for i in range(2):
    print(f'{i+1}. Similarity search: {docs_ss[i].metadata}')
    print(f'{i+1}. Marginal relevance: {docs_mmr[i].metadata}')
    print('')

1. Similarity search: {'chapter': 0, 'character': 'None', 'page': 12, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Plot'}
1. Marginal relevance: {'chapter': 0, 'character': 'None', 'page': 12, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Plot'}

2. Similarity search: {'chapter': 0, 'character': 'None', 'page': 12, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Plot'}
2. Marginal relevance: {'chapter': 0, 'character': 'None', 'page': 13, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Plot'}



**Note**: Unfortunately, there was no way around using openai ver. 0.28 for the following part. Please downgrade it with the command pip install openai==0.28

### Self Query
Self query allows langchain to auto infer the metadata filter as well as the query which results in better inference overall

For example: The query 'What are some movies about aliens made in 1980' would be passed through an llm and divided into te following parts:
1. `query`: alien movies
2. `metadata filter`: eq("year" 1980)

In [5]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name = 'source',
        description = 'File path of the document',
        type = 'text',
    ),
    AttributeInfo(
        name = 'chapter',
        description = 'Chapter number of the 579-page document ranging from 0-6',
        type = 'integer',
    ),
    AttributeInfo(
        name = 'page',
        description = 'Page number of the document ranging from 0-579',
        type = 'integer',
    ),
    AttributeInfo(
        name = 'story',
        description = 'Profession of the character who is narrating their story. Plot is used in the case no character is narrating their story. Possible values are: Priest, Soldier, Poet, Scholar, Detective, Consul, Plot',
        type = 'text',
    ),
    AttributeInfo(
        name = 'character',
        description = 'Name of the character who is narrating their story. None is used in the case no character is narrating their story. Possible values are: Lenar Hoyt, Fedmahn Kassad, Martin Silenus, Sol Weintraub, Brawne Lamia, Consul, None',
        type = 'text',
    ),
]
document_content_description = 'Hyperion book by Dan Simmons'
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True,
)
query_2 = 'Who did Lenar Hoyt find while narrating his story?'
docs_ss_2 = vectordb.similarity_search(query_2, k=2)
docs_sqr = retriever.get_relevant_documents(query_2)
print('Similarity search:\n', [doc.metadata for doc in docs_ss_2])
print('Self query retrieval:\n', [doc.metadata for doc in docs_sqr])

Similarity search:
 [{'chapter': 5, 'character': 'None', 'page': 361, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Plot'}, {'chapter': 5, 'character': 'None', 'page': 361, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Plot'}]
Self query retrieval:
 [{'chapter': 1, 'character': 'Lenar Hoyt', 'page': 37, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Priest'}, {'chapter': 1, 'character': 'Lenar Hoyt', 'page': 37, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Priest'}, {'chapter': 1, 'character': 'Lenar Hoyt', 'page': 37, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Priest'}, {'chapter': 1, 'character': 'Lenar Hoyt', 'page': 37, 'source': 'docs/pdf/hyperion.pdf', 'story': 'Priest'}]
