## Part 1: Document Loading

Type of documents covered:
- PDFs
- Youtube Videos
- Website 

Import and stardardise such that we obtain:
- Content
- Meta data

In [None]:
import openai
import os

# Connecting to account
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("C:/Users/richi/OneDrive/Documents/OpenAI API practice/openai_api_key.env")) # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

### PDF

Each page is a Document. A Document contains text (page_content) and metadata.

In [None]:
# pip install pypdf 

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain for LLM applications/Deep_Learning_A4.pdf")
pages = loader.load()

In [None]:
page = pages[0]
page.page_content[0:500] # number of characters
page.metadata # 

### Youtube (Broken - to be fixed)
- Issue: yt_dlp can't find the ffmpeg files, even though they're properly installed on the local device. Didn't resolve why yet...
- Tutorial used: https://www.youtube.com/watch?v=IECI72XEox0

In [None]:
# ! pip install yt_dlp
# ! pip install pydub

In [None]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser # converts youtube audio to text format (langchain model)
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [None]:
# Youtube URL (video: Josh Angrist: What's the Difference Between Econometrics and Data Science? - 2 min)
url = "https://www.youtube.com/watch?v=2EhRT2mOXm8&t=2s"

# Directory where to save audio
save_dir = "C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain for LLM applications/"


In [None]:
# DOESN'T WORK FOR SOME REASON - installed fmpeg stuff...

# Note: may take a while & will give error if the content is already present/downloaded in file directory
loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser()
)

docs = loader.load()

In [None]:
doc = docs[0]
doc.page_content[0: 500]
doc.metadata

### URLs

Note: format is probably really poorly formatted, so we should post-process for readability.

In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://richie-lee.github.io/post/2021_uplift/")

In [None]:
docs = loader.load()

In [None]:
doc = docs[0]
doc.page_content[:1500]

---
## Part 2: Document splitting

*After* loading data and *before* feeding it into the vector store 

Fundamental concept: splitting on chunks with some size, with overlap. This overlap is helpful in ensuring no information is loss when splitting texts.

Types of splitting:
- **CharacterTextSplitter():** based on characters
- **MarkdownHeaderTextSplitter():** based on MD headers
- **TokenTextSplitter():** based on tokens
- **RecursiveCharacterTextSplitter():** recursively tries to split by different characters to see what works
- **Language():** for Python, Ruby, Markdown, ...
- **NLTKTextSplitter():** based on sentences and NLTK (natural language tool kit)
- **SpacyTextSplitter():** based on sentences and Spacy




In [None]:
import openai
import os

# Connecting to account
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("C:/Users/richi/OneDrive/Documents/OpenAI API practice/openai_api_key.env")) # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

Intuitive examples of (Recursive) character text splitters:

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [None]:
chunk_size =26
chunk_overlap = 4

In [None]:
# initialise two different text splitters
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [None]:
# n < 26 (chunk size)
text1 = 'abcdefghijklmnopqrstuvwxyz'
print(r_splitter.split_text(text1))

# n > 26 (chunk size)
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
print(r_splitter.split_text(text2))

character text splitting issue: it splits on a new characters, by default a newline char, but here there arent't any.

In [None]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
print(r_splitter.split_text(text3)) # recursive character text splitting
print(c_splitter.split_text(text3)) # character text splitting: issue it splits on a new characters, by default a newline char, but here there arent't any.

In [None]:
# Note - given processing C & R become equivalent
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

Recursive splitting details:

In [None]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [None]:
len(some_text)

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""] # default sepearators, but here for illustration explicitly displayed - it moves from left to right recursively
)

In [None]:
# Only splits on spaces
c_splitter.split_text(some_text)

In [None]:
# Splits on \n\n first, and then rest respectively for better quality due to importance hierarchy
r_splitter.split_text(some_text)

For periods, define regex with lookback for better results: "(?<=\.)"

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""] # separators=["\n\n", "\n", "\. ", " ", ""] not this due to REGEX under the hood
)
r_splitter.split_text(some_text)

Try with real example:

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain for LLM applications/Deep_Learning_A4.pdf")
pages = loader.load()

In [None]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [None]:
docs = text_splitter.split_documents(pages)

In [None]:
# To illustrate difference it may make
print(len(docs), len(pages))

**Token splitting:** LLMs often have context windows designated in tokens (approx 4 characters often).

In [None]:
from langchain.text_splitter import TokenTextSplitter

In [None]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [None]:
text1 = "foo bar bazzyfoo"

In [None]:
text_splitter.split_text(text1)

In [None]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [None]:
docs = text_splitter.split_documents(pages)

In [None]:
# Note metadata is same in chunk as in pages (which is good).
docs[0]
pages[0].metadata

**Context aware splitting:** adds meta data to the text chunks

- chunks aim to keep text with common context together
- text splitting often uses sentences or other delimiters to keep related text together, but some docs have explicit structures that can be used (e.g. markdown headers) - headers become metadata

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [None]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [None]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [None]:
print(len(md_header_splits)) # number chunks

print(md_header_splits[0]) # first chunk

---
## Part 3: Vectorstores & Embeddings
Retrieval augmented generation workflow:

Documents => smaller splits => embedding => store in vectorstore 

- Embedding: numerical representation of text (similar embeddings = similar texts)
- 



In [None]:
import openai
import os

# Connecting to account
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("C:/Users/richi/OneDrive/Documents/OpenAI API practice/openai_api_key.env")) # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf"),
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf"),
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A5.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [None]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [None]:
splits = text_splitter.split_documents(docs)

### Embeddings

Simple example to understand what's happening under the hood

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [None]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [None]:
# convert to vectors/embeddings
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [None]:
# Test similarity of embeddings
import numpy as np
print(f"[1, 2]: {np.dot(embedding1, embedding2)}")
print(f"[1, 3]: {np.dot(embedding1, embedding3)}")

### Vectorstores

- Vectorstore used: Chroma (lightweight & in-memory)
- Other vector stores can be hosted, which would be better for larger scale projects

In [None]:
# Installation not necessarily straight-forward: 
# - Go to https://visualstudio.microsoft.com/visual-cpp-build-tools/
# - Download & make sure to toggle C++

# !pip install chromadb

In [None]:
from langchain.vectorstores import Chroma

In [None]:
# Save at directory (for future usage) - check if there's not something already there, as it may fuck shit up
persist_directory = 'C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/'

In [None]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory = persist_directory # chroma-specific keyword
)

In [None]:
# Note: same as number of splits as before
print(vectordb._collection.count())

**Similarity search**:

In [None]:
question = "What's the advantage of a recurrent neural network"

In [None]:
docs = vectordb.similarity_search(question,k=3) # k = number of documents we want to return
len(docs)

In [None]:
# Looks at (embeddings of) chunks and sees which one matches the best 
docs[0].page_content

In [None]:
# Save to use later
vectordb.persist()

### Failure Modes
Edge cases that we should be aware of that cause problems with standard implementations
- Duplicates (for duplicate input)
- Sub-optimal chunks by not leveraging structured information over regular semantics

In [None]:
question = "what did they say about matlab?"
docs = vectordb.similarity_search(question, k = 5)

# Produces duplicates (due to duplicate input data) - no additional value, distint chunks would be more valuable
print(docs[0], "\n\n", docs[1])

In [None]:
question = "what did they say about reinforcement?"
docs = vectordb.similarity_search(question,k=5)

for doc in docs:
    print(doc.metadata)

# It doesn't capture/prioritise structured information over normal sentence semantics and may therefore not prioritise the most relevant info
print(docs[4].page_content)

---
## Part 4: Retrieval
Relative new topic (2023)

- **Maximum marginal relevance (MMR)** 
- **LLM Aided Retrieval**: use meta data and LLM to filter
- **Compression**: filter relevant parts

In [None]:
# !pip install lark

In [None]:
import os
import openai
import sys
sys.path.append('../..')

# Connecting to account
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("C:/Users/richi/OneDrive/Documents/OpenAI API practice/openai_api_key.env")) # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

**Similarity search**

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

Get data for the vector store (load, split, embed, store)

In [None]:
print(vectordb._collection.count())

Create small toy data example:

In [None]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [None]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [None]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [None]:
smalldb.similarity_search(question, k=2)

In [None]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3) # fetch all 3 documents originally

### Addressing Diversity: Maximum marginal relevance (MMR)
- Enforcing diversity in search results while preserving relevance



In [None]:
question = "what did they say about Reinforcement?"

Test performance on real vector store:

In [None]:
# Regular: potentially dupes
docs_ss = vectordb.similarity_search(question,k=3)
print(docs_ss[0].page_content[:100], "\n\n", docs_ss[1].page_content[:100])

In [None]:
# MMR Variant: no dupes!
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)
print(docs_mmr[0].page_content[:100], "\n\n", docs_mmr[1].page_content[:100])

### Addressing specificity: working with metadata

Prompt contains keywords that can support filters and search terms

In [None]:
question = "what did they say about regression in the third lecture?"

In [None]:
# Manually get source
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf"}
)

for d in docs:
    print(d.metadata)

### Addressing specificity: working with metadata using self-query retriever

To infer metadata from queries: use **SelfQueryRetriever**, which uses an LLM to extract
- *Query* string to use for vector search
- Metadata filter (supported by most vector databases / indexes)

Argued to be strongest if you can get the LLM to recognise (nested) metadata structures

In [None]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [None]:
# Specify fields in the metadata and what they refer to (important to make it as descriptive as possible as it will be passed in directly)
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf`, or `C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A5.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the report",
        type="integer",
    ),
]

In [None]:
# Specify what in the document store
document_content_description = "Report notes"
llm = OpenAI(temperature = 0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [None]:
question = "what did they say about LSTMs in the first report?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [None]:
docs = retriever.get_relevant_documents(question)

In [None]:
for d in docs:
    print(d.metadata)

### Contextual compression
Extract only the relevant bits and pass those in the final response

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [None]:
# Wrap our vectorstore / create compressor
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [None]:
# contextual compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

# Note: documents are a lot shorter than the original ones (but there are dupes -> use MMR)
question = "what did they say about reinforcement?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

In [None]:
# Fix dupes using MMR
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr") 
)

question = "what did they say about reinforcement?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

### Other types of retrieval (non-vectordb based)

More like traditional NLP:
- SVM
- TF-IDF

In [None]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Load PDF
loader = PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [None]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [None]:
# SVM
question = "What are major topics for this report?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

In [None]:
# TF-IDF (in example, result a bit worse)
question = "What are major topics for this report?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

---
## Part 5: Question answering

LLM process: Question comes in, look up relevant document, pass splits with human prompt/question, get answer

In [None]:
import os
import openai
import sys
sys.path.append('../..')

# Connecting to account
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("C:/Users/richi/OneDrive/Documents/OpenAI API practice/openai_api_key.env")) # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf"),
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf"),
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A5.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

# Split in chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)

# Embedding
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

# Storing in Vectorstore
persist_directory = 'C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/'
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory = persist_directory # chroma-specific keyword
)

In [None]:
# Check if vectorstore created successfully
print(vectordb._collection.count())

In [None]:
# Similarity search
question = "What are major topics for this report?"
docs = vectordb.similarity_search(question,k=3)
len(docs)

In [None]:
# Initialise language model
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name= "gpt-3.5-turbo", temperature=0)

### RetrievalQA chain

In [None]:
from langchain.chains import RetrievalQA

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [None]:
result = qa_chain({"query": question})

In [None]:
result["result"]

Example with prompts: takes in documents and question and passes it in model

In [None]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [None]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [None]:
question = "Is deep learning a topic in this document?"

In [None]:
result = qa_chain({"query": question})

In [None]:
result["result"]

In [None]:
result["source_documents"][0]

### RetriavalQA chain types
While generally making more calls, unlike default *stuff* method, it can handle arbirary number of documents (scalability).
- Map reduce 
- Refine



(more exist)

In [None]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)
result = qa_chain_mr({"query": question})
result["result"]

In [None]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)
result = qa_chain_mr({"query": question})
result["result"]

Note: this specific chain does not have a *state*, or in other words, doesn't use any memory/context. The next section covers this extension.

---
## Part 6: Chat
Adds concept of chat history (memory/context) - useful for follow-up questions
- Retrieval methods become useful here

In [None]:
import os
import openai
import sys
sys.path.append('../..')

import panel as pn  # GUI
pn.extension()

# Connecting to account
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("C:/Users/richi/OneDrive/Documents/OpenAI API practice/openai_api_key.env")) # read local .env file
openai.api_key = os.environ["OPENAI_API_KEY"]

- The code below was added to assign the openai LLM version filmed until it is deprecated, currently in Sept 2023. 
- LLM responses can often vary, but the responses may be significantly different when using a different model version.

In [None]:
import datetime
current_date = datetime.datetime.now().date()
if current_date < datetime.date(2023, 9, 2):
    llm_name = "gpt-3.5-turbo-0301"
else:
    llm_name = "gpt-3.5-turbo"
print(llm_name)

In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf"),
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf"),
    PyPDFLoader("C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A5.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

# Split in chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)

# Embedding
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

# Storing in Vectorstore
persist_directory = 'C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/'
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [None]:
# Check if vectorstore created successfully
print(vectordb._collection.count())

In [None]:
# Intialise language model
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0)
llm.predict("Hello world!")

In [None]:
# Build prompt
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)

# Run chain
from langchain.chains import RetrievalQA
question = "Is reinforcement a class topic?"
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})


result = qa_chain({"query": question})
result["result"]

### Memory

In [None]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key = "chat_history",
    return_messages = True # return chathistory as list of messages instead of single string
)

### ConversationalRetrievalChain

- Goes beyond RetrievalQA chain by: *adds step that takes history & question, condenses it and passes that to the vectorstore to look up relevant documents*.

In [None]:
from langchain.chains import ConversationalRetrievalChain
retriever=vectordb.as_retriever()
qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=memory
)

In [None]:
# Question
question = "Is Reinforcement a topic in these reports?"
result = qa({"question": question})
result['answer']

In [None]:
# Follow-up question
question = "how does it work?"
result = qa({"question": question})
result['answer']

---
## Part 7: User Interface (UI)
Creating a chatbot that can interact with document uploads

- Alternate memory/retrievers by configuring  in `load_db` and `covchain`
- Extend GUI with panel (https://panel.holoviz.org/) and param (https://param.holoviz.org/)
- Source: https://github.com/sophiamyang/tutorials-LangChain

In [None]:
input_document_directory = "C:/Users/richi/OneDrive/Documents/OpenAI API practice/Langchain chat with your data/DL_A4.pdf"

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA,  ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader

Full walkthrough of full process (covered above):
- Note: memory not passed in here, but managed externally (convenient for GUI). Chat history managed outside chain

In [None]:
def load_db(file, chain_type, k):
    # load documents
    loader = PyPDFLoader(file)
    documents = loader.load()
    
    # split documents
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
    docs = text_splitter.split_documents(documents)
    
    # define embedding
    embeddings = OpenAIEmbeddings()
    
    # create vector database from data
    db = DocArrayInMemorySearch.from_documents(docs, embeddings)
    
    # define retriever
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": k})
    
    # create a chatbot chain. Memory is managed externally.
    qa = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name=llm_name, temperature=0), 
        chain_type=chain_type, 
        retriever=retriever, 
        return_source_documents=True,
        return_generated_question=True,
    )
    return qa 

In [None]:
import panel as pn
import param

class cbfs(param.Parameterized):
    chat_history = param.List([])
    answer = param.String("")
    db_query  = param.String("")
    db_response = param.List([])
    
    def __init__(self,  **params):
        super(cbfs, self).__init__( **params)
        self.panels = []
        self.loaded_file = input_document_directory # input document is passed in here
        self.qa = load_db(self.loaded_file,"stuff", 4)
    
    def call_load_db(self, count):
        if count == 0 or file_input.value is None:  # init or no file specified :
            return pn.pane.Markdown(f"Loaded File: {self.loaded_file}")
        else:
            file_input.save("temp.pdf")  # local copy
            self.loaded_file = file_input.filename
            button_load.button_style="outline"
            self.qa = load_db("temp.pdf", "stuff", 4)
            button_load.button_style="solid"
        self.clr_history()
        return pn.pane.Markdown(f"Loaded File: {self.loaded_file}")

    def convchain(self, query):
        if not query:
            return pn.WidgetBox(pn.Row('User:', pn.pane.Markdown("", width=600)), scroll=True)
        result = self.qa({"question": query, "chat_history": self.chat_history}) # Passing in chat history
        self.chat_history.extend([(query, result["answer"])]) # Extending query with result of context
        self.db_query = result["generated_question"]
        self.db_response = result["source_documents"]
        self.answer = result['answer'] 
        self.panels.extend([
            pn.Row('User:', pn.pane.Markdown(query, width=600)),
            pn.Row('ChatBot:', pn.pane.Markdown(self.answer, width=600, style={'background-color': '#F6F6F6'}))
        ])
        inp.value = ''  #clears loading indicator when cleared
        return pn.WidgetBox(*self.panels,scroll=True)

    @param.depends('db_query ', )
    def get_lquest(self):
        if not self.db_query :
            return pn.Column(
                pn.Row(pn.pane.Markdown(f"Last question to DB:", styles={'background-color': '#F6F6F6'})),
                pn.Row(pn.pane.Str("no DB accesses so far"))
            )
        return pn.Column(
            pn.Row(pn.pane.Markdown(f"DB query:", styles={'background-color': '#F6F6F6'})),
            pn.pane.Str(self.db_query )
        )

    @param.depends('db_response', )
    def get_sources(self):
        if not self.db_response:
            return 
        rlist=[pn.Row(pn.pane.Markdown(f"Result of DB lookup:", styles={'background-color': '#F6F6F6'}))]
        for doc in self.db_response:
            rlist.append(pn.Row(pn.pane.Str(doc)))
        return pn.WidgetBox(*rlist, width=600, scroll=True)

    @param.depends('convchain', 'clr_history') 
    def get_chats(self):
        if not self.chat_history:
            return pn.WidgetBox(pn.Row(pn.pane.Str("No History Yet")), width=600, scroll=True)
        rlist=[pn.Row(pn.pane.Markdown(f"Current Chat History variable", styles={'background-color': '#F6F6F6'}))]
        for exchange in self.chat_history:
            rlist.append(pn.Row(pn.pane.Str(exchange)))
        return pn.WidgetBox(*rlist, width=600, scroll=True)

    def clr_history(self,count=0):
        self.chat_history = []
        return 

Create the Chatbot:

In [89]:
cb = cbfs()

file_input = pn.widgets.FileInput(accept='.pdf')
button_load = pn.widgets.Button(name="Load DB", button_type='primary')
button_clearhistory = pn.widgets.Button(name="Clear History", button_type='warning')
button_clearhistory.on_click(cb.clr_history)
inp = pn.widgets.TextInput( placeholder='Enter text here…')

bound_button_load = pn.bind(cb.call_load_db, button_load.param.clicks)
conversation = pn.bind(cb.convchain, inp) 

jpg_pane = pn.pane.Image( './img/convchain.jpg')

tab1 = pn.Column(
    pn.Row(inp),
    pn.layout.Divider(),
    pn.panel(conversation,  loading_indicator=True, height=300),
    pn.layout.Divider(),
)
tab2= pn.Column(
    pn.panel(cb.get_lquest),
    pn.layout.Divider(),
    pn.panel(cb.get_sources ),
)
tab3= pn.Column(
    pn.panel(cb.get_chats),
    pn.layout.Divider(),
)
tab4=pn.Column(
    pn.Row( file_input, button_load, bound_button_load),
    pn.Row( button_clearhistory, pn.pane.Markdown("Clears chat history. Can use to start a new topic" )),
    pn.layout.Divider(),
    pn.Row(jpg_pane.clone(width=400))
)
dashboard = pn.Column(
    pn.Row(pn.pane.Markdown('# ChatWithYourData_Bot')),
    pn.Tabs(('Conversation', tab1), ('Database', tab2), ('Chat History', tab3),('Configure', tab4))
)
dashboard

---
## Extra: Info for Langchain plus
If you wish to experiment on the `LangChain plus platform`: - Debugging capabilities

 * Go to [langchain plus platform](https://www.langchain.plus/) and sign up
 * Create an API key from your account's settings
 * Use this API key in the code below   
 * uncomment the code  
 Note, the endpoint in the video differs from the one below. Use the one below.

In [None]:
#import os
#os.environ["LANGCHAIN_TRACING_V2"] = "true"
#os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
#os.environ["LANGCHAIN_API_KEY"] = "..." # replace dots with your api key