<a href="https://colab.research.google.com/github/LadyHermitage/Langchain_Berkshire_chatbot/blob/main/LangChain_chatbot_BerkshireHathaway.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a chatbot on select data (10 years of Berkshire Hathaway letters)

MY STORY
When I worked as a Knowledge Manager within an infrastructure team at Square/Block, I could see the role AI would play in my field. All of our internal documentation could be queried by a chatbot. No more hunting and pecking or tapping people by Slack. Information operations would become more efficient and comprehensive.  

Alas, tech layoffs! I did not have the opportunity to implement this but I did still have my enthusiam and curiosity. Over the next 3 months, I learned about ML > generative AI > NLP > LLMs > Python > Pandas > SQL > IDEs > Langchain and more!

Now, I understand what it takes to create this tool (thanks to Code Acedemy, Kaggle and DeepLearning.ai) and it looks a little something like this...hit it!

In [None]:
# Set up with your OpenAI API key
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']
from langchain.chat_models import ChatOpenAI

In [None]:
#! pip install langchain

In [None]:
#! pip install pypdf

In [None]:
from langchain.document_loaders import PyPDFLoader

# Load your data
loaders = [
    # You need to update to the location on your machine
    PyPDFLoader("/BerkshireHathaway_letters/2013ltr.pdf"),
    PyPDFLoader("/BerkshireHathaway_letters/2014ltr.pdf"),
    PyPDFLoader("/BerkshireHathaway_letters/2015ltr.pdf"),
    PyPDFLoader("/BerkshireHathaway_letters/2016ltr.pdf"),
    PyPDFLoader("/BerkshireHathaway_letters/2017ltr.pdf"),
    PyPDFLoader("/BerkshireHathaway_letters/2018ltr.pdf"),
    PyPDFLoader("/BerkshireHathaway_letters/2019ltr.pdf"),
    PyPDFLoader("/BerkshireHathaway_letters/2020ltr.pdf"),
    PyPDFLoader("/BerkshireHathaway_letters/2021ltr.pdf"),
    PyPDFLoader("/BerkshireHathaway_letters/2022ltr.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [None]:
# Split the docs
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
splits = text_splitter.split_documents(docs)


In [None]:
# Set up embeddings
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [None]:
#! pip install chromadb

In [None]:
# Set up the vector store
from langchain.vectorstores import Chroma
# You need to update to the location on your machine
persist_directory = '/docs/chroma/'
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)
vectordb.persist()

In [None]:
# Set up retriever
from langchain.chains import RetrievalQA,  ConversationalRetrievalChain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    retriever=vectordb.as_retriever()
)

In [None]:
# Set up system prompt
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [None]:
# Set up chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [None]:
# Finally, enter your question!
question = "What is The Secret Sauce?"

In [None]:
result = qa_chain({"query": question})
result["result"]

In [None]:
# Check the source doc
result["source_documents"][0]