<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0"> </div>
    <div style="float: left; margin-left: 10px;"> <h1>LangChain for Generative AI</h1>
<h1>ChatBot</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
from collections import Counter
from pprint import pprint

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

import langchain

from langchain.prompts import PromptTemplate
from langchain.document_loaders import GutenbergLoader

from langchain.memory import ConversationBufferMemory
from langchain.memory.chat_message_histories.in_memory import ChatMessageHistory

from langchain.schema import messages_from_dict, messages_to_dict

from langchain.agents import Tool
from langchain.agents import initialize_agent
from langchain.agents import AgentType

from langchain.chains import LLMChain, ConversationalRetrievalChain, ConversationChain
from langchain.chains import RetrievalQA

from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

import langchain_openai
from langchain_openai import ChatOpenAI

import tempfile

import watermark

%load_ext watermark
%matplotlib inline

We start by print out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.12.3

Compiler    : Clang 14.0.6 
OS          : Darwin
Release     : 23.6.0
Machine     : arm64
Processor   : arm
CPU cores   : 16
Architecture: 64bit

Git hash: 0b932bf32c4e9b1e4ff9126fc7e8e7ac7b4205ff

json            : 2.0.9
watermark       : 2.4.3
langchain_openai: 0.1.8
matplotlib      : 3.8.0
langchain       : 0.2.2
pandas          : 2.2.3
numpy           : 1.26.4



Load default figure style

In [3]:
plt.style.use('d4sci.mplstyle')

# Start

In [5]:
cache_dir = "./cache"

In [6]:
loader = GutenbergLoader(
    "https://www.gutenberg.org/cache/epub/1513/pg1513.txt"
)

document = loader.load()

extrait = ' '.join(document[0].page_content.split()[:100])
display(extrait + " .......")

'The Project Gutenberg eBook of Romeo and Juliet This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook. Title: Romeo and Juliet Author: William Shakespeare Release date: November .......'

In [7]:
text_splitter = CharacterTextSplitter(
    chunk_size=1024, # Each chunk is of size 1024
    chunk_overlap=128 # Neigboring chunks overlap by 128 characters
) 

texts = text_splitter.split_documents(document)

In [8]:
model_name = "sentence-transformers/all-MiniLM-L6-v2"

embeddings = HuggingFaceEmbeddings(
    model_name=model_name, 
    cache_folder=cache_dir
)  # Use a pre-cached model

  from tqdm.autonotebook import tqdm, trange


In [9]:
vectordb = Chroma.from_documents(
    texts, 
    embeddings, 
    persist_directory=cache_dir
)

In [10]:
question = "Romeo!"

docs = vectordb.similarity_search(question, k=2)

In [11]:
# Check the length of the document
print(len(docs))

2


In [12]:
# Check the content of the first document
print(docs[0].page_content)
print("="*20)
print(docs[1].page_content)

Romeo! My cousin Romeo! Romeo!





MERCUTIO.


He is wise,


And on my life hath stol’n him home to bed.





BENVOLIO.


He ran this way, and leap’d this orchard wall:


Call, good Mercutio.





MERCUTIO.


Nay, I’ll conjure too.
Romeo! My cousin Romeo! Romeo!





MERCUTIO.


He is wise,


And on my life hath stol’n him home to bed.





BENVOLIO.


He ran this way, and leap’d this orchard wall:


Call, good Mercutio.





MERCUTIO.


Nay, I’ll conjure too.


Create a wrapper around the functionality of our vector database so we can search for similar documents in the vectorstore

In [13]:
retriever = vectordb.as_retriever()

In [25]:
llm = ChatOpenAI(model='gpt-4o', temperature=0)

In [26]:
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 

{context}

Question: {question}

Helpful Answer:"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [27]:
pprint(QA_CHAIN_PROMPT.template)

('Use the following pieces of context to answer the question at the end. If '
 "you don't know the answer, just say that you don't know, don't try to make "
 'up an answer. Use three sentences maximum. Keep the answer as concise as '
 'possible. Always say "thanks for asking!" at the end of the answer. \n'
 '\n'
 '{context}\n'
 '\n'
 'Question: {question}\n'
 '\n'
 'Helpful Answer:')


In [28]:
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    retriever=retriever,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [29]:
query = "What is Juliets family?"

query_results_venice = qa.invoke(query)
print("#" * 12)
query_results_venice['result']

############


"Juliet's family is the Capulet family. Thanks for asking!"

In [30]:
query = "What happens to Romeo and Juliet?"
query_results_romeo = qa.invoke(query)
print("#" * 12)
query_results_romeo['result']

############


'Romeo and Juliet both die, leading to a reconciliation between their feuding families, the Capulets and the Montagues. Thanks for asking!'

In [31]:
query = "Who is Mercutio?"
query_results_romeo = qa.invoke(query)
print("#" * 12)
query_results_romeo['result']

############


'Mercutio is a character in William Shakespeare\'s play "Romeo and Juliet." He is a close friend of Romeo and is known for his witty and playful nature. Thanks for asking!'

In [32]:
query = "Does Romeo live?"
qa_chain_docs = RetrievalQA.from_chain_type(llm,
                                       retriever=vectordb.as_retriever(),
                                       # Return source documents
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})


result = qa_chain_docs({"query": question})
result["result"]

'It seems you are quoting from a scene in "Romeo and Juliet" where Benvolio and Mercutio are looking for Romeo after he has leapt over the orchard wall. They believe he has gone home to bed, but he is actually hiding nearby. Thanks for asking!'

In [33]:
len(result['source_documents'])

4

In [34]:
print(result['source_documents'][2].page_content)

Romeo! My cousin Romeo! Romeo!





MERCUTIO.


He is wise,


And on my life hath stol’n him home to bed.





BENVOLIO.


He ran this way, and leap’d this orchard wall:


Call, good Mercutio.





MERCUTIO.


Nay, I’ll conjure too.


<center>
     <img src="data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</center>