<a href="https://colab.research.google.com/github/Iftekhar-mobin/AI_NLP_Journey/blob/main/LANGCHAIN_LLM_RAG_OPENAI_GutenBurg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "Fill"

In [None]:
%%capture

!pip install chromadb==0.4.10 tiktoken==0.3.3 sqlalchemy==2.0.15
!pip install langchain==0.0.249
!pip install --force-reinstall pydantic==1.10.6
!pip install sentence_transformers

In [None]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain, ConversationalRetrievalChain, ConversationChain
from langchain.memory import ConversationBufferMemory
from langchain.schema import messages_from_dict, messages_to_dict
from langchain.memory.chat_message_histories.in_memory import ChatMessageHistory
from langchain.agents import Tool
from langchain.agents import initialize_agent
from langchain.agents import AgentType

In [None]:
cache_dir = "./cache"

In [None]:
import pandas as pd
pd.set_option('display.max_column', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_seq_items', None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

In [None]:
from langchain.document_loaders import GutenbergLoader

loader = GutenbergLoader(
    "https://www.gutenberg.org/cache/epub/100/pg100.txt"
)

document = loader.load()

extrait = ' '.join(document[0].page_content.split()[:100])
display(extrait + " .......")

'The Project Gutenberg eBook of The Complete Works of William Shakespeare This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook. Title: The Complete Works of William Shakespeare .......'

In [None]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
import tempfile

# Chunk sizes of 1024 and an overlap of 256 (this will take approx. 10mins with this model to build our vector database index)
text_splitter = CharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=256
)
texts = text_splitter.split_documents(document)

In [None]:
model_name = "sentence-transformers/all-MiniLM-L6-v2"

embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    cache_folder=cache_dir
)  # Use a pre-cached model

vectordb = Chroma.from_documents(
    texts,
    embeddings,
    persist_directory=cache_dir
)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
question = "Romeo!"

docs = vectordb.similarity_search(question,k=2)

# Check the length of the document
print(len(docs))

# Check the content of the first document
print(docs[0].page_content)

# Persist the database to use it later
vectordb.persist()

2
Romeo! My cousin Romeo! Romeo!





MERCUTIO.


He is wise,


And on my life hath stol’n him home to bed.





BENVOLIO.


He ran this way, and leap’d this orchard wall:


Call, good Mercutio.





MERCUTIO.


Nay, I’ll conjure too.


Romeo! Humours! Madman! Passion! Lover!


Appear thou in the likeness of a sigh,


Speak but one rhyme, and I am satisfied;


Cry but ‘Ah me!’ Pronounce but Love and dove;


Speak to my gossip Venus one fair word,


One nickname for her purblind son and heir,


Young Abraham Cupid, he that shot so trim


When King Cophetua lov’d the beggar-maid.


He heareth not, he stirreth not, he moveth not;


The ape is dead, and I must conjure him.


I conjure thee by Rosaline’s bright eyes,


By her high forehead and her scarlet lip,


By her fine foot, straight leg, and quivering thigh,


And the demesnes that there adjacent lie,


That in thy likeness thou appear to us.





BENVOLIO.


An if he hear thee, thou wilt anger him.






In [None]:
from langchain.llms import HuggingFacePipeline

# We want to make this a retriever, so we need to convert our index.
# This will create a wrapper around the functionality of our vector database
# so we can search for similar documents/chunks in the vectorstore and retrieve the results:
retriever = vectordb.as_retriever()

# This chain will be used to do QA on the document. We will need
# 1 - A LLM to do the language interpretation
# 2 - A vector database that can perform document retrieval
# 3 - Specification on how to deal with this data

hf_llm = HuggingFacePipeline.from_model_id(
    model_id="google/flan-t5-large",
    task="text2text-generation",
    model_kwargs={
#        "temperature": 0,
        "do_sample":True,
        "max_length": 2048,
        "cache_dir": cache_dir,
    },
)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



In [None]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=hf_llm,
    chain_type="refine",
    retriever=retriever
)
query = "Who is the main character in the Merchant of Venice?"
query_results_venice = qa.run(query)
print("#" * 12)
query_results_venice

############


'Antonio, Salarino and Solanio'

In [None]:
qa = RetrievalQA.from_chain_type(
    llm=hf_llm,
    chain_type="refine",
    retriever=retriever
)
query = "What happens to Romeo and Juliet?"
query_results_romeo = qa.run(query)
print("#" * 12)
query_results_romeo

############


'Willing to marry all that they have, and to save her life from her second marriage to Tybalt. Romeo cannot marry her, and therefore is betrothed to Juliet. Then she asks in vain for Romeo to find his wife Juliet who will not be deceived for her. Romeo has the help of a handsome young man, who makes her promise to help her find Tybalt. But Juliet is not to be conjoint. She is to die. Romeo accepts his fate, and forges for her a good death.'

In [None]:
qa = RetrievalQA.from_chain_type(
    llm=hf_llm,
    chain_type="refine",
    retriever=retriever
)
query = "when Does King John die?"
query_results_romeo = qa.run(query)
print("#" * 12)
query_results_romeo

############


'John the Black Prince died before his father And left behind him Richard, his only son, Who after Edward the Third’s death reigned as king, Till Henry Bolingbroke, Duke of Lancaster, Crowned by the name of Henry the Fourth, Seized on the realm, deposed the rightful king, Sent his poor queen to France, from whence she came, And him to Pomfret; where, as all you know, Harmless Richard was murdered traitorously.'