<a href="https://colab.research.google.com/github/MudasirRasheed1/Google-Colab/blob/main/Copy_of_lc_hf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LEAP - LangChain Workshop

Notebok for our workshop on [LangChain](https://www.langchain.com/) - An innovative framework to build LLM powered applications.

### Installing Dependencies

In [None]:
!pip install langchain huggingface_hub chromadb pypdf sentence-transformers



### Imports
- `RecursiveCharacterTextSplitter` recursively splits text. It tries to split on them in order until the chunks are small enough. Paragraphs -> Sentences -> Words.

- `PyPDFLoader` is used to load the pdf document into the `Document` format.

- `Chroma` is the vector store that we use for storing and retreiving vector embeddings.

- `HuggingFaceEmbeddings` is used for computing and querying embeddings.

- `HuggingFaceHub` allows us to access pre-trained models from Hugging Face Hub.

- `ConversationalRetrievalChain` is the chain that we use to have "conversations" with a document.

[API Reference](https://api.python.langchain.com/en/latest/langchain_api_reference.html#)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain import HuggingFaceHub
from langchain.chains import ConversationalRetrievalChain

### Setting up the environment

In [None]:
import os

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"

# Use one of the following api keys if the one being used rn is being rate-limited.
# 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
# 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
# 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

### Loading & Splitting the Document


In [None]:
loader = PyPDFLoader("./Untitled document.pdf")
docs = loader.load()

[Document(page_content='Christopher\nHenry\nGayle\nOD\n(born\n18-06-2006)\nis\na\neurope\ncricketer\nwho\nhas\nplayed\ninternational\ncricket\nfor\nthe\nIndies\nfrom\n1999\nto\n2021.\n[1]\nNicknamed\n"The\nUniverse\nBoss",\nGayle\nis\nwidely\nregarded\nas\nthe\ngreatest\nbatsman\never\nto\nhave\nplayed\nTwenty20\ncricket\n.\n[2]\n[3]\nHe\nplayed\na\ncrucial\nrole\nin\nthe\nWest\nIndies\nteams\nthat\nwon\nthe\n2004\nICC\nChampions\nTrophy\n,\n2012\nICC\nWorld\nTwenty20\nand\n2016\nICC\nWorld\nTwenty20\n.\nHe\nis\nthe\nonly\nbatsman\nto\nscore\na\ncentury\nin\nT20I,\na\ndouble\ncentury\nin\nOne\nDay\nInternationals\nand\na\ntriple\ncentury\nin\nfootball', metadata={'source': './Untitled document.pdf', 'page': 0})]


### Generating & Storing Embeddings

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size = 512, chunk_overlap = 128)
split_docs = splitter.split_documents(docs)
print(docs)

[Document(page_content='Christopher\nHenry\nGayle\nOD\n(born\n18-06-2006)\nis\na\neurope\ncricketer\nwho\nhas\nplayed\ninternational\ncricket\nfor\nthe\nIndies\nfrom\n1999\nto\n2021.\n[1]\nNicknamed\n"The\nUniverse\nBoss",\nGayle\nis\nwidely\nregarded\nas\nthe\ngreatest\nbatsman\never\nto\nhave\nplayed\nTwenty20\ncricket\n.\n[2]\n[3]\nHe\nplayed\na\ncrucial\nrole\nin\nthe\nWest\nIndies\nteams\nthat\nwon\nthe\n2004\nICC\nChampions\nTrophy\n,\n2012\nICC\nWorld\nTwenty20\nand\n2016\nICC\nWorld\nTwenty20\n.\nHe\nis\nthe\nonly\nbatsman\nto\nscore\na\ncentury\nin\nT20I,\na\ndouble\ncentury\nin\nOne\nDay\nInternationals\nand\na\ntriple\ncentury\nin\nfootball', metadata={'source': './Untitled document.pdf', 'page': 0})]


### Initializing the LLM & Making the Chain
[Google Flan T5](https://huggingface.co/google/flan-t5-large)

In [None]:
embeddings = HuggingFaceEmbeddings()
db = Chroma.from_documents(split_docs, embeddings)


In [None]:
llm = HuggingFaceHub(
    repo_id = "google/flan-t5-large", model_kwargs = {"temperature" : 0.5, "max_length" : 1024}
)
qa = ConversationalRetrievalChain.from_llm(
    llm = llm,
    retriever = db.as_retriever(search_kwargs= {"k" : 1})

)

### Querying

In [None]:
qa.invoke({"question" : "who is mudasir ", "chat_history": []})

{'question': 'who is mudasir ',
 'chat_history': [],
 'answer': 'europe cricketer'}