# Build RAG pipeline using Open Source Large Languages

In the notebook we will build a Chat with Website use cases using Zephyr 7B model

## Installation

In [None]:
!pip install langchain faiss-cpu sentence-transformers chromadb

## Import RAG components required to build pipeline

In [None]:
from langchain.llms import HuggingFaceHub
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter,RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain.chains import RetrievalQA, LLMChain

## Setup HuggingFace Access Token

- Log in to [HuggingFace.co](https://huggingface.co/)
- Click on your profile icon at the top-right corner, then choose [“Settings.”](https://huggingface.co/settings/)
- In the left sidebar, navigate to [“Access Token”](https://huggingface.co/settings/tokens)
- Generate a new access token, assigning it the “write” role.


In [None]:
import os
from getpass import getpass

HF_TOKEN = getpass("HF Token:")
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN

HF Token:··········


## External data/document - ETL

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
WEBSITE_URL = "https://tarunjain.netlify.app/"

In [None]:
loader = WebBaseLoader(WEBSITE_URL)
loader.requests_per_second = 1
docs = loader.aload()

Fetching pages: 100%|##########| 1/1 [00:00<00:00, 15.21it/s]


In [None]:
docs

## Text Splitting - Chunking

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=20)
chunks = text_splitter.split_documents(docs)

In [None]:
chunks[0]

Document(page_content="Tarun Jain\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\nAbout Me\nEvents\nAchievements\nProjects\nWork\nConnect\n\nResume\n\n\nAI with Tarun\n\n\n\n\n\n\n\n\n\n\n\n\nIt's Me...  Tarun Jain\n\n DevRel @AI Planet🥑 ||  Community Lead @Embedchain.ai🤗||  GDE in ML🚀 \nKnow more", metadata={'source': 'https://tarunjain.netlify.app/', 'title': 'Tarun Jain', 'language': 'pt-BR'})

In [None]:
chunks[1]

Document(page_content='About Me', metadata={'source': 'https://tarunjain.netlify.app/', 'title': 'Tarun Jain', 'language': 'pt-BR'})

In [None]:
chunks[2]

Document(page_content="Hello! My name is Tarun R Jain. I'm a passionate coder with expertise in Machine Learning, Image Processing, and Deep Learning. I've published over 80 blog articles documenting my coding journey, and I'm actively involved in various", metadata={'source': 'https://tarunjain.netlify.app/', 'title': 'Tarun Jain', 'language': 'pt-BR'})

In [None]:
chunks[3]

Document(page_content='involved in various communities, including Hugging Face-Keras Working Group, Deep Learning AI- Bangalore Ambassador, TensorFlow User Group Bangalore- Assistant Organizer and Geeksforgeeks- Technical content writer.', metadata={'source': 'https://tarunjain.netlify.app/', 'title': 'Tarun Jain', 'language': 'pt-BR'})

In [None]:
chunks[4]

Document(page_content="What I love most about coding is the ability to create something new and solve complex problems. I'm constantly learning and experimenting with new techniques to improve my skills and knowledge. In addition to my technical pursuits, I also", metadata={'source': 'https://tarunjain.netlify.app/', 'title': 'Tarun Jain', 'language': 'pt-BR'})

## Embeddings

In [None]:
embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_key=HF_TOKEN, model_name="BAAI/bge-base-en-v1.5"
)

## Vector Store - FAISS or ChromaDB

In [None]:
vectorstore = Chroma.from_documents(chunks, embeddings)

In [None]:
vectorstore

<langchain_community.vectorstores.chroma.Chroma at 0x7c4a34463a00>

In [None]:
query = "Where does Tarun work?"
search = vectorstore.similarity_search(query)

In [None]:
search[0].page_content

"Tarun Jain\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\nAbout Me\nEvents\nAchievements\nProjects\nWork\nConnect\n\nResume\n\n\nAI with Tarun\n\n\n\n\n\n\n\n\n\n\n\n\nIt's Me...  Tarun Jain\n\n DevRel @AI Planet🥑 ||  Community Lead @Embedchain.ai🤗||  GDE in ML🚀 \nKnow more"

## Retriever

In [None]:
retriever = vectorstore.as_retriever(
    search_type="mmr", #similarity
    search_kwargs={'k': 4}
)

In [None]:
retriever.get_relevant_documents(query)

[Document(page_content="Tarun Jain\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHome\nAbout Me\nEvents\nAchievements\nProjects\nWork\nConnect\n\nResume\n\n\nAI with Tarun\n\n\n\n\n\n\n\n\n\n\n\n\nIt's Me...  Tarun Jain\n\n DevRel @AI Planet🥑 ||  Community Lead @Embedchain.ai🤗||  GDE in ML🚀 \nKnow more", metadata={'language': 'pt-BR', 'source': 'https://tarunjain.netlify.app/', 'title': 'Tarun Jain'}),
 Document(page_content='MangaPi\n\n\n\n\n\n\n\n\n\n\nHyperspectral Image Compression Using MATLAB (Hardware)\n\n\n\n\n\n\n\n\n\n\nAppreciate my Work\nFeel free to check this resources...\n\n\n\n\n\n\nAI With Tarun:\nSubscribe...', metadata={'language': 'pt-BR', 'source': 'https://tarunjain.netlify.app/', 'title': 'Tarun Jain'}),
 Document(page_content='MediaPipe Tasks API Bootcamp\n\r\n                 I explained the need for Mediapipe and its applications. And later the participants implemented Mediapipe Tasks projects in the Audio, Image and Text domains.', metadata={'language': 'pt-

## Large Language Model - Open Source

In [None]:
llm = HuggingFaceHub(
    repo_id="huggingfaceh4/zephyr-7b-alpha",
    model_kwargs={"temperature": 0.5, "max_length": 64,"max_new_tokens":512}
)

## Prompt Template and User Input (Augment - Step 2)

In [None]:
query = "Name the projects Tarun has worked on?"

prompt = f"""
 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 {query}
 </s>
 <|assistant|>
"""

## RAG RetrievalQA chain

In [None]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=retriever)

In [None]:
response = qa.run(prompt)

In [None]:
response

"\n\nIn addition to MangaPi, a hardware-based hyperspectral image compression project using MATLAB, Tarun has worked on a variety of other projects. One such project was a computer vision project that involved developing a face recognition system using OpenCV. Additionally, Tarun participated in an Object Tracking Bootcamp, where he worked on a vehicle tracking project. \n\nTarun's experience in these projects, as well as his participation in hackathons and competitions, demonstrates his passion for innovation and problem-solving in the fields of computer vision, machine learning, and artificial intelligence."

## Chain

In [None]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate

In [None]:
template = """
 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 {query}
 </s>
 <|assistant|>
"""

In [None]:
prompt = ChatPromptTemplate.from_template(template)

In [None]:
rag_chain = (
    {"context": retriever,  "query": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
response = rag_chain.invoke("Name the projects Tarun has worked on?")

In [None]:
print(response)

I do not have access to specific information about any particular person unless it is publicly available. however, if you provide me with the name "tarun" and specify which tarun you are referring to, i can help you with that. please provide me with more context or details.
