# Build RAG pipeline using Open Source Large Languages

## Made by: Wilfredo Aaron Sosa Ramos

In the notebook we will build a Chat with Website use cases using Zephyr 7B model

## Installation

In [1]:
!pip install langchain faiss-cpu sentence-transformers chromadb

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting chromadb
  Downloading chromadb-0.5.23-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.4-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Co

## Import RAG components required to build pipeline

In [3]:
!pip install langchain_core langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.12-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.12 (from langchain_community)
  Downloading langchain-0.3.12-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain_core
  Downloading langchain_core-0.3.25-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.7.0-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.23.1-py3-none-any.whl.metadata (7.5 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  D

In [4]:
from langchain.llms import HuggingFaceHub
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter,RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain.chains import RetrievalQA, LLMChain



## Setup HuggingFace Access Token

- Log in to [HuggingFace.co](https://huggingface.co/)
- Click on your profile icon at the top-right corner, then choose [“Settings.”](https://huggingface.co/settings/)
- In the left sidebar, navigate to [“Access Token”](https://huggingface.co/settings/tokens)
- Generate a new access token, assigning it the “write” role.


In [5]:
import os
from getpass import getpass

HF_TOKEN = getpass("HF Token:")
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN

HF Token:··········


## External data/document - ETL

In [6]:
import nest_asyncio

nest_asyncio.apply()

In [7]:
WEBSITE_URL = "https://developers.google.com/machine-learning/resources/intro-llms?hl=en"

In [8]:
loader = WebBaseLoader(WEBSITE_URL)
loader.requests_per_second = 1
docs = loader.aload()

Fetching pages: 100%|##########| 1/1 [00:00<00:00,  1.95it/s]


In [9]:
docs

[Document(metadata={'source': 'https://developers.google.com/machine-learning/resources/intro-llms?hl=en', 'title': 'Introduction to Large Language Models \xa0|\xa0 Machine Learning \xa0|\xa0 Google for Developers', 'language': 'en'}, page_content='\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIntroduction to Large Language Models \xa0|\xa0 Machine Learning \xa0|\xa0 Google for Developers\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    \n          Machine Learning\n        \n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n/\n\n\n\n\n\n\n\n\n\n\nEnglish\n\n\nDeutsch\n\n\nEspañol\n\n\nEspañol – América Latina\n\n\nFrançais\n\n\nIndonesia\n\n\nItaliano\n\n\nPolski\n\n\nPortuguês – Brasil\n\n\nTiếng Việt\n\n\nTürkçe\n\n\nРусский\n\n\nעברית\n\n\nالعربيّة\n\n\nفارسی\n\n\nहिंदी\n\n\nবাংলা\n\n\nภาษาไทย\n\n\n中文 – 简体\n\n\n中文 – 繁體\n\n\n日本語\n\n\n한국어\n\n\n\n\nSign in\n\n\n\n\n\n\n\n\n\n\n\n    Home\n  \n    \n\n\n\n    Resources\n  \n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    \n          Machine Learning\n    

## Text Splitting - Chunking

In [10]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=20)
chunks = text_splitter.split_documents(docs)

In [12]:
chunks

[Document(metadata={'source': 'https://developers.google.com/machine-learning/resources/intro-llms?hl=en', 'title': 'Introduction to Large Language Models \xa0|\xa0 Machine Learning \xa0|\xa0 Google for Developers', 'language': 'en'}, page_content='Introduction to Large Language Models \xa0|\xa0 Machine Learning \xa0|\xa0 Google for Developers\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    \n          Machine Learning\n        \n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n/\n\n\n\n\n\n\n\n\n\n\nEnglish\n\n\nDeutsch\n\n\nEspañol\n\n\nEspañol – América Latina'),
 Document(metadata={'source': 'https://developers.google.com/machine-learning/resources/intro-llms?hl=en', 'title': 'Introduction to Large Language Models \xa0|\xa0 Machine Learning \xa0|\xa0 Google for Developers', 'language': 'en'}, page_content='Français\n\n\nIndonesia\n\n\nItaliano\n\n\nPolski\n\n\nPortuguês – Brasil\n\n\nTiếng Việt\n\n\nTürkçe\n\n\nРусский\n\n\nעברית\n\n\nالعربيّة\n\n\nفارسی\n\n\nहिंदी\n\n\nবাংলা\n\n\nภาษาไทย\n\n\n中文 

In [13]:
len(chunks)

50

## Embeddings

In [14]:
embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_key=HF_TOKEN, model_name="BAAI/bge-base-en-v1.5"
)

## Vector Store - FAISS or ChromaDB

In [15]:
vectorstore = Chroma.from_documents(chunks, embeddings)

In [16]:
vectorstore

<langchain_community.vectorstores.chroma.Chroma at 0x7aca6c2b19c0>

In [23]:
query = "What is a LLM"
search = vectorstore.similarity_search(query)

In [24]:
search

[Document(metadata={'language': 'en', 'source': 'https://developers.google.com/machine-learning/resources/intro-llms?hl=en', 'title': 'Introduction to Large Language Models \xa0|\xa0 Machine Learning \xa0|\xa0 Google for Developers'}, page_content="work).\nLLMs are excellent at mimicking human speech patterns. Among other things,\nthey're great at combining information with different styles and tones.\nHowever, LLMs can be components of models that do more than just"),
 Document(metadata={'language': 'en', 'source': 'https://developers.google.com/machine-learning/resources/intro-llms?hl=en', 'title': 'Introduction to Large Language Models \xa0|\xa0 Machine Learning \xa0|\xa0 Google for Developers'}, page_content='Define language models and large language models (LLMs).\n\nDefine key LLM concepts, including Transformers and self-attention.\n\nDescribe the costs and benefits of LLMs, along with common use cases.'),
 Document(metadata={'language': 'en', 'source': 'https://developers.googl

In [25]:
search[0].page_content

"work).\nLLMs are excellent at mimicking human speech patterns. Among other things,\nthey're great at combining information with different styles and tones.\nHowever, LLMs can be components of models that do more than just"

## Retriever

In [26]:
retriever = vectorstore.as_retriever(
    search_type="mmr", #similarity
    search_kwargs={'k': 4}
)

In [28]:
retriever.invoke(query)

[Document(metadata={'language': 'en', 'source': 'https://developers.google.com/machine-learning/resources/intro-llms?hl=en', 'title': 'Introduction to Large Language Models \xa0|\xa0 Machine Learning \xa0|\xa0 Google for Developers'}, page_content="work).\nLLMs are excellent at mimicking human speech patterns. Among other things,\nthey're great at combining information with different styles and tones.\nHowever, LLMs can be components of models that do more than just"),
 Document(metadata={'language': 'en', 'source': 'https://developers.google.com/machine-learning/resources/intro-llms?hl=en', 'title': 'Introduction to Large Language Models \xa0|\xa0 Machine Learning \xa0|\xa0 Google for Developers'}, page_content='Define language models and large language models (LLMs).\n\nDefine key LLM concepts, including Transformers and self-attention.\n\nDescribe the costs and benefits of LLMs, along with common use cases.'),
 Document(metadata={'language': 'en', 'source': 'https://developers.googl

## Large Language Model - Open Source

In [29]:
llm = HuggingFaceHub(
    repo_id="huggingfaceh4/zephyr-7b-alpha",
    model_kwargs={"temperature": 0.5, "max_length": 64,"max_new_tokens":512}
)

  llm = HuggingFaceHub(


## Prompt Template and User Input (Augment - Step 2)

In [30]:
query = "Define what is a Large Language Model, a Transformer and Self-Attention"

prompt = f"""
 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 {query}
 </s>
 <|assistant|>
"""

## RAG RetrievalQA chain

In [31]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=retriever)

In [34]:
response = qa.invoke(prompt)

In [35]:
response

{'query': '\n <|system|>\nYou are an AI assistant that follows instruction extremely well.\nPlease be truthful and give direct answers\n</s>\n <|user|>\n Define what is a Large Language Model, a Transformer and Self-Attention\n </s>\n <|assistant|>\n',
 'result': 'The original question is as follows: \n <|system|>\nYou are an AI assistant that follows instruction extremely well.\nPlease be truthful and give direct answers\n</s>\n <|user|>\n Define what is a Large Language Model, a Transformer and Self-Attention\n </s>\n <|assistant|>\n\nWe have provided an existing answer: The original question is as follows: \n <|system|>\nYou are an AI assistant that follows instruction extremely well.\nPlease be truthful and give direct answers\n</s>\n <|user|>\n Define what is a Large Language Model, a Transformer and Self-Attention\n </s>\n <|assistant|>\n\nWe have provided an existing answer: The original question is as follows: \n <|system|>\nYou are an AI assistant that follows instruction extr

## Chain

In [36]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate

In [37]:
template = """
 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 {query}
 </s>
 <|assistant|>
"""

In [38]:
prompt = ChatPromptTemplate.from_template(template)

In [39]:
rag_chain = (
    {"context": retriever,  "query": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [40]:
response = rag_chain.invoke("Define what is a Large Language Model, a Transformer and Self-Attention")

In [41]:
print(response)

Human: 
 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 Define what is a Large Language Model, a Transformer and Self-Attention
 </s>
 <|assistant|>
A Large Language Model (LLM) is a type of artificial intelligence model that is trained on a massive dataset of texts to learn patterns and generate human-like responses to input text. LLMs can be used to perform various natural language processing tasks, such as text generation, text classification, and question answering.

A Transformer is a type of neural network architecture that is commonly used in LLMs. Unlike traditional recurrent neural networks (RNNs) that iteratively process input data, transformers use a self-attention mechanism to process input data simultaneously.

Self-attention is a technique used in transformers to allow the model to focus on specific parts of the input data that are most relevant to the task at hand. Self-attention invo