https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476



**Installing the Libraries**
First of all, let’s start by installing all required libraries using pip install.

In [2]:
!pip install pypdf
!pip install sentence-transformers
!pip install -qU transformers accelerate einops langchain xformers bitsandbytes faiss-gpu sentence_transformers

Collecting pypdf
  Downloading pypdf-3.16.2-py3-none-any.whl (276 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.3/276.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.16.2
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

**Initializing the Hugging Face Pipeline**

You have to initialize a text-generation pipeline with Hugging Face transformers. The pipeline requires the following three things that you must initialize:

A LLM, in this case it will be meta-llama/Llama-2-7b-chat-hf.
The respective tokenizer for the model.
A stopping criteria object.
You have to initialize the model and move it to CUDA-enabled GPU. Using Colab, this can take 5–10 minutes to download and initialize the model.

Also, you need to generate an access token to allow downloading the model from Hugging Face in your code. For that, go to your Hugging Face Profile > Settings > Access Token > New Token > Generate a Token. Just copy the token and add it in the below code.

In [3]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, you need an access token
hf_auth ='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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

# enable evaluation mode to allow model inference
model.eval()

print(f"Model loaded on {device}")



Downloading (…)lve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]



Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 7B models were trained using the Llama 2 7B tokenizer, which can be initialized with this code:

In [4]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)



Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now, we need to define the stopping criteria of the model. The stopping criteria allows us to specify when the model should stop generating text. If we don’t provide a stopping criteria the model just goes on a bit tangent after answering the initial question.

In [5]:
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

You have to convert these stop token ids into LongTensor objects.

In [6]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

You can do a quick spot check that no <unk> token IDs (0) appear in the stop_token_ids — there are none so we can move on to building the stopping criteria object that will check whether the stopping criteria has been satisfied — meaning whether any of these token ID combinations have been generated.

In [7]:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

You are ready to initialize the Hugging Face pipeline. There are a few additional parameters that we must define here. Comments are included in the code for further explanation.

In [9]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=2000,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Run this code to confirm that everything is working fine.

In [8]:
res = generate_text("write 10 knock knock story for a 5 years old")
print(res[0]["generated_text"])

write 10 knock knock story for a 5 years old child. everybody can do it!

Knock, knock!
Who's there?
Boo.
Boo who?
Don't cry, it's a joke! Haha!


**Implementing HF Pipeline in LangChain**

Now, you have to implement the Hugging Face pipeline in LangChain. You will still get the same output as nothing different is being done here. However, this code will allow you to use LangChain’s advanced agent tooling, chains, etc, with Llama 2.

In [14]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

# checking again that everything is working fine
print(llm(prompt="list the difference between Data Lakehouse and Data Warehouse."))

 everybody knows that data warehousing is a process of collecting, storing, and managing large amounts of data from various sources in a structured manner to support business intelligence activities such as reporting, analysis, and data visualization.
A data lakehouse is a centralized repository that stores all of an organization's data, both structured and unstructured, in its original form. It is designed to handle large volumes of data and provide fast and easy access to data for various use cases, including data science, machine learning, and real-time analytics.
Here are some key differences between a data warehouse and a data lakehouse:
1. Structure: A data warehouse is typically structured around a predefined schema or star architecture, while a data lakehouse has a flexible and dynamic structure that can accommodate different types of data and evolving business needs.
2. Data Types: A data warehouse primarily stores structured data, while a data lakehouse stores both structured

**Ingesting Data using Document Loader**

You have to ingest data using WebBaseLoader document loader which collects data by scraping webpages. In this case, you will be collecting data from Databricks documentation website.

In [13]:

from langchain.document_loaders import PyPDFLoader
# Load and preprocess the PDF document

loader = PyPDFLoader('/content/CIT 701.pdf')

documents = loader.load()

ValueError: ignored

In [15]:
from langchain.document_loaders import WebBaseLoader

web_links = ["https://www.databricks.com/","https://help.databricks.com","https://databricks.com/try-databricks","https://help.databricks.com/s/","https://docs.databricks.com","https://kb.databricks.com/","http://docs.databricks.com/getting-started/index.html","http://docs.databricks.com/introduction/index.html","http://docs.databricks.com/getting-started/tutorials/index.html","http://docs.databricks.com/release-notes/index.html","http://docs.databricks.com/ingestion/index.html","http://docs.databricks.com/exploratory-data-analysis/index.html","http://docs.databricks.com/data-preparation/index.html","http://docs.databricks.com/data-sharing/index.html","http://docs.databricks.com/marketplace/index.html","http://docs.databricks.com/workspace-index.html","http://docs.databricks.com/machine-learning/index.html","http://docs.databricks.com/sql/index.html","http://docs.databricks.com/delta/index.html","http://docs.databricks.com/dev-tools/index.html","http://docs.databricks.com/integrations/index.html","http://docs.databricks.com/administration-guide/index.html","http://docs.databricks.com/security/index.html","http://docs.databricks.com/data-governance/index.html","http://docs.databricks.com/lakehouse-architecture/index.html","http://docs.databricks.com/reference/api.html","http://docs.databricks.com/resources/index.html","http://docs.databricks.com/whats-coming.html","http://docs.databricks.com/archive/index.html","http://docs.databricks.com/lakehouse/index.html","http://docs.databricks.com/getting-started/quick-start.html","http://docs.databricks.com/getting-started/etl-quick-start.html","http://docs.databricks.com/getting-started/lakehouse-e2e.html","http://docs.databricks.com/getting-started/free-training.html","http://docs.databricks.com/sql/language-manual/index.html","http://docs.databricks.com/error-messages/index.html","http://www.apache.org/","https://databricks.com/privacy-policy","https://databricks.com/terms-of-use"]

loader = WebBaseLoader(web_links)
documents = loader.load()

**Splitting in Chunks using Text Splitters**

You have to make sure to split the text into small pieces. You will need to initialize RecursiveCharacterTextSplitter and call it by passing the documents.

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

**Creating Embeddings and Storing in Vector Store**

You have to create embeddings for each small chunk of text and store them in the vector store (i.e. FAISS). You will be using all-mpnet-base-v2 Sentence Transformer to convert all pieces of text in vectors while storing them in the vector store.

In [17]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# storing embeddings in the vector store
vectorstore = FAISS.from_documents(all_splits, embeddings)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

**Initializing Chain**

You have to initialize ConversationalRetrievalChain. This chain allows you to have a chatbot with memory while relying on a vector store to find relevant information from your document.

Additionally, you can return the source documents used to answer the question by specifying an optional parameter i.e. return_source_documents=True when constructing the chain.

In [18]:
from langchain.chains import ConversationalRetrievalChain

chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

Now, it’s time to do some Question-Answering on your own data!

In [20]:
chat_history = []

query = "write a Minecraft command related to Movement and Teleportation Commands, I want only the title for the command,the command and detailed description of the command. Use emojis"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

 Sure! Here are some movement and teleportation commands in Minecraft:

Title: Warp 🚀
Command: /warp [location]
Description: Warps the player to a specific location. You can enter any valid Minecraft coordinate, including X, Y, Z, or a full address like "100, 50, 10".

Title: Home 🏠
Command: /home
Description: Automatically warps the player back to their home location. This command is especially useful when exploring new areas and want to quickly return home.

Title: Tp 🔄
Command: /tp [player] [location]
Description: Instantly transports another player to a specified location. You can use this command to help friends or fellow players who are stuck in a difficult situation.


Output:

This time your previous question and answer will be included as a chat history which will enable the ability to ask follow up questions.

In [None]:
chat_history = [(query, result["answer"])]

query = "write 10 article ideas for a kitchen blog"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

 Sure! Artificial intelligence (AI) is being used in various ways in kitchens today, including:

1. Smart cooking appliances: Many kitchen appliances, such as slow cookers, pressure cookers, and even toasters, are now equipped with AI features like automatic temperature control, smart sensors, and voice assistants. These appliances can help users prepare meals more efficiently and accurately.

2. Meal planning and grocery shopping: AI-powered meal planning apps can suggest recipes based on dietary preferences and ingredient availability, while also generating grocery lists for users. Some apps even offer personalized nutrition plans and recommendations based on individual health goals.

3. Food preparation and presentation: AI-assisted food preparation tools can chop, slice, and dice ingredients with greater precision than human chefs. Additionally, AI-powered kitchen robots can help with food presentation, such as arranging ingredients into visually appealing patterns.

4. Culinary in

Output:


You can also see the source of the information used to generate the answer.

In [None]:
print(result['source_documents'])

Finally…
Et voilà! You have now the capability to do question-answering on your on data using a powerful language model. Additionally, you can further develop it into a chatbot application using Streamlit.