#**Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data**
You can perform Question-Answering (QA) like a chatbot using Meta's  [Llama-2–7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model with LangChain framework and FAISS library over the documents of your choice. In this notebook, I've used [Databricks documentation](https://docs.databricks.com/en/index.html) as a data source and retrieved the data directly from their official website to showcase the functionality.

📚 For more details, you can [click here](https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476) to read the full article written by me on Medium.com


##Getting Started
You can use the open source **Llama-2-7b-chat** model in both Hugging Face transformers and LangChain. However, you have to first request access to Llama 2 models via [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and also accept to share your account details with Meta on [Hugging Face website](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). It typically takes a few minutes or hours to get the access.

🚨 Note that your Hugging Face account email **MUST** match the email you provided on the Meta website, or your request will not be approved.

If you’re using Google Colab to run the code. In your notebook, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4. You will need ~8GB of GPU RAM for inference and running on CPU is practically impossible.

##Installing the Libraries
First of all, let’s start by installing all required libraries using pip install.

In [None]:
!pip install accelerate==0.21.0 transformers==4.31.0 tokenizers==0.13.3
!pip install bitsandbytes==0.40.0 einops==0.6.1
!pip install xformers==0.0.22.post7
!pip install langchain==0.1.4
!pip install faiss-gpu==1.7.1.post3
!pip install sentence_transformers

##Initializing the Hugging Face Pipeline
You have to initialize a `text-generation` pipeline with Hugging Face transformers. The pipeline requires the following three things that you must initialize:

*   A LLM, in this case it will be `meta-llama/Llama-2-7b-chat-hf`.
*   The respective tokenizer for the model.
*   A stopping criteria object.

You have to initialize the model and move it to CUDA-enabled GPU. Using Colab, this can take 5–10 minutes to download and initialize the model.

Also, you need to generate an access token to allow downloading the model from Hugging Face in your code. For that, go to your Hugging Face Profile > Settings > Access Token > New Token > Generate a Token. Just copy the token and add it in the below code.

In [1]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, you need an access token
hf_auth = 'Your Token'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

# enable evaluation mode to allow model inference
model.eval()

print(f"Model loaded on {device}")




Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda122.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 122
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda122.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 7B models were trained using the Llama 2 7B tokenizer, which can be initialized with this code:

In [2]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)



Now, we need to define the *stopping criteria* of the model. The stopping criteria allows us to specify when the model should stop generating text. If we don’t provide a stopping criteria the model just goes on a bit tangent after answering the initial question.

In [3]:
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

You have to convert these stop token ids into `LongTensor` objects.

In [4]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

You can do a quick spot check that no `<unk>` token IDs (`0`) appear in the `stop_token_ids` — there are none so we can move on to building the stopping criteria object that will check whether the stopping criteria has been satisfied — meaning whether any of these token ID combinations have been generated.

In [5]:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

You are ready to initialize the Hugging Face pipeline. There are a few additional parameters that we must define here. Comments are included in the code for further explanation.

In [6]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Run this code to confirm that everything is working fine.

In [8]:
res = generate_text("Explain me the difference between Data Lakehouse and Data Warehouse.")
print(res[0]["generated_text"])

Explain me the difference between Data Lakehouse and Data Warehouse. Unterscheidung between data lakehouse and data warehouse is a common topic of discussion in the data engineering community, as both are designed to store large amounts of structured and unstructured data. A data lakehouse is a centralized repository that stores all types of data, including structured, semi-structured, and unstructured data. A data warehouse, on the other hand, is a repository that stores structured data in a specific format, typically optimized for querying and analysis.

Here are some key differences between a data lakehouse and a data warehouse:

1. Structure: A data lakehouse stores data in its raw form, without any predefined schema or structure. A data warehouse, on the other hand, stores data in a structured format, with a defined schema that defines the relationships between different data entities.
2. Data Types: A data lakehouse can store all types of data, including structured, semi-structur

##Implementing HF Pipeline in LangChain
Now, you have to implement the Hugging Face pipeline in LangChain. You will still get the same output as nothing different is being done here. However, this code will allow you to use LangChain’s advanced agent tooling, chains, etc, with **Llama 2**.

In [7]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

# checking again that everything is working fine
llm(prompt="Explain me the difference between Data Lakehouse and Data Warehouse.")

  warn_deprecated(


' Unterscheidung between data lakehouse and data warehouse is a common topic of discussion in the data engineering community, as both are designed to store large amounts of data but have different architectures and use cases. A data lakehouse is a centralized repository that stores all the raw data from various sources in its original form, without transforming or processing it. A data warehouse, on the other hand, is a structured repository that stores data in a specific format, typically after cleaning, transforming, and aggregating it.\n\nHere are some key differences between a data lakehouse and a data warehouse:\n\n1. Data Structure: A data lakehouse stores data in its raw, unprocessed form, while a data warehouse stores data in a structured format, typically after cleaning, transforming, and aggregating it.\n2. Data Processing: A data lakehouse does not process data, whereas a data warehouse processes data through various transformations, such as data cleansing, data transformati

##Ingesting Data using Document Loader
You have to ingest data using `WebBaseLoader` document loader which collects data by scraping webpages. In this case, you will be collecting data from Databricks documentation website.

In [8]:
from langchain.document_loaders import WebBaseLoader

# web_links = ["https://www.databricks.com/","https://help.databricks.com","https://databricks.com/try-databricks","https://help.databricks.com/s/","https://docs.databricks.com","https://kb.databricks.com/","http://docs.databricks.com/getting-started/index.html","http://docs.databricks.com/introduction/index.html","http://docs.databricks.com/getting-started/tutorials/index.html","http://docs.databricks.com/release-notes/index.html","http://docs.databricks.com/ingestion/index.html","http://docs.databricks.com/exploratory-data-analysis/index.html","http://docs.databricks.com/data-preparation/index.html","http://docs.databricks.com/data-sharing/index.html","http://docs.databricks.com/marketplace/index.html","http://docs.databricks.com/workspace-index.html","http://docs.databricks.com/machine-learning/index.html","http://docs.databricks.com/sql/index.html","http://docs.databricks.com/delta/index.html","http://docs.databricks.com/dev-tools/index.html","http://docs.databricks.com/integrations/index.html","http://docs.databricks.com/administration-guide/index.html","http://docs.databricks.com/security/index.html","http://docs.databricks.com/data-governance/index.html","http://docs.databricks.com/lakehouse-architecture/index.html","http://docs.databricks.com/reference/api.html","http://docs.databricks.com/resources/index.html","http://docs.databricks.com/whats-coming.html","http://docs.databricks.com/archive/index.html","http://docs.databricks.com/lakehouse/index.html","http://docs.databricks.com/getting-started/quick-start.html","http://docs.databricks.com/getting-started/etl-quick-start.html","http://docs.databricks.com/getting-started/lakehouse-e2e.html","http://docs.databricks.com/getting-started/free-training.html","http://docs.databricks.com/sql/language-manual/index.html","http://docs.databricks.com/error-messages/index.html","http://www.apache.org/","https://databricks.com/privacy-policy","https://databricks.com/terms-of-use"]
# web_links = ["https://www.databricks.com/product/startups", "https://www.databricks.com/why-databricks/executives", "https://www.databricks.com/product/data-lakehouse"]
# web_links = ["https://www.databricks.com/product/startups"]
web_links = ["https://www.leab.eu/en/products","https://www.leab.eu/en/services/application-engineering"]
loader = WebBaseLoader(web_links)
documents = loader.load()
len(documents)

2

In [17]:
doc= documents[0].page_content

In [18]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

filtered_words = [word for word in doc.split(' ') if word not in stopwords.words('english')]
filtered_text = ' '.join(filtered_words)

print(filtered_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.








Products // LEAB // mobile energy



 






















        Menu
    



        Close search
    


Term*







Products


Services



Application Engineering


Logistics


Production


Training courses




Industries



Service industry


Transportation industry


Emergency services


Leisure industry




Support



Product consultation


Repair maintenance


Questions answers


Service order


Contact


Downloads




Company



Trade fair dates


Career


Quality environment


Extension




Blog










                    Search
                








                    Favourites
                                    











                    Comparison
                                    








                    Cart
                                    









                    DE
                













Start


Products






Our product portfolioInnovative high quality - mobile power supplyFor emergency, special commercial vehicles, also l

##Splitting in Chunks using Text Splitters
You have to make sure to split the text into small pieces. You will need to initialize `RecursiveCharacterTextSplitter` and call it by passing the documents.

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

##Creating Embeddings and Storing in Vector Store
You have to create embeddings for each small chunk of text and store them in the vector store (i.e. FAISS). You will be using `all-mpnet-base-v2` Sentence Transformer to convert all pieces of text in vectors while storing them in the vector store.

In [21]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# storing embeddings in the vector store
vectorstore = FAISS.from_documents(all_splits, embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  return self.fget.__get__(instance, owner)()


##Initializing Chain
You have to initialize `ConversationalRetrievalChain`. This chain allows you to have a chatbot with memory while relying on a vector store to find relevant information from your document.

Additionally, you can return the source documents used to answer the question by specifying an optional parameter i.e. `return_source_documents=True` when constructing the chain.

In [22]:
from langchain.chains import ConversationalRetrievalChain

chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

Now, it’s time to do some Question-Answering on your own data!

In [23]:
chat_history = []

query = "What is Data lakehouse architecture in Databricks?"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

  warn_deprecated(


 Data Lakehouse architecture in Databricks is a new approach to data engineering that combines the benefits of both data warehousing and data lakes. It provides a unified platform for data engineering, data science, and data analytics, enabling users to work with data in a more agile and flexible manner. The Data Lakehouse architecture in Databricks consists of three main components: the Data Ingestion Layer, the Data Storage Layer, and the Data Processing Layer. Each layer has its own set of features and capabilities, which work together to provide a comprehensive solution for data engineering and analysis.


This time your previous question and answer will be included as a chat history which will enable the ability to ask follow up questions.

In [24]:
chat_history = [(query, result["answer"])]

query = "What are Data Governance and Interoperability in it?"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

  Of course! Data Governance is crucial when designing a Data Lakehouse architecture in Databricks. Here are some key aspects to consider:

1. Data Quality: Ensure that the data stored in the Data Lakehouse is accurate, complete, and consistent. This includes data validation, data cleansing, and data enrichment.
2. Data Security: Implement appropriate security measures to protect sensitive data from unauthorized access, theft, or misuse. This may involve encryption, access controls, and other security mechanisms.
3. Data Lineage: Establish clear data lineage and provenance to track the origin, history, and usage of data within the Data Lakehouse. This helps identify data sources, data transformations, and data consumers.
4. Data Retention: Define retention policies for data stored in the Data Lakehouse, including how long data should be retained, how it should be archived, and how it should be disposed of when no longer needed.
5. Data Access Control: Establish access controls to regul

In [17]:
print(result['source_documents'])

[Document(page_content='Databricks for Startups | DatabricksSkip to main contentWhy Databricks DiscoverFor ExecutivesFor Startups Lakehouse Architecture CustomersFeatured StoriesSee All CustomersPartnersCloud ProvidersDatabricks on AWS, Azure, and GCPConsulting & System IntegratorsExperts to build, deploy and migrate to DatabricksTechnology PartnersConnect your existing tools to your LakehouseC&SI Partner ProgramBuild, deploy or migrate to the LakehouseData PartnersAccess the ecosystem of data consumersPartner SolutionsFind custom industry and migration solutionsBuilt on DatabricksBuild, market and grow your businessProduct Databricks PlatformPlatform OverviewA unified platform for data, analytics and AIData ManagementData reliability, security and performanceSharingAn open, secure, zero-copy sharing for all dataData WarehousingETL and orchestration for batch and streaming dataGovernanceUnified governance for all data, analytics and AI assetsReal-Time AnalyticsReal-time analytics, AI a

##Finally…
Et voilà! You have now the capability to do question-answering on your on data using a powerful language model. Additionally, you can further develop it into a chatbot application using Streamlit.

In [25]:
def predict(query,history):

  chat_history = []

  result = chain({"question": query, "chat_history": chat_history})
  response = result['answer']
  source = result['source_documents'][0].metadata['source']
  print(source)

  return  response

In [None]:
print(result['source_documents'][0].metadata['source'])

http://docs.databricks.com/lakehouse-architecture/index.html


In [26]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [27]:
!pip install gradio



In [None]:
import gradio as gr

demo = gr.ChatInterface(fn=predict, title="Echo Bot")
demo.launch(share=True,debug = True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://f93614a2d8d0f4998a.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


https://www.leab.eu/en/products
https://www.leab.eu/en/products
https://www.leab.eu/en/products
