In [14]:
#importing the libraries:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_community.llms import HuggingFaceHub
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

In [4]:
#reading the pdf from the folder:
loader = PyPDFDirectoryLoader("./docs")
documents = loader.load()

#splitting into chunks:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50)
final_document = text_splitter.split_documents(documents)
final_document[0]

Document(metadata={'source': 'docs\\Databricks-Big-Book-Of-GenAI-FINAL.pdf', 'page': 1}, page_content='THE BIG BOOK OF GENERATIVE AICONTENTSIntroduction  ............................................................................................................................................................................................................ 3\nThe Path to Deploying Production-Quality GenAI Applications  ............................................................................................. 5')

In [7]:
#initializing embedding technique:
hugging_face_embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",
    model_kwargs={'device':'cpu'},
    encode_kwargs={'normalize_embeddings':True}
)

  from tqdm.autonotebook import tqdm, trange
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [11]:
#creating the vector store:
vector_store = FAISS.from_documents(final_document[:100],hugging_face_embeddings)

In [12]:
#Query using similar search:
query = "What is DBRX?"
relevant_documents = vector_store.similarity_search(query)
print(relevant_documents[0].page_content)

What Is DBRX?
DBRX is a transformer-based decoder-only large language model (LLM) that was trained using next-token 
prediction. It uses a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 
36B parameters are active on any input. It was pre-trained on 12T tokens of text and code data. Compared 
to other open MoE models like Mixtral and Grok-1, DBRX is fine-grained, meaning it uses a larger number of


In [13]:
#creating a retriever object:
retriever = vector_store.as_retriever(search_type='similarity',
                                      search_kwargs={"k":3})
retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceBgeEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001C9592D48F0>, search_kwargs={'k': 3})

In [18]:
#loading the huggingface api key:
import os
os.environ['HUGGINGFACEHUB_API_TOKEN']= ""

In [20]:
#loading a hugging face model:
llm = HuggingFaceHub(
        repo_id = "mistralai/Mistral-7B-v0.1",
        model_kwargs={"temperature":0.1,
                      "max_length":500}              
)

In [22]:
#creating a prompt template:
template = '''
use the following context to answer the questions asked.
{context}
Question:{question}
'''

prompt = PromptTemplate(template=template,
                        input_variables=["context","question"])

In [23]:
#creating a retireval QA:
retrievalQA = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt":prompt}
)

In [25]:
#testing the model with a query:
query = "How to Get Started With DBRX on Databricks"
response = retrievalQA.invoke({"query":query})
print(response['result'])


use the following context to answer the questions asked.
of models we have built and brought to production with our customers.
To build DBRX, we leveraged the same suite of Databricks tools that are available to our customers. We 
managed and governed our training data using Unity Catalog. We explored this data using newly acquired  
Lilac AI . We processed and cleaned this data using Apache Spark™ and Databricks notebooks. We trained 
DBRX using optimized versions of our open-source training libraries: MegaBlocks , LLM Foundry , Composer ,

THE BIG BOOK OF GENERATIVE AIThe weights of the base model ( DBRX Base ) and the fine-tuned model ( DBRX Instruct ) are available on Hugging 
Face under an open license. Starting today, DBRX is available for Databricks customers to use via APIs, and 
Databricks customers can pretrain their own DBRX-class models from scratch or continue training on top of  
one of our checkpoints using the same tools and science we used to build it. DBRX is already