### Installing the Libraries

In [2]:
!pip install -q langchain
!pip install -q torch
!pip install -q transformers
!pip install -q sentence-transformers
!pip install -q datasets
!pip install -q faiss-cpu

### Importing the Libraries

In [1]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
import sentencepiece

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
from transformers import AutoModelForSeq2SeqLM

### Document Loading 
        DataBricks Dolly

In [2]:
# Specify the dataset name and the column containing the content
dataset_name = "databricks/databricks-dolly-15k"
page_content_column = "context"  # or any other column you're interested in

# Create a loader instance
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

# Load the data
data = loader.load()

# Display the first 15 entries
data[:2]



[Document(page_content="Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}),
 Document(page_content='', metadata={'instruction': 'Which is a species of fish? Tope or Rope', 'response': 'Tope', 'category': 'classification'})]

### Transformer

#### Text Splitters

In [3]:
# Create an instance of the RecursiveCharacterTextSplitter class with specific parameters.
# It splits text into chunks of 1000 characters each with a 150-character overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# 'data' holds the text you want to split, split the text into documents using the text splitter.
docs = text_splitter.split_documents(data)

In [4]:
docs[0]

Document(page_content="Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'})

### Text Embedding

In [5]:
# Using the HuggingFaceEmbeddings

# Define the path to the pre-trained model you want to use
modelPath = "sentence-transformers/all-MiniLM-l6-v2"

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': False}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)




In [6]:
text = "This is a test document."
query_result = embeddings.embed_query(text)
query_result[:3]

[-0.03833853453397751, 0.1234646737575531, -0.02864297851920128]

#### Vector Stores
        FAISS vector

In [7]:
db = FAISS.from_documents(docs, embeddings)

In [8]:
### Searching our Question

question = "What is butter making?"
searchDocs = db.similarity_search(question)
print(searchDocs[0].page_content)

The earliest references to peanut butter can be traced to Aztec and Inca civilizations, who ground roasted peanuts into a paste.[clarification needed] However, several people can be credited with the invention of modern peanut butter and the processes involved in making it.[according to whom?]

The US National Peanut Board credits three modern inventors with the earliest patents related to the production of modern peanut butter. Marcellus Gilmore Edson of Montreal, Quebec, Canada, obtained the first patent for a method of producing peanut butter from roasted peanuts using heated surfaces in 1884. Edson's cooled product had "a consistency like that of butter, lard, or ointment" according to his patent application which described a process of milling roasted peanuts until the peanuts reached "a fluid or semi-fluid state". He mixed sugar into the paste to harden its consistency.[citation needed]


### Preparing the LLM Model

#### Using Google/Pegasus-Large

In [11]:
# Create a tokenizer object by loading the pretrained "Intel/dynamic_tinybert" tokenizer.
tokenizer =  AutoTokenizer.from_pretrained("google/pegasus-large")

# Create a question-answering model object by loading the pretrained "Intel/dynamic_tinybert" model.
model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-large")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  return self.fget.__get__(instance, owner)()
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Creating a Question Answering Pipeline

In [17]:
# Specify the model name you want to use
model_name = "google/pegasus-large"

# Load the tokenizer associated with the specified model
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)

# Define a question-answering pipeline using the model and tokenizer
question_answerer = pipeline(
    "text2text-generation", 
    model=model_name, 
    tokenizer=tokenizer,
)

# Create an instance of the HuggingFacePipeline, which wraps the question-answering pipeline
# with additional model-specific arguments (temperature and max_length)
llm = HuggingFacePipeline(
    pipeline=question_answerer,
    model_kwargs={"temperature": 0.7, "max_length": 512},
)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Retrievers

In [18]:
# Create a retriever object from the 'db' using the 'as_retriever' method.
# This retriever is likely used for retrieving data or documents from the database.
retriever = db.as_retriever()

In [19]:
docs = retriever.get_relevant_documents("What is butter making?")
print(docs[0].page_content)

The earliest references to peanut butter can be traced to Aztec and Inca civilizations, who ground roasted peanuts into a paste.[clarification needed] However, several people can be credited with the invention of modern peanut butter and the processes involved in making it.[according to whom?]

The US National Peanut Board credits three modern inventors with the earliest patents related to the production of modern peanut butter. Marcellus Gilmore Edson of Montreal, Quebec, Canada, obtained the first patent for a method of producing peanut butter from roasted peanuts using heated surfaces in 1884. Edson's cooled product had "a consistency like that of butter, lard, or ointment" according to his patent application which described a process of milling roasted peanuts until the peanuts reached "a fluid or semi-fluid state". He mixed sugar into the paste to harden its consistency.[citation needed]


#### Retrieval QA Chain

In [20]:
# Create a retriever object from the 'db' with a search configuration where it retrieves up to 4 relevant splits/documents.
retriever = db.as_retriever(search_kwargs={"k": 4})

# Create a question-answering instance (qa) using the RetrievalQA class.
# It's configured with a language model (llm), a chain type "refine," the retriever we created, and an option to not return source documents.
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=retriever)

In [24]:
question = "Who is Thomas Jefferson?"
result = qa.run({"query": "Who is Thomas Jefferson?"})


In [26]:
print(result)

We have provided an existing answer: ------------ Given the new context, refine the original answer to better answer the question.
