In [19]:
!pip install fsspec==2025.3.0
!pip install gcsfs==2025.3.0



Our task is to implement RAG using Langchain and Hugging Face!

1. Set up your environment: : This ensures all the necessary tools are available to build the RAG system. Each library serves a specific role: Langchain handles the orchestration of components, transformers provide pre-trained models, sentence-transformers generate embeddings, datasets load sample data, and FAISS enables fast similarity searches.

Open your terminal or notebook environment.
Install all required libraries by running these commands:

In [20]:
!pip install -q langchain
!pip install -q torch
!pip install -q transformers
!pip install -q sentence-transformers
!pip install -q datasets
!pip install -q faiss-cpu
!pip install -U langchain-community



2. Load the dataset: To provide the system with information to retrieve from, you’ll load a real-world dataset. HuggingFaceDatasetLoader simplifies the process of accessing Hugging Face datasets and formatting them into documents that Langchain can process.

before loading the dataset, run :

In [21]:
!pip install -Uq datasets

Import HuggingFaceDatasetLoader from langchain.document_loaders.
Specify the dataset name and content column:

In [23]:
from langchain.document_loaders import HuggingFaceDatasetLoader
dataset_name = "databricks/databricks-dolly-15k"
page_content_column = "context"

Create a HuggingFaceDatasetLoader instance and load the data as documents:

In [24]:
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)
data = loader.load()
print(data[:2]) # Optional: Print the first 2 entries to verify loading

[Document(metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}, page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."'), Document(metadata={'instruction': 'Which is a species of fish? Tope or Rope', 'response': 'Tope', 'category': 'classification'}, page_content='""')]


3. Split the documents: Language models have a limit on how much text they can process at once. Splitting large documents into smaller, overlapping chunks ensures that no important context is lost and that each piece of text is a manageable size for embedding and retrieval.

Import RecursiveCharacterTextSplitter from langchain.text_splitter.
Create a RecursiveCharacterTextSplitter instance with a chunk_size of 1000 and chunk_overlap of 150:

In [25]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

Split the loaded documents:

In [26]:
docs = text_splitter.split_documents(data)
print(docs[0]) # Optional: Print the first document chunk

page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."' metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}


4. Embed the text: Text needs to be converted into numerical representations (embeddings) so that similar pieces of text can be found efficiently. Using a sentence-transformer model creates embeddings that capture semantic meaning, enabling effective retrieval later.

Import HuggingFaceEmbeddings from langchain.embeddings.
Define the model path, model configurations, and encoding options:

In [27]:
from langchain.embeddings import HuggingFaceEmbeddings

modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings': False}

Initialize HuggingFaceEmbeddings:

In [28]:
embeddings = HuggingFaceEmbeddings(
  model_name=modelPath,
  model_kwargs=model_kwargs,
  encode_kwargs=encode_kwargs
)

(Optional) Test embedding creation:

In [29]:
text = "This is a test document."
query_result = embeddings.embed_query(text)
print(query_result[:3])

[-0.038338541984558105, 0.12346471846103668, -0.02864297851920128]


5. Create a vector store: A vector store like FAISS indexes the embeddings, allowing fast and scalable similarity searches. This is how the system quickly finds relevant pieces of text when a query is made.

Import FAISS from langchain.vectorstores.
Create a FAISS vector store from the document chunks and embeddings:

In [30]:
from langchain.vectorstores import FAISS

db = FAISS.from_documents(docs, embeddings)

6. Prepare the LLM model: The Language Model is responsible for generating answers based on retrieved documents. Loading a pre-trained model and wrapping it in a Langchain pipeline makes it easy to integrate with the retrieval system.

Import necessary classes from transformers and langchain:

In [31]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from langchain import HuggingFacePipeline

Load the tokenizer and question-answering model:

In [32]:
tokenizer = AutoTokenizer.from_pretrained("Intel/dynamic_tinybert")
model = AutoModelForQuestionAnswering.from_pretrained("Intel/dynamic_tinybert")

tokenizer_config.json:   0%|          | 0.00/351 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Invalid model-index. Not loading eval results into CardData.


Create a question-answering pipeline:

In [33]:
model_name = "Intel/dynamic_tinybert"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)
Youtubeer = pipeline(
  "question-answering",
  model=model_name,
  tokenizer=tokenizer,
  return_tensors='pt'
)

Invalid model-index. Not loading eval results into CardData.
Device set to use cpu


Create a Langchain pipeline wrapper:

In [34]:
llm = HuggingFacePipeline(
  pipeline=Youtubeer,
  model_kwargs={"temperature": 0.7, "max_length": 512},
)

  llm = HuggingFacePipeline(


7. Build the Retrieval QA Chain: The Retrieval QA Chain connects the retriever (which finds relevant documents) with the LLM (which generates answers). This chain enables the full RAG process, where the system retrieves helpful context and then answers the user’s query based on that context.

Import RetrievalQA from langchain.chains.
Create a retriever from your FAISS database:

In [35]:
from langchain.chains import RetrievalQA

retriever = db.as_retriever(search_kwargs={"k": 4}) # Optional: You can adjust k for number of documents retrieved

Build the RetrievalQA chain:

In [36]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=retriever, return_source_documents=False)

8. Test your RAG system: Running a test query allows you to verify that all components are working together. This step ensures that documents are retrieved correctly and that the model generates meaningful answers based on the retrieved context.

Define your question:

In [43]:
question = "Why were all the women I have dated toxic?"

Run the QA chain and print the result:

In [44]:
result = qa.run({"query": question})
print(result) # Or print(result["result"]) if the output is a dictionary



ValueError: Context information is below. 
------------
"For Bitter or Worse is the sixth studio album from the Dutch singer Anouk. The album was released on 18 September 2009, via the record label EMI.\n\nThe first single from the album, \"Three Days in a Row\" was released in August. It reached the top of the Netherlands charts in September 2009, making it Anouk's first number one in the country. In June of the same year, one of the songs recorded for the album, \"Today\", was released as promo material. It was so successful that, despite never being released as an official single, the song reached number 50 in the Dutch chart. The second single Woman, was sent to radio stations at the end of October 2009. After just one day the single was at number one on airplay chart. The single was released physically on 24 November 2009."
------------
Given the context information and not prior knowledge, answer the question: Why were all the women I have dated toxic?
 argument needs to be of type (SquadExample, dict)