# Data Ingestion

In [1]:
from langchain_community.document_loaders import TextLoader

In [2]:
loader = TextLoader("exmples.txt")
text_docs = loader.load()
print(text_docs)

[Document(metadata={'source': 'exmples.txt'}, page_content='This is a small sample document for testing RAG pipelines.\nIt contains a few sentences about LangChain and retrieval.\nYou can add more lines to improve chunking and recall.\nRAG systems fetch relevant text before generating answers.\n')]


In [3]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI

load_dotenv()

True

In [4]:
llm = ChatOpenAI(
    model = "arcee-ai/trinity-large-preview:free",
    temperature = 0.9,
    base_url = "https://openrouter.ai/api/v1",
    default_headers={
        "HTTP-Referer": "http://localhost:8000",
        "X-Title": "RAG with Langchain"
    }
)

In [5]:
from langchain_community.document_loaders  import WebBaseLoader
import bs4

## Load CHUNK AND INDEX:

loader = WebBaseLoader(
    web_path="https://www.geeksforgeeks.org/artificial-intelligence/what-is-artificial-intelligence-ai/",
    bs_kwargs={"parse_only": bs4.SoupStrainer(
        class_=["ArticleHeader_article-title__futDC", "MainArticleContent_articleMainContentCss__b_1_R article--viewer_content", "html-chunk"]
    )}
)

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [6]:
docs = loader.load()
print(docs[0].page_content)

What is Artificial Intelligence (AI)Artificial Intelligence (AI) is a technology that enables machines and computers to perform tasks that typically require human intelligence. It helps systems learn from data, recognize patterns and make decisions to solve complex problems. It is used in healthcare, finance, e-commerce and transportation offering personalized recommendations and enabling self-driving cars.Core Concepts of AIAI is based on core concepts and technologies that enable machines to learn, reason and make decisions on their own. Let's see some of the concepts:1. Machine Learning (ML)Machine Learning is a subset of artificial intelligence (AI) that focuses on building systems that can learn from and make decisions based on data. Instead of being explicitly programmed to perform a task, a machine learning model uses algorithms to identify patterns within data and improve its performance over time without human intervention.2. Generative AIGenerative AI is designed to create ne

In [7]:
from langchain_community.document_loaders import PyPDFLoader

path = "AI-for-Education-RAG.pdf"
loader = PyPDFLoader(path)
docs = loader.load()

# Data Splitting

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_text_splitters import CharacterTextSplitter


In [9]:
recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = recursive_splitter.split_documents(docs)
print(len(documents))

18


In [10]:
print(documents[0].page_content)
print(documents[1].page_content)

Using your own content in LLM’s -
Retrieval Augmented Generation
(RAG)
Gen-AI can help with education in many ways
Saves time
• AI lesson planning reduces the time teachers spend preparing lessons. 
• Can automate marking using AI – image recognition or voice technologies to assess students as 
they read aloud.
Improved quality
• Adapt high quality resources to your context (languages or images).
• Can help teachers clarify concepts they may have forgotten.
• Adapt lessons/examples to children’s previous answers.
Improved scalability
• Access on phones (e.g. chatbot to support teachers) enables wide adoption. 
• Rapid integration of new national policies, curricula and best practices by automatically updating 
knowledge bases of LLMs.


In [11]:
character_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
character_documents = character_splitter.split_documents(docs)
print(len(character_documents))

18


# Embeddings and Vector Store

In [12]:
from langchain_huggingface import HuggingFaceEndpointEmbeddings

embeddings = HuggingFaceEndpointEmbeddings(
    model="sentence-transformers/all-MiniLM-L6-v2",
    task="feature-extraction",
)

In [14]:
from langchain_community.vectorstores import Chroma

In [None]:
from langchain_community.embeddings import VoyageEmbeddings
import os

os.environ["VOYAGE_API_KEY"]  

voyage_embeddings = VoyageEmbeddings(
    model="voyage-2",         
    batch_size=32             
)
vectorstore = Chroma.from_documents(
    documents,
    voyage_embeddings,
    persist_directory="chroma_db"
)
vectorstore.persist()  # writes to disk

  vectorstore.persist()  # writes to disk


In [None]:
query = "What is Retrieval Augmented Generation"

In [27]:
result = vectorstore.similarity_search(query)
result

[Document(metadata={'producer': 'PyPDF', 'source': 'AI-for-Education-RAG.pdf', 'page_label': '7', 'moddate': '2024-03-04T11:16:13+00:00', 'total_pages': 18, 'page': 6, 'creator': 'PyPDF', 'creationdate': '2024-03-04T11:16:13+00:00'}, page_content='What is Retrieval Augmented Generation (RAG)?\n• RAG is an extension of Large Language Models (LLMs)\n• LLMs are things like GPT4, Gemini, Claude, LLaMA, Mistral, etc.\n• So, to understand what RAG is, it first helps to recap what an LLM is (and \nwhat it isn’t)\n7'),
 Document(metadata={'creator': 'PyPDF', 'page': 4, 'moddate': '2024-03-04T11:16:13+00:00', 'producer': 'PyPDF', 'page_label': '5', 'total_pages': 18, 'creationdate': '2024-03-04T11:16:13+00:00', 'source': 'AI-for-Education-RAG.pdf'}, page_content='In this series, we will talk through different ways \nusing core education materials. \nToday we focus on Retrieval Augmented Generation \n(RAG)\n1. Prompt Engineering \n2. RAG systems\n3. Fine-tuning \n4. Rebuilding foundational model

In [22]:
result[0].page_content

'What is Retrieval Augmented Generation (RAG)?\n• RAG is an extension of Large Language Models (LLMs)\n• LLMs are things like GPT4, Gemini, Claude, LLaMA, Mistral, etc.\n• So, to understand what RAG is, it first helps to recap what an LLM is (and \nwhat it isn’t)\n7'

In [23]:
result[1].page_content

'In this series, we will talk through different ways \nusing core education materials. \nToday we focus on Retrieval Augmented Generation \n(RAG)\n1. Prompt Engineering \n2. RAG systems\n3. Fine-tuning \n4. Rebuilding foundational models.'

In [24]:
result[2].page_content

'Using your own content in LLM’s -\nRetrieval Augmented Generation\n(RAG)'

In [25]:
result[3].page_content

'There are many considerations for a RAG model\nSplit into the two key components: Retrieval and Generation\nRetrieval: Finding relevant information\n• It’s like going to a huge library and finding the most relevant books to answer the \nquestion\nGeneration: How do we use the retrieved information for response\n• Like an expert scholar summarises the information in the books you have picked out\n15'

# Retriever and Chain In Langchain

In [28]:
model = ChatOpenAI(
    model = "arcee-ai/trinity-large-preview:free",
    temperature = 0.9,
    base_url = "https://openrouter.ai/api/v1",
    default_headers={
        "HTTP-Referer": "http://localhost:8000",
        "X-Title": "RAG with Langchain"
    }
)

In [33]:
from langchain_core.prompts import ChatPromptTemplate

prompt1 = ChatPromptTemplate.from_template("""
    Answer the Following Question based on the Provided Context only.
    Make sure the answer is satisfactiable.
    <context>
    {context}
    </context>
    Question: {input}                        
    """)

## Chain

In [32]:
from langchain_classic.chains.combine_documents import create_stuff_documents_chain


In [34]:
chain_document = create_stuff_documents_chain(model, prompt1)

## Retriver

In [35]:
retriever = vectorstore.as_retriever()
retriever

VectorStoreRetriever(tags=['Chroma', 'VoyageEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x000001E3B0C84CD0>, search_kwargs={})

In [None]:
# Retriver with Chain
from langchain_classic.chains import create_retrieval_chain

In [37]:
retreival_chain = create_retrieval_chain(retriever, chain_document)

In [40]:
response = retreival_chain.invoke({"input": "What is Retrieval Augmented Generation"})
response

{'input': 'What is Retrieval Augmented Generation',
 'context': [Document(metadata={'page_label': '7', 'creator': 'PyPDF', 'creationdate': '2024-03-04T11:16:13+00:00', 'page': 6, 'total_pages': 18, 'source': 'AI-for-Education-RAG.pdf', 'moddate': '2024-03-04T11:16:13+00:00', 'producer': 'PyPDF'}, page_content='What is Retrieval Augmented Generation (RAG)?\n• RAG is an extension of Large Language Models (LLMs)\n• LLMs are things like GPT4, Gemini, Claude, LLaMA, Mistral, etc.\n• So, to understand what RAG is, it first helps to recap what an LLM is (and \nwhat it isn’t)\n7'),
  Document(metadata={'page': 4, 'creator': 'PyPDF', 'page_label': '5', 'total_pages': 18, 'source': 'AI-for-Education-RAG.pdf', 'moddate': '2024-03-04T11:16:13+00:00', 'producer': 'PyPDF', 'creationdate': '2024-03-04T11:16:13+00:00'}, page_content='In this series, we will talk through different ways \nusing core education materials. \nToday we focus on Retrieval Augmented Generation \n(RAG)\n1. Prompt Engineering \n

In [41]:
response['answer']

'Retrieval Augmented Generation (RAG) is an extension of Large Language Models (LLMs) that combines two key components: Retrieval and Generation. The Retrieval component involves finding relevant information from a large dataset, similar to searching for the most relevant books in a library to answer a question. The Generation component involves using the retrieved information to create a coherent and accurate response, much like an expert scholar summarizing the information from the selected books. RAG enhances the capabilities of LLMs by incorporating external data to improve the quality and relevance of the generated responses.'