### Different ways of loading data

In [2]:
# Data Ingestion
from langchain_community.document_loaders import TextLoader
loader=TextLoader("speech.txt")
text_document=loader.load()
text_document

[Document(metadata={'source': 'speech.txt'}, page_content='Fourscore and seven years ago our fathers brought forth upon this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. That noble declaration was not merely a statement for its time, but a promise meant for all generations to come. The founders pledged their lives, their fortunes, and their sacred honor to establish a republic built on freedom and justice.\n\nToday, we stand upon this field, consecrated by the sacrifice of those who gave their last full measure of devotion. They did not falter, though the burden was heavy and the cost immeasurable. Their courage and their blood have made this ground hallowed far beyond any words we can speak. We cannot dedicate, we cannot consecrate, we cannot hallow this soil; for it has already been sanctified by the deeds of brave men, both living and dead, who struggled here with unyielding spirit.\n\nYet it falls to us, the living, 

In [3]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [4]:
# web based loader
from langchain_community.document_loaders import WebBaseLoader
import bs4

# load,chunk and index the content of the html page
loader=WebBaseLoader(web_path="https://lilianweng.github.io/posts/2024-11-28-reward-hacking/",bs_kwargs=dict(parse_only=bs4.SoupStrainer(
    class_=("post-title","post-content","post-header")
)),)

text_documents=loader.load()
text_documents

USER_AGENT environment variable not set, consider setting it to identify your requests.


[Document(metadata={'source': 'https://lilianweng.github.io/posts/2024-11-28-reward-hacking/'}, page_content='\n\n      Reward Hacking in Reinforcement Learning\n    \nDate: November 28, 2024  |  Estimated Reading Time: 37 min  |  Author: Lilian Weng\n\n\nReward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task. Reward hacking exists because RL environments are often imperfect, and it is fundamentally challenging to accurately specify a reward function.\nWith the rise of language models generalizing to a broad spectrum of tasks and RLHF becomes a de facto method for alignment training, reward hacking in RL training of language models has become a critical practical challenge. Instances where the model learns to modify unit tests to pass coding tasks, or where responses contain biases that mimic a user’s preference, are pretty concerning and are 

In [5]:
# pdf reader
from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader('attention.pdf')
docs = loader.load()
docs

[Document(metadata={'source': 'attention.pdf', 'page': 0}, page_content='Attention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗†\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser ∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe super

### Text splitter

In [6]:
# Each chunk will have up to 1000 characters.
# chunk_overlap=200
# Each chunk overlaps the next one by 200 characters.
# This is useful because:

# Sometimes an important idea is cut off at the end of one chunk.

# Overlap ensures that context flows better when the LLM processes consecutive chunks.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
documents=text_splitter.split_documents(docs)
documents[:5]

[Document(metadata={'source': 'attention.pdf', 'page': 0}, page_content='Attention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗†\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser ∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe super

### vector embedding techniques

In [9]:
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

db=Chroma.from_documents(documents[:20],OpenAIEmbeddings())

In [10]:
# vector database
query = "author of paper"
result=db.similarity_search(query)
result[0].page_content

'efﬁcient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and\nimplementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating\nour research.\n†Work performed while at Google Brain.\n‡Work performed while at Google Research.\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.'

In [12]:
# FAISS vector Db
from langchain_community.vectorstores import FAISS
db1=FAISS.from_documents(documents[:20],OpenAIEmbeddings())

In [13]:
query = "author of paper"
result=db1.similarity_search(query)
result[0].page_content

'efﬁcient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and\nimplementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating\nour research.\n†Work performed while at Google Brain.\n‡Work performed while at Google Research.\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.'

In [14]:
from langchain_community.llms import Ollama
llm = Ollama(model="gemma3:1b")
llm

Ollama(model='gemma3:1b')

In [None]:
# Design Chatprompt template
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context.Think step by step before providing a detailed answer. I will tip you $1000 if the user finds teh answer helpful.
<context>{context}                                   
</context>   
Question: {input}""")