## DATA INGESTION

In [1]:
from langchain_community.document_loaders import TextLoader

loader=TextLoader("Speech.txt")
text_documents=loader.load()

text_documents

[Document(metadata={'source': 'Speech.txt'}, page_content='Long years ago, we made a tryst with destiny, and now the time comes when we shall redeem our pledge, not wholly or in full measure, but very substantially.\n\nAt the stroke of the midnight hour, when the world sleeps, India will awake to life and freedom. A moment comes, which comes but rarely in history, when we step out from the old to the new—when an age ends, and when the soul of a nation, long suppressed, finds utterance.\n\nIt is fitting that at this solemn moment, we take the pledge of dedication to the service of India and her people and to the still larger cause of humanity.\n\nTo the people of India, whose representatives we are, we make an appeal to join us with faith and confidence in this great adventure. The future beckons to us. Where do we go and what shall be our endeavor? To bring freedom and opportunity to the common man, to fight and end poverty and ignorance, and to build a strong, independent, and peacefu

In [2]:
import os 
from dotenv import load_dotenv
load_dotenv

os.environ['OPENAI_API_KEY']=os.getenv("OPENAI_API_KEY")

In [3]:
# Web based loader
from langchain_community.document_loaders import WebBaseLoader
import bs4

#load,chunk and index the content of the html page

loader=WebBaseLoader(web_path="https://lilianweng.github.io/posts/2023-06-23-agent/",
                     bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                         class_=("post-title","post-content","post-header")
                     )))

text_documents=loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
text_documents

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistake

In [5]:
# PDF Reader
from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader("Prospectus.pdf")

In [6]:
docs=loader.load()

In [7]:
docs

[Document(metadata={'producer': 'Microsoft® Office Word 2007', 'creator': 'Microsoft® Office Word 2007', 'creationdate': 'D:20230228111715', 'title': '1', 'author': 'Compaq', 'moddate': 'D:20230228111715', 'source': 'Prospectus.pdf', 'total_pages': 61, 'page': 0, 'page_label': '1'}, page_content='1\n \nPROSPECTUS \n \n(202\n3\n \n–\n \n202\n4\n)\n \n \n \n \n \n \n \n \n \n \nFor Admissions to Undergraduate, Masters‟ & \nPh.D. Programmes\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nGOVIND BALLABH PANT \n \nUNIVERSITY OF AGRICULTURE & TECHNOLOGY, \nPANTNAGAR \n–\n \n263145, U.S. NAGAR\n \nUTTARAKHAND, INDIA'),
 Document(metadata={'producer': 'Microsoft® Office Word 2007', 'creator': 'Microsoft® Office Word 2007', 'creationdate': 'D:20230228111715', 'title': '1', 'author': 'Compaq', 'moddate': 'D:20230228111715', 'source': 'Prospectus.pdf', 'total_pages': 61, 'page': 1, 'page_label': '2'}, page_content='2 \nCONTENTS \nPARTICULARS                                   Page \nSOURCES OF

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
documents=text_splitter.split_documents(docs)

documents[:5]

[Document(metadata={'producer': 'Microsoft® Office Word 2007', 'creator': 'Microsoft® Office Word 2007', 'creationdate': 'D:20230228111715', 'title': '1', 'author': 'Compaq', 'moddate': 'D:20230228111715', 'source': 'Prospectus.pdf', 'total_pages': 61, 'page': 0, 'page_label': '1'}, page_content='1\n \nPROSPECTUS \n \n(202\n3\n \n–\n \n202\n4\n)\n \n \n \n \n \n \n \n \n \n \nFor Admissions to Undergraduate, Masters‟ & \nPh.D. Programmes\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nGOVIND BALLABH PANT \n \nUNIVERSITY OF AGRICULTURE & TECHNOLOGY, \nPANTNAGAR \n–\n \n263145, U.S. NAGAR\n \nUTTARAKHAND, INDIA'),
 Document(metadata={'producer': 'Microsoft® Office Word 2007', 'creator': 'Microsoft® Office Word 2007', 'creationdate': 'D:20230228111715', 'title': '1', 'author': 'Compaq', 'moddate': 'D:20230228111715', 'source': 'Prospectus.pdf', 'total_pages': 61, 'page': 1, 'page_label': '2'}, page_content='2 \nCONTENTS \nPARTICULARS                                   Page \nSOURCES OF

# Vector Embeddings and Vector Store

In [None]:
from langchain_community.vectorstores import Chroma

from langchain_community.embeddings import OllamaEmbeddings

embedding_model = OllamaEmbeddings(model="gemma:2b")  

# Initialize Chroma with embeddings
db = Chroma.from_documents(documents[:20], embedding_model)

'''from langchain_community.embeddings import OpenAIEmbeddings
db=Chroma.from_documents(documents[:20],OpenAIEmbeddings()) '''


  embedding_model = OllamaEmbeddings(model="gemma:2b")


In [None]:
## Vector database

query="Is there an Entrance exam for B.Tech"

result=db.similarity_search(query)
result[0].page_content

'CS249: ADVANCED DATA MINING\nInstructor: Yizhou Sun\nyzsun@cs.ucla.edu\nMay 2, 2017\nClustering Evaluation and Practical Issues'

In [None]:
## Vector database

query="Are there Sanctioned Seats for Phd"

result=db.similarity_search(query)
result

[Document(metadata={'page': 2, 'source': 'Clustering.pdf'}, page_content='Learnt \nClustering \nMethods\n3\nVector Data Text Data Recommender \nSystem\nGraph & Network\nClassification Decision Tree; Naïve \nBayes; Logistic \nRegression\nSVM; NN\nLabel Propagation\nClustering K-means; hierarchical\nclustering; DBSCAN; \nMixture Models; \nkernel k-means\nPLSA;\nLDA\nMatrix Factorization SCAN; Spectral \nClustering\nPrediction Linear Regression\nGLM\nCollaborative Filtering\nRanking PageRank\nFeature \nRepresentation\nWord embedding Network embedding'),
 Document(metadata={'page': 11, 'source': 'Clustering.pdf'}, page_content='Question\n•If we flip the ground truth cluster labels \n(2->1 and 1->2), will the evaluation results \nbe the same?\n12\nData points Output clustering Ground truth \nclustering (class) \na 1 2\nb 1 2\nc 2 2\nd 2 1'),
 Document(metadata={'page': 4, 'source': 'Clustering.pdf'}, page_content='Measuring Clustering Quality\n• Two methods: extrinsic vs. intrinsic  \n• Ext

In [None]:
# FAISS Vector database

from langchain_community.vectorstores import FAISS
db1=FAISS.from_documents(documents[:20],embedding_model)

In [None]:
## Vector database

query="What is the application Procedure"

result=db1.similarity_search(query)
result

[Document(metadata={'source': 'Clustering.pdf', 'page': 11}, page_content='Question\n•If we flip the ground truth cluster labels \n(2->1 and 1->2), will the evaluation results \nbe the same?\n12\nData points Output clustering Ground truth \nclustering (class) \na 1 2\nb 1 2\nc 2 2\nd 2 1'),
 Document(metadata={'source': 'Clustering.pdf', 'page': 9}, page_content='Precision and Recall\n• Random Index (RI) = (TP+TN)/(TP+FP+FN+TN)\n• F-measure: 2P*R/(P+R)\n• P = TP/(TP+FP)\n• R = TP/(TP+FN)\n•Consider pairs of data points: \n• hopefully, two data points that are in the same cluster will be \nclustered into the same cluster (TP), and two data points that are \nin different clusters will be clustered into different clusters (TN).\n10\nSame cluster Different clusters\nSame class TP FN\nDifferent classes FP TN'),
 Document(metadata={'source': 'Clustering.pdf', 'page': 15}, page_content='Example: \nData Matrix and Dissimilarity Matrix\n16\npoi nt attribute1 attribute2\nx1 1 2\nx2 3 5\nx3 2 0\n