# Data Ingestion

In [24]:
from langchain_community.document_loaders import TextLoader
loader=TextLoader("speech.txt")
text_documents=loader.load()
text_documents

[Document(page_content='I am happy to join with you today in what will go down in history as the greatest demonstration for freedom in the history of our nation.\n\nFive score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of their captivity.\n\nBut one hundred years later, the Negro still is not free. One hundred years later, the life of the Negro is still sadly crippled by the manacles of segregation and the chains of discrimination. One hundred years later, the Negro lives on a lonely island of poverty in the midst of a vast ocean of material prosperity. One hundred years later, the Negro is still languished in the corners of American society and finds himself an exile in his own land. And so we\'ve come here today to dramatize a 

In [25]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [26]:
## Web based loader

from langchain_community.document_loaders import WebBaseLoader
import bs4

## load chunk and index the content of the html page

loader=WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                     bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                         class_=("post-title", "post-content", "post-header")
                     )),)

text_documents=loader.load()

In [27]:
text_documents

[Document(page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final re

In [28]:
## PDF Loader

from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader('sample_pdf.pdf')

text_documents=loader.load()


could not convert string to float: '0.00-51177066' : FloatObject (b'0.00-51177066') invalid; use 0.0 instead
could not convert string to float: '0.00-60790265' : FloatObject (b'0.00-60790265') invalid; use 0.0 instead
could not convert string to float: '0.00-56221883' : FloatObject (b'0.00-56221883') invalid; use 0.0 instead


In [29]:
text_documents

[Document(page_content='Unsupervised Deep Embedding for Clustering Analysis\nJunyuan Xie JXIE@CS.WASHINGTON .EDU\nUniversity of Washington\nRoss Girshick RBG@FB.COM\nFacebook AI Research (FAIR)\nAli Farhadi ALI@CS.WASHINGTON .EDU\nUniversity of Washington\nAbstract\nClustering is central to many data-driven appli-\ncation domains and has been studied extensively\nin terms of distance functions and grouping al-\ngorithms. Relatively little work has focused on\nlearning representations for clustering. In this\npaper, we propose Deep Embedded Clustering\n(DEC), a method that simultaneously learns fea-\nture representations and cluster assignments us-\ning deep neural networks. DEC learns a map-\nping from the data space to a lower-dimensional\nfeature space in which it iteratively optimizes a\nclustering objective. Our experimental evalua-\ntions on image and text corpora show signiﬁcant\nimprovement over state-of-the-art methods.\n1. Introduction\nClustering, an essential data analysis a

In [30]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=20)
documents=text_splitter.split_documents(text_documents)
documents[:5]

[Document(page_content='Unsupervised Deep Embedding for Clustering Analysis\nJunyuan Xie JXIE@CS.WASHINGTON .EDU\nUniversity of Washington\nRoss Girshick RBG@FB.COM\nFacebook AI Research (FAIR)\nAli Farhadi ALI@CS.WASHINGTON .EDU\nUniversity of Washington\nAbstract\nClustering is central to many data-driven appli-\ncation domains and has been studied extensively\nin terms of distance functions and grouping al-\ngorithms. Relatively little work has focused on\nlearning representations for clustering. In this\npaper, we propose Deep Embedded Clustering\n(DEC), a method that simultaneously learns fea-\nture representations and cluster assignments us-\ning deep neural networks. DEC learns a map-\nping from the data space to a lower-dimensional\nfeature space in which it iteratively optimizes a\nclustering objective. Our experimental evalua-\ntions on image and text corpora show signiﬁcant\nimprovement over state-of-the-art methods.\n1. Introduction\nClustering, an essential data analysis a

In [31]:
## Chroma Vector Embeddings

from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

db=Chroma.from_documents(documents,OpenAIEmbeddings())

In [32]:
## Vector Database Querying

query="What is this paper about?"
result = db.similarity_search(query)
result[0].page_content

'ence on , pp. 8595–8598. IEEE, 2013.\nLeCun, Yann, Bottou, L ´eon, Bengio, Yoshua, and Haffner,\nPatrick. Gradient-based learning applied to document\nrecognition. Proceedings of the IEEE , 86(11):2278–\n2324, 1998.\nLewis, David D, Yang, Yiming, Rose, Tony G, and Li, Fan.\nRcv1: A new benchmark collection for text categoriza-\ntion research. JMLR , 2004.\nLi, Tao, Ma, Sheng, and Ogihara, Mitsunori. Entropy-\nbased criterion in categorical clustering. In ICML , 2004.\nLiu, Huan and Yu, Lei. Toward integrating feature selection\nalgorithms for classiﬁcation and clustering. IEEE Trans-\nactions on Knowledge and Data Engineering , 2005.\nLong, Jonathan, Shelhamer, Evan, and Darrell, Trevor.\nFully convolutional networks for semantic segmentation.\narXiv preprint arXiv:1411.4038 , 2014.\nMacQueen, James et al. Some methods for classiﬁcation\nand analysis of multivariate observations. In Proceed-\nings of the ﬁfth Berkeley symposium on mathematical\nstatistics and probability , pp. 281–297

In [33]:
## FAISS Vector Database

from langchain_community.vectorstores import FAISS
db1=FAISS.from_documents(documents,OpenAIEmbeddings())

In [34]:
## Vector Database Querying

query="What is this paper about?"
result = db1.similarity_search(query)
result[0].page_content

'international conference on Information and knowledge\nmanagement , 2000.\nSrivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,\nSutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A\nsimple way to prevent neural networks from overﬁtting.\nJMLR , 2014.\nSteinbach, Michael, Ert ¨oz, Levent, and Kumar, Vipin. The\nchallenges of clustering high dimensional data. In New\nDirections in Statistical Physics , pp. 273–309. Springer,\n2004.'