### Data Ingestion

### **LOADING DATA**

In [1]:
from langchain_community.document_loaders import TextLoader
loader=TextLoader("speech.txt")

text_documents=loader.load()
text_documents

[Document(page_content='This is a day I’ve been looking forward to for two and a half years. Every once in a while, a revolutionary product comes along that changes everything. And Apple has been—well, first of all, one’s very fortunate if you get to work on just one of these in your career. Apple’s been very fortunate. It’s been able to introduce a few of these into the world. 1984, we introduced the Macintosh. It didn’t just change Apple, it changed the whole computer industry. In 2001, we introduced the first iPod. And it didn’t just change the way we all listen to music, it changed the entire music industry. Well, today, we’re introducing three revolutionary products of this class.', metadata={'source': 'speech.txt'})]

In [2]:
import os
from dotenv import load_dotenv
load_dotenv

os.environ["LANGCHAIN_TRACING_V2"]="true"
os.environ["LANGCHAIN_API_KEY"]=os.getenv("LANGCHAIN_API_KEY")

##### Web based loader

In [3]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

#### Load, chunk and index the content of the web page.

In [4]:
loader=WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",), 
                        bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                            class_=("post-single", "post-content", "post-header")
                    )))

text_documents=loader.load()
# text_documents

##### Reading PDF

In [5]:
from langchain_community.document_loaders import PyPDFLoader

loader=PyPDFLoader("study.pdf")
docs=loader.load()
docs

[Document(page_content='ITDO6014 \nAI AND DS-1 \nModule 4: Introduction to DS CS380 1', metadata={'source': 'study.pdf', 'page': 0}),
 Document(page_content='Introduction to Data Science \n◻What is Data? \n◻Data is the collection of facts and bits of information. In \nthe real world, the data is either structured or \nunstructured. \n◻Structured data \xa0is data that has an order and a \nwell-defined structure. As the structured data is consistent \nand well-defined, it is an easy task to store and access it. \nAlso, searching for data is easy as we can use indexes to \nstore structured data.\xa0 \n◻\xa02', metadata={'source': 'study.pdf', 'page': 1}),
 Document(page_content='Introduction to Data Science \n◻Another type is unstructured data. It is an inconsistent \ntype as it doesn’t have any structure, format, or \nsequence. The unstructured data is error-prone when we \nperform indexing on it. Hence, it is a difficult task to \nunderstand and operate on unstructured data. \nInteresti

### **DATA TRANFORMATION**

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents=text_splitter.split_documents(docs)
documents[:5]

[Document(page_content='ITDO6014 \nAI AND DS-1 \nModule 4: Introduction to DS CS380 1', metadata={'source': 'study.pdf', 'page': 0}),
 Document(page_content='Introduction to Data Science \n◻What is Data? \n◻Data is the collection of facts and bits of information. In \nthe real world, the data is either structured or \nunstructured. \n◻Structured data \xa0is data that has an order and a \nwell-defined structure. As the structured data is consistent \nand well-defined, it is an easy task to store and access it. \nAlso, searching for data is easy as we can use indexes to \nstore structured data.\xa0 \n◻\xa02', metadata={'source': 'study.pdf', 'page': 1}),
 Document(page_content='Introduction to Data Science \n◻Another type is unstructured data. It is an inconsistent \ntype as it doesn’t have any structure, format, or \nsequence. The unstructured data is error-prone when we \nperform indexing on it. Hence, it is a difficult task to \nunderstand and operate on unstructured data. \nInteresti

In [8]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

db=Chroma.from_documents(documents[:20], OllamaEmbeddings())

KeyboardInterrupt: 

In [None]:
query="What is the pdf about?"
result=db.similarity_search(query)
result