### Data Ingestion - Document_loaders

https://docs.langchain.com/oss/python/integrations/document_loaders

#### Loading Text Files

In [None]:
## Text Loader - to load text files

from langchain_community.document_loaders import TextLoader     # all loader types availables in langchain.community.document_loaders

# Initializing TextLoader
loader = TextLoader("Datasets/Time_Series_Basics.txt")
## documents with be saved in loader parameter
loader

<langchain_community.document_loaders.text.TextLoader at 0x109acf040>

In [3]:
text_documents = loader.load()  # loader.load will convert the text document into documents
text_documents

[Document(metadata={'source': 'Datasets/Time_Series_Basics.txt'}, page_content='1. What is a Time Series?\n\nA time series is a sequence of observations recorded over time, usually at regular intervals.\n\nExamples:\n\t•\tDaily stock prices\n\t•\tMonthly sales\n\t•\tHourly temperature\n\t•\tWeekly demand\n\nX_1, X_2, X_3, \\dots, X_t\n\n\n2. Key Components of a Time Series\n\nMost time series can be broken into four parts:\n\n(a) Trend (T)\n\nLong-term direction of the series.\n\t•\tUpward, downward, or flat\nExample: Gradual increase in yearly sales.\n\n(b) Seasonality (S)\n\nRegular repeating patterns at fixed intervals.\n\t•\tDaily, weekly, monthly, yearly\nExample: Ice cream sales higher every summer.\n\n(c) Cyclic (C)\n\nLonger-term ups and downs, not fixed length.\nExample: Economic booms and recessions.\n\n(d) Noise / Residual (R)\n\nRandom variation that cannot be explained.\nExample: Unexpected spikes or drops.\n\n\\text{Time Series} = T + S + C + R\n\n\n3. Types of Time Serie

#### Loading PDF Files

In [11]:
## Reading a PDF File
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('Datasets/Demand_Forecasting.pdf')
pdf_docs = loader.load()
pdf_docs

[Document(metadata={'producer': 'Microsoft® Office Word 2007', 'creator': 'Microsoft® Office Word 2007', 'creationdate': 'D:20221010100057', 'author': 'Lenovo', 'moddate': 'D:20221010100057', 'source': 'Datasets/Demand_Forecasting.pdf', 'total_pages': 18, 'page': 0, 'page_label': '1'}, page_content='Demand Forecasting \nIntroduction \nAn important aspect of demand analysis from the management point of view is concerned with \nforecasting demand for products, either existing or new. Demand forecasting refers to an \nestimate of most likely future demand for product under given conditions. Su ch forecasts are of \nimmense use in making decisions with regard to production, sales, investment, expansion, \nemployment of manpower etc., both in the short run as well as in the long run.  \nMeaning And Features \nDemand forecasting seeks to investigate and me asure the forces that determine sales for existing \nand new products. Generally companies plan their business – production or sales in a

#### Loading Web based files

In [None]:
## Web based loader
from langchain_community.document_loaders import WebBaseLoader
import bs4 # Importing beautiful soup
## Scrape all data from the websites provided
loader = WebBaseLoader(web_paths = ("https://lilianweng.github.io/posts/2023-06-23-agent/",), 
                        bs_kwargs = dict(parse_only = bs4.SoupStrainer(
                            class_ =("post-title", "post-content","post-header")
                        ))
                        )
loader.load()

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistake

#### Loading Arxiv loader

In [18]:
## arxiv
from langchain_community.document_loaders import ArxivLoader

docs = ArxivLoader(query="1706.03762", load_max_docs=2).load()
len(docs)

1

In [19]:
docs

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation 

#### Loading Wikipedia pages

In [22]:
from langchain_community.document_loaders import WikipediaLoader

docs = WikipediaLoader(query="Generative AI", load_max_docs=2).load()
len(docs)
print(docs)

