<a href="https://colab.research.google.com/github/Alex112525/LangChain-with-LLMs/blob/main/Langchain_Documents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The **Document** class in LangChain is a base class for representing documents. It has two main properties:

* __page_content__: The text content of the document.
* __metadata__: Arbitrary metadata about the document, such as the source of the document or its relationships to other documents.

The Document class can be used to represent any type of document, such as a web page, a PDF, or a text file. It can also be used to represent a collection of documents, such as a book or a dataset.

The Document class provides a number of methods for accessing and manipulating the document content and metadata. For example, the __get_page_content()__ method can be used to get the text content of the document, and the __set_metadata()__ method can be used to set the metadata for the document.

In [1]:
%%capture
!pip install langchain

In [2]:
!pip show langchain

Name: langchain
Version: 0.0.268
Summary: Building applications with LLMs through composability
Home-page: https://www.github.com/hwchase17/langchain
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, async-timeout, dataclasses-json, langsmith, numexpr, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: 


**Document class**


```{python}
class Document(Serializable):
    """Interface for interacting with a document."""

    page_content: str
    metadata: dict = Field(default_factory=dict)
```

In [3]:
from langchain.schema import Document

page_content = "The Document class in LangChain is a piece of unstructured data that consists of page_content (the content of the data) and metadata (auxiliary pieces of information describing attributes of the data)"
metadata = {"source" : "Bing"}

doc = Document(
    page_content=page_content,
    metadata=metadata
)

In [4]:
doc

Document(page_content='The Document class in LangChain is a piece of unstructured data that consists of page_content (the content of the data) and metadata (auxiliary pieces of information describing attributes of the data)', metadata={'source': 'Bing'})

In [5]:
doc.page_content

'The Document class in LangChain is a piece of unstructured data that consists of page_content (the content of the data) and metadata (auxiliary pieces of information describing attributes of the data)'

The **Document** class in LangChain can read a variety of file formats, including:

* Text files (.txt)
* PDF files (.pdf)
* HTML files (.html)
* JSON files (.json)
* Markdown files (.md)
* CSV files (.csv)
* etc.

You can also use the File Directory loader to read a directory of files.

In [6]:
import requests
url = "https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf"
response = requests.get(url)

with open('Distributed_Representations.pdf', 'wb') as f:
    f.write(response.content)

In [7]:
response.content[560:870]

b'In this paper we present several improvements that make the Skip\\055gram model more expressive and enable it to learn higher quality vectors more rapidly\\056  We show that by subsampling frequent words we obtain significant speedup\\054  and also learn higher quality representations as measured by our tasks\\05'

The **unstructured** library in Python is an open-source library for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. It provides a number of components for extracting text, converting between formats, and cleaning and normalizing data.

The **unstructured** library is designed to be used with large language models (LLMs). LLMs are a type of artificial intelligence (AI) model that can be used to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. However, LLMs require large amounts of data to train, and this data often needs to be pre-processed in order to be used effectively.

In [None]:
!pip install unstructured==0.10.4

In [10]:
from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("./Distributed_Representations.pdf")
data = loader.load()

In [11]:
type(data[0])

langchain.schema.document.Document

In [22]:
data[0].page_content[364:820]

'Abstract\n\nThe recently introduced continuous Skip-gram model is an efﬁcient method for learning high-quality distributed vector representations that capture a large num- ber of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain signiﬁcant speedup and also learn more regular word representations.'

In [13]:
data[0].metadata

{'source': './Distributed_Representations.pdf'}