LangChain's document loader, embedding, and vector store abstractions.

In [None]:
# pip install langchain-community pypdf

In [3]:
import getpass
import os

try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    pass

os.environ["LANGSMITH_TRACING"] = "true"
if os.environ["LANGSMITH_API_KEY"] not in os.environ:
    os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

### Documents and Document Loaders

LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:

```page_content```: a string representing the content;

```metadata```: a dict containing arbitrary metadata;

```id```: (optional) a string identifier for the document.

In [4]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata = {"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata = {"source": "mammal-pets-doc"},
    )
]

Loading documents

In [5]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()
print(len(docs))

107


In [6]:
print(f"{docs[0].page_content[:200]}\n")
print(f"{docs[0].metadata}")

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F

{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


Splitting

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000, chunk_overlap = 200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
len(all_splits)

# set add_start_index=True so that the character index where each split Document
# starts within the initial Document is preserved as metadata attribute “start_index”.

516

### Embeddings
Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can embed it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text.

In [None]:
# There can be these warning
 
# IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
# from .autonotebook import tqdm as notebook_tqdm

# UserWarning: Could not download mistral tokenizer from Huggingface for calculating batch sizes. Set a Huggingface token via the HF_TOKEN environment variable to download the real tokenizer. Falling back to a dummy tokenizer that uses `len()`.
#   warnings.warn(

# To resolve these issue
# pip install --upgrade jupyter ipywidgets
# pip install transformers

In [11]:
if not os.environ.get("MISTRAL_API_KEY"):
    os.environ["MISTRAL_API_KEY"] = getpass.getpass("Enter API key for MistralAI: ")

from langchain_mistralai import MistralAIEmbeddings

embeddings = MistralAIEmbeddings(model="mistral-embed")

In [13]:
vector1 = embeddings.embed_query(all_splits[0].page_content)
vector2 = embeddings.embed_query(all_splits[1].page_content)
# embeddings.embed_query(text) converts the text into an embedding
# (a numerical vector representation)
assert len(vector1) == len(vector2)
# ensures that both embeddings have the same length.
print(f"Generated vectors of Length {len(vector1)}\n")
print(vector1[:10])

Generated vectors of Length 1024

[-0.0029048919677734375, 0.038848876953125, 0.03326416015625, -0.01450347900390625, 0.01552581787109375, 0.03680419921875, 0.02630615234375, -0.0110321044921875, 0.0010709762573242188, 0.0011987686157226562]


### Vector Stores
LangChain VectorStore objects contain methods for adding text and Document objects to the store, and querying them using various similarity metrics. They are often initialized with embedding models, which determine how text data is translated to numeric vectors.

In [None]:
from langchain_postgres import PGVector

vector_store = PGVector(
    embeddings=embeddings,
    collection_name="my_docs",
    connection="",
)