# Documents and Document Loades

LangChain tiene una abstración Document que cubre el proposito de representar una unidad de texto y su metadata asociada. Posee tres atributos:
* page_content: un string que representa el contenido
* metadata: un diccionario que contiene datos y detalles arbitrarios del documento
* id: (opcional) un string identificador para el documento.

En el metadata se puede capturar información del origen del documento, su relación con otros documentos y más detalles. Comunmente un Document es un objeto que representa un pedazo (chunck) de un documento más largo.

Se pueden generar documentos simpples de la siguiente manera.

In [1]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

En su ecosistema, LangChain tiene document loaders que pueden servir para integrar distintas fuentes de datos comunes, permitiendo una facil incorporación de datos a una aplicación.

## Loading documents
A continuación se carga un docuemnto PDF a objetos Documents. LangChain tiene diferentes document loaders para archivos PDF; PyPDFLoader es una opción bastante ligera para realizar esta tarea.

In [9]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./nke-10k-2023.pdf"
loader = PyPDFLoader(file_path=file_path)

docs = loader.load()

print(len(docs))

107


PyPDFLoader carga un Document por cada página del archivo. Para cada página se puede entonces acceder a:
* El string con el contenido de la página
* Metadata que contiene información como el nombre de archivo y número de página.

In [10]:
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F

{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': './nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


## Splitting

Para recuperación de información y traeas de preguntas-respuestas, una página puede ser una representación muy amplia. El objetivo final es recuperar un objeto Document que responda la consulta del usuario, así que dividiendo (splitting) el PDF ayuda a asegurar que el significado relevante de porciones de texto del documento no se pierden por el texto que los rodea.

Se pueden usar text splitters para esta tarea. Aquí se divide el documento en lotes (chuncks) de 1000 caracteres y 200 caracteres sobrepuestos para mitigar la posibilidad de separar oraciones del contexto que las define. Se ysa RecursiveCharacterTextSplitter, que llama recursivamente a dividir el documento usando separadores comunes y es recomendado para casos de uso generico.

Se ajusta add_start_index=True para que el index de cada dividición de un Document se preserve como metadata de atributo "start_index".

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

516

# Embeddings

Vector search es un metodo común para guardar y buscar sobre datos no estructurados. La idea detras consiste en guardar vectores númericos asocidados con el texto. Dada una consulta, se puede representar a esta como un vector con las mismas dimensiones y usar metricas de similaridad (como el cosine similarity) para identificar el texto relacionado.

LangChain soporte embeddings de diferentes proveedores. Estos modelos especifican como el texto debería convertirse a vectores númericos.

In [12]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [13]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 3072

[0.009308322332799435, -0.01595355197787285, 0.00031549992854706943, 0.006354539655148983, 0.020617090165615082, -0.0391337126493454, -0.007451659068465233, 0.04100913181900978, -0.007989278994500637, 0.05981331691145897]


# Vector stores

Los objetos VectorStore de LangChain contienen metodos para agregar texto y objetos Documents, y además permiten consultas usando metricas de similariadad.

In [14]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

In [15]:
ids = vector_store.add_documents(documents=all_splits)

Muchas de las implementaciones de Vector Stores permiten que te conectes con soluciones existentes de Vector Stores.

Una vez inicializado VectorStore que contiene los documentos, se puede consultar su información. VectorStore inclue metodos para consulta:
* Synchronously and Asynchronously
* By straing query and by vector
* With and withput returning similarity scores
* By similarity and maximum marginal relevance.

Estos metodos generalmente incluirían una lista de objetos Document en sus salidas.

## Usage

Los embeddings representan texto como un "vector denso" cuyos textos con significados similares se encuentran geometricamente cercanos. Esto permite recuperar información importante con solo pasar una pregunta, sin conocimiento de algun termino clave especifico usado en el documento.

Devuelve documentos basado en la similaridad con el string de la consulta:

In [17]:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])

page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our
wholesale, NIKE Direct and merchandising strategies in the region, among other functions.
In the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and three of which are
leased. Two other distribution centers, one located in Indianapolis, Indiana and one located in Dayton, Tennessee, are leased and operated by third-party logistics
providers. One distribution center for Converse is located in Ontario, California, which is leased. NIKE has a number of distribution facilities outside the United States,
some of which are leased and operated by third-party logistics providers. The most significant distribution facilities outside the United States are located in Laakdal,' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 

Consulta asincrona:

In [18]:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'cr

Que devuelva los scores:

In [21]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.6886661237334674

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTSThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
• NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
• NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metad

Devuelva los documentos basadonse en la similaridad de una consulta embebida.

In [22]:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='Table of Contents
GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to
43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
• Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as
product mix;
• Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in
the prior period resulting from lower available inventory supply;
• Unfavorable changes in net foreign currency exchange rates, including hedges; and
• Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:'

In [1]:
["SOME PART OF THE TUTORIAL IS MISSING HERE..."]

['SOME PART OF THE TUTORIAL IS MISSING HERE...']