## **PDF Data Loading**

Dependencias

[PyPDFLoader](https://python.langchain.com/docs/integrations/document_loaders/pypdfloader/)

In [1]:
%%capture
!pip install -qU langchain_community pypdf langchain-text-splitters

In [2]:
from langchain_community.document_loaders import PyPDFLoader

pdf_file = "/kaggle/input/fashion-data/fashion_data.pdf"

loader = PyPDFLoader(pdf_file)

In [3]:
# aquí estamos cargando un listado de documentos del pdf original (por página)
# debemos pasar una lista de texto al splitter
document = loader.load()

# páginas PDF cargadas
#print(f"El número de documentos cargados es: {len(document)}")

#document[0]
#type(document)

## **Text Splitting**

Dependencias

[How to recursively split text by characters](https://python.langchain.com/docs/how_to/recursive_text_splitter/)

[Documents Splitting with LangChain](https://www.kaggle.com/code/youssef19/documents-splitting-with-langchain)

In [4]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator= "\n",
    chunk_size= 1000,
    chunk_overlap= 150,
    length_function= len
)

docs = text_splitter.split_documents(document)

In [5]:
# comparamos el número de chunks creados con el de páginas del PDF inicial
print(len(docs), len(document))

37 21


In [6]:
# detalle chuncks creados
docs[3].page_content

'"Color": "Royal Blue",  \n        "Sizes Available": ["S", "M", "L"],  \n        "Features": "Floor -length, halter neck, with draped silhouette and side slit."  \n    }, \n    { \n        "Item": "Mountaineer Thermal Vest",  \n        "Type": "Jacket",  \n        "Style": "Sporty",  \n        "Season": "Winter",  \n        "Material": "Insulated synthetic",  \n        "Color": "Mountain Gray",  \n        "Sizes Available": ["M", "L", "XL"],  \n        "Features": "Lightweight, water -resistant, with thermal pockets and zip -front closure."  \n    }, \n    { \n        "Item": "Heritage Wool Sweater",  \n        "Type": "Shirt",  \n        "Style": "Casual",  \n        "Season": "Winter",  \n        "Material": "Wool",  \n        "Color": "Burgundy",  \n        "Sizes Available": ["S", "M", "L"],  \n        "Features": "Cable knit, crew neck, with ribbed trim and elbow patches."  \n    }, \n    { \n        "Item": "Nomad Leather Backpack",  \n        "Type": "Accessory",  \n        "St

### **Text Embedding**

[Hugging Face Models](https://python.langchain.com/docs/integrations/platforms/huggingface/)

In [7]:
%%capture
!pip install --upgrade --quiet  langchain sentence_transformers langchain_huggingface

In [8]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [9]:
# Convertir chuncks a embeddings
try:
    embed = embeddings.embed_documents([doc.page_content for doc in docs])
    print("Vector de embeddings creado con éxito")
    
except Exception as e:
    print(f"Error creando Vector de embeddings: {e}")

Vector de embeddings creado con éxito


In [10]:
%%capture
!pip install prettytable

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [11]:
from prettytable import PrettyTable

#type(embed) # tipo lista
#len(embed) # cantidad de vectores
#embed[1] # contenido por vector

# ¿Cuántos valores tiene cada vector?

table = PrettyTable(["Vector", "Longitud"])

# Llenar la tabla con los datos
for i in range(len(embed)):
    table.add_row([i, len(embed[i])])

# Imprimir la tabla
print(table)

+--------+----------+
| Vector | Longitud |
+--------+----------+
|   0    |   768    |
|   1    |   768    |
|   2    |   768    |
|   3    |   768    |
|   4    |   768    |
|   5    |   768    |
|   6    |   768    |
|   7    |   768    |
|   8    |   768    |
|   9    |   768    |
|   10   |   768    |
|   11   |   768    |
|   12   |   768    |
|   13   |   768    |
|   14   |   768    |
|   15   |   768    |
|   16   |   768    |
|   17   |   768    |
|   18   |   768    |
|   19   |   768    |
|   20   |   768    |
|   21   |   768    |
|   22   |   768    |
|   23   |   768    |
|   24   |   768    |
|   25   |   768    |
|   26   |   768    |
|   27   |   768    |
|   28   |   768    |
|   29   |   768    |
|   30   |   768    |
|   31   |   768    |
|   32   |   768    |
|   33   |   768    |
|   34   |   768    |
|   35   |   768    |
|   36   |   768    |
+--------+----------+


### **Storing in Vector Database (Chroma)**

[Embeddings and Vector Databases With ChromaDB by RealPython](https://realpython.com/chromadb-vector-database/)

[Chroma by LangChain](https://python.langchain.com/docs/integrations/vectorstores/chroma/)

In [12]:
%%capture
!pip install -qU "langchain-chroma>=0.1.2"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [13]:
from langchain_chroma import Chroma

# inicializamos Chroma Vector Store
vector_store = Chroma(
    embedding_function= embeddings,
    persist_directory= "data"
)

# agregar documentos al Vector Store
vector_store.add_documents(documents= docs)

['5726ddfa-35c9-48b6-b964-0665471b937a',
 '7840ddb6-8aa1-4ebf-9077-d72fc85f93ac',
 'fd442ef6-7479-4ffd-ac09-6617fe927b84',
 '2ecb14c2-3702-4f8b-9183-2a64f92a9b1c',
 '3c941c4e-d28a-4818-b41e-8d45a6f4b9a4',
 '0f7f804a-6ba2-4a9e-93f8-8955e2b01848',
 '901e6c84-4395-476b-a0d6-0a296b840506',
 '4d549081-a11e-44fc-91ee-80f8d1a42986',
 'c530fd48-1039-41c9-93ff-9f8055151544',
 '0af88142-5e3c-4afd-9086-8f7b6af3c4d1',
 '43facb2d-e538-4d6b-bef9-4aefbb2edc8a',
 'f379028c-776d-4b99-a5c1-0da7b2373caf',
 'a9ef541d-9105-43e5-aef8-946848761771',
 'a5a7e92f-2f80-4834-8110-9331e643817e',
 '8f253659-05d3-4077-b039-e88691081b05',
 'bd441bdb-05ed-4d3d-bbdd-9b80ac044cf4',
 '918e1c9b-6f04-464c-b722-db64468bebbb',
 'fa3cbb6a-a82a-416f-8c78-a6fe0bc361b3',
 'ba58d29d-abc3-4e66-8ffa-26d51ce100a9',
 '084c2379-f7b3-4a09-bdb8-86ec6f42a241',
 '9070f7f1-b420-4c5b-972e-3aa830dad52b',
 '9f81d1d0-5f85-4e67-ae38-e679cc7f8439',
 '104522a9-7630-4995-8be3-6cbf095071a6',
 '2b92a117-80db-48a7-ac37-a9278f1e3902',
 'd0f94f3e-e642-

### **Validate the setup**

In [14]:
from collections import OrderedDict

try:
    # Recuperación de datos
    test_query = "What are some popular items for winter?"
    results = vector_store.search(query= test_query, search_type= 'similarity')
    
    unique_results = OrderedDict()
    for i in results:
        if i.page_content not in unique_results:
            unique_results[i.page_content] = i
            
    final_results = list(unique_results.values())[:3]
    print(f'Salida única: {final_results}')
    
except Exception as e:
    print(f'Error durante el test de recuperación: {e}')

Salida única: [Document(metadata={'page': 2, 'source': '/kaggle/input/fashion-data/fashion_data.pdf'}, page_content='"Color": "Royal Blue",  \n        "Sizes Available": ["S", "M", "L"],  \n        "Features": "Floor -length, halter neck, with draped silhouette and side slit."  \n    }, \n    { \n        "Item": "Mountaineer Thermal Vest",  \n        "Type": "Jacket",  \n        "Style": "Sporty",  \n        "Season": "Winter",  \n        "Material": "Insulated synthetic",  \n        "Color": "Mountain Gray",  \n        "Sizes Available": ["M", "L", "XL"],  \n        "Features": "Lightweight, water -resistant, with thermal pockets and zip -front closure."  \n    }, \n    { \n        "Item": "Heritage Wool Sweater",  \n        "Type": "Shirt",  \n        "Style": "Casual",  \n        "Season": "Winter",  \n        "Material": "Wool",  \n        "Color": "Burgundy",  \n        "Sizes Available": ["S", "M", "L"],  \n        "Features": "Cable knit, crew neck, with ribbed trim and elbow pa