# Vector Stores

En este laboratorio veremos cómo funciona una vectorstore y cómo podemos buscar otras palabras dentro de estas.







## Dependencias


In [1]:
!pip install datasets openai langchain langchain-community faiss-cpu langchain-openai tiktoken

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.23-py3-none-any.whl.metadata (2.5 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.16-py3-none-any.whl.metadata (2.3 kB)
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-20

## Cargar documentos en la Vector Store

Nos descargaremos algunos artículos de Wikipedia

Usaremos un dataset de Cohere con artículos de Wikipedia ya pasados por su modelo de Embedding

[Cohere/wikipedia-2023-11-embed-multilingual-v3](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3)


In [2]:
from langchain.vectorstores import Chroma
from datasets import load_dataset

lang = "simple"
top_k = 5

docs_stream = load_dataset("Cohere/wikipedia-2023-11-embed-multilingual-v3", lang, split="train", streaming=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/30.2k [00:00<?, ?B/s]

Nos quedaremos solo con el texto de este conjunto de datos.


In [3]:
from langchain.docstore.document import Document

texts = []
max_docs = 10000

for doc in docs_stream:
    texts.append(Document(page_content = doc['text']))
    if len(texts) >= max_docs:
        break

Cargamos el modelo Embedding de OpenAI


In [4]:
from langchain.embeddings.openai import OpenAIEmbeddings
import getpass

api_key = getpass.getpass("Enter your OpenAI API Key:")
embedding = OpenAIEmbeddings(api_key = api_key)

Enter your OpenAI API Key:··········


  embedding = OpenAIEmbeddings(api_key = api_key)


## Creamos la vector store

In [5]:
from langchain_community.vectorstores import FAISS

db = FAISS.from_documents(texts, embedding)

## Queries

In [6]:
query = "Can you tell me the andorran population?"
docs = db.similarity_search_with_score(query)
for doc in docs:
  print(doc)
  print("------------------")

(Document(id='5b2fefbc-f2c0-400b-aecf-821bc4545ecc', metadata={}, page_content="There are about 84,000 people living in the country. The capital is Andorra la Vella. It is ruled by a Spanish Bishop and the French President, who both hold the title of Co-Prince. Andorra's government is a parliamentary democracy."), np.float32(0.3091945))
------------------
(Document(id='59bf439d-076b-4fd5-884a-9d3ab27067b4', metadata={}, page_content='Andorra is a rich country mostly because of tourism.  There are about 10.2\xa0million visitors each year.'), np.float32(0.32245833))
------------------
(Document(id='8a9b613f-7d70-4807-9682-11b2c6bc2fbb', metadata={}, page_content='The population of Andorra is mostly (90%) Roman Catholic. Their patron saint is Our Lady of Meritxell.'), np.float32(0.33076102))
------------------
(Document(id='43bc5063-eb11-4bea-ad28-7cb122d2ac7c', metadata={}, page_content="Andorra doesn't have an Army. France and Spain help to defend Andorra.  The country has a police forc

In [7]:
query = "Who is Alan Turing?"
docs = db.similarity_search_with_score(query)
for doc in docs:
  print(doc)
  print("------------------")

(Document(id='2a494d47-31ff-4b3b-bb75-94f549c3ed02', metadata={}, page_content='Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.'), np.float32(0.21449159))
------------------
(Document(id='fb39b00a-a810-4746-8667-d02fdbc12927', metadata={}, page_content='Alan was a brilliant mathematician and cryptographer. He became the founder of modern-day computer science and artificial intelligence. He designed a machine at Bletchley Park to break secret Enigma encrypted messages used by the Nazi German war machine to protect sensitive commercial, diplomatic and military communications during World War 2. This made the single biggest contribution to the Allied victory in the war against Nazi Germany. It possibly saved the lives of an estimated 2 million people, and shortened World War II.'), np.float32(0.22104444))
------------------
(Document(id='d7378b69-5afd-49a7-89f6-60a

In [8]:
query = "Who is Javier Milei?"
docs = db.similarity_search_with_score(query)
for doc in docs:
  print(doc)
  print("------------------")

(Document(id='5752f7f9-a4d0-43a7-acfc-9b506246aaaf', metadata={}, page_content='Jalal Allakhverdiyev, member of the Academy of Sciences of the Azerbaijan Soviet Socialist Republic (later called the Azerbaijan National Academy of Sciences); Mathematics; died in 2017'), np.float32(0.43834937))
------------------
(Document(id='d532375d-a3d7-4083-8d6d-7ea4daa7f9a8', metadata={}, page_content="March 22 - Mijailo Mijailovic is sentenced to life imprisonment for the equivalent of First-degree murder, found guilty of assassination of Sweden's Foreign Minister Anna Lindh, September 10, 2003."), np.float32(0.4475514))
------------------
(Document(id='174c0e9c-4b36-47c1-b3d2-6484376967d9', metadata={}, page_content='Nicanor Parra, got the Cervantes Prize, the most important literary prize in the Spanish-speaking world'), np.float32(0.45396686))
------------------
(Document(id='91a94df8-44c3-4502-9469-2ce4e0625df5', metadata={}, page_content='The Clay Mathematics Institute has said it will give on

In [9]:
query = "Give the population of Argentina"
docs = db.similarity_search_with_score(query)
for doc in docs:
  print(doc)
  print("------------------")

(Document(id='8c71d1fa-11c2-4128-96f5-fc4e33e418ec', metadata={}, page_content='The majority of the Argentineans are descendants of Europeans mainly from Spain, Italy, Germany, Ireland, France, other Europeans countries and Mestizo representing more than 90% of the total population of the country. More than 300,000 Roma gypsies live in Argentina. Since the 1990s, Romanian, Brazilian and Colombian gypsies arrived in Argentina.'), np.float32(0.30113024))
------------------
(Document(id='118a0d73-961b-4753-a6d4-c27b8b0e713e', metadata={}, page_content="Argentina is a Christian country. Most of Argentina's people (80 percent) are Roman Catholic. Argentina also has the largest population of Jewish community after Israel and US. Middle Eastern immigrants who were Muslims converted to Catholicism, but there are still Muslims as well."), np.float32(0.3060348))
------------------
(Document(id='ec2f4be8-5c52-4989-b551-2b745ec3dbdd', metadata={}, page_content='Argentina (officially the Argentine 