In [1]:
import chromadb
import pandas as pd

In [2]:
chroma_client = chromadb.Client()

En caso que se desee persistir la base de datos

In [None]:
# from chromadb.config import Settings
# chroma_client = chromadb.Client(
#     Settings(
#         persist_directory='my_personal_vector_db',
#     )
# )

Nombramos la colección que vamos a utilizar como "my_news"

In [3]:
collection_name = "my_news"

**chroma_client.list_collections()** retorna una lista con información acerca de las colecciones vigentes dentro de la base persistida

En caso que la colección ya exista, y que exista una con el mismo nombre que aquella que intentamos crea, va a ser eliminada para comenzar el proceso nuevamente partiendo desde el inicio

In [4]:
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
    chroma_client.delete_collection(name=collection_name)
else:
    print(f"Creating collection: '{collection_name}'")
    collection = chroma_client.create_collection(name=collection_name)


Creating collection: 'my_news'


Para embeddings customizados, es necesario crear una nueva **función de embedding** que permita procesar texto

```python
    collection = chroma_client.create_collection(name="my_collection", embedding_function=emb_fn)
```


In [5]:
pdf = pd.read_csv("labelled_newscatcher_coloured.csv", index_col=0)

In [6]:
pdf_subset = pdf.head(1000)

In [7]:
pdf_subset

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4
...,...,...,...,...,...,...,...
995,TECHNOLOGY,https://www.androidcentral.com/mate-40-will-be...,androidcentral.com,2020-08-07 17:12:33,The Mate 40 will be the last Huawei phone with...,en,995
996,SCIENCE,https://www.cnn.com/2020/08/17/africa/stone-ag...,cnn.com,2020-08-17 17:10:00,"Early humans knew how to make comfy, pest-free...",en,996
997,HEALTH,https://www.tenterfieldstar.com.au/story/68776...,tenterfieldstar.com.au,2020-08-13 03:26:06,Regional Vic set for virus testing blitz,en,997
998,HEALTH,https://news.sky.com/story/coronavirus-trials-...,news.sky.com,2020-08-13 13:22:58,Coronavirus: Trials of second contact-tracing ...,en,998


## Operaciones CRUD
Sobre esta base de datos vamos a utilizar operaciones CRUD (Create, Read, Update & Delete) con sintaxis similar a MongoDB

Al momento de insertar documentos sobre una base de datos de vectores, se insertan los documentos que van a ser vectorizados, en conjunto con los IDs que van a ser usados para identificar dichos documentos y la metadata asociada a los mismos

Así como pueden insertarse documentos, Chromadb soporta la inserción de **embeddings** de forma directa sin necesidad de especificar algún documento, esto resulta útil para realizar búsquedas con texto sobre bases de datos de imágenes con modelos como [CLIP](https://huggingface.co/openai/clip-vit-large-patch14) capaces de manejar ambos tipos de información (texto y visual)

In [8]:
collection.add(
    documents=pdf_subset["title"][:100].to_list(),
    metadatas=[{"topic": topic} for topic in pdf_subset["topic"][:100].tolist()],
    ids=[f"id{x}" for x in range(100)],
)

In [9]:
import json

results = collection.query(
    query_texts=["space"],
    # query_texts=["espacio"],
    n_results=10
)

print(json.dumps(results, indent=4))

{
    "ids": [
        [
            "id72",
            "id7",
            "id30",
            "id26",
            "id23",
            "id76",
            "id69",
            "id40",
            "id47",
            "id75"
        ]
    ],
    "distances": [
        [
            1.225035309791565,
            1.3089773654937744,
            1.391038179397583,
            1.4064621925354004,
            1.4391297101974487,
            1.4898790121078491,
            1.572824239730835,
            1.5738128423690796,
            1.5835297107696533,
            1.5864628553390503
        ]
    ],
    "metadatas": [
        [
            {
                "topic": "TECHNOLOGY"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENC

In [10]:
import json

results = collection.query(
    query_texts=["bombs"],
    n_results=3
)

print(json.dumps(results, indent=4))

{
    "ids": [
        [
            "id5",
            "id11",
            "id58"
        ]
    ],
    "distances": [
        [
            1.319362759590149,
            1.4891374111175537,
            1.5543259382247925
        ]
    ],
    "metadatas": [
        [
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENCE"
            }
        ]
    ],
    "embeddings": null,
    "documents": [
        [
            "NASA Releases In-Depth Map of Beirut Explosion Damage",
            "NASA Finds Ammonia-Linked 'Mushballs' and 'Shallow Lightning' on Jupiter",
            "Asteroid 29075 1950 DA would be the greatest catastrophe for Earth, Tsunami of 400 toes excessive waves"
        ]
    ]
}


In [11]:
collection.query(
    query_texts=["space"],
    # en el caso de operaciones de filtrado usando  "where", 
    # pueden darse operadores $and, $or, $ge, etc, de la misma forma que se dieron con MongoDB
    where={"topic": "SCIENCE"}, 
    n_results=10,
)

{'ids': [['id7',
   'id30',
   'id26',
   'id23',
   'id76',
   'id69',
   'id40',
   'id47',
   'id75',
   'id52']],
 'distances': [[1.3089773654937744,
   1.391038179397583,
   1.4064621925354004,
   1.4391297101974487,
   1.4898790121078491,
   1.572824239730835,
   1.5738128423690796,
   1.5835297107696533,
   1.5864628553390503,
   1.59842848777771]],
 'metadatas': [[{'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'}]],
 'embeddings': None,
 'documents': [['Orbital space tourism set for rebirth in 2021',
   'NASA drops "insensitive" nicknames for cosmic objects',
   '‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon',
   'Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars',
   "Aust

Borramos el primer elemento de la colección

In [12]:
collection.delete(
    ids=["id0"],
)

Verificamos que ya no se encuentra disponible

In [13]:
collection.get(
    ids=["id0"],
)

{'ids': [], 'embeddings': None, 'metadatas': [], 'documents': []}

Ahora realizamos un ejemplo de actualización de un documento

In [14]:
collection.get(
    ids=["id2"],
)

{'ids': ['id2'],
 'embeddings': None,
 'metadatas': [{'topic': 'SCIENCE'}],

Para el documento 2, vamos a cambiar su tópico de "SCIENCE" a "TECHNOLOGY"

In [15]:
collection.update(
    ids=["id2"],
    metadatas={"topic": "TECHNOLOGY"}
)

Verificamos que el mismo haya cambiado

In [16]:
collection.get(
    ids=["id2"],
)

{'ids': ['id2'],
 'embeddings': None,
 'metadatas': [{'topic': 'TECHNOLOGY'}],

Ahora se va a armar un pipeline de Q&A sencillo utilizando como fuente de datos, la base de datos que acabamos de crear con ChromaDB.

El hito es poder proveer de contexto a algún modelo generativo de lenguaje (en este caso en particular a GPT2) tratando de eficientizar el performance del modelo y tratar de acortar la ventana de contexto necesaria para que el modelo pueda funcionar correctamente

Recordar que GPT2 es una versión gratuita y un modelo antiguo de GPT4, no es de esperar que cuente con el mismo performance que modelos como GPT3 en adelante

In [17]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [18]:
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    # cache_dir='cache'
)

lm_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    # cache_dir='cache',
)

pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    device_map="auto",
)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [19]:
results

{'ids': [['id5', 'id11', 'id58']],
 'distances': [[1.319362759590149, 1.4891374111175537, 1.5543259382247925]],
 'metadatas': [[{'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'}]],
 'embeddings': None,
 'documents': [['NASA Releases In-Depth Map of Beirut Explosion Damage',
   "NASA Finds Ammonia-Linked 'Mushballs' and 'Shallow Lightning' on Jupiter",
   'Asteroid 29075 1950 DA would be the greatest catastrophe for Earth, Tsunami of 400 toes excessive waves']]}

In [20]:
results = collection.query(
    query_texts=["space"],
    # query_texts=["espacio"],
    n_results=10
)

print(json.dumps(results, indent=4))

{
    "ids": [
        [
            "id72",
            "id7",
            "id30",
            "id26",
            "id23",
            "id76",
            "id69",
            "id40",
            "id47",
            "id75"
        ]
    ],
    "distances": [
        [
            1.225035309791565,
            1.3089773654937744,
            1.391038179397583,
            1.4064621925354004,
            1.4391297101974487,
            1.4898790121078491,
            1.572824239730835,
            1.5738128423690796,
            1.5835297107696533,
            1.5864628553390503
        ]
    ],
    "metadatas": [
        [
            {
                "topic": "TECHNOLOGY"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENCE"
            },
            {
                "topic": "SCIENC

In [21]:
question = "What's the latest news on space development?"

Establecemos el contexto necesario para responder la pregunta del usuario

In [22]:
context = " ".join([f"\n#{str(i)}" for i in results["documents"][0]])
print(context)


#Beck teams up with NASA and AI for 'Hyperspace' visual album experience 
#Orbital space tourism set for rebirth in 2021 
#NASA drops "insensitive" nicknames for cosmic objects 
#‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon 
#Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars 
#Australia's small yet crucial part in the mission to find life on Mars 
#NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico 
#SpaceX's Starship spacecraft saw 150 meters high 
#NASA’s InSight lander shows what’s beneath Mars’ surface 
#Alien base on Mercury: ET hunters claim to find huge UFO


In [23]:
prompt_template = f"Relevant context: {context}\n\n The user's question: {question}"
print(prompt_template)

Relevant context: 
#Beck teams up with NASA and AI for 'Hyperspace' visual album experience 
#Orbital space tourism set for rebirth in 2021 
#NASA drops "insensitive" nicknames for cosmic objects 
#‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon 
#Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars 
#Australia's small yet crucial part in the mission to find life on Mars 
#NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico 
#SpaceX's Starship spacecraft saw 150 meters high 
#NASA’s InSight lander shows what’s beneath Mars’ surface 
#Alien base on Mercury: ET hunters claim to find huge UFO

 The user's question: What's the latest news on space development?


In [24]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Relevant context: 
#Beck teams up with NASA and AI for 'Hyperspace' visual album experience 
#Orbital space tourism set for rebirth in 2021 
#NASA drops "insensitive" nicknames for cosmic objects 
#‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon 
#Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars 
#Australia's small yet crucial part in the mission to find life on Mars 
#NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico 
#SpaceX's Starship spacecraft saw 150 meters high 
#NASA’s InSight lander shows what’s beneath Mars’ surface 
#Alien base on Mercury: ET hunters claim to find huge UFO

 The user's question: What's the latest news on space development? The user's asking: What's the current situation in space? What are the questions behind the space age? Well, there may be more to life on Mars than the mere notion of some massive alien, alien civilization being unl

Para más información sobre como incorporar GPT4 y armar un chatbot de Q&A, visitar [Embeddings con OPEN AI](https://docs.trychroma.com/embeddings) para comprender cómo puede integrarse ChromaDB 