# Liar Plus to ChromaDB
This notebook is used to translate the liar plus dataset into a vector database. Then, Gemini will use this to base its responses off of true statements from Liar Plus.

**Prerequisite**: 
* Docker (I personally prefer Docker Desktop to see the images and containers)
* Python 3.11

Additionally, you only need to run this notebook once, as the database will be populated and stay populated when the container gets shut down. Make sure that you **do not** delete the image!

Within the terminal, run the following:
```bash
docker run --rm --name chromadb -v chroma_volume:/chroma/chroma -e IS_PERSISTENT=TRUE -e ANONYMIZED_TELEMETRY=TRUE -p 8000:8000 chromadb/chroma
```
This will make a docker container on port 8000 with a ChromaDB image.

Once you are done running, make sure to shut down the server by doing:
```bash
docker stop chromadb
```

In [None]:
# to improve, read this https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide
# import packages
import pandas as pd
import chromadb
import requests

In [None]:
# Create local server for chromadb to connect to
chroma_client = chromadb.HttpClient(host='localhost', port=8000)

In [None]:
# data preprocessing, renaming columns and removing nulls
train_data = pd.read_csv("../data/train2.tsv", sep="\t", header=None).drop(0, axis=1)
train_data.columns = ['ID', 'label', 'statement', 'subject', 'speaker', 'speaker_title', 'state', 'party_affliation', 'barely_true_counts', 'false_counts', 'half_true_counts', 'mostly_true_counts', 'pants_on_fire_counts', 'context', 'extracted_justification']
train_data = train_data[(train_data['speaker'].notna()) & (train_data['label'].notna())].reset_index(drop=True)
train_data.head()

In [None]:
# Getting or creating a collection named misinformation
collection = chroma_client.get_or_create_collection(name="Misinformation")
collection_count = collection.count()

In [None]:
"""
Three different lists need to be inputted into a chromaDB collection, documents, metadatas about the document, and id correlated to a document.
This code uses the statement as the document. Label, speaker, party, justification are metadata. ID are assigned as the end of the collection
"""
documents = []
metadatas = []
ids = []
for i in range(train_data.shape[0]):
    documents.append(train_data.loc[i, 'statement'])
    metadatas.append({"label": train_data.loc[i, 'label'], "speaker": train_data.loc[i, "speaker"], "party_affliation": train_data.loc[i, "party_affliation"], "justification": train_data.loc[i, "extracted_justification"], })
    ids.append("id" + str(collection_count + i))

In [None]:
# Uploading each document, metadata, and id into the collection
collection.add(documents = documents, 
               metadatas=metadatas, 
               ids=ids)

In [None]:
# Test query to ensure it works!
results = collection.query(query_texts=["Hillary Clinton says Trump is more unhinged, more unstable than in 2016"], 
                n_results=2,
                where=
                {
                "label": "true"
                })
print(results)

In [None]:
# More advanced queries that you could do. 
# collection.query(query_texts=["Hillary Clinton says Trump is more unhinged, more unstable than in 2016"], 
#                  n_results=2,
#                  where=
#                  {
#                      "$and": [
#                          {
#                              "speaker": {
#                              "$eq": "hillary-clinton"
#                              } 
#                          },
#                          {
#                              "label": {
#                                  "$eq": "true"
#                              }
#                          }
#                      ]
#                  })