# Liar Plus to ChromaDB
This notebook is used to translate the liar plus dataset into a vector database. Then, Gemini will use this to base its responses off of true statements from Liar Plus.

**Prerequisite**: 
* Docker (I personally prefer Docker Desktop to see the images and containers)
* Python 3.11

Additionally, you only need to run this notebook once, as the database will be populated and stay populated when the container gets shut down. Make sure that you **do not** delete the image!

Within the terminal, run the following:
```bash
docker run --rm --name chromadb -v chroma_volume:/chroma/chroma -e IS_PERSISTENT=TRUE -e ANONYMIZED_TELEMETRY=TRUE -p 8000:8000 chromadb/chroma
```
This will make a docker container on port 8000 with a ChromaDB image.

Once you are done running, make sure to shut down the server by doing:
```bash
docker stop chromadb
```

In [1]:
# to improve, read this https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide
import pandas as pd
import chromadb
import requests

In [2]:
chroma_client = chromadb.HttpClient(host='localhost', port=8000)

In [3]:
# data preprocessing
train_data = pd.read_csv("../data/train2.tsv", sep="\t", header=None).drop(0, axis=1)
train_data.columns = ['ID', 'label', 'statement', 'subject', 'speaker', 'speaker_title', 'state', 'party_affliation', 'barely_true_counts', 'false_counts', 'half_true_counts', 'mostly_true_counts', 'pants_on_fire_counts', 'context', 'extracted_justification']
train_data = train_data[(train_data['speaker'].notna()) & (train_data['label'].notna())].reset_index(drop=True)
train_data.head()

Unnamed: 0,ID,label,statement,subject,speaker,speaker_title,state,party_affliation,barely_true_counts,false_counts,half_true_counts,mostly_true_counts,pants_on_fire_counts,context,extracted_justification
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer,That's a premise that he fails to back up. Ann...
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.,"Surovell said the decline of coal ""started whe..."
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver,Obama said he would have voted against the ame...
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release,The release may have a point that Mikulskis co...
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN,"Crist said that the economic ""turnaround start..."


In [4]:
documents = []
metadatas = []
ids = []
for i in range(train_data.shape[0]):
    documents.append(train_data.loc[i, 'statement'])
    metadatas.append({"label": train_data.loc[i, 'label'], "speaker": train_data.loc[i, "speaker"], "party_affliation": train_data.loc[i, "party_affliation"], "justification": train_data.loc[i, "extracted_justification"], })
    ids.append("id" + str(i))

In [4]:
# Uploading each document, metadata, and id into the collection
collection = chroma_client.get_or_create_collection(name="Misinformation")

In [6]:
collection.add(documents = documents, 
               metadatas=metadatas, 
               ids=ids)

In [5]:
# Test query to ensure it works!
results = collection.query(query_texts=["Hillary Clinton says Trump is more unhinged, more unstable than in 2016"], 
                n_results=2,
                where=
                {
                "label": "true"
                })
print(results)

{'ids': [['id4994', 'id9142']], 'distances': [[1.1043428182601929, 1.1649692058563232]], 'embeddings': None, 'metadatas': [[{'justification': 'But that quote is pulled from a story in which Obama expresses a sentiment that now that the war has started, the U. S.  should do the best job it can to steer Iraq toward stability. Obama joined the U. S.  Senate in 2005. He has voted several times to continue funding for the war, saying that troops in Iraq should be funded even if he disagreed with the overall war. (The measure passed 97 to zero.', 'label': 'true', 'party_affliation': 'none', 'speaker': 'glenn-beck'}, {'justification': 'She gave $2,400 this year to Rubio. Crist, in his statement to the St.  Petersburg Times editorial board, is off a little bit by saying 14 of the 20 have asked for a refund. Actually, 11 never gave Crist "a penny" for his 2010 Senate campaign. And a 12th person had his money back before sending the letter to Crist.', 'label': 'true', 'party_affliation': 'democr

In [None]:
# collection.query(
#     query_texts=["doc10", "thus spake zarathustra", ...],
#     n_results=10,
#     where={"metadata_field": "is_equal_to_this"},
#     where_document={"$contains":"search_string"}
# )

In [None]:
# More advanced queries that you could do. 
# collection.query(query_texts=["Hillary Clinton says Trump is more unhinged, more unstable than in 2016"], 
#                  n_results=2,
#                  where=
#                  {
#                      "$and": [
#                          {
#                              "speaker": {
#                              "$eq": "hillary-clinton"
#                              } 
#                          },
#                          {
#                              "label": {
#                                  "$eq": "true"
#                              }
#                          }
#                      ]
#                  })