## Full Wikipedia Example

This example goes through how to create an minDB object for the entirety of English Wikipedia. After compression, the trained faiss index only takes up ~1.5GB, which can easily be held in memory on any computer (with the parameters set in this example). The full vectors and text take up 150GB on disk.

#### Pre-requisites:
1. You must have at least 250GB of space on disk. Even though the final result only takes up ~150GB, during the download step, you end up with the raw downloaded files as well as post-processed files (these are the .arrow files). This is why there is a step to remove the folder where all of the downloads end up, since they are no longer needed.
2. `datasets` and `pyarrow`. These can be installed with `pip install pyarrow datasets`.
3. A Cohere API key (only needed for making a test query). You can get one for free [here](https://dashboard.cohere.ai/welcome/register)

### Setup

Load in the necessary packages and append the paths needed

In [None]:
from datasets import load_dataset # Run pip install datasets
import pyarrow as pa # Run pip install pyarrow
import os
import sys
import numpy as np

# Load in minDB from the local directory
current_dir = os.getcwd()
sys.path.append(current_dir + "/../")

from mindb.mindb import minDB

##### Define some helper functions for reading in the data

In [None]:
### Helper functions ###
def read_embeddings(data):
    all_embeddings = [data["emb"][i] for i in range(data.shape[0])]
    return all_embeddings

def read_text(data):
    all_text = [data['text'][i] for i in range(data.shape[0])]
    return all_text

### Download the data

Download the datasets from HuggingFace. This will take ~30-60 minutes depending on your internet connection.

THIS WILL TAKE UP ~250GB OF SPACE ON DISK. PLEASE MAKE SURE YOU HAVE THIS MUCH SPACE BEFORE PROCEEDING

In [None]:
docs = load_dataset("Cohere/wikipedia-22-12-en-embeddings", split="train")

In [None]:
# Define the filepath where the data was saved (this will be printed out above when the download is complete)
# It should be something like "/Users/{username}/.cache/huggingface/datasets/Cohere___parquet/..."
filepath = "/Users/{username}/.cache/huggingface/datasets/Cohere___parquet/..."
files = os.listdir(filepath)
files.sort()

In [None]:
# Delete the downloads folder, since it is no longer needed
# The filepath for that should be something like "/Users/{username}/.cache/huggingface/datasets/downloads"

### PLEASE CONFIRM YOU HAVE THE CORRECT FILEPATH BEFORE RUNNING ###
import shutil
download_dir = '/Users/{username}/.cache/huggingface/datasets/downloads'
shutil.rmtree(download_dir)

### Create the minDB object

In [None]:
# Create the minDB object
db_name = 'wikipedia_database'
db = minDB(db_name)

### Add data to the minDB object

This section parses each file that was downloaded from Cohere to get the embeddings and text. It then creates the list of tuples needed for the `db.add()` method.

If you want to test this on a smaller set of the data first, you can set `max_files` to a smaller number (25, for example, would add ~10% of the data to the minDB object). 

In [None]:
# Read in each file and add the vectors and text to an minDB object (this takes ~45-60 minutes for the entire dataset)

# Optional - set a max number of files to read in (The entire wikipedia dataset is 252 files)
max_files = 500 # initialize to a large number to read in everything

for i,file in enumerate(files):
    print (i)

    if i >= max_files:
        break

    filename = os.path.join(filepath, file)
    extension = filename.split('.')[-1]
    # We only care about the .arrow files
    if extension != 'arrow':
        continue
    mmapped_file = pa.memory_map(filename, 'r')
    reader = pa.ipc.open_stream(mmapped_file)
    table = reader.read_all()
    data = table.to_pandas()

    embeddings = read_embeddings(data)
    text = read_text(data)

    add_data = [(embeddings[i], {"text": text[i]}) for i in range(len(embeddings))]
    db.add(add_data)

### Train the minDB object

This should take 3-4 hours to train the entire dataset

Make sure your computer doesn't go to sleep, or the training will pause

For more information on these parameters, you can visit the Github Wiki [here](https://github.com/D-Star-AI/minDB/wiki/Tunable-parameters)

In [None]:
db.train(True, pca_dimension=256, compressed_vector_bytes=32, omit_opq=True)

### Query the minDB

You can ask anything you like here. 

In [None]:
import cohere
co = cohere.Client("ENTER_YOUR_API_KEY_HERE")

query = ["Who was the founder of YouTube?"]
embeddings = co.embed(query, model="embed-multilingual-v2.0").embeddings
query_embedding = embeddings[0] / np.linalg.norm(embeddings[0])

In [None]:
# Query the index
results = db.query(query_embedding)
print (results["metadata"][0:5])