
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RelevanceAI/RelevanceAI-readme-docs/blob/v0.33.2/docs/GENERAL_FEATURES/creating-a-dataset/_notebooks/creating-a-dataset.ipynb)

# Installation

In [1]:
!pip install -U 'RelevanceAI[notebook]==0.33.2'






# Setup

In [2]:
from relevanceai import Client

"""
You can sign up/login and find your credentials here: https://cloud.relevance.ai/sdk/api
Once you have signed up, click on the value under `Authorization token` and paste it here
"""
client = Client()



### Creating a dataset

To create a new empty dataset pass the name under which you wish to save the dataset to the `create` function as shown below. In this example, we have used `ecommerce-sample-dataset` as the name.

In [3]:
from relevanceai.datasets import get_ecommerce_dataset_encoded

documents = get_ecommerce_dataset_encoded()
{k:v for k, v in documents[0].items() if '_vector_' not in k}


{'_id': '39e05c63-c6af-49a4-acaa-5f45fdba652d',
 'product_image': 'https://ak1.ostkcdn.com/images/products/9645968/P16829728.jpg',
 'query': 'workout clothes for women',
 'product_price': '$89.99 ',
 'source': 'Overstock',
 'product_title': "Electric Yoga Women's Black Graffiti Workout Pants",
 'product_link': 'https://www.overstock.com/Clothing-Shoes/Electric-Yoga-Womens-Black-Graffiti-Workout-Pants/9645968/product.html?refccid=YT3JBWC34MQKPKKCPMCSDWOOAU&searchidx=5',
 'insert_date_': '2021-12-24T07:11:17.450Z'}

In [4]:
df = client.Dataset("ecommerce-sample-dataset")
df.insert_documents(documents)


while inserting, you can visit your dashboard at https://cloud.relevance.ai/dataset/ecommerce-sample-dataset/dashboard/monitor/


{'inserted': 739, 'failed_documents': [], 'failed_documents_detailed': []}


See [Inserting and updating documents](doc:inserting-data) for more details on how to insert/upload documents into a dataset.


* Id field: Relevance AI platform identifies unique data entries within a dataset using a field called `_id` (i.e. every document in the dataset must include an `_id` field with a unique value per document).
* Vector fields: the name of vector fields must end in `_vector_`


### List your datasets


You can see a list of all datasets you have uploaded to your account in the dashboard.

<img src="https://github.com/RelevanceAI/RelevanceAI-readme-docs/blob/v0.33.2/docs_template/GENERAL_FEATURES/creating-a-dataset/_assets/dataset-list-view.png?raw=true" alt="Datasets List View" />


Alternatively, you can use the list endpoint under Python SDK as shown below:



In [5]:

client.list_datasets()


You can view all your datasets at https://cloud.relevance.ai/datasets.


{'datasets': ['workshop-jan-21',
  'tweets_opensea',
  'tweets_elonmusk',
  'test_dataset_sdk_v0300',
  'test_dataset_sdk_v0.31.0',
  'sydney_real_estate_april_2021',
  'sydney-house-prices',
  'skills',
  'sample_test_insert_pd',
  'sample_test_insert_csv',
  'sample_data_chunk',
  'sales-report',
  'quickstart_tfhub_qa',
  'quickstart_text_search',
  'quickstart_text_searc',
  'quickstart_sample',
  'quickstart_multi_vector_search',
  'quickstart_kmeans_clustering',
  'quickstart_clustering_metadata',
  'quickstart_clustering_list_furthest',
  'quickstart_clustering_list_closest',
  'quickstart_clustering_kmeans',
  'quickstart_clustering_aggregation',
  'quickstart_clustering',
  'quickstart_clip',
  'quickstart_chunk_data_encoding',
  'quickstart_auto_clustering_kmeans',
  'quickstart_aggregation',
  'quickstart-example',
  'quickstart',
  'nft-sample-dataset',
  'national_skills_commission_occupations_titles',
  'national_skills_commission_occupations_specialist_skills',
  'nation


### Monitoring a specific dataset

RelevanceAI's dashboard at https://cloud.relevance.ai is the most straightforward place to monitor your data.

<img src="https://github.com/RelevanceAI/RelevanceAI-readme-docs/blob/v0.33.2/docs_template/GENERAL_FEATURES/creating-a-dataset/_assets/monitor-dataset.png?raw=true" width="1263" alt="vector_health.png" />

Alternatively, you can monitor the health of a dataset using the command below which returns the count of total missing and existing fields in the data points in the named dataset.


In [6]:
df.health


You can view your dashboard at: https://cloud.relevance.ai/dataset/ecommerce-sample-dataset/dashboard/monitor/schema


{'insert_date_': {'missing': 0, 'exists': 739},
 'product_image': {'exists': 739, 'missing': 0},
 'product_image_clip_vector_': {'missing': 0,
  'exists': 739,
  'number_of_documents_with_zero_vectors': 1},
 'product_link': {'missing': 0, 'exists': 739},
 'product_price': {'exists': 739, 'missing': 0},
 'product_title': {'missing': 0, 'exists': 739},
 'product_title_clip_vector_': {'missing': 0,
  'exists': 739,
  'number_of_documents_with_zero_vectors': 0},
 'query': {'exists': 739, 'missing': 0},
 'source': {'exists': 739, 'missing': 0}}

In [7]:
df.schema


{'insert_date_': 'date',
 'product_image': 'text',
 'product_image_clip_vector_': {'vector': 512},
 'product_link': 'text',
 'product_price': 'text',
 'product_title': 'text',
 'product_title_clip_vector_': {'vector': 512},
 'query': 'text',
 'source': 'text'}

### Deleting a dataset

Deleting an existing dataset can be done on the dashboard using the delete option available for each dataset. Or through the Python SDK:




In [8]:
client.delete_dataset(dataset_id="ecommerce-sample-dataset")


{'status': 'complete', 'message': 'ecommerce-sample-dataset deleted'}

## Inserting Documents

Inserting new documents into a dataset is simple as the command below.

In [9]:
df.insert_documents(documents=documents)


while inserting, you can visit your dashboard at https://cloud.relevance.ai/dataset/ecommerce-sample-dataset/dashboard/monitor/


{'inserted': 739, 'failed_documents': [], 'failed_documents_detailed': []}

## Upserting Documents

To only update specific documents, use `upsert_documents` as shown in the example below:


In [10]:
SAMPLE_DOCUMENT = {
    '_id': '711160239',
    'product_image': 'https://thumbs4.ebaystatic.com/d/l225/pict/321567405391_1.jpg',
    'product_image_clip_vector_': [0.1, 0.1, 0.1],
    'product_link': 'https://www.ebay.com/itm/20-36-Mens-Silver-Stainless-Steel-Braided-Wheat-Chain-Necklace-Jewelry-3-4-5-6MM-/321567405391?pt=LH_DefaultDomain_0&var=&hash=item4adee9354f',
    'product_price': '$7.99 to $12.99',
    'product_title': '20-36Mens Silver Stainless Steel Braided Wheat Chain Necklace Jewelry 3/4/5/6MM"',
    'product_title_clip_vector_': [0.1, 0.1, 0.1],
    'query': 'steel necklace',
    'source': 'eBay'
}


In [11]:
df.upsert_documents(documents=[SAMPLE_DOCUMENT])


{'inserted': 1, 'failed_documents': [], 'failed_documents_detailed': []}

## Inserting CSV

Uploading a CSV file is as simple as specifying the CSV path to your file.

In [None]:
df = client.Dataset('quickstart_insert_csv')

csv_fpath = "./sample_data/mnist_tests.csv"
df.insert_csv(filepath_or_buffer = csv_fpath)
