# Phoenix Dataset Object

This small tutorial is to demonstrate how we can use the 🔥🐦 Phoenix `Dataset` object. 

This object currently is composed of a dataframe and a schema. Data can be consumed from:
* Pandas DataFrame directly
* From local files: csv & hdf5

In [1]:
import pandas as pd
from phoenix.datasets import Dataset, Schema, EmbeddingColumnNames

In [2]:
test_filename = "NLP_sentiment_classification_language_drift"

df1 = pd.read_csv(f"./fixtures/{test_filename}.csv")
df1.head()

Unnamed: 0.1,Unnamed: 0,prediction_ts,reviewer_age,reviewer_gender,product_category,language,text,text_vector,label,pred_label
0,0,1650092000.0,21,female,apparel,english,Poor quality of fabric and ridiculously tight ...,[-7.05169961e-02 6.64003372e-01 3.35792184e-...,negative,negative
1,1,1650092000.0,29,male,kitchen,english,"Love these glasses, thought they'd be everyday...",[-2.44109239e-03 -5.40627480e-01 3.17134917e-...,positive,positive
2,2,1650093000.0,26,female,sports,english,"These are disgusting, it tastes like you are ""...",[ 4.04878825e-01 8.23539615e-01 3.83339435e-...,neutral,neutral
3,549,1650175000.0,22,female,beauty,english,"Their ok, not steady, not exactly necessary.",[-1.78982448e-02 -4.12485093e-01 4.79548365e-...,neutral,neutral
4,550,1650176000.0,34,male,other,english,I actually got these rain ponchos to use on my...,[ 5.38790300e-02 -5.83797634e-01 1.63948029e-...,positive,positive


Define the schema same as you would in our SDK

In [3]:
features = [
    "reviewer_age",
    "reviewer_gender",
    "product_category",
    "language",
]

embedding_features = {
    "embedding_feature": EmbeddingColumnNames(
        vector_column_name="text_vector",  # Will be name of embedding feature in the app
        data_column_name="text",
    ),
}

# Define a Schema object so that we can map your data's columns to a Dataset
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_label",
    actual_label_column_name="label",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features,
)

You are ready to define a `Dataset`

In [4]:
# Defined directly from dataframe
dataset1 = Dataset(df1, schema)
dataset2 = Dataset.from_dataframe(df1, schema)
print(dataset1.get_embedding_vector_column("embedding_feature"))
# Defined from csv
dataset3 = Dataset.from_csv(f"./fixtures/{test_filename}.csv", schema=schema)
# Defined from hdf5
# dataset4 = Dataset.from_hdf(f"./fixtures/{test_filename}.hdf5", schema=schema, key="training")

0       [-7.05169961e-02  6.64003372e-01  3.35792184e-...
1       [-2.44109239e-03 -5.40627480e-01  3.17134917e-...
2       [ 4.04878825e-01  8.23539615e-01  3.83339435e-...
3       [-1.78982448e-02 -4.12485093e-01  4.79548365e-...
4       [ 5.38790300e-02 -5.83797634e-01  1.63948029e-...
                              ...                        
7449    [ 9.93590578e-02  3.58262956e-02 -1.73807964e-...
7450    [-1.67410597e-01 -9.89676654e-01  4.06790376e-...
7451    [ 0.02008727 -0.2938754   0.39946282 -0.183999...
7452    [ 0.02993855 -0.48610854  0.08177334 -0.270041...
7453    [-4.18550447e-02 -3.66181880e-01  2.91265190e-...
Name: text_vector, Length: 7454, dtype: object
<class 'str'> guard: True
0       [-7.05169961e-02  6.64003372e-01  3.35792184e-...
1       [-2.44109239e-03 -5.40627480e-01  3.17134917e-...
2       [ 4.04878825e-01  8.23539615e-01  3.83339435e-...
3       [-1.78982448e-02 -4.12485093e-01  4.79548365e-...
4       [ 5.38790300e-02 -5.83797634e-01  1.63948029e-...

The following is an issue we need to investigate. We see that all datasets are equal. At first glance that seems ok. But, when loading a csv file, the embeddings are read as strings (issue to fix is filed). Hence the following condition should not be True

In [5]:
dataset1 == dataset2 == dataset3 == dataset4

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [None]:
df2 = df1.copy()
df2.rename(columns={"prediction_ts": "timestamp", "label": "actual_label"}, inplace=True)
df2.head()

In [None]:
# Define a Schema object to map your columns to a Dataset
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="timestamp",
    prediction_label_column_name="pred_label",
    actual_label_column_name="actual_label",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features,
)
dataset5 = Dataset(df1, schema)

This is another issue. In this case we have different dataframes with different schemas. However the Dataset objects are equal?

In [None]:
dataset1 == dataset5