# Phoenix Dataset Object

This small tutorial is to demonstrate how we can use the 🔥🐦 Phoenix `Dataset` object. 

This object currently is composed of a dataframe and a schema. Data can be consumed from:
* Pandas DataFrame directly
* From local files: csv & hdf5

In [22]:
import pandas as pd
from phoenix.datasets import Dataset, Schema, EmbeddingColumnNames

In [23]:
test_filename = "NLP_sentiment_classification_language_drift"

df1 = pd.read_csv(f"./fixtures/{test_filename}.csv")
df1.head()

Unnamed: 0,prediction_ts,reviewer_age,reviewer_gender,product_category,language,text,text_vector,label,pred_label
0,1650092000.0,21,female,apparel,english,Poor quality of fabric and ridiculously tight ...,[-7.05169961e-02 6.64003372e-01 3.35792184e-...,negative,negative
1,1650092000.0,29,male,kitchen,english,"Love these glasses, thought they'd be everyday...",[-2.44109239e-03 -5.40627480e-01 3.17134917e-...,positive,positive
2,1650093000.0,26,female,sports,english,"These are disgusting, it tastes like you are ""...",[ 4.04878825e-01 8.23539615e-01 3.83339435e-...,negative,negative
3,1650093000.0,26,male,other,english,My husband has a pair of TaoTronics so I decid...,[ 0.01881652 0.53441304 0.4907303 -0.024163...,neutral,neutral
4,1650093000.0,37,male,home_improvement,english,"Threads too deep. Engages on tank, but gasket ...",[-0.25348073 0.31603432 0.35810202 -0.246728...,negative,negative


Define the schema same as you would in our SDK

In [24]:
features = [
    'reviewer_age',
    'reviewer_gender',
    'product_category',
    'language',
]

embedding_features = {
    "embedding_feature": EmbeddingColumnNames(
        vector_column_name="text_vector",  # Will be name of embedding feature in the app
        data_column_name="text",
    ),
}

# Define a Schema() object for Arize to pick up data from the correct columns for logging
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_label",
    actual_label_column_name="label",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features
)



You are ready to define a `Dataset`

In [10]:
# Defined directly from dataframe
dataset1 = Dataset(df1,schema)
dataset2 = Dataset.from_dataframe(df1, schema)
# Defined from csv
dataset3 = Dataset.from_csv(f"./fixtures/{test_filename}.csv", schema=schema)
# Defined from hdf5
dataset4 = Dataset.from_hdf(f"./fixtures/{test_filename}.hdf5", schema=schema, key="training")

The following is an issue we need to investigate. We see that all datasets are equal. At first glance that seems ok. But, when loading a csv file, the embeddings are read as strings (issue to fix is filed). Hence the following condition should not be True

In [11]:
dataset1==dataset2==dataset3==dataset4

True

In [19]:
df2 = df1.copy()
df2.rename(
    columns={
        "prediction_ts":"timestamp",
        "label":"actual_label"
    },
    inplace=True
)
df2.head()

Unnamed: 0,timestamp,reviewer_age,reviewer_gender,product_category,language,text,text_vector,actual_label,pred_label
0,1650092000.0,21,female,apparel,english,Poor quality of fabric and ridiculously tight ...,[-7.05169961e-02 6.64003372e-01 3.35792184e-...,negative,negative
1,1650092000.0,29,male,kitchen,english,"Love these glasses, thought they'd be everyday...",[-2.44109239e-03 -5.40627480e-01 3.17134917e-...,positive,positive
2,1650093000.0,26,female,sports,english,"These are disgusting, it tastes like you are ""...",[ 4.04878825e-01 8.23539615e-01 3.83339435e-...,negative,negative
3,1650093000.0,26,male,other,english,My husband has a pair of TaoTronics so I decid...,[ 0.01881652 0.53441304 0.4907303 -0.024163...,neutral,neutral
4,1650093000.0,37,male,home_improvement,english,"Threads too deep. Engages on tank, but gasket ...",[-0.25348073 0.31603432 0.35810202 -0.246728...,negative,negative


In [20]:
# Define a Schema() object for Arize to pick up data from the correct columns for logging
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="timestamp",
    prediction_label_column_name="pred_label",
    actual_label_column_name="actual_label",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features
)
dataset5 = Dataset(df1,schema)

This is another issue. In this case we have different dataframes with different schemas. However the Dataset objects are equal?

In [21]:
dataset1==dataset5

True