# Phoenix Dataset Object

This small tutorial is to demonstrate how we can use the 🔥🐦 Phoenix `Dataset` object. 

This object currently is composed of a dataframe and a schema. Data can be consumed from:
* Pandas DataFrame directly
* From local files: csv & hdf5

In [None]:
import pandas as pd
from phoenix.datasets import Dataset, Schema, EmbeddingColumnNames

In [None]:
test_filename = "NLP_sentiment_classification_language_drift"

df1 = pd.read_csv(f"./fixtures/{test_filename}.csv")
df1.head()

Define the schema same as you would in our SDK

In [None]:
features = [
    "reviewer_age",
    "reviewer_gender",
    "product_category",
    "language",
]

embedding_features = {
    "embedding_feature": EmbeddingColumnNames(
        vector_column_name="text_vector",  # Will be name of embedding feature in the app
        data_column_name="text",
    ),
}

# Define a Schema() object for Arize to pick up data from the correct columns for logging
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_label",
    actual_label_column_name="label",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features,
)

You are ready to define a `Dataset`

In [None]:
# Defined directly from dataframe
dataset1 = Dataset(df1, schema)
dataset2 = Dataset.from_dataframe(df1, schema)
# Defined from csv
dataset3 = Dataset.from_csv(f"./fixtures/{test_filename}.csv", schema=schema)
# Defined from hdf5
dataset4 = Dataset.from_hdf(f"./fixtures/{test_filename}.hdf5", schema=schema, key="training")

The following is an issue we need to investigate. We see that all datasets are equal. At first glance that seems ok. But, when loading a csv file, the embeddings are read as strings (issue to fix is filed). Hence the following condition should not be True

In [None]:
dataset1 == dataset2 == dataset3 == dataset4

In [None]:
df2 = df1.copy()
df2.rename(columns={"prediction_ts": "timestamp", "label": "actual_label"}, inplace=True)
df2.head()

In [None]:
# Define a Schema() object for Arize to pick up data from the correct columns for logging
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="timestamp",
    prediction_label_column_name="pred_label",
    actual_label_column_name="actual_label",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features,
)
dataset5 = Dataset(df1, schema)

This is another issue. In this case we have different dataframes with different schemas. However the Dataset objects are equal?

In [None]:
dataset1 == dataset5