# Visualize Dataset Statistics using Facets

In this example we create a simple pipeline that returns two `pd.DataFrames`, 
one for training data and one for the test data. Then we use the 
`facets_visualization_step` to compare the summary statistics of the two 
datasets.

Let's start by defining our pipeline:

In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from steps.importer.importer_step import importer

from zenml import pipeline, step
from zenml.integrations.facets.steps.facets_visualization_steps import (
    facets_visualization_step,
)
from zenml.steps import Output

@pipeline()
def facets_pipeline():
    """Simple pipeline comparing two datasets using Facets."""
    X_train, X_test, y_train, y_test = importer()
    facets_visualization_step(X_train, X_test)

Next, let's define a step to load the Iris dataset as pandas DataFrames:

In [None]:
@step
def importer() -> Output(
    X_train=pd.DataFrame,
    X_test=pd.DataFrame,
    y_train=pd.Series,
    y_test=pd.Series,
):
    """Load the iris dataset as tuple of Pandas DataFrame / Series."""
    iris = load_iris(as_frame=True)
    X_train, X_test, y_train, y_test = train_test_split(
        iris.data, iris.target, test_size=0.2, shuffle=True, random_state=42
    )
    return X_train, X_test, y_train, y_test

Using the predefined `facets_visualization_step`, we can now compare the
statistics of the training and test splits:

In [None]:
facets_pipeline()

In [None]:
last_run = facets_pipeline.get_runs()[0]
last_run.get_step("facets_visualization_step").visualize()