# 📈 Whylogs Profile Visualization
### gives you various ways to simplify the proccess of detecting dataset drift. Bellow you can see the instructions for creating dummy dataset and list of currently available visual reports.

### 🗂️Install dependencies and make imports

In [None]:
!pip install faker
!pip install pybars3

In [3]:
import numpy as np
import pandas as pd
import datetime
from collections import OrderedDict
from faker import Faker

In [1]:
from whylogs import get_or_create_session
from whylogs.viz import NotebookProfileViewer
from whylogs.core.statistics.constraints import (
    columnValuesInSetConstraint,
    containsEmailConstraint,
    minBetweenConstraint,
    maxLessThanEqualConstraint,
    parametrizedKSTestPValueGreaterThanConstraint,
    columnsMatchSetConstraint,
    columnPairValuesInSetConstraint,
    sumOfRowValuesOfMultipleColumnsEqualsConstraint,
    columnValuesUniqueWithinRow,
    DatasetConstraints
)

  from .autonotebook import tqdm as notebook_tqdm


### ♻️Create dummy data

In [4]:
locales = OrderedDict([
    ('en-US', 1),
    ('fr-FR', 2),
    ('ja_JP', 2),
])
fake = Faker(locales)
distribution = np.concatenate((np.random.normal(0.1, 0.1, 500), np.random.normal(0.6, 0.2, 500)))

### 📝 Log it


In [5]:
session = get_or_create_session()
def profile_generator():
    with session.logger("mytestytest", dataset_timestamp=datetime.datetime(2021, 6, 2)) as logger:
        for _ in range(500):
            logger.log({"uniform_integers": np.random.randint(0,50)})
            logger.log({"strings": fake.name()})
            logger.log({"mixture_distribution": np.random.choice(distribution, 1)[0]}) 
            logger.log({"1mixture_distribution": np.random.choice(distribution, 1)[0]})
            logger.log({"2mixture_distribution": np.random.choice(distribution, 1)[0]})
            logger.log({"3mixture_distribution": np.random.choice(distribution, 1)[0]})
            logger.log({"4mixture_distribution": np.random.choice(distribution, 1)[0]})
            logger.log({"nulls": None})
        logger.log({"moah_data": 1})
        logger.log({"moah_data": 1})
        logger.log({"moah_data": 5})

        return logger.profile
    
target_profile = profile_generator()

reference_profile = profile_generator()

WARN: Missing config


## ✨ Vizualize profiles with Whylogs

### Initialization
Initialize Profile viewer by passing profiles for which you want to get the visualizations

In [6]:
visualization = NotebookProfileViewer()
visualization.set_profiles(target_profile=target_profile, reference_profile=reference_profile)

###### `*target_profiles`: Profiled dataset which will be reffered as `target`
###### `*reference_profiles`: Profiled dataset which will be reffered as `reference`

### Summary Drift Report

You can get summary drift report for `target` and `reference` profiles features

In [7]:
visualization.summary_drift_report(preferred_cell_height="1000px")

###### `preferred_cell_height`: height in `px` for generated visualization cell 

### Double histogram

You can get double histogram for numerical features

In [8]:
visualization.double_histogram(feature_names="uniform_integers")

###### `*feature_names`: string or list of strings containing names of the features for which you want to see double histogram
###### `preferred_cell_height`: height in `px` for generated visualization cell 

### Distribution chart

You can get distirubtion chart for categorical features

In [9]:
visualization.distribution_chart(feature_names="strings")

###### `*feature_names`: string or list of strings containing names of the features for which you want to see double histogram
###### `preferred_cell_height`: height in `px` for generated visualization cell 

### Differenced distribution chart

You can get differenced distirubtion chart for categorical features

In [10]:
visualization.difference_distribution_chart(feature_names="strings")

###### `*feature_names`: string or list of strings containing names of the features for which you want to see double histogram
###### `preferred_cell_height`: height in `px` for generated visualization cell 

### Feature Statistics

You can get set of useful statistics for features by passing the profile and feature names

In [11]:
visualization.feature_statistics(feature_name="mixture_distribution", profile="reference")

###### `*feature_name`: Any feature name from your profiled dataset
###### `profile_name`: `"target"` or `"reference"`
###### `prefered_cell_height`: height in `px` for generated visualization cell 

### Generate constraints

In [12]:
def get_sample_dataset_constraints():
    cvisc = columnValuesInSetConstraint(value_set={2, 5, 8})
    email_constraint = containsEmailConstraint()

    min_gt_constraint = minBetweenConstraint(lower_value=1, upper_value=5)
    max_le_constraint = maxLessThanEqualConstraint(value=100)

    distribution = np.random.normal(0, 1, 50)

    ks_test_p_value_constraint = parametrizedKSTestPValueGreaterThanConstraint(
        distribution,
        p_value=0.5,
        name="has a standard normal distribution"
    )

    set1 = set(["col1", "col2"])
    columns_match_constraint = columnsMatchSetConstraint(set1)

    val_set = {(1, 2), (3, 5)}
    col_set = ["A", "B"]
    mcv_constraints = [
        columnPairValuesInSetConstraint(column_A="A", column_B="B", value_set=val_set),
        sumOfRowValuesOfMultipleColumnsEqualsConstraint(columns=col_set, value=100),
        columnValuesUniqueWithinRow(column_A="A", verbose=True),
    ]

    return DatasetConstraints(
        None,
        value_constraints={"A": [cvisc], "users": [email_constraint]},
        summary_constraints={"B": [max_le_constraint, min_gt_constraint], "value": [ks_test_p_value_constraint]},
        table_shape_constraints=[columns_match_constraint],
        multi_column_value_constraints=mcv_constraints,
    )

data = pd.DataFrame({
    "A": [1, 2, 2, 5, 7, 6],
    "B": [5, 4, 5, 1, 6, 0],
    "users": ["john", "jane@example.com", "alex", "bob", "anna@example.com", "dave"],
    "value": [23.4, 123.2, 423.3, 32.1, 42.2, 344.2],
})

dc = get_sample_dataset_constraints()
constraints_profile = session.log_dataframe(data, "test.data", constraints=dc)
constraints_profile.apply_summary_constraints()
constraints_profile.apply_table_shape_constraints()
session.close()

### Constraints report

In [13]:
visualization.constraints_report(dc)

### Download prefered cell output

You can also download any of those visualisation in `HTML` format for further analysys, by passing the visualization name

In [None]:
visualization.download(html=visualization.summary_drift_report(), html_file_name='example')

By calling `download()` method of `DisplayProfile` and passing visualizer command, path to be downloaded to (optional) and name of the file you prefer (optional). Exaple `download(visualization.feature("title"), path="examlpe/path", html_file_name="example_html_file_name")`. Command will download HTML format.

If path is not passed file will be downloaded to `html_reports` located in whylogs directory by default.

If name of the file is not passed it will be name of the dataset followed by timestamp of the profile by default. 

###### `*feature_name`: Any feature name from your profiled dataset
###### `preferred_path`: save path `default:` `/html_reports` located in whylogs directory
###### `html_file_name`: name of the file `default:` name of the dataset followed by timestamp of the profile