Summary

 Facets Overview gives users a quick understanding of the distribution of values across the features of their dataset(s). Uncover several uncommon and common issues such as unexpected feature values, missing feature values for a large number of observation, training/serving skew and train/test/validation set skew
 It takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis.
Since facets-overview have not been installed on our notebook you must installed with pip.
For more information see https://github.com/PAIR-code/facets/tree/master/facets_overview and https://pair-code.github.io/facets/.
 



In [None]:

pip install facets-overview

In [None]:
# import necessary library 
import numpy as np
import pandas as pd
from skimage import io

In [None]:
# Download data from https://covid.ourworldindata.org/data/ecdc/locations.csv,
# and split it into train and test data with a ratio of 0.2
features = ["countriesAndTerritories", "location", "continent", "population_year","population"]
train_data = pd.read_csv(
    "https://covid.ourworldindata.org/data/ecdc/locations.csv", names=features)
from sklearn.model_selection import train_test_split
train, test = train_test_split(train_data , test_size= 0.2)


In [None]:
# Create the proto from a pandas DataFrame,
# use the ProtoFromDataFrames method of the GenericFeatureStatisticsGenerator class.

from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator
import base64

gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'train', 'table': train},
                                  {'name': 'test', 'table': test}])
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")

Understanding the Visualization

 The visualization contains two tables: one for numeric features and one for categorical (string) features. Each table contains a row for each feature of that type. The rows contains calculated statistics and charts showing the distribution of values for that feature across the dataset(s).
 Potentially problematic statistics, such as a feature is missing (has no value) for a large number of the examples in a dataset, are shown in red and bolded.

In [None]:
# Display the facets overview visualization for this data
# A proto can easily be visualized in a Jupyter notebook using the installed nbextension.
# The proto is stingified and then provided as input to a facets-overview Polymer web component, via the protoInput property on the element. 
# The web component is then displayed in output cell of the notebook.
from IPython.core.display import display, HTML

HTML_TEMPLATE = """
        <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))