In [2]:
from structured_data_profiling.profiler import DatasetProfiler

In [3]:
# Step 1 instantiate the profiler
# Required inputs: CSV_PATH pointing to the dataset and eventual presence of a target column

data_profiler = DatasetProfiler('../test/resources/datasets/adult/uci_adult.csv',target= 'income')

In [4]:
# Run the profiling
data_profiler.profile()

Found 4 numerical columns and 9 categorical columns.


Profiling finished.


In [5]:
# Show the warnings identified during the profiling process

data_profiler.warnings()

17.83 % of the rows has at least one duplicate.


The following columns might be categorical/ordinal:


education {'Some-college': 0, 'Masters': 1, 'HS-grad': 2, 'Bachelors': 3, 'Doctorate': 4, 'Assoc-acdm': 5, 'Assoc-voc': 6, 'Prof-school': 7, 'Preschool': 8, '1st-4th': 9, '5th-6th': 10, '7th-8th': 11, '9th': 12, '10th': 13, '11th': 14, '12th': 15}


The following categorical labels are too rare (frequency<0.005%):


('work_class', Index(['Without-pay', 'Never-worked'], dtype='object'))
('education', Index(['Preschool'], dtype='object'))
('marital_status', Index(['Married-AF-spouse'], dtype='object'))
('occupation', Index(['Armed-Forces'], dtype='object'))
('native_country', Index(['Puerto-Rico', 'Canada', 'Dominican-Republic', 'Italy', 'Cuba',
       'El-Salvador', 'India', 'England', 'South', 'Columbia', 'Guatemala',
       'China', 'Poland', 'Iran', 'Jamaica', 'Taiwan', 'Portugal', 'Vietnam',
       'Japan', 'Ireland', 'Ecuador', 'France', 'Nicaragua', 'Thailand',
       'Haiti', '

In [6]:
# Convert the findings collected during the profiling into data expectations.
# The argument docs tells great_expectations to create an HTML containing data docs

# To generate expectations you need to have great_expectations initialised in your working directory. 

#This can be done by running: $great_expectations init

data_profiler.generate_expectations(docs=True)

{
  "ge_cloud_id": null,
  "expectation_suite_name": "local_suite",
  "meta": {
    "great_expectations_version": "0.13.41"
  },
  "expectations": [
    {
      "meta": {},
      "expectation_type": "expect_column_values_to_be_between",
      "ge_cloud_id": null,
      "kwargs": {
        "column": "age",
        "min_value": 17,
        "max_value": 90
      }
    },
    {
      "meta": {},
      "expectation_type": "expect_column_mean_to_be_between",
      "ge_cloud_id": null,
      "kwargs": {
        "column": "age",
        "min_value": 30.90376,
        "max_value": 46.35564
      }
    },
    {
      "meta": {},
      "expectation_type": "expect_column_stdev_to_be_between",
      "ge_cloud_id": null,
      "kwargs": {
        "column": "age",
        "min_value": 9.646867853445436,
        "max_value": 17.91561172782724
      }
    },
    {
      "meta": {},
      "expectation_type": "expect_column_most_common_value_to_be_in_set",
      "ge_cloud_id": null,
      "kwargs": {
   


If you generated expectations using the docs=True flag you will be able to access the data docs under the ```great_expectations/uncommitted/data_docs``` folder.

Below you can see a two doc examples, one for a numerical column
![data docs example 1](./num_columns.png)
And one for a categorical one
![data docs example 2](./cat_columns.png)

As you can see from the examples expectations can be also used to characterise the dataset using conditional distributions. In this particular example the profiler determined that young individuals are less likely to be married or that widowed invividuals are more likely to be female.