## pandas-profiling

https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/getting_started.html

The documentation for pandas_profiling can be found here: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/ ..

Start by loading in your pandas DataFrame, e.g. by using:

In [1]:
# !pip install pandas-profiling[notebook]

In [None]:
import sys

!{sys.executable} -m pip install -U pandas-profiling
!jupyter nbextension enable --py widgetsnbextension

In [None]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])

To generate the report, run:

In [None]:
profile = ProfileReport(df)

###### Explore deeper
You can configure the profile report in any way you like. The example code below loads the explorative configuration file, that includes many features for text (length distribution, unicode information), files (file size, creation time) and images (dimensions, exif information). If you are interested what exact settings were used, you can compare with the default configuration file.

There are two interfaces (see animations below): through widgets and through a HTML report.

Learn more about configuring pandas-profiling on the Advanced usage https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html page.

In [None]:
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)

###### Jupyter Notebook
We recommend generating reports interactively by using the Jupyter notebook. There are two interfaces (see animations below): through widgets and through a HTML report.

This is achieved by simply displaying the report. In the Jupyter Notebook, run:

In [None]:
profile.to_widgets()

The HTML report can be included in a Jupyter notebook:

In [None]:
profile.to_notebook_iframe()

###### Saving the report
If you want to generate a HTML report file, save the ProfileReport to an object and use the to_file() function:

In [None]:
profile.to_file("your_report.html")

Alternatively, you can obtain the data as JSON:

In [None]:
# As a string
json_data = profile.to_json()

# As a file
profile.to_file("your_report.json")

###### Large datasets
Version 2.4 introduces minimal mode.

This is a default configuration that disables expensive computations (such as correlations and duplicate row detection).

Use the following syntax:

In [None]:
# profile = ProfileReport(large_dataset, minimal=True)
# profile.to_file("output.html")

Benchmarks are available here: https://pandas-profiling.github.io/pandas-profiling/dev/bench/

###### Command line usage
For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling executable.

Run the following for information about options and arguments.

In [None]:
#pandas_profiling -h

###### Advanced usage
A set of options is available in order to adapt the report generated.

title (str): Title for the report ('Pandas Profiling Report' by default).
pool_size (int): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).
progress_bar (bool): If True, pandas-profiling will display a progress bar.
infer_dtypes (bool): When True (default) the dtype of variables are inferred using visions using the typeset logic (for instance a column that has integers stored as string will be analyzed as if being numeric).


More settings can be found in the default configuration file (https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml) and minimal configuration file (https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml).

You find the configuration docs on the advanced usage page here https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html

###### Example

profile = df.profile_report(
    title="Pandas Profiling Report", plot={"histogram": {"bins": 8}}
)
profile.to_file("output.html")

###### Great Expectations

Profiling your data is closely related to data validation: often validation rules are defined in terms of well-known statistics. For that purpose, pandas-profiling integrates with Great Expectations (https://greatexpectations.io/). 

This a world-class open-source library that helps you to maintain data quality and improve communication about data between teams. Great Expectations allows you to create Expectations (which are basically unit tests for your data) and Data Docs (conveniently shareable HTML data reports). pandas-profiling features a method to create a suite of Expectations based on the results of your ProfileReport, which you can store, and use to validate another (or future) dataset.

You can find more details on the Great Expectations integration here (https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/great_expectations_integration.html)