# Exploratory Data Analysis with Pandas Profiling

Pandas profiling is an open source Python module with which we can quickly do an exploratory data analysis with just a few lines of code.

 It automatically generates a dataset profile report that gives valuable insights. 

**The report consist of the following:**
- DataFrame overview,
- Each attribute on which DataFrame is defined,
- Correlations between attributes (Pearson Correlation and Spearman Correlation), and
- A sample of DataFrame.


_Documentation: <https://pandas-profiling.ydata.ai/docs/master/index.html>_

In [None]:
# !pip3 install pandas-profiling

In [None]:
import pandas as pd

df = pd.read_csv("data/titanic.csv")

# df.head()
# df.describe()

Generate the report using these commands:

In [None]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df)
profile.to_file("outputs/output.html")
# profile.to_file("outputs/output.json")

In [None]:
## Show report in Jupyter notebook

# !pip3 install -U pandas-profiling notebook
# profile.to_widgets()
# profile.to_notebook_iframe()

In [None]:
# Disable samples, correlations, missing diagrams and duplicates at once
profile = ProfileReport(
        samples=None,
        correlations=None,
        missing_diagrams=None,
        duplicates=None,
        interactions=None,
)

In [None]:
# Change the configuration as desired
profile = ProfileReport(df)

profile.config.html.style.primary_colors = "red"

profile.to_file("outputs/output_red.html")


## Use cases

- Profiling large datasets
- Sensitive data
- Dataset Comparison (#ML)

## Disadvantage

With the increase in the size of the data the time to generate the report also increases a lot.


### Sampling

We can randomize the order of the data and select a representative sample.

In [None]:
# We only use the first 10 000 data points
prof = ProfileReport(df.sample(n=10_000)) 
prof.to_file(output_file='outputs/output_sampled.html')

### Minimum mode

With the minimum mode a simplified report will be generated with less information than the full one but it can be generated relatively quickly for a large dataset. 

In [None]:
profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="outputs/output_min.html")