# **2. DATA PROFILING**

**YDATA-PROFILING LIBRARY**

*ydata_profiling* is a useful library for building a profiling report. This library automatically generates a profile report from a pandas DataFrame for data understanding.

The following information is presented in an interactive report:

*Overview*: mostly global details and statistics information about the whole dataset.

*Alerts*: a list of potential data quality issues (*e.g.,* high correlation, skewness, uniformity, zeros, missing values, constant values).

*Reproduction*: technical details about the analysis (time, version and configuration).

*Variables* (for each column): data types, distict values, missing values, quantile & descriptive statistics (*e.g.,* min, median, max, Q1, Q3, range, etc.), descriptive statistics (*e.g.,* mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, skewness, etc.), histograms, common and extremes values.

*Interactions & Correlations* between variables (heatmap).

*Missing Values*: count, matrix, heatmap.

*Sample of the data* and *Duplicated rows*.

In [1]:
import sys
!{sys.executable} -m pip install -U ydata-profiling[notebook]
!pip install jupyter-contrib-nbextensions

"c:\Users\User\Desktop\OneDrive" non � riconosciuto come comando interno o esterno,
 un programma eseguibile o un file batch.


Collecting jupyter-contrib-nbextensions
  Downloading jupyter_contrib_nbextensions-0.7.0.tar.gz (23.5 MB)
     ---------------------------------------- 0.0/23.5 MB ? eta -:--:--
     -- ------------------------------------- 1.6/23.5 MB 8.4 MB/s eta 0:00:03
     ------ --------------------------------- 3.9/23.5 MB 9.8 MB/s eta 0:00:02
     ---------- ----------------------------- 6.0/23.5 MB 10.0 MB/s eta 0:00:02
     ------------- -------------------------- 8.1/23.5 MB 9.9 MB/s eta 0:00:02
     ----------------- --------------------- 10.5/23.5 MB 10.1 MB/s eta 0:00:02
     --------------------- ----------------- 12.8/23.5 MB 10.5 MB/s eta 0:00:02
     ------------------------- ------------- 15.2/23.5 MB 10.5 MB/s eta 0:00:01
     ----------------------------- --------- 17.8/23.5 MB 10.7 MB/s eta 0:00:01
     --------------------------------- ----- 20.2/23.5 MB 10.7 MB/s eta 0:00:01
     ------------------------------------- - 22.3/23.5 MB 10.8 MB/s eta 0:00:01
     --------------------

In [2]:
!jupyter nbextension enable --py widgetsnbextension

usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
               [--paths] [--json] [--debug]
               [subcommand]

Jupyter: Interactive Computing

positional arguments:
  subcommand     the subcommand to launch

options:
  -h, --help     show this help message and exit
  --version      show the versions of core jupyter packages and exit
  --config-dir   show Jupyter config dir
  --data-dir     show Jupyter data dir
  --runtime-dir  show Jupyter runtime dir
  --paths        show all Jupyter paths. Add --json for machine-readable
                 format.
  --json         output paths as machine-readable json
  --debug        output debug information about paths

Available subcommands: contrib dejavu events execute kernel kernelspec lab
labextension labhub migrate nbconvert nbextensions_configurator notebook run
server troubleshoot trust

Jupyter command `jupyter-nbextension` not found.


In [4]:
from ydata_profiling import ProfileReport
import pandas as pd
import json

In [5]:
BEERS = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/BEERS.csv')

In [None]:
#create a profile report
PROFILE = ProfileReport(BEERS, title="Profiling Report")
PROFILE

In [8]:
# in html
PROFILE.to_file("BEERS_REPORT.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
#create a profile report in json
PROFILE.to_file("BEERS_REPORT.json")

Render JSON:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
file = open("BEERS_REPORT.json")
JFILE = json.load(file)

In [11]:
JFILE

{'analysis': {'title': 'Profiling Report',
  'date_start': '2024-11-26 15:52:31.312830',
  'date_end': '2024-11-26 15:52:35.722829'},
 'time_index_analysis': 'None',
 'table': {'n': 2419,
  'n_var': 7,
  'memory_size': 135596,
  'record_size': 56.05456800330715,
  'n_cells_missing': 1074,
  'n_vars_with_missing': 3,
  'n_vars_all_missing': 0,
  'p_cells_missing': 0.06342644540246856,
  'types': {'Numeric': 4, 'Text': 3},
  'n_duplicates': 9,
  'p_duplicates': 0.0037205456800330715},
 'variables': {'abv': {'n_distinct': 74,
   'p_distinct': 0.031395842172252865,
   'is_unique': False,
   'n_unique': 9,
   'p_unique': 0.003818413237165889,
   'type': 'Numeric',
   'hashable': True,
   'value_counts_without_nan': {'0.05': 216,
    '55.0': 159,
    '0.06': 126,
    '65.0': 124,
    '0.052': 107,
    '0.07': 93,
    '45.0': 90,
    '48.0': 72,
    '0.0579999999999999': 66,
    '0.0559999999999999': 66,
    '51.0': 62,
    '53.0': 60,
    '62.0': 60,
    '49.0': 59,
    '0.08': 58,
    '47.0

In [12]:
#inspect json profile report
JFILE["table"]["n"]

2419

In [13]:
JFILE["variables"]["ibu"]["n_distinct"]

107