# Installation

To use ydata-profiling, you can simply install the package from pip. To do this inside a notebook use the shell command ("!"). In this case, we'll declare the extra "[notebook]" that adds support for rendering the report in Jupyter notebook widgets.



In [1]:
!pip install -U ydata-profiling[notebook]==4.0.0 matplotlib==3.5.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ydata-profiling[notebook]==4.0.0
  Downloading ydata_profiling-4.0.0-py2.py3-none-any.whl (344 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m344.5/344.5 KB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting matplotlib==3.5.1
  Downloading matplotlib-3.5.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting statsmodels<0.14,>=0.13.2
  Downloading statsmodels-0.13.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m68.1 MB/s[0m eta [36m0:00:00[0m
Collecting visions[type_image_path]==0.7.5
  Downloading visions-0.7.5-py3-none-any.whl (102 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32

# Getting started
Once installed, you just need to `import` the module. Then, using ydata-profiling is a simple two-step process:
1. Create a `ProfileReport` object using one of: `analyze()`, `compare()` or `compare_intra()`
2. Use a `to_notebook_iframe()` function to render the report. You can also save the report to an **html** file.

Let's get started and import ydata-profiling, pandas, and the HCC dataset, which we will use for this notebook:


In [1]:
import pandas as pd
from ydata_profiling import ProfileReport

# Read the data
Don't forget to load the HCC dataset. Here we will read the file directly from our GitHub repository. However, you can first download the file and then upload it to your working directory and read it as `pd.read_csv('hcc.csv')`. See this post on different ways to load data into Google Colab https://towardsdatascience.com/7-ways-to-load-external-data-into-google-colab-7ba73e7d5fc7.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Data-Centric-AI-Community/awesome-data-centric-ai/master/medium/data-profiling-tools/data/hcc.csv')

## Generate and show the Report

In [3]:
profile = ProfileReport(df,title="HCC Profiling Report")

In [4]:
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Save the report to .html

In [5]:
profile.to_file("report.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# Additonal Features
Let's perform a basic data imputation (mean imputation) on Ferritin and check the impact it has on the data.

In [6]:
# Impute Missing Values
df_transformed = df.copy()
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer(strategy="mean")
df_transformed['Ferritin'] = mean_imputer.fit_transform(df_transformed['Ferritin'].values.reshape(-1,1))

In [7]:
transformed_profile = ProfileReport(df_transformed, title="Transformed Data")
comparison_report = profile.compare(transformed_profile)
comparison_report.to_file("original_vs_transformed.html") 

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]