# Installation

To use ydata-profiling, you can simply install the package from pip. To do this inside a notebook use the shell command ("!"). In this case, we'll declare the extra "[notebook]" that adds support for rendering the report in Jupyter notebook widgets.



In [1]:
!pip install -U ydata-profiling[notebook]==4.0.0 matplotlib

Collecting matplotlib
  Using cached matplotlib-3.8.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)


# Getting started
Once installed, you just need to `import` the module. Then, using ydata-profiling is a simple two-step process:
1. Create a `ProfileReport` object using one of: `analyze()`, `compare()` or `compare_intra()`
2. Use a `to_notebook_iframe()` function to render the report. You can also save the report to an **html** file.

Let's get started and import ydata-profiling, pandas, and the HCC dataset, which we will use for this notebook:


In [2]:
import pandas as pd
from ydata_profiling import ProfileReport

  def hasna(x: np.ndarray) -> bool:


# Read the data
Don't forget to load the HCC dataset. Here we will read the file directly from our GitHub repository. However, you can first download the file and then upload it to your working directory and read it as `pd.read_csv('hcc.csv')`. See this post on different ways to load data into Google Colab https://towardsdatascience.com/7-ways-to-load-external-data-into-google-colab-7ba73e7d5fc7.

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/Data-Centric-AI-Community/awesome-data-centric-ai/master/medium/data-profiling-tools/data/hcc.csv')

In [11]:
df.columns

Index(['Gender', 'Age', 'Alcohol', 'Hallmark', 'PS', 'Encephalopathy',
       'Hemoglobin', 'HBeAg', 'MCV', 'Total_Bil', 'O2', 'Dir_Bil', 'Ferritin',
       'Outcome'],
      dtype='object')

In [14]:
df.head(6)

Unnamed: 0,Gender,Age,Alcohol,Hallmark,PS,Encephalopathy,Hemoglobin,HBeAg,MCV,Total_Bil,O2,Dir_Bil,Ferritin,Outcome
0,Male,67,Yes,AYes,Active,,13.7,No,106.6,2.1,999,0.5,,Alive
1,Female,62,No,BYes,Active,,,No,103.4,,999,,,Alive
2,Male,78,Yes,CYes,Ambulatory,,8.9,No,79.8,0.4,999,0.1,16.0,Alive
3,Male,77,Yes,DYes,Active,,13.4,No,97.1,0.4,999,0.2,,Dead
4,Male,76,Yes,EYes,Active,,14.3,No,95.1,0.7,999,,22.0,Alive
5,Male,75,Yes,FYes,Restricted,,13.4,,91.5,3.5,999,1.4,111.0,Dead


In [17]:
d = {'Dead': False, 'Alive': True}
df1= df
df1['Outcome']=df1['Outcome'].map(d)

df1.head(4)

Unnamed: 0,Gender,Age,Alcohol,Hallmark,PS,Encephalopathy,Hemoglobin,HBeAg,MCV,Total_Bil,O2,Dir_Bil,Ferritin,Outcome
0,Male,67,Yes,AYes,Active,,13.7,No,106.6,2.1,999,0.5,,True
1,Female,62,No,BYes,Active,,,No,103.4,,999,,,True
2,Male,78,Yes,CYes,Ambulatory,,8.9,No,79.8,0.4,999,0.1,16.0,True
3,Male,77,Yes,DYes,Active,,13.4,No,97.1,0.4,999,0.2,,False


## Generate and show the Report

In [4]:
profile = ProfileReport(df,title="HCC Profiling Report")

In [5]:
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Save the report to .html

In [6]:
profile.to_file("report.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# Additonal Features
Let's perform a basic data imputation (mean imputation) on Ferritin and check the impact it has on the data.

In [7]:
# Impute Missing Values
df_transformed = df.copy()
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer(strategy="mean")
df_transformed['Ferritin'] = mean_imputer.fit_transform(df_transformed['Ferritin'].values.reshape(-1,1))

In [8]:
transformed_profile = ProfileReport(df_transformed, title="Transformed Data")
comparison_report = profile.compare(transformed_profile)
comparison_report.to_file("original_vs_transformed.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
! pip install sweetviz

Collecting sweetviz
  Downloading sweetviz-2.3.1-py3-none-any.whl (15.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.1/15.1 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sweetviz
Successfully installed sweetviz-2.3.1


In [18]:
#Using SweetViz
import sweetviz as sv

my_report = sv.analyze(df1,target_feat="Outcome")
my_report.show_html(filepath='SWEETVIZ_REPORT.html',
            open_browser=False,
            layout='widescreen',
            scale=None)

                                             |          | [  0%]   00:00 -> (? left)

Report SWEETVIZ_REPORT.html was generated.
