## Data description with YData Profiling

### By:
Jose R. Zapata

### Date:
2024-03-01

### Description:

Data description and exploration with YData Profiling


## 📚 Import  libraries

In [1]:
# base libraries for data science
import pandas as pd

from pathlib import Path
from ydata_profiling import ProfileReport

## 💾 Load data

In [2]:
DATA_DIR = Path.cwd().resolve().parents[1] / "data"
titanic_df = pd.read_parquet(DATA_DIR / 
                            "02_intermediate/titanic_type_fixed.parquet",
                            engine="pyarrow")

## 📊 Data description

In [3]:
# ordinal data has to be converted again

titanic_df["pclass"] = pd.Categorical(
                                        titanic_df["pclass"],
                                        categories=[3, 2, 1],
                                        ordered=True
)

In [4]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   pclass    1309 non-null   category
 1   survived  1309 non-null   bool    
 2   name      1309 non-null   object  
 3   sex       1309 non-null   category
 4   age       1046 non-null   float64 
 5   sibsp     1309 non-null   int8    
 6   parch     1309 non-null   int8    
 7   fare      1308 non-null   float64 
 8   embarked  1307 non-null   category
dtypes: bool(1), category(3), float64(2), int8(2), object(1)
memory usage: 38.9+ KB


In [5]:
type_schema = {
                'embarked': 'categorical',
                'pclass': 'categorical', 
                'survived': 'boolean',
                #'name': 'string',
                'sex': 'categorical'
       }

In [6]:
profile = ProfileReport(
    titanic_df,
    title="Titanic Pandas Profiling Report",
    explorative=True,
    correlations={"auto": {"calculate": False}},
    type_schema=type_schema
)
#

In [7]:
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  annotation = ("{:" + self.fmt + "}").format(val)
(using `df.profile_report(missing_diagrams={"Heatmap": False}`)
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: '--'')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

### HTML report embedded in the notebook

In [8]:
profile.to_notebook_iframe()

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

###  💾 Save report to file

In [9]:
profile.to_file(DATA_DIR / 
                "08_reporting/titanic_pandas_profiling_report.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## 📊 Analysis of Results and Conclusions 

This is a notebook to show how to use YData Profiling to describe and explore data.


## 💡 Proposals and Ideas

use other tools to compare which one can be used to describe and explore data and do data analysis.


## 📖 References

- <https://docs.profiling.ydata.ai/latest/>
- <https://github.com/ydataai/ydata-profiling>