# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#4F200D; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #4F200D">Indroduction to YData-profiling (formerly Pandas Profiling)</p>



<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

Notebook from : https://www.kaggle.com/code/waalbannyantudre/ydata-profiling-tutorial-quick-efficient-eda
    
In this notebook I will share with you a magic tool I dicovered recently for Exploratory Data Analysis purposes.
As you may have guessed, it's none other than **`ydata-profiling`** !  

For this task, I'll use the `Data Science Salaries 2023` and `Data Science Salaries` datasets to explain in detail some of this library's various features and benefits.

![](https://warehouse-camo.ingress.us-east-2.pypi.io/4c3692279382b860ef92ba7097363eefd6335d5a/68747470733a2f2f79646174612d70726f66696c696e672e79646174612e61692f646f63732f6173736574732f6c6f676f5f6865616465722e706e67)


**About ydata-profiling**  

According to the documentation, `ydata-profiling` primary goal is to provide a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution. Like pandas `describe()` function, that is so handy, ydata-profiling delivers an extended analysis of a DataFrame while alllowing the data analysis to be exported in different formats such as html and json.
The package outputs a simple and digested analysis of a dataset, including time-series and text.  

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#E57C23'>üí° Note:</font></h3>
    
Before going any further, it is important to note that the old name of the package was `pandas-profiling`. The new name to be used is `ydata-profiling` !!!

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#4F200D; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #4F200D">Setting up ‚öôÔ∏èüõ†</p>

In [None]:
import numpy as np
import pandas as pd
import sys
import warnings
import zipfile

warnings.filterwarnings('ignore')

Go to your Kaggle account, in the Settings go to API and create a new token. It will download a file named ```kaggle.json```.

Charge this file in the next cell.

In [None]:
from google.colab import files
_ = files.upload()

Saving kaggle.json to kaggle.json


Ensure kaggle.json is in the location ```~/.kaggle/kaggle.json``` to use the API.

In [None]:
!rm -r ~/.kaggle
!mkdir ~/.kaggle
!mv ./kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d arnabchaki/data-science-salaries-2023

Downloading data-science-salaries-2023.zip to /content
  0% 0.00/25.4k [00:00<?, ?B/s]
100% 25.4k/25.4k [00:00<00:00, 48.2MB/s]


In [None]:
zip_ref = zipfile.ZipFile('data-science-salaries-2023.zip', 'r')
zip_ref.extractall('/content')
zip_ref.close()

In [None]:
train = pd.read_csv('./ds_salaries.csv')

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#4F200D; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #4F200D">Basic Exploratory Data Analysis - EDA</p>

In [None]:
train.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


In [None]:
train.columns

Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary', 'salary_currency', 'salary_in_usd', 'employee_residence',
       'remote_ratio', 'company_location', 'company_size'],
      dtype='object')

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


In [None]:
train.isnull().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [None]:
train.duplicated().sum()

1171

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">
    
Another common useful Pandas function is the `describe()` function. Calling it on a dataframe outputs a descriptive statistical summary of all the features of the dataset i.e., number of rows, mean, standard deviation, quartiles, min and max...

Here is how to call the `describe()` function:

In [None]:
train.describe()

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,3755.0,3755.0,3755.0,3755.0
mean,2022.373635,190695.6,137570.38988,46.271638
std,0.691448,671676.5,63055.625278,48.58905
min,2020.0,6000.0,5132.0,0.0
25%,2022.0,100000.0,95000.0,0.0
50%,2022.0,138000.0,135000.0,0.0
75%,2023.0,180000.0,175000.0,100.0
max,2023.0,30400000.0,450000.0,100.0


<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

For a better visualisation, you can transpose the output frame by using `.T` method and pass `include='all'` as a parameter to include all types of variables.  
I prefer it like that:)

In [None]:
train.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
work_year,3755.0,,,,2022.373635,0.691448,2020.0,2022.0,2022.0,2023.0,2023.0
experience_level,3755.0,4.0,SE,2516.0,,,,,,,
employment_type,3755.0,4.0,FT,3718.0,,,,,,,
job_title,3755.0,93.0,Data Engineer,1040.0,,,,,,,
salary,3755.0,,,,190695.571771,671676.500508,6000.0,100000.0,138000.0,180000.0,30400000.0
salary_currency,3755.0,20.0,USD,3224.0,,,,,,,
salary_in_usd,3755.0,,,,137570.38988,63055.625278,5132.0,95000.0,135000.0,175000.0,450000.0
employee_residence,3755.0,78.0,US,3004.0,,,,,,,
remote_ratio,3755.0,,,,46.271638,48.58905,0.0,0.0,0.0,100.0,100.0
company_location,3755.0,72.0,US,3040.0,,,,,,,


<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">
    
Another useful pandas function is the `shape` function, for printing a tuple containing the number of rows and columns.

In [None]:
train.shape

(3755, 11)

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#4F200D; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #4F200D">Reports with YData-profiling üìä</p>

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

Though the pandas`describe()` function is useful to understand the data still it doesn‚Äôt offer many features.  
In fact how to easily print other statistics like pearson correlation, skewness etc?  

    
This is where what I call "the magic tool" comes into play. ydata-profiling offers report generation for the dataset with lots of features and customizations for the report generated.  

Let's explore in-depth this library, look at all the features provided, and some of the advanced use cases and integrations that can prove useful to create stunning reports out of the data frames!  
![](https://editor.analyticsvidhya.com/uploads/74223Pandas%20Profiling.png)


In [None]:
!pip install -U ydata-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension

Collecting ydata-profiling[notebook]
  Downloading ydata_profiling-4.6.4-py2.py3-none-any.whl (357 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/357.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[90m‚ï∫[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m225.3/357.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m357.8/357.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting visions[type_image_path]==0.7.5 (from ydata-profiling[notebook])
  Downloading visions-0.7.5-py3-none-any.whl (102 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m102.7/

## Import

In [None]:
import ydata_profiling

In [None]:
from ydata_profiling import ProfileReport

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">
    
To start profiling a dataframe, you have two ways:  

1. You can call the `.profile_report()` function on the pandas dataframe directly. This function is not part of the pandas API but as soon as you import the profiling library, it adds this function to dataframe objects.  
2. You can pass the dataframe object to the profiling function and then call the function object created to start the generation of the profile.  

In either of the ways, you will get the same output report.  
Let's use the 2nd method to generate the standard profiling report:

In [None]:
profile = ydata_profiling.ProfileReport(train)

In [None]:
profile.to_notebook_iframe() # use this line to show the output
#profile                     # or this one

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Sections & Details of the Report  

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

    
1. **Overview:**
    
This section is divided into 3 tabs: Overview, Warnings, and Reproduction.  
        - **Overview:** overall statistics including the number of variables , rows, missing cells, % of missing cells, duplicates, % of duplicate rows, and total size in memory.  
        - **Warnings:** warnings related to cardinality, correlation with other variables, missing values, zeroes, skewness of the variables, and many others.  
        - **Reproduction:**  displays information related to the report generation like the start and ends the time of the analysis, the time taken to generate the report, the software version of pandas profiling, and a configuration download option.  
        
        
2. **Variables:**  
    
This section of the report gives a detailed analysis of all the variables of the dataset. The information presented varies depending upon the data type of variable:  
        - **For numeric data type features**, you get information about the distinct values, missing values, min-max, mean, and negative values count. You also get small representation values in the form of a Histogram. The toggle button expands to the Statistics, Histogram, Common values, Extreme values tab.  
        - **For string type variables**, you get distinct (unique) values, distinct percentage, missing, missing percentage, memory size, and a horizontal bar presentation of all the unique values with count presentation. The toggle button expands to the Overview, Categories, Words, and Characters tab.
        
        
3. **Correlations:**  
    
Correlation is used to describe the degree to which two variables move in coordination with one another. In the Pandas Profiling Report, you can access 5 types of correlation coefficients: Pearson‚Äôs r, Spearman‚Äôs œÅ, Kendall‚Äôs œÑ, Phik (œÜk), and Cram√©r‚Äôs V (œÜc).

4. **Missing values:**  
    
The report generated also contains the visualizations for the missing values present in the dataset. You get 3 types of plot: Count, matrix, and dendrogram. The count plot is a basic bar plot with an x-axis as column names and the length of the bar represents the number of values present (without null values). Similarly are the matrix and the dendrogram.

5. **Sample:**  
    
This section displays the first and last 10 rows of the dataset.


## Saving the Report

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">
    
<h3 align="left"><font color='#E57C23'>üí° Note:</font></h3>
    
You can save the report using the `to_file()` function on the profile object, in the format of your choice :
- HTML
- JSON

In [None]:
profile.to_file("Exploratory Data Analysis.html")

In [None]:
profile.to_file("Exploratory Data Analysis.json")

## Customizations

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">
    
The report generated by Pandas profiling is a complete analysis without any input from the user except the dataframe object. All the elements of the report are chosen automatically and default values are preferred.

There might be some elements in the report that you don‚Äôt want to include or you need to add your own metadata for the final report. There comes the advanced usage of this library. You can control every aspect of your report by changing the default configurations.
Some of the ways in which you can customize your reports are:  
    
- **Add MetaData:**  
You can add information such as ‚Äútitle‚Äù, ‚Äúdescription‚Äù, ‚Äúcreator‚Äù, ‚Äúauthor‚Äù, ‚ÄúURL‚Äù, and  ‚Äúcopyright_holder‚Äù. This information will appear in the dataset overview section. For this metadata, a new tab called ‚Äúdataset‚Äù will be created. To add this data to report, use dataset parameter in the ProfileReport function and pass this data as a dictionary.


In [None]:
profile = ProfileReport(train,
                        title="Data Science Salaries",
                        dataset={
                        "description": "Salaries of jobs in the Data Science domain",
                        "copyright_holder": "CC0: Public Domain",
                        "url": "https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries",
                    })
profile

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

- **Controlling parameters of the Report:**
Suppose you don‚Äôt want to display all types of correlation coefficients. You can simply disable other coefficients by using the configuration for correlations. This is also a dictionary object and can be passed to the ProfileReport function.

In [None]:
profile = ProfileReport(train,
                        title="Data Science Salaries",
                        correlations={
                                        "pearson" : {"calculate": True},
                                        "spearman": {"calculate": True},
                                        "kendall" : {"calculate": False},
                                        "phi_k"   : {"calculate": False},
                                    }
                       )
profile

### Widgets in Jupyter notebook

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

There are two interfaces to consume the report inside a Jupyter notebook:
- through an embedded HTML report (like we have seen above)
- through widgets

While running normally the pandas profiling in your Jupyter notebooks, you will get the HTML rendered in the code cell only(see the above output). This disturbs the user experience. You can make it act like a widget that is easily accessible and offers a compact view. To do this, simply call `.to_widgets()` on your profile object:

In [None]:
profile.to_widgets()

## Datasets Comparison

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

ydata-profiling can be used to compare multiple version of the same dataset. This is useful when comparing data from multiple time periods, like Data Science Salaries in 2022 VS in 2023.  
Another common scenario is to view the dataset profile for training, validation and test sets in Machine Learning.

In [None]:
train_report  = ProfileReport(train,  title="Data Science Salaries 2022")
train_report2 = ProfileReport(train2, title="Data Science Salaries 2023")

comparison_report = train_report.compare(train_report2)
comparison_report

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#E57C23'>üí° Note:</font></h3>

Widgets interface are not yet supported for comparing reports, you should use the HTML rendering !!!

In [None]:
comparison_report.to_file("comparison2022-VS-2023.html")

# <p style="font-family:JetBrains Mono; font-weight:bold; letter-spacing: 2px; color:#4F200D; font-size:140%; text-align:center;padding: 0px; border-bottom: 3px solid #4F200D">References üìö</p>

<div style="border-radius:10px; border:#E57C23 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">
    
Documentation : [ydata-profiling](https://ydata-profiling.ydata.ai/docs/master/index.html)  
    
Article       : [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/06/generate-reports-using-pandas-profiling-deploy-using-streamlit/)