## Contents

- [Preparation](#1)
- [Overviewing of EDA component of dataprep](#2)
    - create_report() 
- [Tips](#3)
    - Tips1: light way to overview
    - Tips2: analyze variable with eda.plot()
    - Tips3: analize interactions
    - Tips4: show correlations with eda.plot_correlation()
    - Tips5: show missing with eda.plot_missing()
    - Tips6: check difference with dataprep

The dataprep EDA component is very cool and elegant module that reads  dataset converted to pandas DataFrame and automatically does profiling and visualization. Anybody can try good EDA with dataprep.

### We can see [dataprep document here!](https://dataprep.ai/)

Recently, some kagglers have published notebooks using dataprep in EDA and I am very interested in it.

However, while the functionality of dataprep is really great, there are some challenges unique in kaggle, and I feel that we are not taking full advantage of our capabilities.

In this notebook, I will share some of the challenges that I feel and the tips that I have come up with.

## <u>Excellent points</u>

### Anyone can create a well-organized report in a short time✨

At a minimum, you can just pass the pandas DataFrame to eda.create_report() and get a neat and tidy report. In addition to basic and common visualizations such as bar graphs, it can also generate Q-Q plots and world maps according to your data. We can get exhaustive visualization results.

### Using HTML, it is possible to visualize a huge number of graphs in a very compact way💪

Such as in Interactions sections we can use HTML tab to switch between multiple plots in generated report and plot. This is especially very powerful when you have a lot of features, such as the Tabular Playground Series.

## <u>Challenges in kaggle</u>

### Duplication with kaggle's Data Explorer 📊

If we use dataprep's features as is, outputs tends to become information overload. For example, eda.plot can be used to easily visualize the graph of each column of the data frame, but this is almost identical to the content of the Data Explorer of data page of each competition.

### Our notebook may become heavy ⌛

This may overlap with the first point, but the notebook tends to be heavy because eda module generates a large amount of content in proportion to the columns entered. If you like creating or reading notebooks, you may have been frustrated by the fact that notebooks with lots of graphs and other information are not easily displayed.

In order to avoid these challenges and use the EDA component more effectively in kaggle notebooks, we will first check the overview of the EDA component with create_report() and then see how to visualize the component effectively.

<a id="1"></a>
# <div class="alert alert-block alert-info">Preparation</div> 

docker-python container image of kaggle doesn't contain dataprep, so first we have to pip install it.

In [None]:
!pip install dataprep

In [None]:
import dataprep
from dataprep import eda
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

In [None]:
print(f"Version of dataprep is {dataprep.__version__}.")

In [None]:
tps06_train = pd.read_csv("../input/tabular-playground-series-jun-2021/train.csv")
titanic_train = pd.read_csv("../input/titanic/train.csv")
titanic_test = pd.read_csv("../input/titanic/test.csv")
world_happiness_report_2021 = pd.read_csv("../input/world-happiness-report-2021/world-happiness-report-2021.csv")
bitcoin = pd.read_csv("../input/meme-cryptocurrency-historical-data/Bitcoin.csv")

In [None]:
# Remove the ID column as it is in the way.

tps06_train = tps06_train.drop('id', axis=1)

<a id="2"></a>
# <div class="alert alert-block alert-success">Overviewing of EDA component of dataprep</div> 

## create_report()

With create_report(), we can automatically visualize and profile the dataset, and output the results in report format. The EDA component's output can be roughly summarized in this function.

According to the documentation, the output is as follows:

> 1. Overview: detect the types of columns in a dataframe
> 2. Variables: variable type, unique values, distint count, missing values
> 3. Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
> 4. Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
> 5. Text analysis for length, sample and letter
> 6. Correlations: highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
> 7. Missing Values: bar chart, heatmap and spectrum of missing values

<font size="2">The discription from <I>
[create_report: generate profile reports](https://docs.dataprep.ai/user_guide/eda/create_report.html).</I></font>

Additionally, in Interactions section, we can check the scatter plot between numerical variables.

Let's see example. 

### <span style="color: orange; ">↓↓↓ You can pass DataFrame to eda.create_report().</span>

In [None]:
# I thought that passing all the columns would make the notebook too heavy, so I will pass only some of them.

tps06_report_target = [col for col in tps06_train.columns][:10]

eda.create_report(tps06_train[tps06_report_target], title="Tabular Playground Series - Jun 2021 report")

With limited data, this is not a problem, but when the columns of DataFrame are many, the amount of content quickly becomes too much, and the notebook becomes information overload. (There should have been already a little loading time when the notebook was displayed.) Also, some of the content in the reports you create may be almost duplicated or already known.

Therefore, the readability and accessibility of the EDA notebook will be improved if we use alternative function where we need rather than only using create_report() for all data.

From this perspective, we will see tips and visualization examples for each section of the report.

<a id="3"></a>
# <div class="alert alert-block alert-warning">Tips</div> 

## <b>Tips1: light way to overview</b>

Using describe(), info() and sum() of DataFrame, we can calculate stats and profiles. And if we use DataFrame.style property, we can add visual effects to outputs.

In [None]:
tps06_train.describe().T.style.bar(subset=['mean'], color='#20c8f2')\
                      .background_gradient(subset=['std'], cmap='YlGn')

In [None]:
tps06_train.info()

In [None]:
pd.DataFrame(tps06_train.isna().sum()/len(tps06_train), columns=["missing_rate"])\
                               .style.bar(subset=['missing_rate'], color='#20c8f2')

In [None]:
pd.DataFrame((tps06_train==0).sum()/len(tps06_train), columns=["zero_rate"])\
                             .style.bar(subset=['zero_rate'], color='#20c8f2')

## <b>Tips2: analyze variable with eda.plot()</b>

Using eda.plot(), we can genelate profiles and visualization like Variable part.

In [None]:
eda.plot(tps06_train, "feature_0")

If the columns to be analyzed are categorical or country names, Word Cloud and World Map will also be displayed.

In [None]:
eda.plot(world_happiness_report_2021, "Country name")

## <b>Tips3: analize interactions</b>

By using seaborn.pairplot(), we can check the scatter plots for all columns.

In [None]:
sns.pairplot(tps06_train[tps06_report_target], corner=True)

However, in some cases, such as the Tabular Playground Series, there are many columns and we cannot be visualized well. In such cases, you can limit the output of create_report() to only Interaction.

In [None]:
eda.create_report(tps06_train[tps06_report_target],display=["Interactions"])

And we can use eda.plot() to display the same plots that were displayed in the Interactions section by specifying numerical variables for columns you are interested in. In addition, we can display the Hexbin Plot and Box Plot.

In [None]:
eda.plot(tps06_train, "feature_0", "feature_1")

In addition, different combinations of the types of variables passed to eda.plot() can be used to create even different visualizations.

In [None]:
eda.plot(titanic_train, "Age", "Sex")

In the case of eda.plot(), we can configure it in config when the size of the figure is large.

In [None]:
eda.plot(titanic_train, "Sex", "Survived", config={ 'height': 400, 'width': 450, })

## <b>Tips4: show correlations with  eda.plot_correlation()</b>

We can use eda.plot_correlation() to perform the same visualization as the Correlations part.

In [None]:
eda.plot_correlation(tps06_train)

We can also use value_range argument to set the threshold for displaying the heatmap, and display to specify the type of analysis to display argument.

In [None]:
eda.plot_correlation(bitcoin, value_range=[0.6, 1], display=["Pearson", "Stats"])

It is also possible to narrow down the vertical axis of the heat map to a specific variable.

In [None]:
eda.plot_correlation(tps06_train, "feature_2", display=["Spearman"])

## <b>Tips5: show missing with eda.plot_missing()</b>

By using eda.plot_missing(), we can have the same visualization as the missing part.

In [None]:
eda.plot_missing(titanic_train)

By specifying col1 and col2 argument in plot_missing(), we can also see how dropping the missing values in col1 affects col2.

In [None]:
eda.plot_missing(titanic_train, "Age", "Survived")

## <b>Tips6: check difference with dataprep</b>

One feature not found in create_report() is eda.plot_diff(). With this function, you can visualize the corresponding columns of two DataFrames and see what differences there are.

In [None]:
diff_cols = ["Age", "Sex", "Pclass"]
eda.plot_diff([titanic_train[diff_cols], titanic_test[diff_cols]])

# I am very happy if you could comment on this notebook with <u>your thoughts to it and constructive advice</u>, and I hope that! Thanks.