## 📖 Background
### OVERVIEW
The June edition of the 2022 Tabular Playground series is all about data imputation. The dataset has similarities to the May 2022 Tabular Playground, except that there are no targets. Rather, there are missing data values in the dataset, and your task is to predict what these values should be.
## 💾 The data
For this challenge, you are given (simulated) manufacturing control data that contains missing values due to electronic errors. Your task is to predict the values of all missing data in this dataset. (Note, while there are continuous and categorical features, only the continuous features have missing values.)


Good luck!

### Files

data.csv - the file includes normalized continuous data and categorical data; your task is to predict the values of the missing data.

sample_submission.csv - a sample submission file in the correct format; the row-col indicator corresponds to the row and column of each missing value in data.csv


> **Numbers have an important story to tell. They rely on you to give them a clear and convincing voice” — Stephen Few**

Most of the industries today have recognized data as a valuable asset. However, what you do with the data and how you utilize it is what helps you get those additional profit figures or that new discovery that is going to create a revolution.

When you start working with a dataset most of the trends and patterns are not apparent. Exploratory data analysis helps one to carefully analyze data through an analytical lens. It helps us draw conclusions to get an overall sense of what’s happening with the data. Uncovering these hidden relationships and patterns are critical to build analytical and learning models on the top of the data.

The general workflow of EDA looks as follows:

![image.png](attachment:d2e22460-f0ac-4e0c-af60-c4cf9ee4d773.png)




<h2 style=color:green align="left"> Table-of-contents </h2>

* [1) Introduction](#1)

* [2) Load Required Libraries](#2)
* [3) Read Data](#3)
* [4) DataPrep (AutoEDA)](#4)
  *  [a) Train  Dataset Part 1](#a)
     * [4.1) Analyze distributions with plot()](#4.1)
     
     * [4.2) Analyze correlations with plot_correlation()](#4.2)
     * [4.3) Analyze missing values with plot_missing()](#4.3)
     * [4.4) Create a profile report with create_report()](#4.4)
     
  *  [b) Train  Dataset Part 2](#b)
  
  *  [c)Train  Dataset Part 3](#c)

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 1) Introduction </h1>

In [None]:
#from IPython.display import Image
#Image("../input/images/dataprep_01.png",width=400,height=400)

![image.png](attachment:a5e027c0-4777-4421-868a-4d8346f1bbb1.png)

# Introduction to Dataprep.eda:
Dataprepare is an initiative by SFU Data Science Research Group to speed up Data Science. Dataprep.eda attempts to simplify the entire EDA process with very minimal lines of code. Since we know that EDA is a very essential and time-consuming part of the data science pipeline, having a tool that eases the process is a boon.

This blog intends on providing you an easy and hands-on experience of everything you can do with dataprepare.eda. So let’s dive into it, shall we?

![image.png](attachment:b3c325eb-1ce5-4892-99fd-2c3c676d91c8.png)

### Introduction to Exploratory Data Analysis and dataprep.eda


- DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.

- You can create a beautiful profile report from a Pandas/Dask DataFrame with the create_report function. DataPrep.EDA has the following advantages compared to other tools:

 - DataPrep.EDA is **10-100X faster** than **Pandas-based profiling** tools due to its highly optimized Dask-based computing module.

 - DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.

 - DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.

- **Exploratory Data Analysis (EDA)** is the process of exploring a dataset and getting an understanding of its main characteristics. The dataprep.eda package simplifies this process by allowing the user to explore important characteristics with simple APIs. Each API allows the user to analyze the dataset from a high level to a low level, and from different perspectives. Specifically, dataprep.eda provides the following functionality:

#### 1) Analyze distributions with plot()

- **Analyze column distributions with plot().** The function plot() explores the column distributions and statistics of the dataset. It will detect the column type, and then output various plots and statistics that are appropriate for the respective type. The user can optionally pass one or two columns of interest as parameters: If one column is passed, its distribution will be plotted in various ways, and column statistics will be computed. If two columns are passed, plots depicting the relationship between the two columns will be generated.

 - **plot(df):** plots the distribution of each column and calculates dataset statistics

 - **plot(df, x):** plots the distribution of column x in various ways and calculates column statistics

 - **plot(df, x, y):** generates plots depicting the relationship between columns x and y

#### 2) Analyze correlations with plot_correlation()

- **Analyze correlations with plot_correlation().** The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. By default, it plots correlation matrices with various metrics. The user can optionally pass one or two columns of interest as parameters: If one column is passed, the correlation between this column and all other columns will be computed and ranked. If two columns are passed, a scatter plot and regression line will be plotted.

  - **plot_correlation()**: explores the correlation between columns in various ways and using multiple correlation metrics. It generates correlation matrices using **Pearson, Spearman, and KendallTau correlation coefficients**.

  - **plot_correlation(df):** plots correlation matrices (correlations between all pairs of columns)

  - **plot_correlation(df, x):** plots the most correlated columns to column x

  - **plot_correlation(df, x, y):** plots the joint distribution of column x and column y and computes a regression line

#### 3) Analyze missing values with plot_missing()

- **Analyze missing values with plot_missing().** The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. By default, it will generate various plots which display the amount of missing values for each column and any underlying patterns of the missing values in the dataset. To understand the impact of the missing values in one column on the other columns, the user can pass the column name as a parameter. Then, plot_missing() will generate the distribution of each column with and without the missing values from the given column, enabling a thorough understanding of their impact.

 - **plot_missing()**: enables thorough analysis of the missing values and their impact on the dataset

 - **plot_missing(df):** plots the amount and position of missing values, and their relationship between columns

 - **plot_missing(df, x):** plots the impact of the missing values in column x on all other columns

 - **plot_missing(df, x, y):** plots the impact of the missing values from column x on column y in various ways.

#### 4) Create a profile report with create_report()

 - **create_report()**: generates a comprehensive profile report of the dataset.

 - **Overview:** detect the types of columns in a dataframe

 - **Variables:** variable type, unique values, distint count, missing values

 - **Quantile** statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range

 - **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

 - **Text analysis** for length, sample and letter

 - **Correlations:** highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices

 - **Missing Values:** bar chart, heatmap and spectrum of missing values

#### 5) Time Series Data Analysis

 - plot(covid, "Date", "Confirmed", "State/UnionTerritory", agg='sum')
 - eu = covid.loc[covid['State/UnionTerritory'] == 'Maharashtra']
    plot(eu, "Date", "Confirmed", "State/UnionTerritory", agg='sum', ngroups=50)

![image.png](attachment:22797e40-066b-4b00-b8cc-235109e8e581.png)

### I want an overview of the dataset
#### plot(df)

---------------------------------------------------------

### Understand Missing Value
#### plot_missing(df)

--------------------------------------------------------

### Understand Correlation
#### plot_correlation(df)

-------------------------------------------------------

### Understand Numerical Column
#### plot(df, 'Age')

-------------------------------------------------------

### Understand Text Column
#### plot(df, 'Name')

-------------------------------------------------------

### Understand Column Relationship
#### plot(df, 'Price', bins=50)

-------------------------------------------------------

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 2) Load Required Libraries </h1>

In [None]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 3) Read Data </h1>

In [None]:
train = pd.read_csv("../input/tabular-playground-series-jun-2022/data.csv")

In [None]:
display(train.head())

In [None]:
display(train.shape)


In [None]:
display(train.info())

In [None]:
display(train.isnull().sum())

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 4) DataPrep (AutoEDA) </h1>

<h1 style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:left;"> a) Train  Dataset Part 1 </h1>

<h2 style=color:green align="left"> Table-of-contents </h2>

* [1) Analyze distributions with plot()](#1)

* [2) Analyze correlations with plot_correlation()](#2)

* [3) Analyze missing values with plot_missing()](#3)

* [4) Create a profile report with create_report()](#4)

In [None]:
!pip install dataprep

In [None]:
from dataprep.eda import *

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.1) Analyze distributions with plot() </h1>

 - **a) The function plot()** explores the distributions and statistics of the dataset. The following describes the functionality of plot() for a given dataframe df.


 - **b) plot(df):** plots the distribution of each column and calculates dataset statistics (“I want to see an overview of the dataset”
)


 - **c) plot(df, x):** plots the distribution of column x in various ways and calculates column statistics (“I want to understand the column x”)


 - **d) plot(df, x, y):** generates plots depicting the relationship between columns x and y. (“I want to understand the relationship between x and y”)

In [None]:
from dataprep.eda import plot

In [None]:
# plots the distribution of each column and calculates dataset statistics
plot(train.iloc[0:5000])

In [None]:
# plots the distribution of column x in various ways and calculates column statistics
plot(train.iloc[0:5000], 'F_1_0')

In [None]:
train.iloc[0:5000]['F_2_6'].unique()

In [None]:
# generates plots depicting the relationship between columns x and y
plot(train.iloc[0:5000], 'F_2_0','F_1_0')

In [None]:
# generates plots depicting the relationship between columns x and y
plot(train.iloc[0:5000], 'F_2_1',  'F_2_0')

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.2) Analyze correlations with plot_correlation() </h1>

 - The function **plot_correlation()** explores the **correlation between columns** in various ways and using multiple correlation metrics. It generates correlation matrices using **Pearson, Spearman, and KendallTau correlation** coefficients 

 - **plot_correlation(df):** plots correlation matrices (correlations between all pairs of columns)

 - **plot_correlation(df, x):** plots the most correlated columns to column x

 - **plot_correlation(df, x, y):** plots the joint distribution of column x and column y and computes a regression line

In [None]:
from dataprep.eda import plot_correlation

In [None]:
plot_correlation(train.iloc[0:5000])

In [None]:
# plots the most correlated columns to column "Churn"
plot_correlation(train.iloc[0:5000], 'F_2_2')

In [None]:
# plots the joint distribution of column x and column y and computes a regression line
plot_correlation(train.iloc[0:5000], 'F_1_2', 'F_2_2')

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.3) Analyze missing values with plot_missing() </h1>

 - The function **plot_missing()** enables thorough analysis of the missing values and their impact on the dataset. The following describes the functionality of plot_missing() for a given dataframe df.

 - **plot_missing(df):** plots the amount and position of missing values, and their relationship between columns (“I want to understand the missing values of the dataset”)

 - **plot_missing(df, x):** plots the impact of the missing values in column x on all other columns

 - **plot_missing(df, x, y):** plots the impact of the missing values from column x on column y in various ways.

In [None]:
from dataprep.eda import plot_missing

In [None]:
plot_missing(train.iloc[0:5000])

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> 4.4) Create a profile report with create_report() </h1>


- The function **create_report()** generates a **comprehensive profile report** of the dataset. create_report() **combines the individual components** of the dataprep.eda package and outputs them into a nicely formatted **HTML** document. The document contains the following information:

 - **Overview:** detect the types of columns in a dataframe

 - **Variables:** variable type, unique values, distint count, missing values

 - **Quantile statistics** like minimum value, Q1, median, Q3, maximum, range, interquartile range

 - **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

 - **Text analysis** for length, sample and letter

 - **Correlations:** highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices

 - **Missing Values:** bar chart, heatmap and spectrum of missing values

In [None]:
create_report(train.iloc[0:5000])

<h1 style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:left;"> b) Train Dataset Part 2  </h1>

**i) Data Exploration ( dataprep.eda )**

**ii) Data Cleaning( dataprep.clean )**

**iii) Data Collection ( dataprep.connector )**

**i) DataPrep Exploration**

- DataPrep offers us to create an interactive profile report with one line of code. This report object is an HTML object separated from our Notebook with many choices of exploration. Let’s try the API with the sample data.

In [None]:
from dataprep.eda import create_report


In [None]:
create_report(train.iloc[0:5000]).show_browser()

**ii) DataPrep Cleaning**

- DataPrep Cleaning API collection offers more than 140 APIs to clean and validate our DataFrame. For example, the APIs we can use are:

 - Column Headers

 - Country Names
 
 - Dates and Times

 - Duplicate Values

 - Email Addresses

In [None]:
# Using the ‘Const’ case, we would end up with all capitalized columns names.
from dataprep.clean import clean_headers
clean_headers(train.iloc[0:5000], case = 'const').head()

In [None]:
# If we switch the case into ‘Camel.’
# The result is all lower columns name except for the ‘sibSp’ column, where they have two words within their column name.
clean_headers(train.iloc[0:5000], case = 'camel').head()

In [None]:
from dataprep.clean import clean_df
inferred_dtypes, cleaned_df = clean_df(train.iloc[0:5000])

In [None]:
inferred_dtypes

In [None]:
cleaned_df

In [None]:
# Missing values
plot_missing(cleaned_df)

In [None]:
cleaned_df.columns

In [None]:
plot_missing(cleaned_df, "f_2_1", "f_2_2") #count of rows with and without dropping the missing values

<h1 style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:left;"> a) plot(): analyze distributions </h1>

<h1 style="background-color:magenta; font-family:newtimeroman; font-size:170%; text-align:left;"> I) Univariate Analysis </h1>

In [None]:
plot(train.iloc[0:5000], 'F_2_0')

<h1 style="background-color:magenta; font-family:newtimeroman; font-size:170%; text-align:left;"> II) Bivariate Analysis </h1>
 
 - Numerical and Numerical
 
 - Categorical and Categorical
 
 - Numerical and Categorical

In [None]:
# Numerical and Numerical
plot(train.iloc[0:5000], 'F_1_0', 'F_1_1')

In [None]:
# Categorical and Categorical
plot(train.iloc[0:5000], 'F_2_1', 'F_2_0')

In [None]:
# Numerical and Categorical
plot(train.iloc[0:5000], 'F_1_0', 'F_2_1')

<h1 style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:left;"> b) plot_correlation(): analyze correlations </h1>

- plot_correlation to analyze the correlation between columns. It plots the correlation matrix between columns. If a user is interested in the correlated columns for a specific column, e.g the most correlated columns to column 'F_2_2', the API can provide a more detailed analysis by passing column names as the parameter.

In [None]:
plot_correlation(train.iloc[0:5000])

In [None]:
plot_correlation(train.iloc[0:5000], k=1)

In [None]:
# Correlation of specified element with all other attributes
plot_correlation(train.iloc[0:5000] , 'F_2_2')

In [None]:
# All the correlation values that lie within the given range. (-1, 0.3) for Survived will appear in the plot
plot_correlation(train.iloc[0:5000], "F_2_2", value_range=[-1, 0.3])

In [None]:
# Correlation between two attributes with line of best fit and most influential points
plot_correlation(train.iloc[0:5000], "F_1_0", "F_1_1", k=5)

<h1 style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:left;"> c)  plot_missing(): analyze missing values </h1>

In [None]:
# Missing values
plot_missing(train.iloc[0:5000])

- To understand the impact of missing values from a specific column, we can pass the column name into the parameter. It will compare the distribution of each column with and without missing values from the given column, such that we could understand the impact of the missing values

In [None]:
plot_missing(train.iloc[0:5000], 'F_1_0', 'F_2_2') #count of rows with and without dropping the missing values

<h1 style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:left;"> d)  create_report: generate profile reports from a Pandas dataframe </h1>

- The goal of create_report is to generate profile reports from a pandas DataFrame. create_report utilizes the functionalities and formats the plots from dataprep. It provides information like overview, variables, quantile statistics (minimum value, Q1, median, Q3, maximum, range, interquartile range), descriptive statistics (mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness), text analysis for length, sample and letter, correlations and missing values.

In [None]:
#create_report(train)

<h1 style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:left;"> c)  Time Series Data Analysis </h1>

In [None]:
### plot(covid, "Date", "Confirmed", "State/UnionTerritory", agg='sum')

In [None]:
# eu = covid.loc[covid['State/UnionTerritory'] == 'Maharashtra']
# plot(eu, "Date", "Confirmed", "State/UnionTerritory", agg='sum', ngroups=50)

#### GPS TO LAT/LON 

In [None]:
# clean_lat_long(df, "lat_long", output_format="ddh")

### Conclusion
Data science is adapting every second as we speak. We are exploring the lengths and breadths of almost all the domains right from sport analysis to medical imaging. These times call for us to channel the power of AI and Machine Learning into developing things that will make an impact and save those few seconds. Having tools like data prepare at our disposal enables us to carry out the preparatory tasks more efficiently.

Dataprepare.eda excels at tasks like checking the data-distribution, correlation, missing values and the EDA process in general. The ease of coding and readability also facilitates novices to use dataprepare library. All in all, Dataprepare.eda serves as your one-stop library for carrying out all the preparatory analysis tasks. It’s an upcoming library with a promising future!


<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:left;"> If you like the kernal... Don't forget to upvote!!!!!!!!!! </h1>

reference : 

https://towardsdatascience.com/dataprep-eda-accelerate-your-eda-eb845a4088bc

https://www.kaggle.com/code/sureshmecad/dataprep-autoeda

https://www.kaggle.com/discussions/general/233781#1279626

Complete  Auto EDA  Lib:

