# My 6-part Powerful EDA Template That Speaks of Ultimate Skill
## EDA - done right...
<img src='images/mary.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@mary-taylor?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Mary Taylor</a>
        on 
        <a href='https://www.pexels.com/photo/energetic-man-standing-on-railing-near-fence-6009207/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Why Are You Stuck?

It is hard to get started when that blank void of a Jupyter notebook is staring at you. You have a dataset with hundred features and you have no idea where to begin. Your gut feeling tells you: "Normal, start with a feature that is normally distributed". As always... 

You dive head-first into the data, moving from feature to feature until you find yourself chasing wild geese in a forest without any purpose whatsoever. So, what is the reason?

Well, to start, it means you don't have a clear process. Many say Exploratory Data Analysis is the most important part of any project taking up most of your time. And having a structured, process-based approach ensures a successful EDA. Also, it tells that 'you know what you are doing'. 

Don't have that process yet? Fear not, I got you covered.

Today, we will talk about an EDA template that I learned while doing many online courses that fits for any project. 

### #1. Intro to the Dataset and Clearly State the Aim of the EDA

The first section of any EDA should be about giving whatever information needed to build an initial understanding of the dataset and problems it tries to solve. 

Just like any written content, you should write the beginning in such a way that it keeps the audience reading till the end. 

> Even though you will be showing code, it doesn't have to be boring.

In one of the [notebooks](https://www.kaggle.com/piantic/osic-pulmonary-fibrosis-progression-basic-eda) for OSIC Pulmonary Fibrosis competition on Kaggle, I have seen an excellent strategy of building a good EDA intro. The notebook starts with a background information into the problem and why it is important to solve it. Then, it moves on to giving basic info on the dataset, how it was collected and what the notebook tries to achieve.

While writing this section, don't turn it into a wall of text. Use nice formatting and proper visuals to make your EDA memorable. 

In a separate sub-section, import necessary libraries and modules. I recommend doing this in a single cell. You can go extra mile by importing helpful libraries such as `tqdm, colorama` and tweaking `matplotlib`'s `rcParams` to your like. If you want to know more about a perfect project setup, I recommend reading my other article on it:

https://towardsdatascience.com/from-kagglers-best-project-setup-for-ds-and-ml-ffb253485f98?source=your_stories_page-------------------------------------

### #2. Basic Exploration And Preprocessing

Before moving on to visualizing, it is common to take a high-level overview of the dataset. In a small sub-section, get to know your data better by using common `pandas` functions such as `head`, `describe`, `info`, etc. 

Doing this is important because it will allow you to identify basic cleaning issues that violate data constraints like data type, uniqueness and range. 

What I recommend is to first highlight all the issues and deal with them separately. Data cleaning is stressful and boring, so finding an issue and immediately diving into solving it makes the process even worse. 

> Try to find all the issues with a clear mind without worrying about how to fix them. 

I like recording all my issues in a single cell like this:

![image.png](attachment:image.png)

This allows me to cross off each issue as I fix them. While fixing each issue, I usually follow this pattern:

![image.png](attachment:image.png)

I declare the issue with a heading and fix it in a single cell. To check for mistakes, I use `assert` statements that return no output if the check is successful. Otherwise, it throws an `AssertionError`.

For massive datasets, even the smallest of operations can take a long time. When you think something is taking much longer than expected, it is likely you are doing it the slow way. Try searching for faster methods of what you are doing, probably there were others in your situation. 

In the example notebook I prepared for this article, I noticed that `pd.to_datetime` was taking almost 2 minutes for a million rows just to convert a single column to `datetime`. I searched this on StackOverflow and found out that providing a format string to the function significantly reduces the execution time:

![image.png](attachment:image.png)

The solution took a few seconds for a million-row dataset compared to a couple minutes. 

> I am sure there are many such speed tricks for cleaning operations, so make sure your search for them. 

### #3. Univariate Exploration

Section 3 is the start of visual exploration. Specifically, univariate exploration is about visualizing single variables. 

Using distribution plots such as histograms, [PMF and PDF plots, CDFs](https://towardsdatascience.com/3-best-often-better-histogram-alternatives-avoid-binning-bias-1ffd3c811a31?source=your_stories_page-------------------------------------) helps you identify the distribution of each numerical feature. This can be important for when you try to use the variables you explored for Machine Learning models. 

It is helpful to know about different probability distributions like normal, poisson, binomial and many others.

> Recently, I have written a series of articles specifically aimed at probability distributions. Read them [here](https://towardsdatascience.com/how-to-think-probabilistically-with-discrete-distributions-ea28e2bcafdc?source=your_stories_page-------------------------------------) to learn about them and how to find out if the data follows one distribution or not. 

Especially, having a [normally distributed](https://towardsdatascience.com/how-to-use-normal-distribution-like-you-know-what-you-are-doing-1cf4c55241e3?source=your_stories_page-------------------------------------) variable is the best thing you could hope for. 

![image.png](attachment:image.png)

For categorical features, use bar charts and countplots to see the proportion of each category in the dataset. Again, this is crucial for classification problems because you can calculate metrics like class imbalance before fitting models.

![image.png](attachment:image.png)

### #4. Bivariate Exploration

Now, start looking at two variables at a time. Explore relationships between variable pairs like numerical and categorical, numerical and numerical.

This is where your skills at generating plots like scatterplots, boxplots and heatmaps shine. Even though these are simple plots, getting them right can be hard. Consider this:

![image.png](attachment:image.png)

You can see that you can get drastically different results by tweaking the parameters of your plot. It is also a good idea to compute correlation matrix to identify linear relationships between numerical features. 

When looking at correlation, it is important [not to misunderstand it](https://towardsdatascience.com/how-to-not-misunderstand-correlation-75ce9b0289e). Generally, high positive and negative coefficient suggests a strong positive/negative **linear** relationship while coefficients close to 0 may indicate non-linear relationships. Finding out the type of the relationship is helpful for Regression tasks.

![image.png](attachment:image.png)

In this section, you should also continue your exploration of distributions. Now, you start comparing them to each other instead of looking at them individually:

![image.png](attachment:image.png)

### #5. Multivariate Exploration