# Introduction to Exploratory Data Analysis (EDA)

Written by: [KV Subbaih Setty](https://www.kvssetty.com).

These series of articles are inspired by: https://doi.org/10.18434/M32189

This series of articles and tutorials presents the principles, assumptions, and techniques necessary to gain insight into data via EDA--exploratory data analysis.

**EDA is the most important and most neglected step in Machine Learning model development process**.To overcome this neglected aspect, which causes many complications during the later stages of ML model building process, we must understand its concepts and importance. Every aspiring Data Scientist and Citizen Data Scientist must learn the art and science of doing EDA to become a successful Data Scientist and ML engineer/Practitioner. 

These series articles and tutorials are aimed towards teaching and training the aspiring data scientists, beginner data scientists, and budding citizen data scientists to do EDA as a first step on receiving data for developing high-performance ML models.


**This, Introduction to Exploratory Analysis (EDA) series of tutorials is divided in to following five sections** 

* Section ONE: EDA Introduction.
* Section TWO: EDA Assumptions.
* Section Three: EDA techniques.
* Section Four: EDA case studies.
* Section Five: EDA libraries in Python.

And each section is further sub divided in to number of topics and each topic is covered in a single separate tutorial. 

# Section ONE: EDA Introduction
Table of Topics:
1. **What is EDA?**
2. **EDA vs Classical & Bayesian analysis**
3. **EDA vs Summary statistics**
4. **EDA Goals**
5. **The Role of Graphics in EDA**
6. **An EDA Example**
7. **General EDA Problem Categories**

Each topic is covered in a separate article/notebook.

## 1. What is EDA?

### EDA is an Approach
Exploratory Data Analysis (EDA) is an approach/philosophy
for data analysis that employs a variety of techniques (mostly
graphical) to
1. maximize insight into a data set;
2. uncover underlying data structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models; and
7. determine optimal feature representation.
8. detect  presence of any missing values.

### Focus of EDA  
The EDA approach is precisely that--an approach--not a set of
techniques or rules, but an attitude/philosophy about how a data
analysis should be carried out.

>*‘Exploratory data analysis’ is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.”
— John Tukey, Author of Exploratory Data Analysis*

For exploratory data analysis, the focus is on the data--its structure, outliers, provide hits for data preparation, and models suggested by the data.

EDA is art and science and comes by lot of practice and experience. It is not just plotting a set graphs as many believe.Though we use lot of statistical plots as techniques of EDA , drawing inferences from  the plots is the core of EDA 

### Philosophy of EDA 
EDA is not identical to *statistical graphics* although the two
terms are used almost interchangeably. Statistical graphics is a
collection of techniques--all graphically based and all
focusing on one data characterization aspect. EDA
encompasses a larger landscape; EDA is an approach to data
analysis that postpones the usual assumptions about what kind
of model the data follow with the more direct approach of
allowing the data itself to reveal its underlying structure and
model. EDA is not a mere collection of techniques; EDA is a
philosophy as to how we:
- dissect a data set; 
- what we look for in the data;
- how we look at data;
- how we interpret data.

It is true that EDA heavily uses the collection of techniques that we call
*statistical graphics*, but it is not identical to statistical
graphics per se.

### Techniques of EDA
Most EDA techniques are graphical in nature with a few quantitative techniques. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing the data to reveal its structural secrets, and being always ready to gain some new, often unsuspected, insight into the data. In combination with the natural pattern-recognition capabilities that we all possess, graphics provides, of course, unparalleled power to carry this out.

The particular graphical techniques employed in EDA are often quite simple, consisting of various techniques of:

- Plotting the raw data (such as **data traces, histograms, bi-histograms, probability plots, lag plots, block plots, and [Youden plots](https://en.wikipedia.org/wiki/Youden%27s_J_statistic)**.

- Plotting simple statistics such as **mean plots, standard deviation plots, box plots, and main effects plots of the raw data**.

- Positioning such plots so as to maximize our natural pattern-recognition abilities, such as using multiple plots per page and over plotting .


In the next tutorial we focus on various approaches to data analysis and why EDA takes center stages among them.