**[Refference](https://kite.com/blog/python/data-analysis-visualization-python#detailed_explanation_of_EDA)**: 

## Introduce 

There is so much data in today’s world. Modern businesses and academics alike collect vast amounts of data on myriad processes and phenomena. 
=> New data analysis and visualization programs allow for reaching even deeper understanding.

Data analytics allow businesses to understand their efficiency and performance, and ultimately helps the business make more informed decisions. 
For example, an e-commerce company might be interested in analyzing customer attributes in order to display targeted ads for improving sales. 


## Defining Exploratory Data Analysis

**Exploratory Data Analysis** – EDA – plays a critical role in understanding the what, why, and how of the problem statement.It’s first in the order of operations that a data analyst will perform when handed a new data source and problem statement.

Exploratory Data Analysis is an approach to analyzing data sets by summarizing their main characteristics with visualizations. The EDA process is a crucial step prior to building a model in order to unravel various insights that later become important in developing a robust algorithmic model.

Different operations where EDA comes into play:
* First and foremost, EDA provides a stage for breaking down problem statements into smaller experiments which can help understand the dataset
* EDA provides relevant insights which help analysts make key business decisions
* The EDA step provides a platform to run all thought experiments and ultimately guides us towards making a critical decision

## Overview

Introduces key components of Exploratory Data Analysis along with a few examples to get you started on analyzing your own data. 

We’ll cover a few relevant theoretical explanations, as well as use sample code as an example so ultimately, you can apply these techniques to your own data set.

The main objective of the introductory article is to cover how to:
* Read and examine a dataset and classify variables by their type: quantitative vs. categorical
* Handle categorical variables with numerically coded values
* Perform univariate and bivariate analysis and derive meaningful insights about the dataset
* Identify and treat missing values and remove dataset outliers
* Build a correlation matrix to identify relevant variables

Above all, we’ll learn about the important API's of the python packages that will help us perform various EDA techniques.


## A detailed explanation of an EDA on sales data

We’ll look into some code and learn to interpret key insights from the different operations that we perform.
Our requirements include the [pandas](https://kite.com/python/docs/pandas), [numpy](https://kite.com/python/docs/numpy), [seaborn](https://kite.com/python/docs/seaborn), and [matplotlib](https://kite.com/python/docs/matplotlib) python packages.
[python](https://kite.com/blog/python/)


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
from matplotlib import pyplot as plt

For reading data and performing EDA operations, we’ll primarily use the numpy and pandas Python packages, which offer simple API's that allow us to plug our data sources and perform our desired operation.
For the output, we’ll be using the Seaborn package which is a Python-based data visualization library built on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Data visualization is an important part of analysis since it allows even non-programmers to be able to decipher trends and patterns.

Kaggle is a great community of data scientists analyzing data together – it’s a great place to find data to practice the skills.

We’ll be analyzing a [Kaggle data set](https://www.kaggle.com/flenderson/sales-analysis) on a company’s sales and inventory patterns.
The dataset contains **a detailed set of products in an inventory** and the **main problem statement here is to determine the products that should continue to sell, and which products to remove from the inventory.**
The file contains the **observations of both historical sales and active inventory data.** The end solution here is to create **a model that will predict which products to keep and which to remove from the inventory** – we’ll **perform EDA on this data to understand the data better**. 


**[a companion Kaggle notepad](https://www.kaggle.com/dvigneshwer/kernele7f4dbb964)**

## Quick peek at functions: an example

Let’s analyze the dataset and take a closer look at its content. The aim here is to find details like the number of columns and other metadata which will help us to gauge size and other properties such as the range of values in the columns of the dataset.

In [2]:
sales_data = pd.read_csv("../input/SalesKaggle3.csv")
sales_data.head()

# similarly we can see the bottom rows of the Pandas dataframe with the command sales_data.tail().

Unnamed: 0,Order,File_Type,SKU_number,SoldFlag,SoldCount,MarketingType,ReleaseNumber,New_Release_Flag,StrengthFactor,PriceReg,ReleaseYear,ItemCount,LowUserPrice,LowNetPrice
0,2,Historical,1737127,0.0,0.0,D,15,1,682743.0,44.99,2015,8,28.97,31.84
1,3,Historical,3255963,0.0,0.0,D,7,1,1016014.0,24.81,2005,39,0.0,15.54
2,4,Historical,612701,0.0,0.0,D,0,0,340464.0,46.0,2013,34,30.19,27.97
3,6,Historical,115883,1.0,1.0,D,4,1,334011.0,100.0,2006,20,133.93,83.15
4,7,Historical,863939,1.0,1.0,D,2,1,1287938.0,121.95,2010,28,4.0,23.99


*Types of variables and descriptive statistics*

Once we have loaded the dataset into the Python environment, our next step is understanding what these columns actually contain with respect to the range of values, learn which ones are categorical in nature etc.

To get a little more context about the data it’s necessary to understand what the columns mean with respect to the context of the business – this helps establish rules for the potential transformations that can be applied to the column values.

Here are the definitions for a few of the columns:
* **File_Type**: The value “Active" means that the particular product needs investigation
* **SoldFlag**: The value 1 = sale, 0 = no sale in past six months
* **SKU_number**: This is the unique identifier for each product.
* **Order**: Just a sequential counter. Can be ignored.
* **SoldFlag**: 1 = sold in past 6 mos. 0 = Not sold
* **MarketingType**: Two categories of how we market the product.
* **New_Release_Flag**: Any product that has had a future release (i.e., Release Number > 1)

In [3]:
sales_data.describe()

Unnamed: 0,Order,SKU_number,SoldFlag,SoldCount,ReleaseNumber,New_Release_Flag,StrengthFactor,PriceReg,ReleaseYear,ItemCount,LowUserPrice,LowNetPrice
count,198917.0,198917.0,75996.0,75996.0,198917.0,198917.0,198917.0,198917.0,198917.0,198917.0,198917.0,198917.0
mean,106483.543242,861362.6,0.171009,0.322306,3.412202,0.642248,1117115.0,90.895243,2006.016414,41.426283,30.982487,46.832053
std,60136.716784,869979.4,0.376519,1.168615,3.864243,0.47934,1522090.0,86.736367,9.158331,37.541215,69.066155,128.513236
min,2.0,50001.0,0.0,0.0,0.0,0.0,6.275,0.0,0.0,0.0,0.0,0.0
25%,55665.0,217252.0,0.0,0.0,1.0,0.0,161418.8,42.0,2003.0,21.0,4.91,17.95
50%,108569.0,612208.0,0.0,0.0,2.0,1.0,582224.0,69.95,2007.0,32.0,16.08,33.98
75%,158298.0,904751.0,0.0,0.0,5.0,1.0,1430083.0,116.0,2011.0,50.0,40.24,55.49
max,208027.0,3960788.0,1.0,73.0,99.0,1.0,17384450.0,12671.48,2018.0,2542.0,14140.21,19138.79


The describe function returns a pandas series type that provides descriptive statistics which summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.
The **three main numerical measures** for the center of a distribution are the **mode, mean(µ), and the median (M)**.
The mode is **the most frequently occurring value**.
The mean is the average value, while the median is the middle value.