# Exploratory Data Analysis



### 1- First steps when you get a new dataset?

* How many data points/row count/observations are there?
* What are the headers/column names/column descriptions
* What is the type of the data set
* What is the preliminary objective of the dataset - if classification then - 
  * How many classes
  * Is there class imbalance 
* NAN/Missing values 
* Extereme values 
* Incomplete data entries or rows 
* Range of each column, acceptable range 
* Who is giving us the data - *can we trust them?*
* How to check if we received all the data 
* Perform column analysis
* What is the refresh rate of the data
  * Is this data one time lookup or one time dump
  * If we receive it at a regular clip, what is the frequency - s, mm, hr, dd, qtr, yyyy
  * What is the volume/size of the transmit
  * What is the velocity/rate - batch or real time or near real time 


### 2-Basic Plots and Measures 

 **<u>Scatter Plot</u>** 

* Good to identify relationships between columns 
* Pari plots, pair wise scatter plots for more than three dimensions 
* If a straight line can seperate the classes correctly then the data set is called linearly seperable 
* Example pair plot of Petal lenght vs Petal Width from iris data set

![scatter_plot_iris](../images/scatter_plot_iris.png)     

**Histogram/Probability Density Function (PDF)** 

- Histogram is a frequency count of points, typically we the data is broken into buckets (binning/bucketing), ex if you have age as a features and we have 100 people, we can count number of people in age group 10-20, 21-30, 31-40 etc. 

- Smooth form of histogram is called probability density function, the process of smoothing out histogram is called Kernel Density Estimation (KDE)

- Area under PDF is always one 

- PDF's gives estimation of high and low density data points 

- PDF for continous random variable, Probability Mass Function for discrete random variable
- PDF's are used for calculating probabilites for a range not at individual point as the AUC is zero.
- PDF is not a probability 
- Example PDF of feature petal width from iris data set. This shows that we can clearly seperate setosa class from other two classes

![scatter_plot_iris](../images/pdf_iris.png)

**Univariate Analysis usnig PDF**

* one variable analysis 

* Helpful in identifying feature importance in a classification problem, example in the figure below we have three features F1, F2 and F3 and three classes C1, C2 and C3. The plots are pdfs of all the three features. We can clearly see that Feature1 is most important (and has the maximum variance) since it can clearly serperate the classes 

![scatter_plot_iris](../images/pdf_features.png)

**Cumulative Distribution Function CDF**

* Area of under PDF curve gives us CDF
* The Y-axis of CDF is probability
* Differentiate CDF = PDF, Integerate PDF = CDF
* Example, the figure shown below there are 82% of setosa flowers have a petal length <= 1.6 
![scatter_plot_iris](../images/cdf_pdf_iris.png)

**Mean**
* Mean is arthimatic average, prone to outliers due to law of large numbers 

\begin{equation}
\mu=\frac{\sum_{i=1}^n a_i}{n}
\end{equation}

**Variance**
* measure of spread, how wide the data is spread out.
\begin{equation}
\sigma^2=\frac{\sum_{i=1}^n (a_i-\mu)^2}{n}
\end{equation}

**Standard Deviation**
* quantifies the spread of values, low std dev means spread is small and points are centered around mean, if std dev is large then points are spread out widely

\begin{equation}
\sigma=\sqrt \frac{\sum_{i=1}^n (a_i-\mu)^2}{n}
\end{equation}

**Median**

* sort the data points and pick the middle value, let $ n $ is the number of data points. Doesnt get corrupted by outliers. We need more than 50% of the data to be corrupted to move the median value 



\begin{align}
median =
\begin{cases}
\frac{n+1}{2},  & \text{if $n$ is odd} \\[2ex]
average(\frac{n}{2}, \frac{n}{2} +1), & \text{if $n$ is even}
\end{cases}
\end{align}


**Percentile**
* spread of data between zero and 100 percentiles, used for ranking items in relative order (asc)
* 50th percentile is median
* quartiles - 25, 50, 75, 100
* Example, Food delivery 95th percentile is four days means 95 percentage of people are getting food delivered in 4 days means


**Median Absolute Deviation (MAD)**

* spread of data points around median, works well with outliers.
\begin{equation}
MAD=(median)_{i=1}^n (|(x_i-median)|)
\end{equation}

```
X = {1,2,3,4,5}

median = 3

|1-3| = -2
|2-3| = -1
|3-3| = 0
|4-3| = 1
|5-3| = 2

median(-2,-1,0,1,2)= 1

```

**Inter Quartile Range IQR**
* measure of variability, based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal part
![scatter_plot_iris](../images/iqr.png)

**Box Plot**
* standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.
<table><tr>
<td> <img src="../images/boxplot_2.png"  style="width: 350px;"/> </td>
<td> <img src="../images/boxplot.png"  style="width: 500px;"/> </td>
</tr></table>


**Violin Plot**
* pdf+boxplot
![scatter_plot_iris](../images/violin_plot.png)

** Effects of Outliers ** 
* Anscombe's quartet consists of four datasets constructed to have identical summary statistics, including the same regression line and correlation coefficient. It illustrates the importance of exploring data graphically and the effect of outliers.
![scatter_plot_iris](../images/effect_of_outliers.png)
