# EXPLORATORY DATA ANALYSIS

## Introduction to EDA

<b>EDA:</b>
- Critical process of performing initial investigtions on data to:
    - Discover Patterns
    - Outlier Detection
    - Test Hypothesis
    - Check Assumptions with summary statistics & graphical representations.
    
<b>Case Study:</b>
- Imagine that, you decide to watch a movie on Netflix which you haven't heard of.
- You find yourself puzzled with lot of questions which needs to be answered in order to make a decision.
- So you will start asking questions like:
   - What is the cast of the movie?
   - What is the crew of the movie?
   - Watch the trailer of the movie.
   - Ratings & Reviews by other audience.
   
<b>Significance of EDA:</b>
- It allows data scientists to anlyze the data before coming to any assumptions. It ensures that the results produced are valid nd applicable to business outcomes and goals.
- Helps in identifying errors in datasets.
- Gives a better understanding of dataset.
- Helps detect outliers.
- Understand dataset variables & relationship among them.

<b>Why EDA?</b>
- Provides summary statistics for each column of the dataset.
- eg: Lets observe the Anscombe's Quartet & its summary.
<img src="assets/Anscombe-Quartet.png" width="600">  

- Now lets observe the EDA of the dataset visually.
<img src="assets/Anscombe-Quartet-Visual.png" width="600">  

- We can clearly observe that, all the datasets have different properties.

<b>Summary of Ascombe's Quartet:</b>
- Dataset 1: Fits the linear regression model.
- Dataset 2: Could not fit linear regression model on the data quite well as the data is non-linear.
- Dataset 3: Shows the outliers involved in the dataset which cannot be handled by linear regression model.
- Dataset 4: Shows the outliers involved in the dataset which cannot be handled by linear regression model.

<b>Key thing to remember:</b>
- All the important features in the dataset must be visualized before implementing any Machine Learning Algorithm.

## Data Analysis Techniques

<b>Data Analysis:</b>
- Graphical: Exploratory Data Analysis
- Quantitative: Classical Statistical Methods

<b>EDA Classification:</b>
- Univariate EDA:
    - Uni means one and variate means variable. So, in univariate analysis, there is only one dependent variable (column).
    - Can be Categorical/Numerical.
    
- Bivariate EDA:
    - Bi means two and variate means variable. So, in bivariate analysis, its a analysis related to relationship between 2 variables.
    
- Multivariate EDA:
    - Multivariate analysis is required when more than two variables must be analyzed simultaneously.


All three types of EDAs can be performed using Non-Graphical & Graphical techniques.

## Univariate Non-Graphical EDA

- Helps us to understand the underlying sample distribution and make observations about the population.
- For Categorical data: Build a table containing the count and frequency of data of each category.

<b>Analysis of Numerical Data:</b>  
Numerical data can be analyzed with:
- Measures of Central Tendency
- Measures of Dispersion
- Measures of Position
- Measures of Shape
- Existence of Outliers

<b>Existence of Outliers:</b>
- An outlier is an observation that appears to deviate markedly from other observations in the sample.
<img src="assets/Outliers.png" width="300">

<br>

<b>How can an Outlier occur in the sample dataset?</b>
- Data Entry/Experimental Measurement Error:  
    Typo, Easy to identify.
- Sampling Problems:  
    Due to collection of random samples.
- Natural Variation:  
    Abnormal Conditions
    
    
<b>Existence of Outliers - Impact:</b>
- Significant impact on the mean and the standard deviation.
- Outliers are non-randomly distributed, they can decrease normality.
- Create bias or influence estimates that may be of substantive interest.
- Have an impact on basic assumption of Regression, ANOVA and other statistical model assumptions.

<b>Methods of Identifying Outliers:</b>
- Sort the dataset.
- Using Graphical Methods.
- Standard Score (z-score).
- Interquartile Range (IQR).

## Univariate Graphical EDA

- This technique helps us look graphically at the distribution of the sample.  
    Graphical methods are more qualitative. They help us to visualize the patterns in the dataset.
- Types of Graphs used to summarize and organize data:
    - Line Graph
    - Pie Charts
    - Bar graphs
    - Histograms
    - Line graphs
    - Box plots
    - Scatter plots
    - Dot plots
    
<b>Statistical Graphs - Pie Charts:</b>
- A pie chart is a circular statistical graphical chart, which is divided into slices to illustrate numerical proportions.
- In pie charts, central angle, area and the arc length of each slices are proportional to the quantity or percentage they represent.
- Pie charts are useful for limited number of categories in the dataset.  
    If the number of categories is greater than 7, the pie chart maynot give the effective visualization.
<img src="assets/Pie-Chart.png" width="500">

<br>

<b>Statistical Graphs - Bar Graphs:</b>
- A bar graph consists of horizontal or vertical bars that are separated from each other.  
    horizontal stripes: Bar graphs
    vertical stripes: Column graphs
- Although they look the same, bar graphs and histograms have one important difference -  
    Discrete data is plotted on a Bar Graph &  
    Continuous data is plotted on a Histogram.
<img src="assets/Bar-Graph.png" width="400">

<br>

<b>Statistical Graphs - Line Graphs:</b>
- A line graph is usually used to show the change of information over a period.
- The line graph is often viewed as a time series graph and is one of the most popular graphs in quantitative univariate analysis.
<img src="assets/Line-Graph.png" width="400">

<br>

<b>Statistical Graphs - Box Plots:</b>
- Box plots are used for detecting and illustrating location and variation changes between different groups of data.
- Box plots are very good at presenting information about how tightly data are grouped, the central tendency and skewness as well as outliers.
<img src="assets/Box-Plot.png" width="600">

<br>

<b>Statistical Graphs - Dot Plots:</b>
- Dot plots use dots to show the data values (or scores) in a distribution and are used for relatively small datasets.
- Dot plots are a type of simple histogram-like charts that use dots instead of bars.
<img src="assets/Dot-Plot.png" width="400">


## Multivariate Non-Graphical EDA

- Like univariate analysis, multivariate analysis involves both computing summary statistics and producing visual displays.
- The appropriate type of multivariate analysis depends on the nature of the data, numerical or categorical.
- Under Multivariate Non-Graphical EDA:
    - Cross-tabulation
    - Calculate correlation and covariance
- Under Multivariate Graphical EDA:
    - Correlation using scatter plots
    - Correlation using heat maps
    
<b>Cross-Tabulation:</b>
- For categorical data (and numerical data with only a few variables), cross-tabulation is performed by making a two-way table with column headings that match the levels of one variable and row headings that match the levels of the other variable, then filling in the counts of all subjects that share a pair of levels.
<img src="assets/Cross-Tabulation.png" width="500">

<br>


<b>Covariance:</b>
- Covariance indicates whether the variables are positively or negatively related.
- Covariance indicates the direction and not the strength of linear relationship between variables.
- Covariance has dimensions.
- Formula:  
    cov<sub>x,y</sub> = $\frac{Σ(xᵢ - x̄)(yᵢ - ȳ)}{N - 1}$
<img src="assets/Covariance.png" width="400">

<b>Correlation:</b>
- Correlation indicates the degree to which the variables are related.
- Correlation measures both strength and direction of linear relationship between two variables.
- Correlation is a dimensionless unit.
- Formula:  
    r = $\frac{Σ(xᵢ - x̄)(yᵢ - ȳ)}{\sqrt Σ(xᵢ - x̄)²(yᵢ - ȳ)² }$
<img src="assets/Correlation.png" width="400">

Correlation is a function of Covariance.  
Covariance & Correlation help analyze the linear relationship between variables.

<br>

<b>Association:</b>
- Represents the relationship between two variables.
- This connection can be something much less strictly defined.
- eg: A certain smell can remind someone of their home, or a certain sound of another event that was important to them.

<b>Confounding Variables:</b>
- Sometimes there is a third variable that is not accounted for, that can affect the relationship between the two variables under study.
- Suppose a researcher collects data on ice-cream sales and shark attacks and finds that the two variables are highly correlated. Thats unlikely.
- More likely cause is the confounding variable temperature.
- When it is warmer outside, more people buy ice-cream and more people go in the ocean.
- Requirements for Confounding Variables:
    - It must be correlated with the independent variable.
    - It must have a casual relationship with the dependent variable.
    
<b>Causation:</b>
- Causation means that changes in one variable measured directly caused changes in the other.
- eg: Smoking and Cancer.
- The number of studies and collective evidence give strong indication that lung cancer is causally related to tobacco consumption.
- Relative risk of lung cancer is directly caused by cigarette smoking.
- Causality implies association.
<img src="assets/Causation.png" width="400">

<br>

<b>Correlation Matrix:</b>
- Correlation matrix is a matrix in which the iᵗʰ - jᵗʰ position defines the correlation between the iᵗʰ and jᵗʰ parameter of the specified dataset.
- The value (r) provides the strength and direction of association as follows:
    - Value of r ranges from -1 to 1.
    - Positive value indicates a positive association and vice-versa.
    - The association is the strongest when r = 1 and reduces with the value of r.
<img src="assets/Correlation-Matrix.png" width="500">

## Multivariate Graphical EDA

Multivariate Graphical EDA techniques include:
- Scatter plots
- Heat maps

<b>Scatter Plot:</b>
- A scatter plot represents individual pieces of data using dots.
- These plots make it easier to see if two variables are related to each other.
- The resulting pattern indicates the type (linear or non-linear) and strength of the relationship between two variables.
- Scatter plot is the visual counterpart of Correlation Matrix.
<img src="assets/Scatter-Plot.png" width="400">

<br>

Scatter Plot Matrix:
<img src="assets/Pair-Plot.png" width="900">

<br>

<b>Heat Maps:</b>
- A data visualization technique that shows magnitude of a phenomenon as color in two dimensions.
- Heat maps are used to show relationships between two variables, one plotted on each axis.
- Heat maps use correlation data to depict the association between two or more variables.
<img src="assets/Heat-Map.png" width="400">

## Steps in EDA

<b>i. Variable Identification:</b>
- In this step, we identify every variable by discovering its type.
- According to our needs, we can change the datatype of any variable.
<img src='assets/Variable-Identification.png' width='300'>


<b>ii. Univariate Analysis:</b>
- Here, we study individual characteristics of every feature/variable available in the dataset.
- Continuous Variable:
    - Histogram, KDE, Boxplot & Q-Q plot
- Categorical Variable:
    - Bar plot, Pie chart, Frequency table


<b>iii. Bi/Multivariate Analysis:</b>
- Here, we study the relationship between any two or more variables which can be categorical-continuous, categorical-categorical or continuous-continuous.
- Continuous-Continuous:  
    Scatterplot, Heatmap, Jointplot, Pairplot.
- Categorical-Continuous:  
    Factor plot, Swarm map, Violin plot, Strip plot.
- Categorical-Categorical:  
    Cross-Tabulation, Stacked Bar, Bar chart.


<b>iv. Missing value Treatment:</b>
- If we dont treat the missing values, they can interfere with the pattern running in the data which in turn can degrade the model's performance.
- We usually fill the missing values with the mean of the dataset.
- If our dataset has any outliers, we fill the missing values with the median.
- If our data is a string value (or a categorical value), we usually fill the missing values with the mode.


<b>v. Outlier Removal:</b>
- Outliers cause unusual patterns or observations in mean and standard deviation.
- Presence of outliers can cause issues with model's performance.
- Hence, it is required to treat them accordingly.

### Summary of Graphical EDA

<b>EDA Techniques based on Variables:</b>
<img src='assets/EDA-based-on-Variables.png' width='450'>

<br>

<b>EDA Techniques based on the Objectives of Analysis:</b>
<img src='assets/EDA-based-on-Objectives-of-Analysis.png' width='500'>