# EDA Assumptions


## Section 02:EDA Assumptions

Written by: [KV Subbaih Setty](https://www.kvssetty.com).

This series of articles are inspired by: https://doi.org/10.18434/M32189

**This series of articles and tutorials is divided in to following five sections** 

* [Section One:   EDA Introduction.](https://medium.com/nerd-for-tech/what-general-types-of-problems-eda-approach-can-solve-71d84a766975)
* **Section Two:   EDA Assumptions.**(This Section)
* Section Three: EDA techniques.
* Section Four:  EDA case studies.
* Section Five:  EDA Python libraries.

And each section is further sub divided in to number of topics and each topic is covered in a single separate tutorial. 

### Introduction to EDA Assumptions


The gamut of scientific, engineering, financial, social, manufacturing, health care and other experimentation is virtually limitless in data analysis.
In this sea of diversity is there any common base that allows the analyst to systematically and validly arrive at supportable, repeatable decisions and conclusions?

Fortunately, there is such a basis and it is rooted in the fact that every measurement process, however complicated, has certain underlying assumptions. This section deals with what those assumptions are, why they are important, how to go about testing  validity of them, and what the consequences are if the assumptions do not hold.

### Topics of section 02 : EDA assumptions:

1. **Underlying Assumptions** (This Article)
2. **Importance of Assumptions** (This Article)
3. Testing Assumptions
4. Importance of Graphs
5. Consequences if assumptions do not hold.

# Underlying Assumptions

### **Assumptions Underlying a Measurement Process (dataset observations)**

There are four assumptions that typically underlie all measurement processes; namely, that the data from the process at hand "behave like":
1. random drawings;
2. from a fixed distribution;
3. with the distribution having fixed location; and
4. with the distribution having fixed variation.

For example consider: **Univariate or Single Response Variable**

The simplest problem type is univariate; that is, a single variable. For the univariate problem, the general model

**response = deterministic component + random component**

becomes

**response = constant + error**

### **Assumptions for Univariate Model**

The *fixed location* referred to in item 3 above differs for different problem types.

For this case, the *fixed location* is simply the unknown constant. We can thus imagine the process at hand to be operating under constant conditions that produce a single column of data with the properties that
- the data are uncorrelated with one another;
- the deterministic component consists of only a constant;
- the random component has a fixed distribution;and
- the random component has fixed variation.

### **Extending to a Function of Many Variables (Multivariate)**

The universal power and importance of the univariate model is that it can easily be extended to the more general case where the deterministic component is not just a constant, but is in fact a function of many variables as shown below: 

**response = deterministic component + random component**

becomes

**response = function(variable1,variable2,variable3,...)  + error**

And the engineering objective is to **characterize and model the function**.




### **Residuals Will Behave According to Univariate Assumptions**

The key point is that regardless of how many features/variables there are, and regardless of how complicated the function is, if the MLengineer succeeds in choosing a good model, then the differences (residuals) between the raw response data(ground truth data) and the predicted values from the fitted model should themselves behave like a univariate process. Furthermore, the residuals from this univariate process fit will behave like:
- random drawings;
- from a fixed distribution;
- with fixed location (namely, 0 in this case); and
- with fixed variation.

### **Validation of Model**

Thus if the **residuals from the fitted model** do in fact behave like the ideal, then testing of underlying assumptions becomes a tool for the validation and quality of fit of the chosen model. On the other hand, if the residuals from the chosen fitted model violate one or more of the above univariate assumptions, then the chosen fitted model is inadequate and an opportunity exists for arriving at an improved model.

# Importance of Assumptions

### **Predictability and Statistical Control**
Predictability is an all-important goal in science and engineering. If the four underlying assumptions hold, then we have achieved probabilistic predictability-the ability to make probability statements not only about the process in the past, but also about the process in the future. In short, such processes are said to be **in statistical control**.

### **Validity of Engineering, Business Conclusions and Decisions** 
Moreover, if the four assumptions are valid, then the process is amenable to the generation of valid scientific, engineering and business conclusions and conclusions. If the four assumptions are not valid, then the process is drifting (with respect to location, variation, or distribution), unpredictable, and out of control. A simple characterization of such processes by a location estimate, a variation estimate, or a distribution estimate inevitably leads to engineering conclusions that are not valid, are not supportable (scientifically or legally), and which are not repeatable in the laboratory or survey.

In the next article we will discuss techniques for testing above four assumptions using graph methods.


The source code for all the tutorials in this series can be found in my GitHub [repository ](https://github.com/KVSSetty/Draft-of-EDA-Hand-book-by-kvs)