# Exploratory Data Analysis:
Starting today, we are going to explore in depth the first and most important phase of the machine-learning pipeline and that is Exploratory Data Analysis (EDA). EDA is the process of identifying and sorting out insights from the given dataset from an analytical point of view. This allows us to get a good grip of what our data represents, how good each field performs statistically and how many values need to be transformed for proficient results. Exploratory Data Analysis can be classified into different steps that take place one by one in the order as illsutrated in the figure. (This idea for classification is taken from the article [Advanced exploratory data analysis (EDA) by Michael P. Notter](https://miykael.github.io/blog/2022/advanced_eda/))

![image-2.png](attachment:image-2.png)

Let's cover each of these steps briefly:

### 1. Structural Investigation:
The first step of EDA deals with the structure of the data being analyzed. The structure of the data refers to the shape and data types of the tabular data. Structural Investigation can comprise of different sub-steps such as

* __Computing Shape of Data:__ Finding the total number of records and fields in the provided data. This may later helps us in identifying the purity of the dataset e.g. how much percentage of provided data was useful for making insights or useable for prediction purposes.

* __Investigating Attribute Datatypes:__ Finding the different datatypes carried by the attributes. This helps us in identifying the type of data we are dealing with i.e. how many variety of the to-be-called features are in place and the ratio between the qualitative and quantitative variables provided in the given data.

* __Classification of Data Variables:__ One of the most crucial insights to get when analyzing the data is the kind of data represented by each attribute (Remember that by kind, I mean the kind of variable it is not the datatype that is used to initialize it). After this step, one gains the insights about how many numeric and non-numeric variables are present. Furthermore, which of the numeric attributes are nominal, ordinal or continuous values. Similarly, we can classify the non-numeric variables into classes like datetime, urls, text passages or some other type of object, etc., allowing us to get a good grip on what kind of preprocessing/analysis techniques are to be applied to each variable or combination of variables to get the most out of the result.

### 2. Quality Investigation:
The second step of EDA deals with the quality of the data being analyzed. The quality of the data refers to the different aspects from which the data is dirty in nature. Quality Investigation can comprise of different sub-steps such as:

* __Finding Duplicate Values:__ Duplicate values are a very common problem found in many datasets. The duplicate values in later stages results in biasness towards particular record(s). There are different ways in which you can deal with duplicate records e.g. dropping them off based on some subset, improvising the data source from which they occur, etc.

* __Finding Missing Values:__ Missing values are another common impurity found in most datasets. Since missing values mean nothing, they must be dealt with before performing any descriptive analysis or predictions. Missing values come in different types, giving different ways to deal with them e.g. dropping them off, imputing via statistical measures, etc.

* __Unique Values:__ Unique values provide the insight about as to how many occurrences take place for a particular record or subset of record. This helps in determining whether the dataset is imbalanced or which particular set of variables are the most common among different records. Datasets with too many or imbalanced proportion of unique values must be handled before analyzing. To deal with such cases we can use measures such as data permutation, data sampling or discretization, etc.

### 3. Content Investigation:
The third and last step of EDA deals with the content depicted by the stored data. By content, I mean the statistical significance each variable/record carries for the predictive analysis phase. Content Investigation can comprise of different sub-steps such as:

* __Computing Descriptive Statistics:__ Descriptive statistics such as mean, median, variance, std, etc. are useful to analyze the distribution of the provided variables in the dataset. This helps us in finding out how much the values in a particular attribute scale, their point-of-focus and trends. Descriptive stats is carried out as univariate analysis in the first stage.

* __Finding Outliers:__ Outliers are a very common occurence and are removed with the help of statistical measures. The descriptive statistics provide a good approximation of which values might be outliers. Depending on the condition, outliers may be handled in different ways e.g. dropping records, using a trimmed statistical measure, clipping, etc.

* __Finding Correlation:__ An important part of exploratory data analysis is finding the relationship between the different variables (be it bi-variate or multi-variate). Correlation allows to give you an insight on which attributes are highly related to others. The correlated attributes are thus removed, which can help alot in later stages of the prediction process.

### 4. Bonus Stage: Prediction Analysis:
Sometimes there might be cases where we can give a certain prediction without heading towards the machine learning phase. Usually such datasets are simpler in nature as they allow statistical techniques to provide an insight for the values to be predicted. However, machine learning models are usually utilized to deal with this part as in many cases, computational hurdles are present which do not allow the user to manifest the prediction results using basic statistical techniques e.g. a large dataset, a dataset with many features, dataset with complex patterns hidden, etc.

One thing to note before I finish is that although I have described which measures are taken to get rid of the different problems that are identified in these process but remember that we do not apply any of these measures in EDA as the purpose of EDA is just to investigate and not take action on the data. All of these measures take place in the data preprocessing step, which takes actions based on the insights we have gathered from this EDA phase as well as the problem that is being targetted. 

Thats it for today! We will continue with each step and sub-step for EDA in detail from tomorrow!