# Exploratory Data Analysis Fundamentals

**Chapter overview**:
Exploratory data analysis fundamentals will help us learn and revise the fundamental aspects of EDA. We will dig into the importance of EDA and the main data analysis tasks, and try to make sense out of data. In addition to that, we will use Python to explore different types of data, including numerical data, time series-data, geospatial data, categorical data, and others.

**EDA**: Is a process of examining the available dataset to discover patterns, spot anomalies, test hypotheses, and check assumptions using statistical measures. The primary aim for EDA is to examine what data can tell us before actually going through formal modeling or hypothesis formulation. 

**Chapter topics**
- Understanding data science.
- The significance of EDA.
- Making sense of data.
- Comparing EDA with classical and Bayesian analysis.

## Understanding data science

**Stages of EDA**

- **Data requirements**: What type of data is required for the organization to be collected, curated, and stored. In addition to this, it is required to categorize the data, numerical or categorical, and the format of storage and dissemination.
- **Data collection**: Data collected from several sources must be stored in the correct format and transferred to the right information technology personnel within company. Data can be collected from several objects on several events using different types of sensors and storage tools.
- **Data processing**: Preprocessing involves the process of pre-curating the dataset before actual analysis. Common tasks involve correctly exporting the dataset, placing them under the right tables, structuring them, and exporting them in the correct format.
- **Data cleaning**: Preprocessed data is still not ready for detailed analysis. It must be correctly transformed for an incompleteness check, duplicates check, error check, and missing value check. These tasks are performed in the data cleaning stage, which involves responsibilities such as matching the correct record, finding inaccuracies in the dataset, understanding the overall data quality, removing duplicate items, and filling in the missing values. Data cleaning is dependent on the types of data under study. Finding data issues on any dataset requires us to perform some analytical techniques, we call it *Data Transformation*.
- **EDA**: Is the stage where we actually start to understand the message contained in the data. Several types of data transformation techniques might be required during the process of exploration.
- **Modelling and algorithm**: In general, a model always describes the relationship between independent and dependent variables. Inferential statistics deals with quantifying relationships between particular variables. An example of inferential statistic would be regression analysis.
- **Data product**: Any computer software that uses data as inputs, produces outputs, and provides feedback based on the output to control the environment is referred to as a data product. A data product is generally based on a model developed during data analysis, for example, a recommendation model that inputs user purchase history and recommends a related item that the user is highly likely to buy.
- **Communication**: This stage deals with dissemination the results to end *stakeholders* to use the result for *business intelligence*. One of the most notable steps in this stage is data visualization. Visualization deals with information relay techniques such as tables, charts and summary diagrams to show the analyzed result. 

## The significance of EDA

To be certain of the insights that the collected data provides and to make further decisions, data mining is performed where we go through distinctive analysis processes. Exploratory Data Analysis is key, and usually the first exercise in data mining. It allow us to visualize data to understand it as well as to create hypotheses for further analysis. The exploratory analysis centers around creating a synopsis of data or insights for the next steps in a data mining project. EDA actually reveals ground truth about the content without making any underlying assumptions. This is the fact that data scientist use this process to actually understand what type of modeling and and hypotheses can be created. Key components of EDA include *summarizing data*, *statistical analysis*, and *visualization of data*.

### Steps in EDA

- **Problem definition**: The problem definition works as the driving force for a data analysis plan execution. The main tasks involved in problem definition are defining the main objective of the analysis, defining the main deliverables, outlining the main roles and responsibilities, obtaining the current status of the data, defining the timetable, and performing cost/benefits analysis.
- **Data preparation**: This step involves methods for preparing the dataset before actual analysis. In this step, we define the sources of the data, define data schemas and tables, understand the main characteristics of the data, clean the dataset, delete non-relevant datasets, trasform the data, and divide the data into required chunks for analysis.
- **Data analysis**: This is one of the most crucial steps that deals with descriptive statistics and analysis of the data. The main tasks involve *summarizing the data*, *finding the hidden correlation and relationships among the data*, *developing predictive models*, *evaluating the models*, and *calculating the accuracies*. Some of the techniques used for data summarization are summary tables, graphs, descriptive statistics, inferential statistics, correlation statistics, searching, grouping, and mathematical models.
- **Development and representation of the results**: This step involves presenting the dataset to the target audience in the form of graphs, summary tables, maps, and diagrams. This is also an essential step as the result analyzed from the dataset should be interpretable by the business stakeholders, which is one of the major steps of EDA. Most of the graphical analysis techiques include scattering plots, character plots, histograms, box plots, residual plots, mean plots, and others. 

## Making sense of data

Different disciplines store different kinds of data for different purposes. A dataset contains many observations about a particular object. For instance a dataset about patients in a hospital can contain many observations. A *patient* can be *described* by a patient identifier (ID), name, address, weight, date of birth, email and gender. Each of these features that describes a patient is a *variable*. Each observation (a patient) can have a specific value for each of these variables. 

Most of this data is stored in some sort of database management system in tables/schema. Most of the datasets broadly falls in two groups: numerical and categorical data.

Understanding the type of data is relevant in understanding what type of computation you can perform, what type of model you should fit on the dataset, and what type of visualization you can generate. 

### Numerical data


This data is often referred to as quantitative data in statistics. The numerical data can be either discrete or continuous types.
- **Discrete data**: This is data that is countable and its values can be listed out (finite possible values). 
- **Continuous data**: A variable that can have an infinite number of numerical values within a specific range is classified as continuous data. Continuous data can follow an interval measure of scale or ratio measure of scale. 

### Categorical data

This data is often referred to as qualitative data in statistics. A variable describing categorical data is referred to as a categorical variable. These type of variables can have one of a limited number of values. There are different types of categorical variables: 
- A **binary categorical variable** can take exactly two values and is also referred to as a **dichotomous variable**.
- **Polytomous variables** are categorical variables that can take more than two possible values.

Most of the categorical dataset follows either nominal or ordinal measurement scales.

### Measurement scales

There are four different types of measurement scales described in statistics: nominal, ordinal, interval, and ratio. These scales are used more in academic industries.

$$
\begin{array}{|c|c|c|c|c|}
\hline
\textbf{Provides: } & \textbf{Nominal} & \textbf{Ordinal} & \textbf{Interval} & \textbf{Ratio} \\ \hline
\text{The "order" of values is known} &  & ✓ & ✓ & ✓ \\ \hline
\text{"Counts", aka "Frequency of Distribution"} & ✓ & ✓ & ✓ & ✓ \\ \hline
\text{Mode} & ✓ & ✓ & ✓ & ✓ \\ \hline
\text{Median} &  & ✓ & ✓ & ✓ \\ \hline
\text{Mean} &  &  & ✓ & ✓ \\ \hline
\text{Can quantify the difference between each value} &  &  & ✓ & ✓ \\ \hline
\text{Can add or substract values} &  &  & ✓ & ✓ \\ \hline
\text{Can multiply and divide values} &  &  &  & ✓ \\ \hline
\text{Has "true zero"} &  &  & & ✓ \\ \hline
\end{array}
$$


#### Nominal

These are practiced for labeling variables without any quantitative value. The scales are generally referred to as **labels**. And these scales are mutually exclusive and do not carry any numerical importance. Examples:
- What is your gender?
    - Male
    - Female
    - Third gender/Non binary
    - I prefer not to answer
    - Other
Other examples include the following: the languages that are spoken in a particular country, biological species, parts of speech in grammar(noun, pronoun, adjective, and so on), taxonomic ranks in biology (Archea, Bacteria, and Eukarya).

Nominal scales are considered qualitative scales and the measurements that are taken using qualitative scales are considered qualitative data. No form of arithmetic calculation can be made on nominal measures. 

In the case of a nominal dataset, you can certainly know the following:
- **Frequency** is the rate at which a label occurs over a period of time within the dataset.
- **Proportion** can be calculated by dividing the frequency by the total number of events.
- Then, you could complete the **percentage** of each proportion.
- And to **visualize** the nominal dataset, you can use either a pie chart or a bar chart.


#### Ordinal

The main difference in the ordinal and nominal scale is the order. In ordinal scales, the order of the values is a significant factor. For example: the answer to the question *Wordpress is making content managers' lives easier* is scaled down to five different ordinal values: Strongly Agree, Agree, Neutral, Disagree, and Strongly Disagree. Scales like these are referred to as the *Likert Scale*. 

Consider ordinal scales as an order of ranking (1st, 2nd, 3rd, 4th, and so on). The **median** item is allowed as the measure of central tendency; however, the **average** is not permitted. 



#### Interval

In interval scales, both the order and exact differences between the values are significant. Interval scales are widely used in statistics, for example, in the measure of central tendencies-mean, median, mode, and standard deviations. Examples include location in Cartesian coordinates and direction measured in degrees from magnetic north. The mean, median, and mode are allowed on interval data.

#### Ratio

Ratio scales contain order, exact values, and absolute zero, which makes it possible to be used in descriptive and inferential statistics. These scales provide numerous possibilities for statistical analysis. Mathematical operations, the measure of central tendencies, and the **measure of dispersion** and **coefficient of variation**  can also be computed from such scales. Examples include a measure of energy, mass, length, duration, electrical energy, plan angle, and volume. 

## Comparing EDA with classical and Bayesian analysis

There are several approaches to data analysis. Three of them are:

- **Classical data analysis**: For the classical data analysis approach, the problem definition and data collection step are followed by model development, which is followed by analysis and result communication.

- **Exploratory data analysis approach**: For the EDA approach, it follows the same approach as classical data analysis except the model imposition and the data analysis steps are swapped. The main focus is on the data, its structure, outliers, models, and visualizations. Generally, in EDA we do not impose any deterministic or probabilistic models on the data.

- **Bayesian data analysis approach**: The bayesian approach incorporates *prior probability distribution* knowledge into the analysis steps, after model development and before data analysis. Prior probability distribution of any quantity expresses the belief about that particular quantity before considering some evidence.

Data analyst and data scientist freely mix steps mentioned in the preceding approaches to get meaningful insights from the data. In addition to that, it is essentially difficult to judge or estimate which model is best for data analysis. All of them have their paradigms and are suitable for different types of data analysis. 