# Introduction to the ML Process

The typical Machine Learning process consists of the following steps:
- Gathering data
- Data preparation
- Model selection
- Training and evaluation
- Prediction and deployment

## What is Data Preparation?

There are 3 significant stages involved in preparing data for a machine learning model. Data preparation is a part of the `Data Preprocessing` step in the established data pipeline.

1. Data Exploration:
    - Explore Data
    - Explore the target variable
2. Find relationships between target and other variables
3. Data cleaning
    - Missing values
    - Outlier detection

## Importance of Preparing Data

A general framework for data preparation can include the following steps:

1. `Data Understanding`: Familiarize with the data by reviewing the structure, size, and content. Use visualizations and summary statistics.
2. `Data Cleaning`: Handle missing, duplicate, or inconsistent values, and remove irrelevant data to ensure the quality of the data being used.
3. `Data Transformation`: Convert the data into a format suitable for analysis and modeling, which includes normalizing, scaling, and encoding variables.
4. `Data Reduction`: Reduce the size of the data, if necessary , by removing features that are redundant or irrelevant to the problem.
5. `Data Splitting`: Split the data into training and testing datasets. The training dataset is used to build the model, while the testing dataset is used to evaluate its performance

Once completed, the data is ready to be used for machine learning.

## Data Exploration

`Data Exploration` is about describing the data by means of statistical and visualization techniques. Exploring data brings important aspects of that data into focus for further analysis.

### 1. Univariate Analysis

`Univariate analysis` explores variables (attributes) in isolation. Variables could either be `categorical` or `numerical`.  There are different statistical and visualization techniques of investigation for each type of variable.

- `Numerical` variables can be transformed into categorical counterparts by a process called `binning` or `discretization`.
- `Categorical` variables can be transformed into numerical counterparts by a process called `encoding`.
- Finally, proper handling of missing values is an important issue in mining data.

#### 1A. Categorical Values

A categorical or discrete variable is one that has two or more categories (values). Ther are two types of categorical variable, `nominal` and `ordinal`. 

- A `nominal` variable has no intrinsic ordering to its categories, such as gender.
- An `ordinal` variable has a clear ordering, such as `low`, `medium`, and `high` value categories.
- A `frequency table` is a way of counting how often each category of the variable in question occurs. It may be enhanced by the adition of percentages that fall into each category.

#### 1B. Numerical Variables

A numerical or continuous varaible (attribute) is one that may take on any value within a finite or infinite interval (eg. height, weight, temperature, blood glucose, etc.). There are two types of numerical variables: `interval` and `ratio`.

- An `interval` variable has values whose differences are interpretable, but it does not have a true zero, such as temperature in centigrade. Data on an interval scale can be added and subtracted but cannot be meaningfully multiplied or divided.
- A `ratio` variable has values with a true zero and can be added, subtracted, multiplied, or divided (eg. weight).

### 2. Bivariate Analysis

`Bivariate` analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, where there exists an association and the strength of this association, or where there are differences between two variables and the significance of these differences.

#### 2A. Numerical & Numerical

##### Scatter Plot

A `scatter plot` is useful for visual representation of the relationship between two numerical variables (attributes) and is usually drawn before working out a linear correlation or fitting a regression line. The resulting pattern indicates the type (linear or non-linear) and strength of the relationship between two variables.

More information can be added to a 2-D scattter plot (eg. label points  to indicate a third variable). If dealing with many variables, a way of presenting all possible scatter plots of two variabeles at a time is a `scatter plot matrix`.

![Scatter Plot Matrix](https://www.saedsayad.com/images/ScatterPlot_Matrix_1.png)

##### Linear Correlation

`Linear correlation` quantifies the strength of a linear relationship between two numerical variables. When there is no correaltion between two variables, there is no tendency for the values of one quantity to increaes or decrease with the values of the second quantity.

`r` only measures the strength of a linear relationship and is always between -1 and 1, where -1 means perfect negative linear correlation and +1 means perfect positive linear correlation and zero means no linear correlation.

#### 2B. Categorical & Categorical

##### Stacked Column Chart

`Stacked Column Chart` is a useful graph to visualize the relationship between two categorical variables. It compares the percentage that each category from one variable contributes to a total across categories of the second variable.

![Stacked Column Chart](https://www.saedsayad.com/images/Orange_Stackedplot.png)

##### Combination Chart

A `Combination Chart` uses two or more chart types to emphasize that the chart contains different kinds of information. This visualization method is used to demonstrate the predictability power of predictor (X-axis) against a target (Y-axis).

![Combination Chart](https://www.saedsayad.com/images/Temperature.gif)

##### Chi-Square Test

The `Chi-Square Test` can be used to determine the association between categorical variables. It is based on the difference between the expected frequencies (e) and the observed frequencies (n) in one or more categories in the frequency table. The `chi-square distribution` returns a probability for the computed `chi-square` and the degree of freedom. A probability of zero shows a complete dependency between two categorical variables and a probability of one means that two categorical variables are completely independent.

`Tchouproff Contingency Coefficient` measures the amount of dependency between two categorical variables.

#### 2C. Categorical & Numerical

##### Line Chart with Error Bars

A `line chart` with error bars displays information as a series of data points connected by straight line segments. Each data point is the average of the numerical data for the corresponding category of the categorical varaible with error bar showing standard error. It is a way to summarize how pieces of information are related and how they vary depnding on one another.

![Line Chart](https://www.saedsayad.com/images/Linechart_Errorbar.png)

##### Combination Chart

A `combination chart` uses two or more chart types to emphasize that the chart contains different kinds of information. The combination chart is the best visualization method to demonstrate the predictability power of a predictor (X-axis) against a target (Y-axis).

![Combination Chart](https://www.saedsayad.com/images/Bivar_sepal_length.gif)

##### Z-test and t-test

`Z-test` and `t-test` are the same. They assess whether the averages of two groups are statistically differnt from each other. This analysis is appropriate for comparing the averages of a numerical variable for two categories of a categorical variable.

If the probability of Z is small, the difference between two averages is more significant.

When the `n1` or `n2` is less than 30, a `t-test` is used instead.

##### Aanalysis of Variance (ANOVA)

The `ANOVA` test assesses whether the averages of more than two groups are statistically differnt from each other. This analysis is appropriate for comparing the averages of a numerical variable for more than two categories of a categorical variable.