# MAchine Learning Intro

Machine learning is the scientific study of algorithms and statistical models to perform a specific task effectively without using explicit instructions. Machine learning algorithms include – supervised and unsupervised algorithms

## Supervised Learning 
In supervised learning, the target is already known and is used in the model prediction.

**Classification**: When target variable is categorical 
**Regression**: When target variable is continuous 


## Unsupervised Learning 
In unsupervised learning, the target is not known and is supposed to be determined through the models.

**Clustering**: Customer segmentation  
**Association**: Market basket analysis 

![image.png](attachment:image.png)

# Feature Engineering

Feature engineering is the process of using domain knowledge of the data to create features or variables to use in machine learning. The following topics are covered in this section:

- Types of variables 
- Variable characteristics 
- Variable transformation 
- Treating categorical variables 
- Feature scaling 
- Discretization or binning 
- Missing value imputation 
- Outlier treatment 

## Variable types

- **Numerical variable**: can be discrete or continuous. Discrete variable takes only whole numbers. Continuous variable takes any value within some  range. 
- **Categorical variable**: can be ordinal or nominal. Ordinal variable takes categories that can be meaningfully ordered. Nominal variable takes labels that have no intrinsic order. 
- **Mixed variables**: can have number / labels in different observations or number / labels in same observation

## Variable characteristic

- **Cardinality**: the number of different labels is known as cardinality. As cardinality increases the chances of over-fitting also increases. 
- **Skewed distribution**: One of the tails is longer than the other tail. For skewed distribution, median is better than mean for imputation. 
- **Magnitude**: impact the regression coefficients. Features with bigger magnitudes dominate over features with smaller magnitudes. **Feature scaling** helps - to bring all the features in the same range. 
- **Missing data**: occurs when no data is stored for a certain observation in the variable. Can have significant impact on the model. 
- **Outliers**: is a data point that is significantly different from the remaining data. Depending upon the context, outliers either deserve special attention or should be completely ignored. 


## Variable transformation

If the distribution of the variable is skewed then transformations are applied to make the distribution closer to normal distribution

data transformation also involves Smoothing and Aggregation techniques

Gathering accurate data of superior quality and in a humongous quantity is very essential to produce relevant results. 
Smoothing, on the other hand, is the process of eliminating noise from the data using algorithms that help highlight the important features present within the data. It also helps in predicting the present patterns correctly.

- Logarithmic (X>0)
- Exponential (X>>large; may lead to errors)
- Reciprocal (X<>0)
- Box-Cox (X>0)
- Yeo-Johnson 


## Treating categorical variables

Machine learning algorithms work only with numerical variables. Hence, replacing the categories with numerical representations is done, so that machine learning models can use these variables

- **One hot encoding**: Consists of encoding each categorical variable with a set of **Boolean variables** **K-1 dummies** are created. One hot encoding of top categories only considers the most frequent categories. 
- **Ordinal encoding**: Consists of replacing the categories by digits from 0 to 9. Numbers are assigned arbitrarily. This encoding method allows for quick benchmarking of machine learning models. 
- **Count of frequent encoding**: Categories are replaced by percentage of observations shown against that category. Captures representation of each label.
- **Target guided encoding**: Helps to get monotonic relationship between the variable and target. Categories are replaced with integers from 1 to K where k is the number of distinct categories in the variable, but the number is informed by the mean of the target for each category. Probability ratio encoding is where each category is replaced by the odds-ratio or weight of evidence. 
- **Mean encoding**: Replacing the category by average of target value for that category. 
- **Rare label encoding**: Rare labels are those that appear only in a tiny proportion of the observations in the dataset. These labels are grouped together into a single label. 
- **Binary encoding**: Binary code is used to encode the meaning of the variable. However, it lacks human readable meaning.

## Feature scaling

Feature scaling is the method used to normalize the range of values. This is done to bring all the variables at the same scale. 

- **Standardization** [Z = (x-u)/s]: It preserves the shape of the variable with mean = 0 and standard deviation = 1. It preserves outliers.
- **Mean normalization** [Z = (x-mean) / (max-min)]: Rescales the range of the variable with mean = 0. It may alter the shape of the variable. 
- **Min max scalar** [Z = (x-min) / (max-min)]: Rescales the range of the variable and returns only positive values. It preserves outliers. 
- **Maximum absolute scalar** [Z = x / max|x|]: Rescales the range of the variable. Mean is not centered to 0 and variance is not scaled. 
- **Scaling to Median and IQR** [Z = (x – Median) / (Q3 – Q1)]: Median is centered to zero and it handles outliers. 

## Discretization or binning 

Discretization is the process of transforming continuous variables into discrete variables by a set of continuous intervals. It is also called binning. 

Data binning, also known as bucketing, groups of data in bins or buckets, replaces values contained in a small interval with a representative value for that interval. Binning method tends to improve the accuracy in models, especially predictive models. It provides a new categorical variable feature from the data reducing the noise or non-linearity in the dataset.

Binning method is also used for the sheer purpose of **data smoothening**

Here the data is first sorted and then the sorted values get distributed into several buckets or bins. As binning methods consult the neighboring values, this is also known as **local smoothing**. 

![image.png](attachment:image.png)

- **Equal width**: Divides the variable into ***K bins of same width***. It does not improve value spread and we observe the **same distribution**. 
- **Equal frequency**: Divides the variable into ***K bins with same number of observations***. Interval boundaries correspond to quartiles. It handles outliers and **improves the spread of the variable**. 
- **K means**: Applying K-means clustering to the continuous variable. Divides the variable into clusters according to the centroids. 
- **Decision trees**: Consists of using decision trees to identify the optimal bins. It creates **discrete variable as well as monotonic relationship**. It handles outliers. 
- **Note on monotonic relationship**: Re-order the intervals so that we get monotonic relationship with target. Monotonic relationship improves performance of machine learning models and creates shallower trees. 


Example:
data['Sex'].replace(['male','female'],[0,1],inplace=True)
data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)
Converting this using the binning method,

data['Age_cat']=0
data.loc[data['Age']<=16,'Age_cat']=0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_cat']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_cat']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_cat']=3
data.loc[data['Age']>64,'Age_cat']=4
Similarly, other columns can be converted into categorical features by using the Binning method,

data['Fare_cat']=0
data.loc[data['Fare']<=7.775,'Fare_cat']=0
data.loc[(data['Fare']>7.775)&(data['Fare']<=8.662),'Fare_cat']=1
data.loc[(data['Fare']>8.662)&(data['Fare']<=14.454),'Fare_cat']=2
data.loc[(data['Fare']>14.454)&(data['Fare']<=26.0),'Fare_cat']=3
data.loc[(data['Fare']>26.0)&(data['Fare']<=52.369),'Fare_cat']=4
data.loc[data['Fare']>52.369,'Fare_cat']=5


## Missing data imputation 

Act of replacing the missing data with statistical estimates of missing values. The goal is to produce a whole dataset that can be used to train machine learning models. 

- **Complete Case Analysis**: The list wise **deletion or discarding** observations where values in any of the variable are missing. We analyze only those observations for which information is available for all the variables. Suitable for numerical and categorical variables. Should be used when data is missing at random and not more than 5% of data is missing. 

- **Mean or Median Imputation**: consists of replacing all occurrences of missing values within a variable with either mean or median. Suitable for **numerical** variable. If the variable is **normally distributed then use mean** if the distribution is **skewed then use median**. Should be used when data is **missing at random and missing observations mostly look like the majority of the data**. Mean or median should be **calculated only on training set**. The **value should be used to replace missing values in both train and test datasets**. This is to avoid over-fitting. 

- **Arbitrary Value Imputation**: Consists of replacing missing value with an arbitrary value. For **categorical – ‘missing’ and for numerical – 999**. This should be used when **data is not missing at random**. **Works well with tree based algorithms** but not with linear regression or logistic regression. 

- **Frequent Category Imputation/ Mode Imutation**: Mode imputation consists of replacing all occurrences of missing values within a variable with mode. Suitable for **categorical** variable. Should be used when **data is missing at random and the missing observations most likely look like majority of observations**. 

- **Missing Category Imputation**: Consists of treating missing data as an additional label or category. This is widely used method for **categorical** variables. 

- **Random Sample Imputation**: Consists in taking random observations from the pool of available observations of the variable and use it to fill the missing values. Suitable for both **numerical and categorical**. 

- **Missing indicator**: Additional **binary variable** is added which indicates whether the data was missing for an observation or not. Suitable for numerical and categorical variables. It ***should be used together with other methods that assume data is missing at random*** (mean, median or mode imputation and random sample imputation). If data is missing at random then it is captured by mean or median and if data is not missing at random then it is captured by the binary variable. If more than 5% of data is missing then it is advised to add missing indicator. 

- **KNN Imputation**: Determines missing data points as **weighted average of the values of its K nearest neighbors**. KNN is trained on other variables, the K nearest neighbors is determined and weighted average is taken to impute the missing value. Suitable when a **small percentage of the data is missing.**

- **MICE**: A series of models whereby each variable is modeled conditional upon other variables in the data. Each incomplete variable is imputed by a separate model. 

- **Miss Forest**: MICE is implemented using Random Forest. Works well with mixed data types and can handle non-linear relationship. 


## Outlier treatment 

Outlier is a data point that is significantly different from the remaining data. Outliers may impact the performance of linear models, however, their impact is **minimal on tree-based algorithms**. 

Outliers can be identified using 
- Gaussian distribution (u -/+ 3*s)
- Interquartile range (Q3 – Q1) 
- Quartiles (1 percentile and 99 percentile). 



- **Trimming**: **Remove outliers** from dataset. However, it can remove large proportion of data.
- **Capping**: No data is removed. However, it **distorts variable distribution**. 
- **Missing data**: The outliers are treated as missing data.
- **Discretization**: The outliers are put into lower and upper bins. 
- **Arbitrary capping**: Domain knowledge of the variable is required to cap the min and max 


## Shuffling

Data shuffling can be implemented using only rank order data, and thus it provides a nonparametric method for masking. The applicability of data shuffling stands the same for small and large data sets. During machine learning, we are required to split the dataset into further training, testing & validation datasets. It is very important that the dataset is shuffled well to avoid any element of bias or patterns in the split datasets before training begins for the ML model. Shuffling improves the model quality and the predictive performance of the model that it is being applied to. 



# Reference

- https://analyticsindiamag.com/ai-trends/study-notes-on-machine-learning-pipeline-feature-engineering-feature-selection-and-hyper-parameters-optimization/

- https://analyticsindiamag.com/ai-trends/common-feature-engineering-techniques-to-tackle-real-world-data/