# ML Cheatsheet

### Pre-Processing

#### EDA

- covariance and correlation: shows the relationship between 2 variables.  
corr: (-1,1)  
<u>Pearson’s correlation</u> only measures linear relationships. If there’s a nonlinear relationship, we need other corr formulas.  
we use _corr_ for _num features_ and _chi-square_ for _cat features_. _ANCOVA_ for both of them.

- VIF:  
to check multicolinearity. -> to solve it: use regularization  


- mean and variance and std are sensitive to outliers, median and mean absolute deviation are not.



- PCA:  
_first drop correlated features, adding correlated variables lets PCA put more importance on those variable, which is misleading._
<br>
PCA transforms features in a dataset by combining them into uncorrelated linear combinations. These new features, or principal components, sequentially maximize the variance represented (i.e. the first principal component has the most variance, the second principal component has the second most, and so on).

#### Imputation

>models ok with missing values:  
_trees, naive bayes_

first of all, check if there's a pattern in missing values (For example, if a survey question was left unanswered), if it was fill all of them with same value,  
 otherwise:</font>


1. median, mode, mean
2. regressive (iterative): models like linear reg
3. classifier: models like knn 
4. distance-based
5. masking:  
1 if it exists and 0 if not
5. multiple:  
 creating multiple imputed datasets, each with a different set of imputed values, and then combining the results to obtain a final estimate. 

#### Data Encoding

>models ok with cat data:  
_Naive Bayes, tree-based (random forest, Catboost, LightGBM (not XGBoost))_


1. label 
2. One-hot
3. binary
4. count (frequency)
5. Bucketizer

#### Outliers

>models robust to outliers:  
_trees, GBM, SVM, NN_


Finding outliers:
1. IQR: using typical quantile, , good when data is not normally distibuted 
2. distribution: like z-score when data is normally distibuted (3.5 std away)
3. models: 
    * isolation forest: good when outliers are in sparse regions
    * LOF: density-based, good when outliers are in low density regions
    * SVM, KDE


Handling Outliers:
1. Winsorize (cap at threshold).:
    - the extreme values are replaced with the value at a specified percentile. For example, if we want to winsorize a dataset at the 5th and 95th percentiles, any values below the 5th percentile would be replaced with the value at the 5th percentile, and any values above the 95th percentile would be replaced with the value at the 95th percentile.
2. Transform to reduce skew (using Box-Cox or similar):
    - Stabilizing variance: stabilize the variance across different levels of the data, which can reduce the impact of outliers.
    - Normalizing data: make the data more normally distributed, which can help reduce the influence of extreme values (outliers).
    - Reducing skewness: reduce skewness in the data, which can make the data more symmetric and less sensitive to outliers
3. Remove outliers if you're certain they are anomalies or measurement errors.

#### Feature Selection

- Filter based:  
We specify some metric and based on that filter features. An example of such a metric could be correlation/chi-square(we calculate the chi-square metric between the target and the numerical variable and only select the variable with the maximum chi-squared values.).
- Wrapper-based:  
consider the selection of a set of features as a search problem. Example: Recursive Feature Elimination
- Embedded:  
algorithms that have built-in feature selection methods. For instance, Lasso and RF have their own feature selection methods.

- Use Random Forest, Xgboost and plot variable importance chart
- information gain (Mututal Information)

#### Imbalance Data

>models more robust to imbalance data:  
_random forest, GBM, SVM, NN, KNN_

1. Resampling:  
oversampling the minority class or undersampling the majority class.  
Specifying a hypothesis and then collecting data following randomization and random sampling principles ensures against bias.
2. Ensemble methods
3. weight classes: penalized-SVM and penalized-LDA.


### 
------

## ML


* __Frequentist__ (linear regression, logistic, svm, tree based, ...): for <u>large datas</u> , based on maximum likelihood
* __Bayesian__ (bayesian linear regression, ...) -> for <u>small</u> and <u>noisy</u> data

* __Parametric models__:  
finite number of parameters. To predict new data, you only need to know the parameters of the model.( linear reg, logistic reg, and linear SVMs.)
* __Non-parametric models__:  
an unbounded number of parameters, allowing for more flexibility. To predict new data, you need to know the parameters of the model and the state of the data that has been observed. ( trees, KNN, and topic models using latent dirichlet analysis.)

* __Generative__:   
_learn categories of data_  
best for small datasets, generate new samples from the learned input <u>distributions</u> <br> 
                based on density, prior and posterior(GMM, Naive Bayes, variational autoencoders,...)<br>
                give us more information, since they learn both the <u>input distribution</u> and the <u>class probabilities</u>.<br>
                can deal naturally with missing data<br>
                more sensitive to outliers, because outliers can have large effect on the input distributions.<br>

* __Discriminative__:   
    _learn the distinction between different categories of data_  
    1.distribution-free (KNN, trees, SVM) 2. probabilistic based on posterior (logistic regression,NN,..)


* __Kernel-based__:  
better learn non-linear datas  
robust to noise  
efficient when using kernel trick in higher spaces

* __Ensemble models__

### Regression

__loss functions__:
* MSE, MAE
* Huber Loss:  
It uses MSE for small errors and MAE for large errors.  
less sensitive to outliers


for optimization: put deviations of loss function to zero

__Evaluation metrics__:
1. MSE: easier to optimize, penalize large errors, absolute measure

2. RMSE: penalize large errors, better for interpretation

3. MAE: not good when outliers are prominent, more robust to outliers,
           treats all errors the same,
           doesn't change th units

4. MAPE: (0-100) -> 0 is the best

5. R-squared:  
This metric represents the part of the variance of the dependent variable explained by the independent variables of the model  
               not good for overfitting because of independent variable which make model complicated,  
               so we need adjusted r- squared,(0-1) -> 1 is the best  
               R-squared is better than RMSE, because RMSE is an absolute 
            measure (highly dependent on the variables, not a normalized measure)  
               Example:  
                    in marketing campaigns, where companies can measure their effectiveness by analyzing the relationship between the amount of money spent on advertising and the resulting increase in sales revenue. A high R-squared would indicate that there is a strong relationship between money spent on advertising and sales revenue

what we must make sure of:
- The errors or residuals of the data are normally distributed and independent from each other.
- There is minimal multicollinearity between explanatory variables

### Normalization

* __NN__:  
The point of adding a Normalization layer to the architecture of a neural network is to improve its convergence. <br>
The more faster we approach global minimum, the less time and resources we require for training. <br>
Adding a normalization layer also helps in elevating the robustness of the model.<br>

* __Others__:  
normalization should be used in distance-based algorithms (knn,kmeans, SVM, PCA...) and doesn't help with tree based models.

    1. Z-Score Normalization (Standard Scaler): mean = 0 and std = 1

    2. Min-Max Scaling : (0-1), useful when the distribution of features is somewhat <u>uniform</u> in nature

    3. Log Scaling: helpful when there is a large imbalance in the distribution of the feature values,
                    improving the linear performance of the model,
    4. Robust: useful when the data contains outliers or extreme values that can affect the scaling of the data. 
           most used in algorithms sensitive to the scale of the data,  knn and svm.


### Regularization

> overfit: Low bias and high variance <br>

* Ridge(L2): squared
* Lasso(L1): absolute, since it can make some features to be 0 it's a feature selection
* more data, less features, lower complexity
* larger K in knn, lower depth in trees, tuning hyperparameters using techniques such as grid search, random search, or Bayesian optimization.
* cross validation


### Classification

>In case of classification problem, we should always use __stratified sampling__ instead of random sampling. A random sampling doesn’t takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.

* Linear classification: optimized by gradient descent, loss functions: hinge loss, logistic loss,... <br>
* logistic regression: based on maximizing likelihood, loss: sigmoid, 


__loss functions__:
* cross-entropy (binary/ categorical):   
is a measure of the difference between two probability distributions.  
                 for models that predict probabilities.  
                 most suited for gradient descent using <u>logistic regression</u>.

* Hinge Loss: 
    used in SVMs for binary classification.


__Evaluation__:
1. accuracy: TP + TN / total

2. Percision: TP / TP + FP, useful when the cost of false positives is high (like fraud detection), not good in imbalance data

3. Recall: TP / TP + FN, useful when the goal is to detect as many positive cases as possible. (like diseases)

4. F1-score: 2*P*R / P + R, useful when both false positives and false negatives are important aspects to consider,( spam detection),  
best for imbalanced data

5. AUC: 
<br>
(0-1) the more the better, generated by changing thresholed and calculating TPR and FPR, <b>robust to outliers</b>, only for binary classes,  
Note that with highly imbalanced datasets, ROC AUC can be misleading.

6. log loss:  useful when the goal is to penalize the model for being overly confident about predicting the wrong class

__Model__:  
>for non-linear data bagging and boosting models are better.  
 for deployment simple models like linear reg,... are better than black box models like svm, xgbosst ,...


* __maximal margin classifier__: base of SVM, sensitive to outliers
* __SVM__:  
using kernel functions (like radial basis) changes the space and dimensions to find new realtions  
_Kernel_: A more complex kernel (e.g., RBF) can reduce bias but increase variance, while a simpler kernel (e.g., linear) can increase bias but reduce variance.   
_Regularization parameter_ (C): Increasing C can lead to a more complex model, which reduces bias but increases variance.  
* __KNN__:  
the higher K the smoother bounderies,   
        can handle <u>non-linear</u> relationships,  
        not perform well with high-dimensional data due to the "curse of dimensionality.",  
        <u>not good with outliers</u> because of instance-based method  
        We don’t use manhattan distance because it calculates distance horizontally or vertically only. euclidean metric can be used in any space to calculate distance.  
        _Number of neighbors_ (k): Increasing K can lead to a smoother decision boundary, which increases bias but reduces variance. 

* __KDE__:  
useful when the data is not well-represented by a parametric distribution,  
        <u>robust</u> when dealing with small or noisy datasets,   
        helpful in identifying outliers or anomalies.
* __Naive Bayes__: 

* __Decision Trees__:  
_Max depth_: Increasing the maximum depth of the tree can lead to a more complex model, which reduces bias but increases variance.  
_Min samples split_: Increasing the minimum number of samples required to split a node can lead to a simpler model, which increases bias but reduces variance.

* __Logistic Reg__: robust to outliers

* __Neural Networks__:  
the main motivation for using activation functions in NN is Capturing complex non-linear patterns.  
_Number of hidden layers and neurons_: Increasing the number of hidden layers or neurons can lead to a more complex model, which reduces bias but increases variance.  
_Regularization techniques_ (e.g., L1, L2, or dropout): Applying regularization can help control the complexity of the model, reducing variance while potentially increasing bias.


* __Ensemble methods__ (e.g., Random Forests, Gradient Boosting):   
 _Number of base models_: Increasing the number of base models can help reduce variance by averaging the predictions of multiple models. However, it may not significantly affect bias.  
_Hyperparameters of base models_: Changing the hyperparameters of the base models can affect the bias and variance of the ensemble method.  
larger trees in random forest causes overfit.

* __Random Forest__:  
Calibration in Random Forest refers to the process of adjusting the predicted probabilities of the model to better match the true probabilities of the outcome  
  methods:  
  - Platt scaling: This method fits a logistic regression model to the predicted probabilities of the Random Forest model and uses the coefficients to adjust the probabilities.  
  - Isotonic regression: This method fits a non-parametric regression model to the predicted probabilities and adjusts them to be monotonic.  
  - Bayesian calibration: This method uses a Bayesian framework to estimate the true probabilities of the outcome and adjusts the predicted probabilities accordingly.


### Density Estimation

a technique used to estimate the probability density function(PDF) of a random variable based on a set of observed data. 
- Kernel density estimation:  
a non-parametric technique that estimates PDF by placing a kernel function at each data point and summing the contributions of all kernels.  
The bandwidth of the kernel function determines the smoothness of the estimated density.
- Gaussian mixture models:  
a parametric technique that models the PDF as a weighted sum of Gaussian distributions.  
The number of Gaussian components and their parameters are estimated from the observed data.
- Autoencoders:  
a neural network architecture that can be used for density estimation by training the network to reconstruct the input data.  
The network can be used to estimate the PDF by measuring the reconstruction error.
- Variational autoencoders:  
a type of autoencoder that can be used for generative modeling and density estimation.  
Variational autoencoders learn a latent representation of the data and use it to generate new samples that are similar to the observed data.
<br>
<br>


### Information Theory

1. __Entropy__:  
can be used to quantify the amount of information contained in a dataset or a feature (decision trees for splitting nodes)
2. __Mutual Information__:  
measure of the amount of information that two random variables share.
                     (feature selection: subset of features that maximizes the mutual information between the features and the target variable.)
3. __Kullback-Leibler Divergence__:  
a measure of the difference between two probability distributions. 
                                used to compare the similarity of two models or to optimize the parameters of a model.  
                                (unsupervised learning: to measure the difference between the input data distribution and the model distribution, and the goal is to minimize the KL divergence by adjusting the model parameters.)  
                                __KL loss__:   
is a measure of the difference between two probability distributions.(it uses log in formula)  
measure the difference between the prior distribution and the posterior distribution of the model parameter  
           for models that predict probability distributions.   
           useful in <u>generative models</u>, such as VAEs, to train the model to generate samples that are similar to the true data distribution. 
4. __Information Gain__:  
Information gain is a measure of the reduction in entropy,
                     can be used to select the best feature to split a node.

### Optimization


>hyperparameter: learning rate

* Full Batch Gradient Descent:  
        computing the gradient of the cost function using the entire training dataset.
        No shuffle is needed, because each update passes through the entire dataset anyway and the order doesn't matter. 
* Stochastic Gradient Descent:  
        computing the gradient using a single randomly selected training example at a time. 
* Mini-batch Gradient Descent:  
        is a compromise between the two, using a small randomly selected subset of the training dataset

<br>


### Clustering

__Models__:

>instead of train-test splitting, clustering algorithms typically use all available data.
However, some clustering algorithms may require tuning of hyperparameters. In these cases, it may be useful to use cross-validation.  


>In clustering, bias refers to the tendency of a model to consistently miss the true structure of the data
variance refers to the tendency of a model to be overly sensitive to noise in the data.  
bias can arise when the algorithm is not flexible enough to capture the true structure of the data. For example, if the algorithm assumes that the data is linearly separable when it is actually non-linear, it will consistently miss the true structure of the data. This can lead to underfitting, where the model is too simple to capture the complexity of the data.  
On the other hand, variance can arise when the algorithm is too flexible and captures noise in the data instead of the true structure. 



density-based and model-based
1. __KMeans__:  
find k by variance reduction -> elbow method,   
        suitable for datasets with a large number of features and a moderate number of clusters,  
        works well when the clusters are well-separated and have a roughly spherical shape.  
        regularization makes the decision boundaries more regular.  
   * _Using cosine distance as the distance metric in clustering_:  
     - useful in situations where the magnitude of the feature vectors is not important, and only the direction of the vectors matters. For example, in text clustering, the cosine distance can be used to measure the similarity between two documents based on the angle between their word frequency vectors.  
     - less sensitive to the scale of the input features compared to Euclidean distance.  
     - more effective at capturing the semantic similarity between data points. This is because the cosine distance measures the similarity between the directions of the feature vectors, which can be more meaningful in some applications than the distance between the feature values themselves.  
     - one potential disadvantage is that it can be sensitive to the sparsity of the input features. In high-dimensional spaces, many feature vectors can be orthogonal to each other, resulting in a cosine distance of zero. This can make it difficult to distinguish between data points that are far apart in the feature space but have similar directions
2. __Hierarchial__:  
suitable for datasets with a small to moderate number of features and a small to moderate number of clusters.

3. __GMM__ :  
modeling complex data distributions,   
         work well when the clusters have different shapes and sizes,  
         can be used for density estimation and anomaly detection.
4. __DBSCAN__:  
(density-based:  density refers to the concentration of data points in a given area or region.)  
           suitable for datasets with arbitrary shapes, densities and sizes of clusters,(spatial data such as the geometrical locations of houses)  
           handle noise and outliers in the data.

5. __Fuzzy C-Means__:  
suitable for datasets where each data point can belong to multiple clusters with different degrees of membership.
                 works well when the clusters have overlapping boundaries and can be used for feature selection and data compression.

6. __LDA__: generative, best for topic modeling

__Evaluation__:


* Internal:
    1. Silhouette coefficient:  
    measures the similarity of a data point to its own cluster compared to other clusters. The higher the better. 
    2. Calinski-Harabasz index:  
    measures the ratio of the between-cluster variance to the within-cluster variance. The higher the better.
    3. davies-Bouldin index:  
    The Davies-Bouldin index measures the average similarity between each cluster and its most similar cluster. The lower the better.  

* External:  
    1. Adjusted Rand index:  
    measures the similarity between the clustering results and the ground truth labels. The higher the better.
    2. Fowlkes-Mallows index:  
    measures the geometric mean of the precision and recall of the clustering results compared to the ground truth labels. The higher the better.
    3. Normalized Mutual Information:  
    measures the mutual information between the clustering results and the ground truth labels, normalized by the entropy of the two sets. The higher the better.

### Ensemble Models

>reduce overfitting, more robust  
ensemble learners are built on the premise of combining weak <b>uncorrelated</b> models. if models are correlated, no improve will be made.



1. __averaging__:  
              Averaging models improve the performance and stability and a lower variance than individual models   
              by combining the predictions of multiple models.   
              _simple averaging_, _weighted averaging_, _stacking_(This involves training a meta-model that takes the predictions of multiple models   as input and produces a final prediction.)  
              1. Split the training data into two or more parts.  
              2. Train several base models on the first part of the training data.  
              3.Use the trained base models to make predictions on the second part of the training data.  
              4. Combine the predictions of the base models to create a new dataset.  
              5. Train a meta-model on the new dataset, using the actual target values as the labels.  
              6. Use the trained meta-model to make predictions on the test data.  

2. __Bagging__:  
            (random forest) reduce variance by taking a random sample of data,  
            runs weighted averages in <u>parallel</u>,  
             useful when the base models are <u>unstable</u>, meaning that small changes in the training data can lead to large changes in the model.   
             useful when the base models are <u>simple</u>, as it can help to reduce bias and improve model performance.  
             predictions are combined using <u>voting</u> (classification) or <u>averaging</u> (regression).  

3. __Boosting__:  
            (ada boost, catboost) These trees would be working in a sequential order.   
             The output of one tree is used by other trees to focus more on the errors and to fit over the residuals.  
             useful when the base models are <u>weak</u>, meaning that they have low predictive power  
             useful when the base models are <u>complex</u>  

### Semi-Supervised

>involves training models on a combination of labeled and unlabeled data.   
Self-organizing maps are specialized neural network for semi-supervised learning.


- Self-training:  
 involves training a model on the labeled data, and then using the model to make predictions on the unlabeled data. The high-confidence predictions are then added to the labeled data, and the model is retrained on the expanded labeled data.
- Co-training:   
involves training two or more models on different subsets of the features. The models make predictions on the unlabeled data, and the high-confidence predictions are used to train the other models.
- Generative models:   
Generative models, such as generative adversarial networks (GANs) and variational autoencoders (VAEs), can be used for semi-supervised learning by learning to generate realistic samples from the data distribution.  
The generative models can be trained on the labeled and unlabeled data, and the generated samples can be used to improve the performance of the discriminative models.

### Online Learning

In summary, online learning is a machine learning technique that allows a model to learn continuously from new data as it becomes available, without requiring the model to be retrained from scratch. Online learning is particularly useful in situations where the data is constantly changing or evolving, and has several advantages over traditional batch learning, including scalability, adaptability, efficiency, and cost-effectiveness.

### RL

__Models__:
* model-based:
        uses the transition function (and the reward function) in order to estimate the optimal policy.
* model-free: 
        estimates the optimal policy without using or estimating the dynamics (transition and reward functions) of the environment.


* Q-learning: model-free
    more: https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c#:~:text=Q%2Dlearning%20is%20a%20model,equation(particularly%20Bellman%20equation).&text=Means%20it%20learns%20the%20value,independently%20of%20the%20agent's%20actions

### Recommender Systems

filtering:
* content-based
* collaborative
* Hybrid

Vectorization:
* tf-idf


### Time Series:


>In time series problem, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3.Instead, we can use forward chaining strategy


----