<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Other Considerations - Modeling](18.02-mlpg-Other-Considerations-Modeling.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Other Considerations - Algorithm Comparisons](18.04-mlpg-Other-Considerations-Algorithm-Comparisons.ipynb) ]>

# 18. Other Considerations

## 18.3. Algorithms

### 18.3.1. Multinomial Logistic Regression
* **Binomial Logistic Regression**: Standard logistic regression that predicts a binomial probability (i.e. for two classes) for each input example
* **Multinomial Logistic Regression**: Modified version of logistic regression that predicts a multinomial probability (i.e. more than two classes – multi-class classification problem) for each input example

### 18.3.2. Robust Regression
* **Robust regression** refers to a suite of algorithms that are robust to outliers present in training data
* Robust regression algorithms can be used for data with outliers in the input or target values
* Examples of Robust regression algorithms are:
  - Huber Regression
  - RANSAC Regression
  - Theil Sen Regression

**Comparison of Robust Regression algorithms Line of Best Fit:**<br>
<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-RobustRegression.png"><br>
<br><br><br><br><br>
Image credit [ (Source) ](https://machinelearningmastery.com/robust-regression-for-machine-learning-in-python/)

### 18.3.3. Linear Discriminant Analysis (LDA)
* Linear Discriminant Analysis, or LDA for short, is a classification machine learning algorithm
* LDA assumes that the input variables are numeric and normally distributed and that they have the same variance (spread). If this is not the case, it may be desirable to transform the data to have a Gaussian distribution and standardize or normalize the data before modeling
* It also assumes that the input variables are not correlated; if they are, a PCA transform may be helpful to remove the linear dependence
* The LDA model is naturally multi-class. This means that it supports two-class classification problems and extends to more than two classes (multi-class classification) without modification or augmentation
* It is a linear classification algorithm, like logistic regression. This means that classes are separated in the feature space by lines or hyperplanes. Extensions of the method can be used that allow other shapes, like Quadratic Discriminant Analysis (QDA), which allows curved shapes in the decision boundary

### 18.3.4. Nearest Radius Neighbors (NRN) algorithm
* The NRN Classifier is a simple extension of the k-nearest neighbors classification algorithm
* It is based on the k-nearest neighbors algorithm or kNN. kNN involves taking the entire training dataset and storing it. Then, at prediction time, the k-closest examples in the training dataset are located for each new example for which we want to predict. The mode (most common value) class label from the k neighbors is then assigned to the new example

### 18.3.5. Gaussian Processes Classification (GPC) algorithm
* The GPC is a non-parametric algorithm that can be applied to binary classification tasks
* A Gaussian process is a generalization of the Gaussian probability distribution
* Gaussian processes can be used as a machine learning algorithm for classification predictive modeling

### 18.3.6. Clustering algorithms (Unsupervised)
* Clustering is an unsupervised problem of finding natural groups in the feature space of input data
* It is often used as a data analysis technique for discovering interesting patterns in data, such as groups of customers based on their behavior, so-called pattern discovery, or knowledge discovery
* Examples of cluster problems:
  - Market segmentation
  - Separating normal data from outliers or anomalies 
  - Clustering can also be useful as a type of feature engineering, where existing and new examples can be mapped and labeled as belonging to one of the identified clusters in the data
* Many algorithms use similarity or distance measures between examples in the feature space to discover dense regions of observations
* As such, it is often good practice to scale data before using clustering algorithms
* Cluster analysis is an iterative process where a subjective evaluation of the identified clusters is fed back into changes to algorithm configuration until a desired or appropriate result is achieved
* 10 popular clustering algorithms:<br>
  - `1) Affinity Propagation`
    - Finding a set of exemplars that best summarize the data
  - `2) Agglomerative Clustering`
    - Merging examples until the desired number of clusters is achieved
  - `3) BIRCH`
    - Constructing a tree structure from which cluster centroids are extracted
  - `4) DBSCAN`
    - Finding high-density areas in the domain and expanding those areas of the feature space around them as clusters
  - `5) K-Means`
    - Assigning examples to clusters to minimize the variance within each cluster
  - `6) Mini-Batch K-Means`
    - A modified version of k-means that makes updates to the cluster centroids using mini-batches of samples rather than the entire dataset, which can make it faster for large datasets, and perhaps more robust to statistical noise
  - `7) Mean Shift`
    - Finding and adapting centroids based on the density of examples in the feature space
  - `8) OPTICS`
    - A modified version of DBSCAN
  - `9) Spectral Clustering`
    - A general class of clustering methods, drawn from linear algebra
  - `10) Mixture of Gaussians`
    - Summarizes a multivariate probability density function with a mixture of Gaussian probability distributions

### 18.3.7. Working of an unsupervised learning algorithm

**Introduction to Unsupervised Learning:**
* Unsupervised learning is contrasted with Supervised learning because it uses an **unlabeled** training dataset rather than a labeled one, in other words, we don't have the vector of expected results, we only have a dataset of features where we can find structure
* Clustering is good for:
  - Market segmentation
  - Social network analysis
  - Organizing computer clusters
  - Astronomical data analysis, etc.

**K-Means Clustering algorithm:**
* It is the most commonly used algorithm for automatically grouping data into coherent subsets and it works as follows:
  - **Step 1**: `Initialize`
    - Randomly initialize the **K** number of _`cluster centroids`_ (data points in the dataset)
  - **Step 2**: `Cluster assignment`
    - Assign all examples into one or more groups based on which cluster centroid the example is closest to
  - **Step 3**: `Move centroid`
    - Compute the averages for all the points inside each of the centroid groups, then,
    - Move the cluster centroid points to those averages
  - **Step 4**: `Repeat`
    - Re-run steps 2 & 3 until all the clusters are found
* The main variables/parameters are:
  - **K** (_`number of clusters`_)
  - **X** (_`Training dataset`_ [$x^1, x^2,…,x^m$])
* If we have a cluster centroid with 0 points assigned to it, we can `randomly re-initialize` that centroid to a new point, or, simply `eliminate` that cluster group
* After several iterations, the algorithm will converge where new iterations do not affect the clusters
* Non-separated clusters: 
  - Some datasets have no real inner separation or natural structure
  - K-Means algorithm can still evenly segment the data into **K** subsets so can still be useful
* **`Choosing the Number of Clusters`** (i.e., **`Step 1`**):
  - Choosing K can be quite arbitrary and ambiguous
  - **Elbow method:** 
    - Plot the cost function **J** and the number of clusters **K**
    - The **J** should reduce as we increase **K**, then flatten out
    - Choose **K** at the point where the cost **J** starts to flatten out
    - However, fairly often, the **curve is very gradual**, so there is **no clear elbow**

**NOTE:**
* The cost function **J** always decreases as **K** is increased
* There is one exception when K-Means gets stuck at a bad local optimum
* Another way to choose **K** is to observe how well K-Means performs on a `downstream purpose`, in other words, we choose K that proves to be most useful for some goals we are trying to achieve from clusters 

<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-ChoosingKValue1.png"><br>
<br><br><br><br>
Image credit [ (Source) ](https://www.coursera.org/in)

<img align="left" style="padding-right:10px;" src="figures/MLPG-OC-ChoosingKValue2.png"><br>
<br><br><br><br><br><br><br><br>
Image credit [ (Source) ](https://www.coursera.org/in)

### 18.3.8. Limitations of k-Means clustering
* Works well for simple clusters that are the same size, well-separated, globular shapes
* But, does not do well with irregular, complex clusters

<!--NAVIGATION-->
<br>

<[ [Other Considerations - Modeling](18.02-mlpg-Other-Considerations-Modeling.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Other Considerations - Algorithm Comparisons](18.04-mlpg-Other-Considerations-Algorithm-Comparisons.ipynb) ]>