<br>
<br>
<br>
<br>

# DAV 6150 Module 8: "Distance-based" Machine Learning Algorithms
<br>
<br>
<br>

## Project 1 Review

#### Be very careful when making decisions regarding which attributes or categories to discard from your data

The categorical attributes within the NYSED data set provide very valuable information, e.g., there is a very strong relationship between dropout count and school district, between dropout count and county, etc. These relationships could have been discovered via appropriate EDA work. The response variable is Dropouts, so ALL categories could feasibly have been used for purposes of training a model. 

Arbitrarily discarding categorical data results in an enormous amount of valuable predictive information being removed from the dataset, which is never an appropriate outcome.

#### "Outliers": Are they or aren't they ?

During EDA and Data Preparation we need to be __VERY__ careful regarding our approach to identifying and treating suspected outliers. Be sure to ask yourself:

- __Is my approach to identifying outliers appropriate for each of the attributes I believe may have outlying values__? Make sure you do some research / develop some domain knowledge relative to the data you are working with so that you are well-informed regarding the possible __valid__ data values for each of your attributes.


- __If I choose to remove observations that I suspect may have outliers from the data set, am I discarding too many observations__ ? If your outlier analysis results in the discarding of a large percentage of the observations contained within a data set, make sure you have a strong set of supporting facts/ domain knowledge that justify your actions.


- __Is there some other method (other than deletion) I can make use of when I have identified a likely outlier__? Think about why the supposed outlying values may be present within the data set: Are they actually valid values? Could they be the result of a data entry or computational error of some sort? Are there similar observations within the data set? Can we preserve otherwise valid data via the use of a well-founded imputation of a new value for the supposed outlier? etc.

#### Be careful of "Data Leakage" when selecting attributes as explanatory variables

"In statistics and machine learning, __leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time__, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment." (SOURCE: https://en.wikipedia.org/wiki/Leakage_(machine_learning)). 


In the P1 data set, we are provided with pairs of attributes whose meaning/context is nearly identical, e.g., the __pct__ and __cnt__ attributes. Of most concern is the presence of the __dropout_pct__ attribute, which is both highly correlated with the response variable, but is also conveying / providing the same type of information found in the __dropout_cnt__ response variable. If we were attempting to predict __dropout_cnt__ and we ALREADY had access to the __dropout_pct__ data, we would have no need of constructing any type of predictive model, i.e., we could simply convert the percentage to a count using the other data provided within the dataset. This is an example of an obvious case of __data leakage__, i.e., if we were truly trying to predict __dropout_cnt__, it is highly unlikely that we would have already had access to an attribute that provided us with the __percentage__ of students that dropped out. As such, the __dropout_pct__ attribute should have been __EXCLUDED__ from your models.

#### Use of the sm.GLM() function for Negative Binomial models requires that we derive a value for the 'alpha' parameter from our data.

As is explained in the Assigned Reading from M6 (https://towardsdatascience.com/negative-binomial-regression-f99031bb25b4), we need to calculate a value for alpha before training a negative binomial regression model. This article (which was highlighted in the __M6 Lecture Notes__ as an appropriate approach to use) https://dius.com.au/2017/08/03/using-statsmodels-glms-to-model-beverage-consumption/#cameron also provides a detailed explanation of how to calculate an appropriate value for alpha. Use of an arbitrary value for the 'alpha' parameter is never appropriate.

#### Make sure you discuss + explain your model coefficients

The "directionality" of model coefficients can provide a great deal of information we can use to help explain a model to others. Therefore, we should __always__ review and discuss the directionality of our model coefficients for purposes of explaining to others the effects that various explanatory variables have upon our response variable.

#### An R^2 metric cannot serve as a basis of comparison between linear models and Poisson or Negative Binomial models

As we learned in M5 and M6, R^2 is a metric that applies to linear regression models: It measures the strength of a LINEAR relationship between independent and dependent variables. As such it has no applicability whatsoever to the assessment of any predictions made by either negative binomial or Poisson regression models. When comparing models of different types, we need to rely upon common metrics, e.g., log likelihood scores, RMSE scores, etc.

#### Would a linear regression model have been appropriate for this data?

As was discussed during the Module 6 Live Session + Lecture Notes, __linear regression models can generate negative non-whole values__. Since our response variable was a cardinal non-negative integer, use of a linear regression model would not be appropriate for purposes of estimating the number of cases of wine likely to be sold.

# K-Nearest Neighbors

__K-Nearest Neighbors (KNN)__ is a __supervised machine learning algorithm__ most frequently used for solving __classification__ problems.


KNN has two underlying assumptions:

- 1) We can use a distance metric to calculate the “distance” between any two given data observations within a data set


- 2) Data observations that are “near” to one another are likely to be similar to each other. 


__“K”__ is a constant representing the number of nearby / neighboring training set data points (or data observations) to be used to predict a valid classification for a given data point


We can select from a wide variety of distance metrics for use within a KNN implementation. __Minkowski Distance__ is __a generalized distance formula__ that can be used as a framework for calculating a variety of distance measures.

### The Minkowski Distance generalized formula is as follows:

## $\left (\sum _{i=1}^{n} |x_{i} - y_{i}|^p \right) ^{1/p}$

<br>

We can manipulate the value of $p$ in the above formula to derive different distance metrics, each of which is explained graphically __here__: http://www.ieee.ma/uaesb/pdf/distances-in-classification.pdf



#### p = 1: Manhattan Distance - Calculate Distance via a Grid-Like Path

## $d = {\sum _{i=1}^n |x_{i} - y_{i}| } $ 

<br>

####  p = 2: Euclidean Distance - Calculate "As the Crow Flies" Distance Between 2 Points on a 2-D Plane

This is the classic Euclidean formula: 

## $d(x,y) = \sqrt{\sum _{i=1}^{n} \left(x_{i}-y_{i}\right)^2 }$

<br>


#### p = $\infty$:  Chebyshev Distance - Calculate the Distance Between Two Vectors

Chebyshev Distance is also sometimes referred to as __chessboard distance__


## $d_{chebyshev} (x,y) = \max\limits_i (|x_{i} - y_{i}|) $

<br>


#### Other Distance Metrics

__Mahalanobis Distance__: For a given data point and distribution $D$, measure how many standard deviations away the point is from the mean of $D$. 

#### $D_{M}(\overrightarrow{x}) = \sqrt{(\overrightarrow{x} - \overrightarrow{\mu})^T S^{-1} (\overrightarrow{x} - \overrightarrow{\mu})  } $

<br>


__Cosine Distance__: Most frequently used to measure similarity of documents; Applied to term frequency vectors constructed from the content of documents.

## $\cos\theta = \frac {\overrightarrow{a} \cdot \overrightarrow{b}} {||\overrightarrow{a}|| ||\overrightarrow{b}||} $

Compare the result of the equation to the following cosine angle values to determine how similar your documents are:

- $\cos\theta = 1$ : Vectors are pointing in the same direction => documents are very similar


- $\cos\theta = 0$ : Vectors are orthogonal => Documents have some similarities but are unlikely to be related to one another


- $\cos\theta = (- 1)$ : Vectors are pointing in opposite directions => Documents are completely dissimilar


Unfortunately, __there is no single "rule of thumb" or specific set of guidelines for determining which distance function to apply__.  Therefore, __apply your empirical skills__ and test various distance functions to derive an KNN model that works best relative to your data. 

### Implementing KNN in Python

__sklearn__ includes a pre-built KNN classifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html


An example from the assigned readings for Module 8: https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d

# Support Vector Machines

[Youtube Video:](https://yu.instructure.com/courses/63488/pages/m8-a-dot-4-support-vector-machines-basics?module_item_id=1157320)

PAGE 113 https://github.com/mattharrison/ml_pocket_reference/blob/master/ch10.ipynb

__Support Vector Machines (SVM)__ are __supervised machine learning algorithms__ most frequently used for solving __classification__ problems.


- SVM uses the concept of __margin classification__, wherein we attempt to identify classifications within a data set by deriving a decision boundary that maximizes the distance between groups of data points. 


- SVM identifies __parallel hyperplanes__ that separate the classes of data __via the maximum distance possible__ relative to the constraints of the data set. 


- The region bounded by the hyperplanes is called the __"margin"__, and __the maximum-margin hyperplane is the hyperplane that lies halfway between them__. The __“Support Vectors”__ are __the data points that lie along the edge of the maximum-margin hyperplane__. From the assigned readings: https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989


- SVM can be used for both linear + non-linear classification tasks.


- SVM is well suited for use with small to medium size, relatively complex data sets


### Using SVM with Non-Linear Data

SVM can be successfully applied to non-linear data by adding additional polynomial features to a model. However, instead of adding a significant number of features to a model, which would negatively impact our ability to implement an effective model, we make use of a __kernel trick__, which allows us to achieve the results of including a high degree of new features without actually adding them to our data. 


__How the "kernel trick" approach works__:  We map our non-linearly separable data into a higher dimensional space via a mathematical function. We then try to find a hyperplane within that higher dimensional space that can effectively separate the samples.


__What types of kernel tricks are commonly used?__: 

- Polynomial


- Radial Basis Function (RBF)


- Gaussian (a special case of RBF)


- Sigmoid


See this link for a detailed discussion of kernel tricks: https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-trick-mercers-theorem-e1e6848c6c4d


Unfortunately, __there is no single "rule of thumb" or specific set of guidelines for determining which of the non-linear kernel tricks to apply__.


If your data is non-linear, defaulting to the use of the __RBF__ kernel trick is often suggested. However, the __polynomial__ kernel can be effective in many instances. Therefore, __apply your empirical skills__ and test various combinations of kernel tricks + SVM tuning parameter settings to derive an SVM model that works best relative to your data. 


__How do we implement SVM + kernel tricks in Python?__:  We can use a pre-built SVM classifier provided within the __scikit-learn__ library:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html


The __sklearn.svm.SVC()__ function includes a parameter that allows us to select the kernel function to be applied within the SVM classifier. We can choose from ‘linear’ (use when you have data that is known to be linearly separable), ‘poly’ (polynomial), ‘rbf’ (radial basis function), ‘sigmoid’. We can also construct + use our own kernel function if we prefer (simply set the "kernel =" parameter to the name of your Python function).

## Module 8 Assignment Guidelines / Requirements
[See link](https://yu.instructure.com/courses/63488/assignments/324310)
