## Choosing an algorithm when you have a...

#### Binary target variable

Try logistic regression, probit

#### Categorical target variable

- Try multinomial logistic regression in R
- Naive Bayes
- Linear Discriminant Analysis and Quadratic Discriminant Analysis (LDA and QDA)
- Decision Trees or Random Forests
- k-Nearest Neighbors

#### Ordinal target variable

Try an ordinal logistic regression. Mord offers a newer [python implementation](http://pythonhosted.org/mord/), but it may be wise to do this using a [Proportional Odds Logistic Regression in R](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/polr.html) 

# There are many ways to characterize algorithms.

## I. Generative versus Discriminative Algorithms

#### Generative models
Attempt to model the conditional probability and prior distribution functions of all features in order to understand the most likely outcome, given a set of features. 

$$ P(y|X) = \frac{P(X|y)P(y)}{P(X)}$$

So in order to find P(y|X), we first have to estimate the likelihood P(X|y) and the prior distribution P(y). [(see Appendix I for more info)](#why-no-px).  This requires a maximum likelihood (or minimum error) estimation, also known as the MLE. 

**Bonus 1: ** If you create a generative model, you can use the prior distributions to *generate* synthetic data

**Bonus 2: ** Generative models typically outperform Discriminative models for smaller datasets, as they are less prone to overfitting

#### Examples of generative models
- Naive Bayes
- LDA and QDA

#### Discriminative models
Attempt to find a discriminative boundary between classes by directly estimating posterior possibilities.
$$ P(y|X) $$

**Bonus 1: ** Tend to outperform generative models with larger datasets, as they learn P(y|X) directly

**Bonus 2: ** Discriminative models do not require as many assumptions about the joint distribution structure of your features, such as their conditional independence.

#### Examples of discriminative models
- Logistic Regression
- SVM
- kNN
- neural networks

#### More details:
1. [Stats.StackExchange](http://stats.stackexchange.com/questions/12421/generative-vs-discriminative)
2. [A useful lecture from Columbia](https://www.ee.columbia.edu/~dpwe/e6820/lectures/L03-ml.pdf)

## II. Parametric versus non-parametric

### Parametric algorithms
Start with a particular mapping function form, such as this linear combination:

$$ y = \beta_0 + \beta_1\chi_1 + \beta_2\chi_2 + \beta_3\chi_1\chi_2$$

Features may be transformed to behave according to the form's underlying assumptions (e.g. normality or nonlinearity), and the coefficients are found by fitting the training data to the model's form.  

In fact, many parametric models involve transformations on the exponential family to make the model operate as a linear combination of features.  Examples, such as the logistic regression, can be seen here: http://www.cs.princeton.edu/courses/archive/spr09/cos513/scribe/lecture11.pdf  

#### Examples: #### 
- Logistic Regression
- Linear Discriminant Analysis
- Perceptron
- Naive Bayes
- Simple Neural Networks

#### Choose a parametric algorithm when: ####
- you want the results to be interpretable and insightful, not just predictive
- you know how the distributions should behave, and you may not have a very large dataset
- you have intuition into how your features behave and interact to produce a target variable

#### Beware of: ####
- over-generalization and lack of flexibility 

### Non-parametric algorithms

Are more flexible ways to fit to the underlying data while maintaining the ability to generalize to new datasets

#### Examples: ####
- k-Nearest Neighbors
- Decision Trees
- Support Vector Machines

#### Choose a non-parametric algorithm when: ####
- you don't have an ideal mapping function that applies to every case
- you have a lot of data

#### Beware of: ####
- overfitting


## Appendix

<a name="why-no-px"> I. Modeling the joint likelihood for generative models</a> 

We are trying to find P(y|X) given the following function: $$ P(y|X) = \frac{P(X|y)P(y)}{P(X)}$$

In this case, the normalization function P(X) does not need to be estimated in order to find $\underset{y}{\operatorname{<argmax>}} P(X|y)P(y),$ as it is invariant with respect to y. Furthermore, $P(X|y)P(y) = P(X,y),$
so what generative models are finding is the joint likelihood P(X,y) instead of modeling the conditional likelihood P(y|X) directly.