# Machine Learning for Data Science and Analytics

# Machine Learning

## Algorithms in Machine Learning
### Massive Data Sets
- Dealing with massive data sets requires the use of efficient data structures  
- Need to compute summaries and samples of data  
- Need randomized algorithms  
- Need to store data so that it can be efficiently accessed (*binary trees*, *hashing*, ...)  

### Types of learning problems 
#### Supervised
**Marked patterns are available**  
Task of learning a function that maps an input to an output based on example input-output pairs.  

#### Unsupervised
**No marked patterns**  
Helps find previously unknown patterns in data set without pre-existing labels.  

#### A machine learning method is a computer algorithm that searches for patterns in data
- Patterns are learned from data  
- Mathematical tools for describing patterns: Statistics and probability  
- Machine Learning combines these with optimization, algorithms, and other tools  

**f(state, action) = next state**

## Classification
Predicts the category which the data belongs to.  

![classification](https://cdn-images-1.medium.com/max/788/1*I4zVi0P_jUaq2FmYvMQg1Q.jpeg)

- Need to compute **distances** and minimum distances between a point and a line (**shortest path**)  
- Computing a **linear separator** (*classifier*) is a linear programming problem  
- Computing a **non-linear separator** (*support vector machine*) requires algorithms for problems such as **semi-definite programming** or **convex programming**

##### Feature space
- Data = points in ***R***  
- Dimensions = scalar measurements  

##### Classifier functions
- A classifier for K classes is a function  
    - f(***R***) = {1, ..., K}  
- Classifiers carve up the space feature into regions 

#### Quantifying mistakes
- Loss function for K classes : 
    - loss : {1, ..., K} x {1, ..., K} -> [0, ∞)  
    - loss(f(x), true class of x)
    - If all mistakes are equally bad :  
        - loss(i, j) = { 1 if i != j; 0 if i = j}
- If class distributions known :  
    - Risk of classifier is the expected loss  
        - risk(f) = E[loss(f(X), true class of X)]  

    
### Nearest Neighbor Classification
Use training data as classifier  
- Given : Data point x  
- Find training data point closest to x  
- Assign x the label of closest point

#### Drawbacks
- In large data set, finding nearest data points is expensive  
- Expense also grows with dimension
- Was an important method when data sets were small  

### K-means Clustering
**Clustering algorithm**.  
K represents the **number of clusters** we are going to classify our data points into.  

<img src="https://cdn-images-1.medium.com/max/1600/1*K3DzIBwc6jlBuGM0W0vYjQ.png" alt="k-means" width="400"/>

- Requires :  
    - Shortest paths  
    - Randomization  
    - Graph models  
    - Approximating NP-complete problems  


## Linear Classifiers
Classification decision based on the value of a **linear combination** of the characteristics.  

#### Hyperplane
Create support vector machines.  
Used to define decision boundaries.  
![hyperplane](https://images.deepai.org/glossary-terms/3bb86574825445cba73a67222b744648/hyperplane.png)  

#### Limitations
- Problem 1 : Curved optimal decision boundary :  
    - Can be addressed using the **kernel trick**  
- Problem 2 : Classes may overlap SVM address :  
    - Permitting misclassified training points
    - Each such point contributes a *cost* to the optimization target function  
- Problem 3 : More than two classes :  
    - Can be addressed by combining multiple linear classifiers  
    - There are several ways to do so; each has drawbacks  
 

## Ensemble Classifiers
- Train many *weak* classifierrs  
- Combine results by majority vote  

#### Error rate
- Proportion of misclassified points  
- Expected number of errors

#### Weak Classifier
- Consider two classes of equal size  
- Assigning class by coin flip : *50 %* expected error  
- Weak classifier = error rate sightly above 50 %  

#### Classification by majority vote 
- **m classifiers** take a vote, m is an odd number  
- Two choices :  
    - One is correct  
    - One is wrong  
- Decision is made by simple majority  

For two classes and classifiers f1, ..., fm :
    - Majority vote at input x = sgn(∑fi(x))

#### Tree Classifiers 
Uses a **decision tree** to go from observations about an item to conclusions about the item's target value.  
![tree classifier](https://upload.wikimedia.org/wikipedia/commons/2/25/Cart_tree_kyphosis.png) 

#### Training
- Input n training points of classes 1, ..., K  
    - Select n points uniformly at random with replacement  
    - Train a tree on the randomized data set  
- Repeat m times  

#### Empirical observation
    - Tree ensemble typically performs reasonably well  
    - Too dependent
    
#### Tree training
    - In each step, computes best split point along each axis  
    - Then splits the axis that minimized error  
    - Split is optimized over all axes  
    
#### Random Forests
Constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.  
- In each step, select small random subset of axes  
- Only optimize over those  

## Model Selection

### Terminology
- Model  
    - Family of classifiers
- Parameter  
    - Indexes different classifiers within model  
- Hyperparameter  
    - Indexes different models  

#### Training error of a classifier
Training error = number of misclassified training points / number of training points  

#### Prediction error of a classifier
- The distribution of the underlying data source  
    - Prediction error = E[proportion of misclassified points]  
    - If we assume all errors cost the same, this is the risk  

#### Classifier parameters
- Tree classifier  
    - Number of splits  
- Tree ensembler  
    - Tree parameters  
    - Number of trees  
- Random Forest  
    - Tree parameters  
    - Number of trees  
    - Number of random dimension

### How do we select an adequate model based on simple data
- Model selection chooses a model complexity (*hyperparameter*)  
- Training a classifier chooses parameter values  
- The training can often be formulated as minimizing the training error  
- Model selection **cannot** be performed by minimizing training error, that would lead to overfitting

#### If we knew underlying distribution
- Train classifiers with different hyperparameters on training data  
- Compute prediction errors under true distribution  
- Choose the one with smallest prediction error  

**Separating model selection and training prevent** ***overfitting*** 

#### Approximation by sample data 
- Split training data set  
- Train on set 1  
- Test predictive performance on set 2  

Data splitting estimates the prediction error from data.  
Prediction error estimates can be used in two ways :  
    - Model selection  
    - Classifier assessment  

Classifier assessment estimates the prediction error of the final choice of classifier.  

- Model selection -> Optimize performance  
- Classifier assessment -> Interpret performance  

## Cross Validation
**Cross validation selects model and assesses classifier**

- Split data into *three* sets :  
    1) Training set  
    2) Test set  
    3) Validation set (*hold-out set*)
- Train classifiers with different hyperparameters on training set  
- Select the one with smallest prediction error on test set  
- Estimate performance on validation set  

**Prediction error estimate on test set is confounded by model selection**  

### How to split the data
- If samples assumed -> split at random (samplit without replacement)  
- How large should each set be ? 
    - Large training set -> More accurate classifier  
    - Small training set -> Reflects variation between sample sets  
    
![cross validation](https://3gp10c1vpy442j63me73gy3s-wpengine.netdna-ssl.com/wp-content/uploads/2018/03/Screen-Shot-2018-03-21-at-4.26.53-PM.png)

### K-fold cross validation
- Remove validation set and set it aside  
- Subdivide remaining data into K equally sized blocks


When classifier chosen, estimate its performance on validation set 

# Machine Learning Applications
## Probabilistic Modeling