# **CLASSIFICATION – PREDICT CATEGORY (SUPERVISED)  **  

Dependent variable y (outcome we're trying to predict) is a category  
Given input features (Independent variables x) and try to produce discrete labels (dependent variable y)    

## **Properties**
- Output Type: Discrete (class labels)
- What you're trying to find: Decision Boundary
- Evaluation: Accuracy  

## Examples
ML Algorithm takes data and transforms it into a decision surface that for all future cases can enable us to make a determination what class it is in.  
Try to learn to predict target (Y) given input (X)

- Whether or not it will rain tomorrow  
- Whether or not Google’s stock price will rise or fall tomorrow  
- Whether Katie will like or dislike a song
    - We don't input raw music instead we extract 'features' (insensity, tempo, genre, gender) from the music
    - then Katie's brain processes it into one of two categories: Like and Don't like.  

## Descriminative vs Generative Classifiers
### **Discriminative**
- We start with X, we get Y
- Classifiers like logistic regression model this directly (discriminative)  
    
### **Generative**
- We start with Y(the class) and model X  
- Think of each class as a ‘data-making machine’  (it ‘generates’ the data)   
- Naïve Bayes    

## Coding 
### **Split > Fit > Predict > Score It**   
Fit aka 'train'  
Algorithms that can learn from observational data and make predictions based on it  

### **Using 2 Main Functions **   
train(x,y)  
predict(x)


### **Data Types and Shapes**    
Sci-kit learn requires everything to be numerical  

x is a **MATRIX** of N x D (two brackets)
X = dataset[['feature1','feature2','feature3']]   
- n = # of observations
- d = number of features (dimensions or columns)

y is a **VECTOR** of shape N x 1 (1 bracket)  
y = dataset['target']
- for a classification y will contain discrete integers from 0...k-1 where k is the number of classes

--- 

## **1.Logistic Regression  - - LINEAR**    
a.	Pros: Probabilistic approach, gives information about statistical significance of features  
b.	Cons: The Logistic Regression Assumptions  
c.	Use Logistic Regression (or Naive Bayes if problem is non-linear) when you want to rank your predictions by their   probability.   
-For example if you want to rank your customers from the highest probability that they buy a certain product, to the   lowest probability.   
-Eventually that allows you to target your marketing campaigns. And of course for this type of business   problem, you should use Logistic Regression if your problem is linear  

---

## **2.Support Vector Machine (SVM) -- LINEAR**    
When you want to predict to which segment your customers belong to. 
Use SVC to classify data using SVM      
Can be any kind of segments, for example  some market segments you identified earlier with clustering.  
Can use different ‘kernels’ (linear, rbf, polynomial) . Some will work better than others for a given data set    
Put first and foremost the classification of the labels, then maximizes margin
- Margin = distance between the line and the nearest point of both of the two classes    
- want to maximize this because is where it is most robust  

Pros:   
- Performant, not biased by outliers, not sensitive to overfitting  
- Works well for classifying higher dimensional data (lots of features)

Cons:   
- Not appropriate for non linear problems, not the best choice for large # of features  
- Need to scale features  
 
#Linear SVM  

#Split  
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .20, random_state= 0)  

#Fit  
from sklearn.svm import SVC  
clf = SVC(kernel = 'linear')  
clf.fit(features_train, labels_train)  
 
#Predict  
pred = clf.predict(features_test)  

#Score  
from sklearn.metrics import accuracy_score  
acc = accuracy_score(pred, labels_test)  

---

## **3.K-Nearest Neighbors (K-NN) – NON LINEAR**    
Used to classify new data points based on ‘distance’ to known data on a scatter plot  
- Find K nearest neighbors, based on your distance metric  
- Let them all vote on the classification  

Pros: Simple to understand, fast and efficient  
Cons: Need to choose the number of neighbours k  


--- 
## **4.Kernel SVM – NON LINEAR**     

### **Kernel Trick**  
There are functions that take a low dimensional input space (feature space) and map it to a very high dimensional space.   
So what use to not be linearably separable turns it into a seperable problem.     

These functions are called Kernels 

Find the best linear separator between the different classes, apply the kernel trick in a high dimensional space and what you effectively get is a very powerful system to set data sets apart where the division line might be non-linear.  
                     
x, y not sepearable  > Kernels > x1 x2 x3 x4 x5  separable


Steps when you apply the kernel trick:  
- change your input space from x, y into a much larger input space >   
- separate the data points using support vector machines >  
- then take the solution and go back to the original space >   
- you now have a non-linear separation.   


### **Pros: **  
- High performance on nonlinear problems, not biased by outliers, not sensitive to overfitting 
- work well in complicated domains where there is a clear margin of separation 

### **Cons:**   
- Very big data set with lots and lots of features, SVM right out of the box might be very slow and prone to overfitting to the noise in your Data. 
- Doesn't work well in very large data sets 
- Not the best choice for large number of features, more complex  
- Doesn't work well with lots of noise 
    - where the classes are very overlapping, have to count independent evidence.   
    - That's when a naive bayes classifier would be better   
- Need to scale features 

### Parameters in Machine Learning
Arguments passed when you create your classifier (before fitting)    
For SVM: Tuning kernel, C, and gamma all help with overfitting  

**1. Kernel**   
- Kernels options: linear, poly, rbf, sigmoid, precomputed, or a callable

**2. C:** controls tradeoff between a smooth decision boundary and one that classifies all the training points correctly  
- something very straight but coems at the cost of afew points being missed classified
- something wiggly but where you get potentiall all of the training points correct
    -something complicated like, chances are it this won't generalize that well to test set 
- Large value of C means you're going to get more traiing pionts correct (more intricate decision boundaries where it can wiggle around individual data points to get everything correct)  

**3. Gamma:** defines how far the influence of a single training example reaches
- low values - far reach (even the far away get taken into consideration when deciding where to draw decision boundary)
    -makes decision boundary more linear, smoother, less jagged
- high values - close reach (ignores points that are farther away)  
    - can end up with a wiggly decision boundary

---

## **5.Naive Bayes – NON LINEAR**     
Use Naive Bayes (if problem is non-linear) when you want to rank your predictions by their probability. For example if you   want to rank your customers from the highest probability that they buy a certain product, to the lowest probability. Eventually  that allows you to target your marketing campaigns. And of course for this type of business problem, Naive Bayes if your   problem is non linear (Logistic Regression if your problem is linear).    

One particular feature of Naive Bayes is that it’s a good algorithm for working with text classification. When dealing with text, it’s very common to treat each unique word as a feature, and since the typical person’s vocabulary is many thousands of words, this makes for a large number of features. The relative simplicity of the algorithm and the independent features assumption of Naive Bayes make it a strong performer for classifying texts.

### Pros:   
- Efficient, not biased by outliers, works on nonlinear problems, probabilistic approach  
- Grounded in probability, which can be powerful  

### Cons:  
- Based on the assumption that features have same statistical relevance  

from sklearn.naive_bayes import GaussianNB    
clf = GaussianNB()  
clf.fit(features_train, labels_train)  
pred = clf.predict(features_test)  

### Bayes Rule
Prior Probability * Test Evidence = Posterior Probability  

### Non-Naive Bayes    
- Usually we just call it ‘Bayes Classifier’  
- More generally we can have a ‘Bayes Model’ 


### Naive Bayes - Text Learning Example  
With naïve Bayes we assume all of the features are independent  

Chris
- Love .10
- Deal .80
- Life .10  

Sara
- Love .50  
- Deal .20  
- Life .30    

**Text: "Life Deal"**     

**Prior Probability**  
P(Chris) = 0.50    
P(Sara) = 0.50    

**Evidence * Prior Probability**  
Chris: .10 x .80 x .50 = 0.04<<<<Chris wrote it    
Sara: .30 x .20 x .50 = 0.03 

**Posterior Probability**     
P(Chris | 'Life Deal') = 0.04 / (0.04 + 0.03) = 0.57    
P(Sara | 'Life Deal') =0.03/(0.04 + 0.03) =  0.43  

Text: "Love Deal"  

**Prior Probability**            
P(Chris) = 0.50    
P(Sara) = 0.50    

**Evidence * Prior Probability**  
P(Chris| 'Love Deal') = 0.10 x 0.80 x 0.50 = 0.04    
P(Sara |'Love Deal') = 0.50 x 0.20 x 0.50 = 0.05   

**Posterior Probability**  
P(Chris | 'Love Deal') = 0.04 / (0.04 + 0.05) = 0.444      
P(Sara | 'Love Deal') =0.05/(0.04 + 0.05) =  0.555 <<<Sara wrote it 

### Why is Naive Bayes Naive?   
- Because it ignores Word Order. Just looks at frequency 
-
### Naive Bayes Strengths and Weaknesses   
- Easy to implement  
- Can break with phrases    
 
---








##  **6.Decision Tree Classification – NON LINEAR**   
Decision trees uses a trick that lets you do non-linear decision making with simple linear decision surfaces.   
Allows you to ask multiple linear questions, one after the other.    
Basically a bunch of nested if statements   

What makes it ML is how we choose the conditions  
- Based on information theory  
- We only look at one attribute at a time (Usually call these ‘input features’ but called ‘attributes’ when talking about decision trees)    
  
### Entropy  
A measure of a dataset’s disorder – how same or different it is   
Measure of impurity in a bunch of examples    

If we classify a dataset into N different classes (ex: a data set of animal attributes and their species)  
- Entropy = 0: All examples are the same class (everyone is an iguana)   
- Entropy = 1: They’re all different classes  

Controls how a DT decides where to split the data (want to minimize impurity in the splitting)


### Information Gain   
Decision tree looks at all the training examples, all of the different features that are available to it and it uses information gain criterion in deciding which variable to split on, and how to make the splits.  
- Decision tree algorithm: maximize information gain  
- This is how it will choose which feature to make a split on  

Information Gain = entropy (p) - [weighted average]entropy(children)  

import scipy.stats
print scipy.stats.entropy([2,1],base=2)    
      
### Bias-Variance Dilemma  
**High Bias ML algorithm:  **
- one that practially ignores the data
- almost no capacity to learn anything 
- Ex: bias car, no matter which way I train it - doesn't do anything differently    

**High Variance ML algorithm:**
- extremely perceptive to data   
- can only replicate stuff its seen before  
- reacts very poorly to situations it hasn't seen before because it doesn't have the right bias to generalize to new stuff.   


**Want something in the middle**
- has some authority to generalize but is still very open to listen to the data  


### Pros: 
- Interpretability, no need for feature scaling, works on both linear / nonlinear problems  
- Scaling optional  
- Use when you want to have clear interpretation of your model results.

### Cons:  
- Poor results on too small datasets, overfitting can easily occur 
- Susceptible to overfitting; especially if the data has lots and lots of features     
- Decision trees have a major flaw—they overfit to the training data.   
    - Because we build up a very "deep" decision tree in terms of splits, we end up with a lot of rules that are specific to the quirks of the training data, and not generalizable to new data sets.  
    - one of the easiest ways to get an overfit decision tree is to use a small training set and lots of features.

### Speeding Up Performance  
- A general rule is that the parameters can tune the complexity of the algorithm, with more complex algorithms generally running more slowly.

- Another way to control the complexity of an algorithm is via the number of features that you use in training/testing. The more features the algorithm has available, the more potential there is for a complex fit.

### Decision Tree Parameters    

**min_samples_split: **governs whether there is enough samples available to me continue to split further (default is 2; higher # more simple the boundary)   

### Tuning Criterion Parameter  
default= 'gini' index
Another similar metric of impurity  



---

## **7.Random Forest Classification – NON LINEAR**  
'ensemble method'  
meta classifier built from (usually) decision trees  

Build hundreds of trees with slightly randomized input data, and slightly randomized split points. Each tree in a random forest gets a random subset of the overall training data. The algorithm performs each split point in each tree on a random subset of the potential columns to split on. By averaging the predictions of all of the trees, we get a stronger overall prediction and minimize overfitting.

Construct several alternate decision trees and let them “vote” on the final classification  
- Randomly re-sample the input data for each tree   
- Called “bootstrap aggregating” or “bagging”  
- Randomize a subset of the attributes each step is allowed to choose from 

### **When To Use Random Forests**
While the random forest algorithm is incredibly powerful, it isn't applicable to all tasks. 

### **The main strengths of a random forest are:**    
- Pros: Powerful and accurate, good performance on many problems, including non linear


- Very accurate predictions - Random forests achieve near state-of-the-art performance on many machine learning tasks. Along with neural networks and gradient-boosted trees, they're typically one of the top-performing algorithms.  


- Resistance to overfitting - Due to their construction, random forests are fairly resistant to overfitting. We still need to set and tweak parameters like max_depth though.

### **The main weaknesses of using a random forest are:**  
- Cons: No interpretability, overfitting can easily occur, need to choose the number of trees    


- They're difficult to interpret - Because we've averaging the results of many trees, it can be hard to figure out why a random forest is making predictions the way it is.


- They take longer to create - Making two trees takes twice as long as making one, making three takes three times as long, and so on. Fortunately, we can exploit multicore processors to parallelize tree construction. Scikit allows us to do this through the n_jobs parameter on RandomForestClassifier..

### **Random Forest v Decision Tree Bottom Line**    
**Random Forest**  
Given these trade-offs, it makes sense to use random forests in situations where accuracy is of the utmost importance; being able to interpret or explain the decisions the model is making isn't key. 
Use when you are just looking for high performance with less need for interpretation.   


**Decision Tree**  
In cases where time is of the essence or interpretability is important, a single decision tree may be a better choice.

---

# MODEL EVALUATION  

## Confusion Matrices  
2 x 2 matrix that compares actual class v prediced class 

## Accuracy  
Accuracy = All data points labeled correctly / All data points

**Pros**  
Simple metric  

**Cons**    
Not ideal for skewed classes    

## Recall (aka Sensitivity)
Out of all the TP items, how many were correctly classified.  How many TP items were 'recalled' from the data set
TP / (TP + FN)  
 

##   Precision     
Out of all the items labeled as p, how many truely belong in the p class.
TP / (TP + FP)  


There is usually a trade-off between precision in recall.  

## F-1 Score    

Considers both the precision and recall to compute the score

F1 = (2)(Precision x Recall)/(Precision + Recall)  

This is the best of both words:  
- If my identifier finds a Person of Interest then the person is almost certainly a Person of Interest and  
- if the identifier does not flag someone, then they are almost certainly not a Person of Interest.  

In a multiclass classification problem like this one (more than 2 labels to apply), accuracy is a less-intuitive metric than in the 2-class case. Instead, a popular metric is the F1 score.

We’ll learn about the F1 score properly in the lesson on evaluation metrics, but you’ll figure out for yourself whether a good classifier is characterized by a high or low F1 score. You’ll do this by varying the number of principal components and watching how the F1 score changes in response.


---

## **IMPROVING PERFORMANCE** 

### **Tuning Parameters To Improve Accuracy**

- The first (and easiest) thing we can do to improve the accuracy of the random forest is to increase the number of trees we're using. Training more trees will take more time, but because we're averaging many predictions we made on different subsets of the data, having more trees will greatly increase accuracy (up to a point).


- We can also tweak the min_samples_split and min_samples_leaf variables to reduce overfitting. Because of the way a decision tree works, very deep splits in a tree can make it fit to quirks in the data set, rather than true signal. For this reason, increasing min_samples_split and min_samples_leaf can reduce overfitting. This will actually improve our score because we're making predictions on unseen data. A model that's less overfit and can generalize better will actually perform better on unseen data, but worse on seen data. 

- Popular method for choosing hyper parameters is k-folds cross validation

- Can simply use the mean   

- Also use statistical testing to check if one 1 parameter setting is ‘statistically significantly’ better than the other


### **Identifying the Best Features to Use**  
- Feature engineering is the most important part of any machine learning task, and there are a lot more features we could calculate. However, we also need a way to figure out which features are the best.


- One way to accomplish this is to use univariate feature selection. This approach essentially involves reviewing a data set column by column to identify the ones that correlate most closely with what we're trying to predict (Survived).


- As usual, sklearn has a function that will help us with feature selection. The SelectKBest function selects the best features from the data. We can specify how many features we want this function to select.

### ** Gradient Boosting**  
- Another technique that builds on decision trees is gradient boosting. Boosting involves training decision trees one after another, and feeding the errors from one tree into the next tree. 


- This method allows each tree to build on all the ones that came before it. Using this method can lead to overfitting if we build too many trees, though. As we get above 100 trees, it's very easy to overfit and train to quirks in the data set. Because our data set is extremely small, we'll limit the tree count to 25.

 
- Another way to reduce overfitting is to limit the depth to which we can build each tree in the gradient boosting process. We'll limit the tree depth to 3 here to avoid overfitting.

### ** Making Predictions with Multiple Classifiers (Ensembling)**  
- One thing we can do to improve the accuracy of our predictions is ensemble different classifiers. Ensembling means generating predictions based on information from a set of classifiers, instead of just one. In practice, this means that we average their predictions.


- Generally speaking, the more diverse the models we ensemble, the higher our accuracy will be. Diversity means that the models generate their results from different columns, or use very different methods to generate predictions. Ensembling a random forest classifier with a decision tree probably won't work extremely well, because they're very similar. On the other hand, ensembling a linear regression with a random forest can yield very good results.


- One caveat with ensembling is that the classifiers we use have to be about the same in terms of accuracy. Ensembling one classifier that's much less accurate than the other will probably make the final result worse.


- In this case, we'll ensemble logistic regression we trained on the most linear predictors (the ones that have a linear order, as well as some correlation to Survived) with a gradient-boosted tree we trained on all of the predictors.
