# Classification


## Confusion Matrix

Each row in a confusion matrix represents an actual class, while each column represents a predicted class.
<img src="imgs/confusion matrix.png" alt="confusion matrix" style="width: 230px;"/>

__Precision__ of the classifier is the accuracy of the positive prediction. It expresses the proportion of the instances that our model says was positive were actually positive:
$$precision = \frac{TP}{TP + FP}$$


__Recall__, also called _sensitivity_ or _true positive rate_, is the ratio of positive instances that are correctly detected by the classifier. It represents the ability to find all positive instances in the dataset:
$$recall = \frac{TP}{TP + FN}$$

__F1 Score__ is the harmonic mean of precision and recall. The classifier will only get high F1 score if both recall and precision are high:
$$F_1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}} = \frac{TP}{TP + \frac{FN + FP}{2}}$$

> If someone says "let's reach 99% precision," you should ask,"at what recall?"

### The ROC Curve

Receiver Operating Characteristic (ROC) curve is another common tool used with binary classifiers. It plots the __true positive rate (recall)__ against the __false positive rate__ (FPR, ratio of negative instances that are incorrectly classified as positive).

One way to compare classifiers is to measure the area under the curve (AUC)

```python
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,method="decision_function")
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, label=None): 
        plt.plot(fpr, tpr, linewidth=2, label=label) 
        plt.plot([0, 1], [0, 1], 'k--') plt.axis([0, 1, 0, 1])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
plot_roc_curve(fpr, tpr)
plt.show()
```

> As a rule of thumb, you should prefer the (precision / recall) PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise.

## Training Models

### Gradient descent

_Stochastic gradient descent_ will end up very close to the minimum, but once it gets there it will continue to bounce around and never settling down. So when the cost function is very irregular, it has a better chance of finding the global minimum than _batch gradient descent_ does.

### The Bias / Variance Tradeoff

Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s complexity increases its bias and reduces its variance.

## Decision Tree

__Do not require _feature scaling or centering_ at all__. It makes very few assumptions about the training data.

Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train Decision Trees (Training Algorithm). The algorithm first splits the training set in two subsets using a single feature $k$ and a threshold $t_k$. To choose $k$ and $k_t$, it searches for the pair ($k$, $t_k$) that produces the purest subsets. The training algorithm in Scikit-Learn is stochastic, so it randomly selects the set of features instead of using all features to evaluate at each node.

__Training complexity__: $O(n\times m\log(m))$, since you need to compare all features on all samples at each node.

__Prediction complexity__: $O(\log_2(m))$, which is independent of the number of features. Making predictions requires traversing the Decision Tree from the root to a leaf

### Gini Impurity or Entropy?
Gini impurity is slightly faster and tends to isolate the most frequent class in its own branch of the tree.
Entropy tends to produce slightly more balanced trees.

### Regularization Hyperparameters

- _max-depth_: maximum depth of the Decision Tree. Default, None.

- _min-samples-split_: the minimum number of samples a node must have before it can be split.

- _min-samples-leaf_: the minimum number of samples a leaf node must have.

- _min-weight-fraction-leaf_: same as _min-samples-leaf_ but expressed as a fraction of the total number of weighted instances).

- _max-leaf-nodes_: maximum number of leaf nodes.

- _max-features_: maximum number of features that are evaluated for splitting at each node

Increasing __min\*__ hyperparameters or reducing __max\*__ will regularize the model


### Instability

1. Orthogonal decision boundaries (all splits are perpendicular to an axis): makes them sensitive to training set rotation. One way to limit this problem is to use PCA, which often results in a better orientation of the training data.
1. Sensitive to small variations in the training data. Since the training algorithm used by Scikit-Learn is stochastic, you may even get very different models on the same training data. Random Forests can limit this instability by averaging predictions over many trees.

## Ensemble learning

Ensemble methods work best when the predictors are as independent from one another as possible. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its prede‐ cessor.

There are two ways to get a diverse set of classifiers:
1. use very different training algorithms
1. use the same training algorithm but to train them on different random subsets of the training set. __Bagging__ means sampling with replacement. __Pasting__ is sampling without replacement.

__hard voting classifier__: Aggregate the predictions of each classifier and predict the class that gets the most votes.

__soft voting classifier__: If all classifiers are able to estimate class probabilities, predict the class with the highest class probability, averaged over all the individual classifiers.

__Boosting__: any ensemble method that can combine several weak learners into a strong learner.
- Adaboost: sequential training with instance weight updates.
- Gradient Boosting: fitting the new predictor to the residual errors made by the previous predictor.
- stochastic Gradient Boosting: training on xx% of the training instances selected randomly.

# Dinensionality Reduction

__Curse of Dimensinality__: with the increasing of the number of features for each training instances, the training becomes extremely slow and it's much harder to find a good solution. The hight-dimensional datasets are at risk of being very sparse. The more dimensions the training set has, the greater the risk of overfitting. The number of training instances required grows exponentially with the number of dimensions. 

