# Notes on Topics 29 - 32

## Topic 29: Decision Trees
### Intro to Decision Trees
- “recursive binary splitting" The process of training a decision tree and predicting the target features of a dataset is as follows:
    1. Present a dataset of training examples containing features/predictors and a target (similar to classifiers we have seen earlier).
    2. Train the tree model by making splits for the target using the values of predictors. Which features to use as predictors gets selected following the idea of feature selection and uses measures like "information gain" and "Gini Index". We shall cover these shortly.
    3. The tree is grown until some stopping criteria is achieved. This could be a set depth of the tree or any other similar measure.
    4. Show a new set of features to the tree, with an unknown class and let the example propagate through a trained tree. The resulting leaf node represents the class prediction for this example datum.
- CART (Classification and Regression Trees) uses the Gini Index as a metric
- ID3 (Iterative Dichotomiser 3) uses the entropy function and information gain as metrics
- $Entropy(p) = -\sum (P_i . log_2(P_i))$
- High entropy means less predictive power
- As input, the function should take in `D` as a class distribution array for target class, and `a` the class distribution of the attribute to be tested, then calculate gain as $gain(D,A) = Entropy(D) - \sum(\frac{|D_i|}{|D|}.Entropy(D_i))$, where $D_{i}$ represents distribution of each class in `a`

### ID3 Classification Trees: Perfect Split with Information Gain Lab

In [4]:
from math import log
def entropy(pi):
    """
    return the Entropy of a probability distribution:
    entropy(p) = - SUM (Pi * log(Pi) )
    pi is a list of how many occurances there are in each class
    """
    total = 0
    for p in pi:
        p = p / sum(pi)
        if p != 0:
            total +=  p * log(p, 2)
        else:
            total += 0
    total *= -1
    return total
def IG(D, a):
    """
    return the information gain:
    gain(D, A) = entropy(D)− SUM( |Di| / |D| * entropy(Di) )
    """
    total = 0
    for Di in a:
        total += abs(sum(Di) / sum(D)) * entropy(Di)
    gain = entropy(D) - total
    return gain

### Building Trees using scikit-learn + Lab

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn import tree
# One-hot encode the training data and show the resulting DataFrame with proper column names
ohe = OneHotEncoder()
ohe.fit(X_train)
X_train_ohe = ohe.transform(X_train).toarray()
# Creating this DataFrame is not necessary its only to show the result of the ohe
ohe_df = pd.DataFrame(X_train_ohe, columns=ohe.get_feature_names(X_train.columns))
ohe_df.head()
# Create the classifier, fit it on the training data and make predictions on the test set
clf = DecisionTreeClassifier(criterion='entropy') # or 'gini'
clf.fit(X_train_ohe, y_train)

In [None]:
fig, axes = plt.subplots(nrows = 1,ncols = 1, figsize = (3,3), dpi=300)
tree.plot_tree(clf,
               feature_names = ohe_df.columns, 
               class_names=np.unique(y).astype('str'),
               filled = True)
plt.show()

In [None]:
X_test_ohe = ohe.transform(X_test)
y_preds = clf.predict(X_test_ohe)

# Calculate accuracy 
acc = accuracy_score(y_test,y_pred) * 100
print('Accuracy is :{0}'.format(acc))

# Check the AUC for predictions
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('\nAUC is :{0}'.format(round(roc_auc, 2)))

# Create and print a confusion matrix 
print('\nConfusion Matrix')
print('----------------')
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

# or
# Alternative confusion matrix
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(classifier, X, y, values_format='.3g')
plt.show()

### Hyperparameter Tuning and Pruning in Decision Trees + Lab
We can prune our trees using:
- Maximum depth: Reduce the depth of the tree to build a generalized tree. Set the depth of the tree to 3, 5, 10 depending after verification on test data
- Minimum samples leaf with split: minimum number of samples required to split an internal node.
- max_depth and min_samples_split are also both related to the computational cost
- Minimum leaf sample size: minimum number of samples that we want a leaf node to contain. When this minimum size is achieved at a nodE. Size in terminal nodes can be fixed to 30, 100, 300 or 5% of total
- Maximum leaf nodes: Reduce the number of leaf nodes
- Maximum features: Maximum number of features to consider when splitting a node
- The main difference between the two is that min_samples_leaf guarantees a minimum number of samples in a leaf, while min_samples_split can create arbitrary small leaves, though min_samples_split is more common in practice

- For instance, if min_samples_split = 5, and there are 7 samples at an internal node, then the split is allowed. But let's say the split results in two leaves, one with 1 sample, and another with 6 samples. If min_samples_leaf = 2, then the split won't be allowed (even if the internal node has 7 samples) because one of the leaves resulted will have less than the minimum number of samples required to be at a leaf node.

### Regression with CART Trees + Lab
- Recursive partitioning, instead of global model.
- Cost Function: $J(D, \theta) = \frac{n_{left}}{n_{total}} MSE_{left} + \frac{n_{right}}{n_{total}} MSE_{right}$
    - $D$: remaining training examples   
    - $n_{total}$ : number of remaining training examples
    - $\theta = (f, t_f)$: feature and feature threshold
    - $n_{left}/n_{right}$: number of samples in the left/right subset
    - $MSE_{left}/MSE_{right}$: MSE of the left/right subset
- $ \hat{y}_m = \frac{1}{n_{m}} \sum_{i \in D_m} y_i $
$ MSE_m = \frac{1}{n_{m}} \sum_{i \in D_m} (y_i - \hat{y}_m)^2 $
    - $D_m$: training examples in node $m$
    - $n_{m}$ : total number of training examples in node $m$
    - $y_i$: target value of $i-$th example
- Without regularization, decision trees are likely to overfit

In [None]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=42, max_depth=3)
regressor.fit(X_train, y_train)
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
y_pred = regressor.predict(X_test)
print('MSE score:', mse(y_test, y_pred))
print('R-sq score:', r2_score(y_test,y_pred))

### Regression Trees and Model Optimization Lab

## Topic 30: Ensemble Methods

- Bagging, short for Bootstrap Aggregation, is a combination of two ideas -- bootstrap resampling and aggregation
- common approach is to treat each classifier in the ensemble's prediction as a "vote" and let our overall prediction be the majority vote
- also common to see ensembles that take the arithmetic mean of all predictions, or compute a weighted average

### Random Forests
- classification and regression, ensemble of decision trees 
- Bagging: sample two-thirds of our training data with replacement
- the algorithm then uses the remaining one-third of data that wasn't sampled to calculate the Out-Of-Bag Error
- 
- Subspace sampling method: randomly select a subset of features (exact number is tunable parameter) to use as predictors for each node when training a decision tree
- Benefits: Strong performance and interpretability 
- Drawbacks: Computational complexity and memory storage

### Tree Ensembles and Random Forests Lab

### GridSearchCV + Lab
GridSearchCV: combines K-Fold Cross-Validation with a grid search of parameters
- very time consuming and computationally expensive

In [None]:
clf = DecisionTreeClassifier()
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [1, 5, 10, 20]
}
gs_tree = GridSearchCV(clf, param_grid, cv=3)
gs_tree.fit(train_data, train_labels)
gs_tree.best_params_


### Gradient Boosting and Weak Learners + Lab
Boosting works as follows:
1. Train a single weak learner
2. Figure out which examples the weak learner got wrong
3. Build another weak learner that focuses on the areas the first weak learner got wrong
4. Continue this process until a predetermined stopping condition is met, such as until a set number of weak learners have been created, or the model's performance has plateaued

Adaboost:
- Trained on subset of training sample w/ replacement like bagging, except each data point carries a weight. Weight increases when weak longer misclassifies. Each weight acts as a probability that sample will go into bag

Gradient Boosted Trees:
- Starts with a weak learning and then calculates the Residuals for each data point.
- Model combines the Residuals with a differentiable loss function to calculate the overall loss
- Use gradients and the loss as predictors to train the next tree against. 
- Next learner focuses on these harder examples. Loss is reduced because these more commonly misclassified data points are focused on.
- Gamma, learning rate


### XGBoost
- parallelizes the construction of trees across all your computer's CPU cores during the training phase. It also allows for more advanced use cases
- automatically handles missing values
- `conda install py-xgboost`
- https://xgboost.readthedocs.io/en/latest/

## Topic 31: Support Vector Machines
- Max Margin classifier: aim to maximize the margin
    - $ b + w_Tx^{(i)} \geq 1$ if $y ^{(i)} = 1$, OR $y ^{(i)} (b + w_Tx^{(i)} )\geq 1$ for each $i$
    - $w_T$ term is called the **weight vector**, $b$ term is called the **bias** 
- Soft Margin classifier:
    - $ b + w_Tx^{(i)} \geq 1-\xi^{(i)}$ if $y ^{(i)} = 1$
    - $\xi^{(i)}$ is a **slack variable**
- Hyperplane defined by weight vector wT and bias b
- [kernels](https://scikit-learn.org/stable/modules/svm.html#kernel-functions)


### Building an SVM using scikit-learn Lab

### The Kernel Trick + Lab

## Topic 32: ML Pipelines
- Pipelines create an efficient workflow to combine data manipulations, preprocessing, and modeling

[Pipelines](https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html)

[Integrating Grid Search](https://www.kdnuggets.com/2018/01/managing-machine-learning-workflows-scikit-learn-pipelines-part-2.html)

### Pipelines in sklearn Lab

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
# defining the sequence of actions to perform
pipe = Pipeline([('mms', MinMaxScaler()),
                 ('tree', DecisionTreeClassifier(random_state=123))])
# fit the model,
pipe.fit(X_train, y_train)
# score the model on test data
pipe.score(X_test, y_test)

# or implement GridSearch
grid = [{'tree__max_depth': [None, 2, 6, 10], 
         'tree__min_samples_split': [5, 10]}]
gridsearch = GridSearchCV(estimator=pipe, 
                          param_grid=grid, 
                          scoring='accuracy', 
                          cv=5)
gridsearch.fit(X_train, y_train)
gridsearch.score(X_test, y_test)

In [None]:
# Build a pipeline with StandardScaler and KNeighborsClassifier
scaled_pipeline_1 = Pipeline([('ss', StandardScaler()), 
                              ('knn', KNeighborsClassifier())])
scaled_pipeline_1.fit(X_train, y_train)
scaled_pipeline_1.score(X_test, y_test)

scaled_pipeline_2 = Pipeline([('ss', StandardScaler()), 
                              ('RF', RandomForestClassifier(random_state=123))])
grid = [{'RF__max_depth': [4, 5, 6], 
         'RF__min_samples_split': [2, 5, 10], 
         'RF__min_samples_leaf': [1, 3, 5]}]
gridsearch = GridSearchCV(estimator=scaled_pipeline_2, 
                          param_grid=grid, 
                          scoring='accuracy', 
                          cv=5)
gridsearch.fit(X_train, y_train)
gridsearch.score(X_test, y_test)


## Extra Notes
[scoring parameters](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)
