In [None]:
import pandas as pd

# Classification

We'll take a tour of the methods for classification in sklearn. First let's load a toy dataset to use:

In [None]:
from sklearn.datasets import load_breast_cancer
breast = load_breast_cancer()

Let's take a look

In [None]:
# Convert it to a dataframe for better visuals
df = pd.DataFrame(breast.data)
df.columns = breast.feature_names
df

And now look at the targets

In [None]:
print(breast.target_names)
breast.target

## Classification Trees

Using the scikit learn models is basically the same as in Julia's ScikitLearn.jl

In [None]:
from sklearn.tree import DecisionTreeClassifier
cart = DecisionTreeClassifier(max_depth=2, min_samples_leaf=140)
cart.fit(breast.data, breast.target)

Here's a helper function to plot the trees.

# Installing Graphviz (tedious)

## Windows

1. Download graphviz from https://graphviz.gitlab.io/_pages/Download/Download_windows.html
2. Install it by running the .msi file
3. Set the pat variable:
    (a) Go to Control Panel > System and Security > System > Advanced System Settings >  Environment Variables > Path > Edit
    (b) Add 'C:\Program Files (x86)\Graphviz2.38\bin'
4. Run `conda install graphviz`
5. Run `conda install python-graphviz`

## macOS and Linux

1. Run `brew install graphviz` (install `brew` from https://docs.brew.sh/Installation if you don't have it)
2. Run `conda install graphviz`
3. Run `conda install python-graphviz`


In [None]:
import graphviz
import sklearn.tree
def visualize_tree(sktree):
    dot_data = sklearn.tree.export_graphviz(sktree, out_file=None, 
                                    filled=True, rounded=True,  
                                    special_characters=False,
                                    feature_names=df.columns)
    return graphviz.Source(dot_data)

In [None]:
visualize_tree(cart)

We can get the label predictions with the `.predict` method

In [None]:
labels = cart.predict(breast.data)
labels

And similarly the predicted probabilities with `.predict_proba`

In [None]:
probs = cart.predict_proba(breast.data)
probs

Just like in Julia, the probabilities are returned for each class

In [None]:
probs.shape

We can extract the second column of the probs by slicing, just like how we did it in Julia

In [None]:
probs = cart.predict_proba(breast.data)[:,1]
probs

To evaluate the model, we can use functions from `sklearn.metrics`

In [None]:
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

In [None]:
roc_auc_score(breast.target, probs)

In [None]:
accuracy_score(breast.target, labels)

In [None]:
confusion_matrix(breast.target, labels)

## Random Forests and Boosting

We use random forests and boosting in the same way as CART

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest.fit(breast.data, breast.target)

In [None]:
labels = forest.predict(breast.data)
probs = forest.predict_proba(breast.data)[:,1]
print(roc_auc_score(breast.target, probs))
print(accuracy_score(breast.target, labels))
confusion_matrix(breast.target, labels)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
boost = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
boost.fit(breast.data, breast.target)

In [None]:
labels = boost.predict(breast.data)
probs = boost.predict_proba(breast.data)[:,1]
print(roc_auc_score(breast.target, probs))
print(accuracy_score(breast.target, labels))
confusion_matrix(breast.target, labels)

## Logistic Regression

We can also access logistic regression from sklearn

In [None]:
from sklearn.linear_model import LogisticRegression
logit = LogisticRegression()
logit.fit(breast.data, breast.target)

In [None]:
labels = logit.predict(breast.data)
probs = logit.predict_proba(breast.data)[:,1]
print(roc_auc_score(breast.target, probs))
print(accuracy_score(breast.target, labels))
confusion_matrix(breast.target, labels)

The sklearn implementation has options for regularization in logistic regression. You can choose between L1 and L2 regularization:

![](http://scikit-learn.org/stable/_images/math/6a0bcf21baaeb0c2b879ab74fe333c0aab0d6ae6.png)


![](http://scikit-learn.org/stable/_images/math/760c999ccbc78b72d2a91186ba55ce37f0d2cf37.png)

Note that this regularization is adhoc and **not equivalent to robustness**. For a robust logistic regression, follow the approach from 15.680.

You control the regularization with the `penalty` and `C` hyperparameters. We can see that our model above used L2 regularization with $C=1$.

### Exercise

Try out unregularized logistic regression as well as L1 regularization. Which of the three options seems best? What if you try changing $C$?

In [None]:
# No regularization
logit = LogisticRegression(C=1e10)
logit.fit(breast.data, breast.target)
labels = logit.predict(breast.data)
probs = logit.predict_proba(breast.data)[:,1]
print(roc_auc_score(breast.target, probs))
print(accuracy_score(breast.target, labels))
confusion_matrix(breast.target, labels)

In [None]:
# L1 regularization
logit = LogisticRegression(C=100, penalty='l1')
logit.fit(breast.data, breast.target)
labels = logit.predict(breast.data)
probs = logit.predict_proba(breast.data)[:,1]
print(roc_auc_score(breast.target, probs))
print(accuracy_score(breast.target, labels))
confusion_matrix(breast.target, labels)

# Regression

Now let's take a look at regression in sklearn. Again we can start by loading up a dataset.

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)

Take a look at the X

In [None]:
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df

In [None]:
boston.target

## Regression Trees

We use regression trees in the same way as classification

In [None]:
from sklearn.tree import DecisionTreeRegressor
cart = DecisionTreeRegressor(max_depth=2, min_samples_leaf=5)
cart.fit(boston.data, boston.target)
visualize_tree(cart)

Like for classification, we get the predicted labels out with the `.predict` method

In [None]:
preds = cart.predict(boston.data)
preds

There are functions provided by `sklearn.metrics` to evaluate the predictions

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
print(mean_absolute_error(boston.target, preds))
print(mean_squared_error(boston.target, preds))
print(r2_score(boston.target, preds))

## Random Forests and Boosting

Random forests and boosting for regression work the same as in classification, except we use the `Regressor` version rather than `Classifier`.

### Exercise

Test and compare the (in-sample) performance of random forests and boosting on the Boston data with some sensible parameters.

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=100)
forest.fit(boston.data, boston.target)
preds = forest.predict(boston.data)
print(mean_absolute_error(boston.target, preds))
print(mean_squared_error(boston.target, preds))
print(r2_score(boston.target, preds))

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
boost = GradientBoostingRegressor(n_estimators=100, learning_rate=0.2)
boost.fit(boston.data, boston.target)
preds = boost.predict(boston.data)
print(mean_absolute_error(boston.target, preds))
print(mean_squared_error(boston.target, preds))
print(r2_score(boston.target, preds))

## Linear Regression Models

There are a large collection of linear regression models in sklearn. Let's start with a simple ordinary linear regression

In [None]:
from sklearn.linear_model import LinearRegression
linear = LinearRegression()
linear.fit(boston.data, boston.target)
preds = linear.predict(boston.data)
print(mean_absolute_error(boston.target, preds))
print(mean_squared_error(boston.target, preds))
print(r2_score(boston.target, preds))

We can also take a look at the betas:

In [None]:
linear.coef_

We can use regularized models as well. Here is ridge regression:

In [None]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=10)
ridge.fit(boston.data, boston.target)
preds = ridge.predict(boston.data)
print(mean_absolute_error(boston.target, preds))
print(mean_squared_error(boston.target, preds))
print(r2_score(boston.target, preds))
ridge.coef_

And here is lasso

In [None]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1)
lasso.fit(boston.data, boston.target)
preds = lasso.predict(boston.data)
print(mean_absolute_error(boston.target, preds))
print(mean_squared_error(boston.target, preds))
print(r2_score(boston.target, preds))
lasso.coef_

There are many other linear regression models available. See the [linear model documentation](http://scikit-learn.org/stable/modules/linear_model.html) for more.

### Exercise

The elastic net is another linear regression method that combines ridge and lasso regularization. Try running it on this dataset, referring to the documentation as needed to learn how to use it and control the hyperparameters.