# Example notebook for SLU10 - Classification
How to use the very useful sklearn:
- MinMaxScaler
- SGDClassifier
- LogisticRegression

to solve the last exercise of the Exercise Notebook of SLU10.

In [1]:
import pandas as pd 
import numpy as np 

from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import SGDClassifier, LogisticRegression

The Wisconsin Breast Cancer Diagnostic dataset is another data science classic. It is the result of extraction of breast cell's nuclei characteristics to understand which of them are the most relevent for developing breast cancer.

Your quest, is to first analyze this dataset from the materials that you've learned in the previous SLUs and then create a logistic regression model that can correctly classify cancer cells from healthy ones.

Dataset description:

1. Sample code number: id number 
2. Clump Thickness
3. Uniformity of Cell Size
4. Uniformity of Cell Shape
5. Marginal Adhesion 
6. Single Epithelial Cell Size
7. Bare Nuclei
8. Bland Chromatin
9. Normal Nucleoli
10. Mitoses 
11. Class: (2 for benign, 4 for malignant) > We will modify to (0 for benign, 1 for malignant) for simplicity
The data is loaded for you below.

In [2]:
columns = ['Sample code number','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape',
           'Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli',
           'Mitoses','Class']
data = pd.read_csv('data/breast-cancer-wisconsin.csv',names=columns, index_col=0)
data["Bare Nuclei"] = data["Bare Nuclei"].replace(['?'],np.nan)
data = data.dropna()
data["Bare Nuclei"] = data["Bare Nuclei"].map(int)
data.Class = data.Class.map(lambda x: 1 if x == 4 else 0)
X_train = data.drop('Class').values
Y_train = data.Class.values

How does the dataset (features) and target look like?

In [3]:
X_train

array([[ 5,  1,  1, ...,  1,  1,  0],
       [ 5,  4,  4, ...,  2,  1,  0],
       [ 3,  1,  1, ...,  1,  1,  0],
       ...,
       [ 5, 10, 10, ..., 10,  2,  1],
       [ 4,  8,  6, ...,  6,  1,  1],
       [ 4,  8,  8, ...,  4,  1,  1]])

In [4]:
Y_train

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,

# [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
_Transforms features by scaling each feature to a given range._

You can select the range for your final feature values with argument `feature_range=(0, 1)`

In [5]:
# Init class
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit your class
scaler.fit(X_train)



MinMaxScaler(copy=True, feature_range=(0, 1))

In [6]:
# Transform your data
X_train = scaler.transform(X_train)
X_train

array([[0.44444444, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.44444444, 0.33333333, 0.33333333, ..., 0.11111111, 0.        ,
        0.        ],
       [0.22222222, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.44444444, 1.        , 1.        , ..., 1.        , 0.11111111,
        1.        ],
       [0.33333333, 0.77777778, 0.55555556, ..., 0.55555556, 0.        ,
        1.        ],
       [0.33333333, 0.77777778, 0.77777778, ..., 0.33333333, 0.        ,
        1.        ]])

So, now our features are scaled between 0 and 1.

# [SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)
_Linear classifiers (SVM, logistic regression, a.o.) with SGD training._

Let us use the _log loss_ by setting 

In [7]:
sgd_clf = SGDClassifier(loss='log', tol=0.0001, random_state=1)

# Fit it!
sgd_clf.fit(X_train, Y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=1, shuffle=True,
       tol=0.0001, verbose=0, warm_start=False)

# [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
_Logistic Regression (aka logit, MaxEnt) classifier._ In this case let us use the L2 penalty (argument: `penalty='l2'`)

In [8]:
# init with your arguments
logit_clf = LogisticRegression(penalty='l2', random_state=1)

# Fit it!
logit_clf.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

What are the predicted probabilities on the training data (probability of being `1`) with our **Logit** classifier?

In [9]:
# First ten instances
logit_clf.predict_proba(X_train)[:, 1][:10]

array([0.01773246, 0.17356015, 0.01834332, 0.182352  , 0.01958361,
       0.99726463, 0.05117657, 0.0166379 , 0.01618869, 0.01755417])