## Using sklearn for basic data transformation, cross-validation

Many common machine learning operations and procedures are already encoded in various `sklearn` libraries.  Here, we see one way of handling a couple basic tasks—data transformation and cross-validation testing.

In [None]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import minmax_scale

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
np.set_printoptions(suppress=True, precision=2)
plt.style.use('seaborn') # pretty matplotlib plots
sns.set(font_scale=2)

### A built-in data set

There are a number of existing data-sets in `sklearn`, many drawn from real-world data-sources. Here, we use the Wisconsin breast cancer set, which allows high-accuracy classification with pretty simple models:  

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

Here, the data is loaded into a dataframe; in basic form, it consists of 569 data-points, each characterized by 30 real-valued features in a number of different units. For more information about the data, see the UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

In [None]:
dataset = load_breast_cancer(as_frame = True)
frame = dataset.frame
X = frame.iloc[:,:-1]
y = frame.iloc[:,-1]

X

### A basic perceptron model: one test

A simple perceptron model will often do an OK job on this data.  Performance can vary quite a bit, however, depending upon the exact test/train split we get, which by default is randomized across runs.  Performance can also be hampered somewhat by the fact that the original data is not scaled, and displays different orders of magnitude.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html

In [None]:
print("-----------------\nClassify with base data, 1 split\n-----------------")

acc_train = 0
acc_test = 0
print("Train accuracy: ", acc_train)
print("Test accuracy: ", acc_test)

### Scaling data features

We can use the exact same test/train split, but scale all our features to the $[0,1]$ range, independently (i.e., each is scaled according to its own maximum/minimum values).  This tends to give significantly better performance, since coefficient-weights on large-magnitude features are less likely to exert undue influence in the solution process.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html

In [None]:
print("-----------------\nClassify with scaled data, 1 split\n-----------------")

acc_train = 0
acc_test = 0
print("Train accuracy: ", acc_train)
print("Test accuracy: ", acc_test)

### Cross-validation testing

Rather than a single randomized test/train split, we can automate the process somewhat by using $k$-fold cross validation techniques.  Like most things, there are a number of ways of handling this; this is one that is pretty basic, using a `KFold` object to generate splits of our data automatically.  Here, we do this for our basic, non-scaled data, using 5 folds.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

In [None]:
print("-----------------\nClassify with base data, 5 folds\n-----------------")
    
print("\nAverage train accuracy: ", 0)
print("Average test accuracy: ", 0)

### Combining data transformation and cross-validation

We can also do our $k$-fold validation of data after scaling each feature to $[0,1]$.  This gives us the best accuracy and most robust expected performance.

This code is almost identical to that above, but differs in how we access data.
    In the above, the `X.iloc[index,:]` notation is used, because the data is still in
    data-frame form from pandas.  Here, we index more directly as `X[index,:]`, because
    we have run `minmax_scale()` on it, which converts it from pandas frame to more basic
    array-based structure.
    
**NB**: the `KFold.split()` function can handle data in either pandas data-frame or 
    basic array-based format, and several more.  In general, most of sklearn is pretty good at handling all
    manner of basic linear data.  See the entry for 'array-like' at:
    
    https://scikit-learn.org/stable/glossary.html

In [None]:
print("-----------------\nClassify with scaled data, 5 folds\n-----------------")

""" 
"""

X_scaled = minmax_scale(X)
train_scores = []
test_scores = []

for train_idx, test_idx in kfold.split(X_scaled):
    X_train, X_test = X_scaled[train_idx,:], X_scaled[test_idx,:]
    y_train, y_test = y[train_idx], y[test_idx]
    
    model.fit(X_train, y_train)
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)

    acc_train = accuracy_score(pred_train, y_train)
    acc_test = accuracy_score(pred_test, y_test)
    print("Train accuracy: ", acc_train)
    print("Test accuracy: ", acc_test)
    
    train_scores.append(acc_train)
    test_scores.append(acc_test)
    
print("\nAverage train accuracy: ", np.average(acc_train))
print("Average test accuracy: ", np.average(acc_test))