# Scientific Python: Scikit-Learn

## Benefits

### Scikit-Learn

***Scikit-learn*** is an open-source machine learning library that supports supervised and unsupervised learning.  It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. Some of the key benefits of the library include:

- Simple and efficient tools for predictive data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib for efficient implementation
- Open source and commercially usable

Key features include:

- Classification
- Regression
- Clustering
- Dimensionality reduction
- Model selection
- Preprocessing

## Data Manipulation

### Dimensionality Reduction

*Dimensionality reduction* is the process of reducing the number of random variables to consider or work with when working with your data.  Another way to look at it is as a way to reduce the number of features or variables in your dataset without losing any (significant) predictive power.  One popular set of dimensionality reduction techniques is call *manifold learning*.  

*Manifold learning* is a nonlinear dimensionality reduction technique.  In other words it enables you to take a dataset with $N$ number of features and represent it with $\hat{N}$ number of features, where $\hat{N} < N$.  A few manifold learning techniques are:

- ***T-Distributed stochastic neighbor embedding (TSNE) -*** tool to visualize high-dimensional data; uses probabilities and aims to minimize K-L Divergence between joint probabilities
- ***Isomap -*** nonlinear dimensionality reduction through isometric mapping
- ***Multidimensional scaling (MDS) -*** seeks a low dimensional representation of data in a manner that aims to preserve distances between points in both the high and low dimensions

In [None]:
# Import packages

import time
import numpy
import sklearn
import toyplot
from sklearn import datasets, tree, model_selection
from sklearn.manifold import TSNE, Isomap, MDS

In [2]:
# Load Data

data = sklearn.datasets.load_breast_cancer()
X = data.data
Y = data.target

# Separate data into malignant and benign groups
benign_idx = numpy.where(Y==1)[-1]
malignant_idx = numpy.where(Y==0)[-1]

benign_data = X[benign_idx]
malignant_data = X[malignant_idx]

# Separate data into training and testing sets
xtrain, xtest, ytrain, ytest = model_selection.train_test_split(X, Y, test_size=0.3, random_state=42, stratify=Y)


#### Data Transofrmation: TSNE

In [3]:
Xt = TSNE(n_components=2).fit_transform(X)



In [4]:
canvast = toyplot.Canvas(500,500)
axest = canvast.cartesian()
markt0 = axest.scatterplot(Xt[benign_idx], color='blue')
markt1 = axest.scatterplot(Xt[malignant_idx], color = 'red')

#### Data Transofrmation: Isomap

In [5]:
Xi = Isomap(n_components=2).fit_transform(X)

In [6]:
canvasi = toyplot.Canvas(500,500)
axesi = canvasi.cartesian()
marki0 = axesi.scatterplot(Xi[benign_idx], color='blue')
marki1 = axesi.scatterplot(Xi[malignant_idx], color = 'red')

#### Data Transofrmation: MDS

In [7]:
Xm = MDS(n_components=2).fit_transform(X)

In [8]:
canvasm = toyplot.Canvas(500,500)
axesm = canvasm.cartesian()
markm0 = axesm.scatterplot(Xm[benign_idx], color='blue')
markm1 = axesm.scatterplot(Xm[malignant_idx], color = 'red')

### Preprocessing

*Preprocessing* refers to the scaling, feature extraction, and/or normalization of data before performing any type of analysis.  Of these functionalities, scaling and normalization are probably the most commonly used.  This can be done easily with ***scikit-learn*** using functions such as: 

- Standard Scaler
- MinMax Scaler
- MaxAbs Scaler

In [9]:
from sklearn import preprocessing

In [27]:
data = numpy.random.randint(0, 10, (10,10))
print()
print("The original data is:")
print()
print(data)

# Standard Scaler
scaler = preprocessing.StandardScaler()
scaled_data = scaler.fit_transform(data)
print()
print("The standard sclaed data is:")
print()
print(scaled_data)

# MinMax Scaler
scaler = preprocessing.MinMaxScaler((-1,1))
scaled_data = scaler.fit_transform(data)
print()
print("The MinMax scaled data is:")
print()
print(scaled_data)

# MaxAbs Scaler
scaler = preprocessing.MaxAbsScaler()
scaled_data = scaler.fit_transform(data)
print()
print("The MaxAbs scaled data is:")
print()
print(scaled_data)


The original data is:

[[6 1 0 8 0 1 6 4 7 8]
 [3 9 9 0 4 1 3 5 8 4]
 [2 9 6 6 0 8 5 8 6 4]
 [3 4 7 8 7 5 1 5 2 9]
 [2 6 7 4 2 8 6 4 5 5]
 [8 4 4 1 0 2 4 4 5 4]
 [1 7 6 6 6 0 3 6 2 1]
 [4 7 4 4 1 7 9 4 0 6]
 [0 9 8 4 7 0 5 8 6 8]
 [0 1 7 1 7 4 0 2 9 9]]

The standard sclaed data is:

[[ 1.27733275 -1.60175571 -2.37577257  1.40069858 -1.14354375 -0.84622792
   0.72524067 -0.55901699  0.73521462  0.87235674]
 [ 0.04120428  1.12463699  1.31077107 -1.54814054  0.20180184 -0.84622792
  -0.48349378  0.          1.10282193 -0.71374643]
 [-0.37083854  1.12463699  0.08192319  0.6634888  -1.14354375  1.43207802
   0.32232919  1.67705098  0.36760731 -0.71374643]
 [ 0.04120428 -0.57935845  0.49153915  1.40069858  1.21081103  0.45566119
  -1.28931674  0.         -1.10282193  1.26888254]
 [-0.37083854  0.10223973  0.49153915 -0.07372098 -0.47087096  1.43207802
   0.72524067 -0.55901699  0.         -0.31722063]
 [ 2.10141839 -0.57935845 -0.73730873 -1.17953565 -1.14354375 -0.52075564
  -0.0805823  -

## Modeling

### Supervised and Unsupervised Modeling

One of the greatest features of the ***scikit-learn*** library is its implementation of various popular and useful supervised and unsupervised learning models.  Regardless of the model, the library follows a common structure for defining, training (fitting), and testing (predicting on) models.  The common/universal steps are:

- ***Instantiating***: defining the instance of a model
- ***Fitting***: fitting the instantiated model to your training inputs (for both supervised and unsupervised models) and training outputs (for supervised models only)
- ***Predicting***: evaluating the model on \[presumably\] never before seen data

In [35]:
import numpy
from sklearn.mixture import GaussianMixture
from sklearn.tree import DecisionTreeClassifier

X = numpy.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
y = numpy.array([0, 0, 0, 1, 1, 1])

### Unsupervised Model

# Instantiate
gm = GaussianMixture(n_components=2, random_state=0)

# Fit
gm.fit(X)

# Predict
u_preds = gm.predict([[1,-6], [10, 14]])

print("The predictions for the unsupervised (clustering) model are: ", u_preds)

### Supervised Model

# Instantiate
dt = DecisionTreeClassifier()

# Fit 
dt.fit(X, y)

# Predict
s_preds = dt.predict([[1,-6], [10, 14]])

print("The predictions for the supervised (classification) model are: ", s_preds)


The predictions for the unsupervised (clustering) model are:  [1 0]
The predictions for the supervised (classification) model are:  [0 1]
