<a href="https://colab.research.google.com/github/ekaratnida/Applied-machine-learning/blob/master/colabs/scikit/Simple_Scikit_Integration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/scikit/Simple_Scikit_Integration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{simple-sklearn} -->

<img src="https://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />

<!--- @wandbcode{simple-sklearn} -->

# 🏋️‍♀️ W&B + 🧪 Scikit-learn
Use Weights & Biases for machine learning experiment tracking, dataset versioning, and project collaboration.


<img src="https://wandb.me/mini-diagram" width="650" alt="Weights & Biases" />


## What this notebook covers:
* Easy integration of Weights and Biases with Scikit.
* W&B Scikit plots for model interpretation and diagnostics for regression, classification, and clustering.

**Note**: Sections starting with _Step_ are all you need to integrate W&B to existing code.


## The interactive W&B Dashboard will look like this:

![](https://i.imgur.com/F1ZgR4A.png)

In [1]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans
from sklearn import datasets, cluster

from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

## Step 0: Install W&B

In [2]:
!pip install wandb -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.1/267.1 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 1: Import W&B and Login

In [3]:
import wandb


In [4]:
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

# Regression

**Let's check out a quick example**

In [6]:
# Load data
housing = datasets.fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
X, y = X[::2], y[::2]  # subsample for faster demo
wandb.errors.term._show_warnings = False
# ignore warnings about charts being built from subset of data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train model, get predictions
reg = Ridge()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

## Step 2: Initialize W&B run

In [7]:
run = wandb.init(project='my-scikit-integration', name="regression")

[34m[1mwandb[0m: Currently logged in as: [33mpokekarat[0m ([33mdads[0m). Use [1m`wandb login --relogin`[0m to force relogin


## Step 3: Visualize model performance

### Residual Plot

Measures and plots the predicted target values (y-axis) vs the difference between actual and predicted target values (x-axis), as well as the distribution of the residual error.

Generally, the residuals of a well-fit model should be randomly distributed because good models will account for most phenomena in a data set, except for random error.

[Check out the official documentation here $\rightarrow$](https://docs.wandb.com/library/integrations/scikit#residuals-plot)


In [8]:
wandb.sklearn.plot_residuals(reg, X_train, y_train)



### Outlier Candidate

Measures a datapoint's influence on regression model via Cook's distance. Instances with heavily skewed influences could potentially be outliers. Useful for outlier detection.

[Check out the official documentation here $\rightarrow$](https://docs.wandb.com/library/integrations/scikit#outlier-candidates-plot)

In [9]:
wandb.sklearn.plot_outlier_candidates(reg, X_train, y_train)



## All-in-one: Regression plot

Using this all in one API one can:
* Log summary of metrics
* Log learning curve
* Log outlier candidates
* Log residual plot

In [10]:
wandb.sklearn.plot_regressor(reg, X_train, X_test, y_train, y_test, model_name='Ridge')

wandb.finish()

[34m[1mwandb[0m: 
[34m[1mwandb[0m: Plotting Ridge.
[34m[1mwandb[0m: Logged summary metrics.
[34m[1mwandb[0m: Logged learning curve.
[34m[1mwandb[0m: Logged outlier candidates.
[34m[1mwandb[0m: Logged residuals.


VBox(children=(Label(value='0.248 MB of 0.248 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

# Classification

**Let's check out a quick example.**

In [11]:
# Load data
wbcd = wisconsin_breast_cancer_data = datasets.load_breast_cancer()
feature_names = wbcd.feature_names
labels = wbcd.target_names

X_train, X_test, y_train, y_test = train_test_split(wbcd.data, wbcd.target, test_size=0.2)


# Train model, get predictions
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_probas = model.predict_proba(X_test)
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

## Step 2: Initialize W&B run

In [12]:
run = wandb.init(project='my-scikit-integration', name="classification")

## Step 3: Visualize model performance

### Class Proportions

Plots the distribution of target classes in training and test sets. Useful for detecting imbalanced classes and ensuring that one class doesn't have a disproportionate influence on the model.

[Check out the official documentation here $\rightarrow$](https://docs.wandb.com/library/integrations/scikit#class-proportions)

In [None]:
wandb.sklearn.plot_class_proportions(y_train, y_test, labels)

### Learning Curve

Trains model on datasets of varying lengths and generates a plot of cross validated scores vs dataset size, for both training and test sets.

[Check out the official documentation here $\rightarrow$](https://docs.wandb.com/library/integrations/scikit#learning-curve)

In [None]:
wandb.sklearn.plot_learning_curve(model, X_train, y_train)

### ROC

ROC curves plot true positive rate (y-axis) vs false positive rate (x-axis). The ideal score is a `TPR = 1` and `FPR = 0`, which is the point on the top left. Typically we calculate the area under the ROC curve (AUC-ROC), and the greater the AUC-ROC the better.

[Check out the official documentation here $\rightarrow$](https://docs.wandb.com/library/integrations/scikit#roc)

In [None]:
wandb.sklearn.plot_roc(y_test, y_probas, labels)

### Precision Recall Curve

Computes the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.

[Check out the official documentation here $\rightarrow$](https://docs.wandb.com/library/integrations/scikit#precision-recall-curve)

In [None]:
wandb.sklearn.plot_precision_recall(y_test, y_probas, labels)

### Feature Importances

Evaluates and plots the importance of each feature for the classification task. Only works with classifiers that have a `feature_importances_` attribute, like trees.

[Check out the official documentation here $\rightarrow$](https://docs.wandb.com/library/integrations/scikit#feature-importances)

In [None]:
wandb.sklearn.plot_feature_importances(model);

## All-in-one: Classifier Plot

Using this all in one API one can:
* Log feature importance
* Log learning curve
* Log confusion matrix
* Log summary metrics
* Log class proportions
* Log calibration curve
* Log roc curve
* Log precision recall curve

In [13]:
wandb.sklearn.plot_classifier(model,
                              X_train, X_test,
                              y_train, y_test,
                              y_pred, y_probas,
                              labels,
                              is_binary=True,
                              model_name='RandomForest')

wandb.finish()

[34m[1mwandb[0m: 
[34m[1mwandb[0m: Plotting RandomForest.
[34m[1mwandb[0m: Logged feature importances.
[34m[1mwandb[0m: Logged confusion matrix.
[34m[1mwandb[0m: Logged summary metrics.
[34m[1mwandb[0m: Logged class proportions.
[34m[1mwandb[0m: Logged calibration curve.
[34m[1mwandb[0m: Logged roc curve.
[34m[1mwandb[0m: Logged precision-recall curve.


VBox(children=(Label(value='0.016 MB of 0.016 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

# Clustering

In [17]:
# Load data
iris = datasets.load_iris()
X, y = iris.data, iris.target
names = iris.target_names

def get_label_ids(classes):
    return np.array([names[aclass] for aclass in classes])
labels = get_label_ids(y)

# Train model
kmeans = KMeans(n_clusters=3, random_state=1)
cluster_labels = kmeans.fit_predict(X)

## Step 2: Initialize W&B run

In [19]:
run = wandb.init(project='my-scikit-integration', name="clustering")

## Step 3: Visualize model performance

### Elbow Plot

Measures and plots the percentage of variance explained as a function of the number of clusters, along with training times. Useful in picking the optimal number of clusters.

[Check out the official documentation here $\rightarrow$](https://docs.wandb.com/library/integrations/scikit#elbow-plot)

In [None]:
wandb.sklearn.plot_elbow_curve(kmeans, X)

### Silhouette Plot

Measures & plots how close each point in one cluster is to points in the neighboring clusters. The thickness of the clusters corresponds to the cluster size. The vertical line represents the average silhouette score of all the points.

[Check out the official documentation here $\rightarrow$](https://docs.wandb.com/library/integrations/scikit#silhouette-plot)

In [None]:
wandb.sklearn.plot_silhouette(kmeans, X, labels)

## All in one: Clusterer Plot

Using this all-in-one API you can:
* Log elbow curve
* Log silhouette plot

In [20]:
wandb.sklearn.plot_clusterer(kmeans, X, cluster_labels, labels, 'KMeans')

wandb.finish()

[34m[1mwandb[0m: 
[34m[1mwandb[0m: Plotting KMeans.
[34m[1mwandb[0m: Logged elbow curve.
[34m[1mwandb[0m: Logged silhouette plot.


VBox(children=(Label(value='0.017 MB of 0.030 MB uploaded\r'), FloatProgress(value=0.5554607238817765, max=1.0…

# Sweep 101

Use Weights & Biases Sweeps to automate hyperparameter optimization and explore the space of possible models.

## [Check out Hyperparameter Optimization in PyTorch using W&B Sweeps $\rightarrow$](http://wandb.me/sweeps-colab)

Running a hyperparameter sweep with Weights & Biases is very easy. There are just 3 simple steps:

1. **Define the sweep:** We do this by creating a dictionary or a [YAML file](https://docs.wandb.com/library/sweeps/configuration) that specifies the parameters to search through, the search strategy, the optimization metric et all.

2. **Initialize the sweep:**
`sweep_id = wandb.sweep(sweep_config)`

3. **Run the sweep agent:**
`wandb.agent(sweep_id, function=train)`

And voila! That's all there is to running a hyperparameter sweep! In the notebook below, we'll walk through these 3 steps in more detail.

<img src="https://imgur.com/sdQXdDz.png" alt="Sweep Result" />


# Example Gallery

See examples of projects tracked and visualized with W&B in our gallery, [Fully Connected →](https://wandb.me/fc)

# Basic Setup
1. **Projects**: Log multiple runs to a project to compare them. `wandb.init(project="project-name")`
2. **Groups**: For multiple processes or cross validation folds, log each process as a runs and group them together. `wandb.init(group='experiment-1')`
3. **Tags**: Add tags to track your current baseline or production model.
4. **Notes**: Type notes in the table to track the changes between runs.
5. **Reports**: Take quick notes on progress to share with colleagues and make dashboards and snapshots of your ML projects.

# Advanced Setup
1. [Environment variables](https://docs.wandb.com/library/environment-variables): Set API keys in environment variables so you can run training on a managed cluster.
2. [Offline mode](https://docs.wandb.com/library/technical-faq#can-i-run-wandb-offline): Use `dryrun` mode to train offline and sync results later.
3. [On-prem](https://docs.wandb.com/self-hosted): Install W&B in a private cloud or air-gapped servers in your own infrastructure. We have local installations for everyone from academics to enterprise teams.