## `scikit-learn` is a widely used machine learning library which has a large collection of supervised and unsupervised machine learning algorithms implemented.

## Below are some examples of useful functionalities provided in the library.


In [None]:
from sklearn import datasets
import numpy as np

### Toy Datasets

#### `scikit-learn` has a few small standard datasets that don't need to be downloaded from the web. These are very useful and testing and understanding algorithms before running on a big dataset.
#### This includes dataset for both classification and regression tasks. Let's import one such dataset called diabetes dataset and print some basic statistics of the dataset.

In [None]:
diabetes = datasets.load_diabetes()
print('Shape of the dataset :', diabetes.data.shape)
print('Shape of the labels:', diabetes.target.shape)
print('features in the dataset:', diabetes.feature_names)

Shape of the dataset : (442, 10)
Shape of the labels: (442,)
features in the dataset: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']


### Train/Test Split
#### One of the most important step for training any machine learning models is to create a good train/test split of your dataset. `scikit-learnt` provides a very handy and quick utility to create test/train split. This utility also performs input validation and optional subsampling.

#### The following code performs a 75% train and 25% test split of the diabetes dataset loaded in the previous step. We are also passing `random state` to make sure that we can reproduce the result as this utility by default randomly shuffles the dataset.

In [None]:
# Split the dataset into train and test sets.
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.25, random_state=42)

print("Train features shape: ", X_train.shape)
print("Train labels: ", Y_train.shape)
print("Test features shape: ", X_test.shape)
print("Test labels: ", Y_test.shape)

Train features shape:  (331, 10)
Train labels:  (331,)
Test features shape:  (111, 10)
Test labels:  (111,)


### Linear Regression
#### linear regression is one of the most basic yet useful machine learning model. It's widely used for simple forecasting tasks. In the next cell, we will fit a linear regression model using `LinearRegression` class.

In [None]:
# import LinearRegression class
from sklearn.linear_model import LinearRegression
# instantiate LinearRegression class
linear_regression = LinearRegression()

# Fit linear regression model on the training dataset created above
linear_regression.fit(X_train, Y_train)

LinearRegression()

### After training the model, we will predict labels in the test dataset

In [None]:
# Now we will perform prediction on the test dataset
Y_predictions = linear_regression.predict(X_test)

print("Test predictions shape: ", Y_predictions.shape)

Test predictions shape:  (111,)


### Error/Loss metrics
#### Calculating the accuracy of a model is another important functionality to know how well has your model learnt and able to generalize that learning. These metrics are called `prediction error` or `prediction loss`.

#### `scikit-learn` provides a lot of built-in methods to calculate prediction errors. Two most popular error metrics are `mse` or `mean squared error` and `mae` or `mean absolute error`. In next cell, we will compute both these metrics on predictions generated above.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Calculate mean squared error which is the mean of the squared difference between predictions and actuals
mse = mean_squared_error(Y_test, Y_predictions)
# Calculate mean absoulte error which is the mean of the absolute difference between predictions and actuals
mae = mean_absolute_error(Y_test, Y_predictions)

print("Mean squared error: %4f" % mse)
print("Mean absolute error: %4f" % mae)

Mean squared error: 2848.295308
Mean absolute error: 41.548363


### SVM (Support Vector Machine) Model**

### Used for classification and regression to find a hyperplane in a N-dimensional space that distinctly classifies data points. This is useful for classification tasks where classes can't be separated linearly in lower dimensions. So, we project them to a higher dimension space where classes are linearly separable.

### `scikit-learn` also has built-in class to implement support vector machine. In the next cells, we will load breast cancer toy dataset and fit a support vector machine model on the training data.

### Let's load the dataset and split into train and test datasets. We will use stratified split in this case as sometimes data can have large imbalances. For example, there could be many more negative examples compared to positive examples which could lead to model only learning pronounced classes and still able to achieve high accuracy.

In [None]:
from sklearn.svm import SVC

# Load breast cancer toy dataset
b_cancer = datasets.load_breast_cancer()
X = b_cancer.data
Y = b_cancer.target

# Create training and test split with stratified sampling
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42, stratify=Y)

### Preprocessing
#### Most modern datasets can't be used directly to train a model. There could be missing values, really large max feature values or categorical features. This could make it hard to train a model on raw data due to uncertainity on how to fix missing values or expensive hardware requirements. To address these, one of the key steps in a machine learning workflow is to preprocess the data.

#### `scikit-learn preprocessing` provides some standard algorithms to preprocess the data such as StandardScalar. After scaling raw data, normalized data will have zero mean and unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# Fit StandardScalar which will compute mean and scale of all the features
sc.fit(X_train)

# let's print mean and scale of raw data
print("Raw data mean: ", sc.mean_)
print("Raw data scale: ", sc.scale_)

print("=============================================================================================")
print("=============================================================================================")

X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)
print("Scaled train data mean: {} std: {} ".format(X_train_scaled.mean(axis=0), X_train_scaled.std(axis=0)))
print("Scaled test data mean: {} std: {} ".format(X_test_scaled.mean(axis=0), X_test_scaled.std(axis=0)))

Raw data mean:  [1.40752019e+01 1.92950469e+01 9.15929108e+01 6.49627230e+02
 9.60178404e-02 1.03459577e-01 8.89306613e-02 4.82109366e-02
 1.80202582e-01 6.27792958e-02 3.98923239e-01 1.21372300e+00
 2.82511291e+00 3.91714390e+01 7.07850939e-03 2.55824695e-02
 3.24967761e-02 1.16545986e-02 2.07900657e-02 3.81400728e-03
 1.62025423e+01 2.57082629e+01 1.06803732e+02 8.72524413e+02
 1.32365258e-01 2.56040775e-01 2.77706495e-01 1.14155988e-01
 2.91209624e-01 8.40794601e-02]
Raw data scale:  [3.50479948e+00 4.44258216e+00 2.41565929e+01 3.45533249e+02
 1.33298715e-02 5.26037833e-02 8.08417973e-02 3.84561113e-02
 2.76122781e-02 7.22760813e-03 2.63030497e-01 5.69417242e-01
 1.94955244e+00 4.12775756e+01 3.08496788e-03 1.84850051e-02
 3.27951744e-02 6.33488329e-03 8.85972255e-03 2.83941769e-03
 4.80276118e+00 6.30299305e+00 3.34093443e+01 5.58478747e+02
 2.26226334e-02 1.62163195e-01 2.19105068e-01 6.71725169e-02
 6.52951977e-02 1.87265961e-02]
Scaled train data mean: [-4.68430719e-15  3.92226

### Fit the model on scaled training dataset and performance of the model
#### `scikit-learn.metrics` also provides different ways to calculate accuracy of the model such as calculating precision which is ratio of (True Positives) / (True Positives + False Positives), recall which is ratio of (True Positive) / (True Positives + False Negatives) and accuracy.

In [None]:
# Instantiate SVM classifier and fit on training dataset
svc = SVC(C=1.0, random_state=1, kernel='linear')
svc.fit(X_train_scaled, Y_train)

# let's check the performance of our model
from sklearn.metrics import accuracy_score, recall_score, precision_score

Y_predict = svc.predict(X_test_scaled)

print("Accuracy score %.3f" % accuracy_score(Y_test, Y_predict))
print("Recall: ", recall_score(Y_test, Y_predict))
print("Precision: " , precision_score(Y_test, Y_predict))

Accuracy score 0.986
Recall:  0.9888888888888889
Precision:  0.9888888888888889


### Classification report
#### Another useful functionality is to create a classfication report to evaluate the model performance. This function calculates precision, recall, f1-score, accuracy and number of support examples for each class.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(Y_test, Y_predict))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98        53
           1       0.99      0.99      0.99        90

    accuracy                           0.99       143
   macro avg       0.99      0.99      0.99       143
weighted avg       0.99      0.99      0.99       143



### Confusion Matrix
#### Yet another useful functionality is to calculate confusion matrix which helps in visualizing model performance. This is specially useful in multiclass classification problems to see which classes have the most mislabels etc.

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(Y_test, Y_predict)

array([[52,  1],
       [ 1, 89]])