# 1. Introduction to Scikit-Learn

## sklearn focuses on modeling data and most popular groups of models:
### a) Supervised Learning Algorithms
### b) Unsupervised Learning Algorithms
### c) Clustering
### d) Cross Validation
### e) Dimensionality Reduction
### f) Ensemble Methods
### g) Feature Extraction
---

# Modelling Process

### 1. Dataset Loading
- Dataset usually has feature matrix and label vector
- sklearn has few example datasets like **iris** and **digits** for classification and the **Boston House Price** for regression



In [1]:
# Iris Dataset
from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data
y = iris.target

features_name = iris.feature_names
target_name = iris.target_names

print("Features Name:", features_name)
print("Targets Name:", target_name)
print("First 10 Rows in X:\n", X[:10])

Features Name: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Targets Name: ['setosa' 'versicolor' 'virginica']
First 10 Rows in X:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]


## Split the dataset
- Training : Test = 70 : 30
- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)
print("Train data:", X_train.shape, "Train labels:", y_train.shape)
print("Test data:", X_test.shape, "Test labels:", y_test.shape)

Train data: (105, 4) Train labels: (105,)
Test data: (45, 4) Test labels: (45,)


## Train the Model
- Use K Nearest Neighbors classifier

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9777777777777777


### Model Persistence
- Can be done with **dump** and **load** from **joblib** package


In [4]:
from joblib import dump, load

dump(knn, "knn.joblib")  # save to
model = load("knn.joblib") # read from

### Preprocessing the Data
- Package: **sklearn.preprocessing**
1. Convert numerical to boolean: **preprocessing.Binarizer(threshold = 2).transform(input_sample)**


In [7]:
import numpy as np
from sklearn import preprocessing

input_sample = np.array([[2.1, 1.9, -0.5], [3.23, 5.29, 0.36]])
binarized = preprocessing.Binarizer(threshold = 2).transform(input_sample)
print("Binarized Data:", binarized)

Binarized Data: [[1. 0. 0.]
 [1. 1. 0.]]


2. Standardization: **data_normalized = preprocessing.scale(input_sample)**

In [12]:
print("Mean =", input_sample.mean(axis = 0)) # each col mean
print("Std =", input_sample.std(axis = 0))

data_normalized = preprocessing.scale(input_sample)
print("Mean =", data_normalized.mean(axis = 0))
print("Std =", data_normalized.std(axis = 0))

Mean = [ 2.665  3.595 -0.07 ]
Std = [0.565 1.695 0.43 ]
Mean = [0.00000000e+00 1.66533454e-16 0.00000000e+00]
Std = [1. 1. 1.]


3. Min-Max Scaling / Normalization: **preprocessing.MinMaxScaler(feature_range = (0, 1)).fit_transform(X)**
  (max - min) / (X.max(axis=0) - X.min(axis=0))  ${\text{default axis = 0}}$

In [23]:
X = np.array([[2.1, -1.9, 5.5], [-1.5, 2.4, 3.5], [0.5, -7.9, 5.6], [5.9, 2.3, -5.8]])
min_max_scaled = preprocessing.MinMaxScaler(feature_range = (0, 1)).fit_transform(X)
print("Min-max Scaled:", min_max_scaled)

Min-max Scaled: [[0.48648649 0.58252427 0.99122807]
 [0.         1.         0.81578947]
 [0.27027027 0.         1.        ]
 [1.         0.99029126 0.        ]]


4. Normalization: default axis = 1
    a. L1-Norm: **preprocessing.normalize(X, norm = "l1")**
    b. L2-Norm: **preprocessing.normalize(X, norm = "l2")**

In [24]:
l1_normed = preprocessing.normalize(X, norm = "l1")
l2_normed = preprocessing.normalize(X, norm = "l2")

print("L1 Normed:", l1_normed)
print("L2 Normed:", l2_normed)

L1 Normed: [[ 0.22105263 -0.2         0.57894737]
 [-0.2027027   0.32432432  0.47297297]
 [ 0.03571429 -0.56428571  0.4       ]
 [ 0.42142857  0.16428571 -0.41428571]]
L2 Normed: [[ 0.33946114 -0.30713151  0.88906489]
 [-0.33325106  0.53320169  0.7775858 ]
 [ 0.05156558 -0.81473612  0.57753446]
 [ 0.68706914  0.26784051 -0.6754239 ]]
