<a href="https://colab.research.google.com/github/bomlme/Machine-Learning-Cheat-Sheets/blob/main/Scikit-Learn%20Cheat%20Sheet/ScikitLearnCheatSheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Scikit-learn

Workflow of a basic example

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = datasets.load_iris() #1.Preprocessing
X, y = iris.data[:,:3], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
X_train_std = StandardScaler().fit_transform(X_train)
X_test_std = StandardScaler().fit_transform(X_test)
knc = KNeighborsClassifier(n_neighbors=3) #2.Model creation
knc.fit(X_train_std,y_train) #3.Model fitting
y_pred = knc.predict(X_test_std) #4.Prediction
accuracy_score(y_test, y_pred) #5.Performance evaluation

## 1) Preprocessing
https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

* Number of Instances:
150 (50 in each of three classes)
* Attributes:
sepal length, sepal width, petal length, petal width (cm)
* Class:
Iris-Setosa, Iris-Versicolour, Iris-Virginica

In [None]:
# Import data
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:,:4], iris.target

In [None]:
# Split training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

* When features of input dataset have large differences between their ranges
* Want to ensure zero mean and unit standard deviation
* Each of the attributes contributes equally to the analysis
* Standardization does not have a bounding range, can remove outliers
* X_standardized = StandardScaler.fit_transform(X_train)

In [None]:
# Standardization (z score = (x – μ) / σ ) 
scaler = StandardScaler().fit(X_train)
X_train_standardized = scaler.transform(X_train)
X_test_standardized = scaler.transform(X_test)

* Transform features to be on a similar scale, e.g. [0 1],[-1 1]
* When features are of different scales.
* When distribution of data does not follow a Gaussian distribution
* X_normalized = Normalizer.fit_transform(X_train)


In [None]:
# Normalization ((X - X_min)/(X_max - X_min)) 
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
X_train_normalized = scaler.transform(X_train)
X_test_normalized = scaler.transform(X_test)

* X_binarized = binarizer.fit_transform(X_train)


In [None]:
# Binarization (numerical features to boolean values)
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold = 3).fit(X_train)
X_binarized = binarizer.transform(X_test)

In [None]:
# Imputation of Missing Values
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit_transform(X_train)

* Generate a new feature matrix consisting of all polynomial combinations 
* of the features with degree less than or equal to the specified degree
* Add new interaction features this way

In [None]:
# Generating polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
X_poly = poly.fit_transform(X_train)

* Log Transform helps to handle skewed data
* The distribution becomes more approximate to normal after it
* Log transform normalizes the magnitude differences
* It also decreases the effect of the outliers
* Forwards the arguments to a user-defined function

In [None]:
# Custom transformers
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X_transformed = transformer.transform(X_train)

* Make the model more robust and prevent overfitting
* It has a cost to the performance.
* Binning can be applied on both categorical and numerical data
* Bin continuous data (features in columns) into intervals


In [None]:
# Binning
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
k_bins = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans').fit(X)
est = k_bins.transform(X_train)


* Encoding Categorical Features 
* Use this when attributes are nominal (mutually exclusive)
* it can take a multidimensional array

In [None]:
# One-Hot Encoder
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder() 
enc.fit(y.reshape(-1,1))  
enc.transform(y.reshape(-1,1)).toarray()

* Use this when attributes are ordinal


In [None]:
# Label Encoder 
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y_encoded = enc.fit_transform(y)

## 2). Model Creation

### Supervised learning

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes.
<img src="https://66.media.tumblr.com/ff709fe1c77091952fb3e3e6af91e302/tumblr_inline_o9aa8dYRkB1u37g00_540.png"><br>
<img src="https://66.media.tumblr.com/7f12391977435370c1ddf4945dca0575/tumblr_inline_o9aa9nH3WQ1u37g00_540.png" width="300"><br>
The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it.<br>
<img src="https://66.media.tumblr.com/9bffea56372d28d2a30f80557451e824/tumblr_inline_o9aabehtqP1u37g00_540.png">

https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/

In [None]:
# Supervised Learning Estimators - classification

# Support Vector Machines (SVM)
from sklearn.svm import SVC
svc = SVC()

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

# K Nearest Neighbor
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier(n_neighbors=3)

In [None]:
# Supervised Learning Estimators - regression

# Linear Regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

# K Nearest Neighbor
from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor(n_neighbors=2)

### Unsupervised learning

In [None]:
# Unsuperviser Learning Estimators

# K means
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters = 3, random_state= 0)

# PCA
# Reduce number of attributes, while presrving as much info as possible
# Use Singular Value Decomposition of the data to project it to a lower dimensional space
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X_train)

PCA(n_components=2)

In [None]:
# model fitting
# Supervised learning
clf = svc.fit(X_train, y_train)
clf = knc.fit(X_train, y_train)
clf = gnb.fit(X_train, y_train)

# Unsupervised learning
reg = k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)

In [None]:
# Prediction
# Supervised learning
y_pred = svc.predict(X_test)
y_pred = lr.predict(X_test)
y_pred = knc.predict(X_test)

# Unsupervised learning
y_pred = k_means.predict(X_test)

## 3). Model Performance Evaluation

In [None]:
# Classifcation Metrics

# Accuracy score
from sklearn.metrics import accuracy_score
knc.score(X_test,y_test)
accuracy_score(y_test,y_pred)

# Classfication Report
from sklearn.metrics import classification_report
classification_report(y_test,y_pred)

# Confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

In [None]:
# Regression Metrics

# Mean Absolute Error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

# Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

# R^2 score
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

In [None]:
# Clustering Metrics

# Adjusted Rand Index
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y, y_pred)

# Homogeneity
from sklearn.metrics import homogeneity_score
homogeneity_score(y_pred, y_pred)

# V-measure
from sklearn.metrics import v_measure_score
v_measure_score(y, y_pred)

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" width = "300"><br>
When evaluating different settings (“hyperparameters”) for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.<br>

Training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.<br>

By partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.<br>

* k-fold cross-validation
1. A model is trained using k-1  of the folds as training data;
2. The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width="500">

In [None]:
# cross-validation
# https://scikit-learn.org/stable/modules/cross_validation.html

from sklearn.model_selection import cross_val_score
clf = svc.fit(X_train, y_train)
scores = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')

## 4). Model Tuning

In [None]:
# Grid search
from sklearn.model_selection import GridSearchCV
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
clf = GridSearchCV(svc, parameters)
clf.fit(X_train, y_train) 
# To check the results
clf.cv_results_

In [None]:
# Randomized Parameter Optimization
from sklearn.model_selection import RandomizedSearchCV
params ={'n_neighbors': [2,3,4], 'weights':['uniform','distance']}
clf = RandomizedSearchCV(estimator=knc,
                             param_distributions=params,
                             cv=4,
                             n_iter=8,
                             random_state=5)
clf.fit(X_train, y_train)
clf.cv_results_

In [None]:
class Guy(object):

    # attributes
    name = "Bo"
    work = "Cal Poly"
    home = "California"
    hobby = "Traveling"
    interests = "Machine Learning, Mechatronics"
    
    # methods
    def update_status(self, string):
        print("i'm working on", string, "now!")
        
    def introduce(self):
        print("Hello, this is", self.name)
        print("I'm into", self.interests, "and", self.hobby)

Bo = Guy()
Bo.introduce()
Bo.update_status("deep learning")

Hello, this is Bo
I'm into Machine Learning, Mechatronics and Traveling
i'm working on deep learning now!
