# Scikit-Learn Cheat Sheet
Scikit-learn is a robust and popular library for machine learning based on Python, NumPy, SciPy and Matplotlib. It consists a selection of common tools for machine learning and statistical modeling, for example, classification, regression, clustering and dimensionality reduction. This document summaries the functions you may use with Scikit-Learn.

## Workflow of a basic example
This section shows the workflow of solving a classfication problem using machine learning.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = datasets.load_iris() #1.Preprocessing
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
X_train_std = StandardScaler().fit_transform(X_train)
X_test_std = StandardScaler().fit_transform(X_test)
knc = KNeighborsClassifier(n_neighbors=3) #2.Model creation
knc.fit(X_train_std,y_train) #3.Model fitting
y_pred = knc.predict(X_test_std) #4.Prediction
accuracy_score(y_test, y_pred) #5.Performance evaluation

# 1.Data Preprocessing


### Data set
The data set here consists of a 150x4 numpy.ndarray of 3 different types of irises (Setosa, Versicolour, and Virginica). The rows are the samples and the columns areg Sepal Length, Sepal Width, Petal Length and Petal Width.
* Number of Instances:
150 (50 in each of three classes)
* Attributes:
sepal length, sepal width, petal length, petal width (cm)
* Class:
Iris-Setosa, Iris-Versicolour, Iris-Virginica

In [None]:
# Import data
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:,:4], iris.target

### Taining and test data
train_test_split: this function split arrays or matrices into random train and test subsets. The default proprtions are 0.75 and 0.25

In [None]:
# Split training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

### Feature Scaling - Standardization
Standardization of datasets is a common requirement for many machine learning estimators, and it is needed when:
* Distribution of the data is normal distribution
* Features of input dataset have large differences between their ranges
* Zero mean and unit standard deviation are needed
* Each of the attributes need to contribute equally to the analysis
* Remove outliers impact (standardization does not have a bounding range)

In [None]:
# Standardization (z score = (x – μ) / σ ) 
scaler = StandardScaler().fit(X_train)
X_train_standardized = scaler.transform(X_train)
X_test_standardized = scaler.transform(X_test)

### Feature Scaling - Normalization 
Normalization is the process of scaling individual samples to have unit norm, and it is useful when:
* Distribution of data is unknown.
* Quadratic form (dot-product or any other kernel) is used to quantify the similarity of any pair of samples
* Features need to be transformed to a similar scale, e.g. [0 1]

In [None]:
# Normalization ((X - X_min)/(X_max - X_min)) 
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
X_train_normalized = scaler.transform(X_train)
X_test_normalized = scaler.transform(X_test)

### Binarization 
Binarization is a common operation on text count (presence or absence of a feature), and for estimators that consider boolean random variables (modelled using the Bernoulli distribution in a Bayesian setting).

In [None]:
# Binarization (numerical features to boolean values)
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold = 3).fit(X_train)
X_binarized = binarizer.transform(X_test)

### Imputation
It is used to impute the missing values and infer them from the known part of the data.
* Univariate imputes values in the i-th feature dimension using only non-missing values in that feature dimension (e.g. impute.SimpleImputer). 
* Multivariate use the entire set of available feature dimensions to estimate the missing values (e.g. impute.IterativeImputer).

In [None]:
# Imputation of Missing Values
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit_transform(X_train)

### Polynomial Feature Transforms
Input features may interact in many ways, engineering new features can expose these interactions and see potential improved model performance.
* Generate a new feature matrix consisting of all specified polynomial combinations of the features degree
* Add new interaction features

In [None]:
# Generating polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
X_poly = poly.fit_transform(X_train)

### Custom Transformers
It is used to convert an existing Python function into a transformer to assist in data cleaning or processing. The the popular log transform returns the natural logarithm of one plus the input array, element-wise.
* Log Transform helps to handle skewed data
* The distribution becomes more approximate to normal after the transform
* Log transform normalizes the magnitude differences
* It also decreases the effect of the outliers
* Forwards the arguments to a user-defined function

In [None]:
# Custom transformers
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X_transformed = transformer.transform(X_train)

### Binning 
Binning is applied to data
* Make the model more robust and prevent overfitting
* Binning can be applied on both categorical and numerical data
* Bin continuous data (features in columns) into intervals
* It has a cost to the performance though



In [None]:
# Binning
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
k_bins = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans').fit(X)
est = k_bins.transform(X_train)

### One-Hot Encoding
One hot encoding is one method of converting data to prepare it for an algorithm and get a better prediction. One-hot encoder converts each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector.
* Encoding Categorical Features 
* Use this when attributes are nominal (mutually exclusive)
* it can take a multidimensional array

In [None]:
# One-Hot Encoder
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder() 
enc.fit(y.reshape(-1,1))  
enc.transform(y.reshape(-1,1)).toarray()

### Label Encoding
Label Encoding converts labels into a numeric form. Machine learning algorithms can then decide in a better way how those labels must be operated.
* Use this when attributes are ordinal


In [None]:
# Label Encoder 
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y_encoded = enc.fit_transform(y)

#2. Model Creation

### Supervised learning - classfication
Supervised learning is an approach to creating AI algorithms to be trained on input data and "correct" output. The model is trained until it can detect the underlying patterns and relationships between the input data and the output labels. This enables the model to predict results given new input data.<br> Classification a supervised learning technique and it is used to categorize a set of data into classes.

#### Support Vector Machines (SVM)
SVMs are based on the idea of finding a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it.<br>
<img src="https://github.com/mltrends/Machine-Learning-Cheat-Sheets/blob/main/assets/svm3.png?raw=true" width="400"><br>
SVM uses a kernel function to draw Support Vector Classifier in a higher dimension. Types of Kernel Functions are :
1. Linear
2. Polynomial
3. Radial Basis Function(rbf)<br>
Kernel function only calculates relationship between every pair of points as if they are in the higher dimensions; they don’t actually do the transformation. This trick , calculating the high dimensional relationships without actually transforming data to the higher dimension, is called the Kernel Trick.

#### Naive Bayes
Naive Bayes is based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.<br>
$P(A|B) = \frac{P(B|A)*P(A)}{P(B)}$<br>

#### K nearest neighbor (KNN)
KNN is a lazy learning algorithm which stores all instances corresponding to training data in n-dimensional space.KNN does not focus on constructing a general internal model, but it works on storing instances of training data.<br>
Classification is calculated from a simple majority vote of the k nearest neighbors of each point, so whichever label has most of the neighbors is the label for the new point.<br>
<img src="https://github.com/mltrends/Machine-Learning-Cheat-Sheets/blob/main/assets/knn1.png?raw=true" width="300">

In [None]:
# Supervised Learning Estimators - classification
# 1. Support Vector Machines (SVM)
from sklearn.svm import SVC
svc = SVC()
# 2. Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
# 3. K Nearest Neighbor
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier(n_neighbors=3)

### Supervised Learning - regression
Regression is a supervised machine learning technique and it is used to predict continuous values.

#### Linear Regression
Linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). This method is mostly used for forecasting and finding out cause and effect linear relationship.

In [None]:
# Supervised Learning Estimators - regression
# 1. Linear Regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
# 2. K Nearest Neighbor
from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor(n_neighbors=2)

### Unsupervised learning
Unsupervised learning is a type of algorithm which learns patterns from untagged data.

#### K means
A K-means clustering algorithm tries to group similar items in the form of clusters. The number of groups is represented by K. 
1. K points are placed into the object data space representing the initial group of centroids. 
2. Each data point is then assigned into the closest cluster through reducing the in-cluster variance. 
3. After all points are assigned, the positions of the k centroids are recalculated using the k clusters. 
4. Steps 2 and 3 are repeated until the positions of the centroids no longer move.<br>
In summary, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the clusters as small as possible.

<img src="https://github.com/MLtrends/Machine-Learning-Cheat-Sheets/blob/main/assets/k-means.png?raw=true" width=400>

#### Principal component analysis (PCA)
PCA is a way to bring out strong patterns from large and complex datasets. It finds a way to reduce the dimensions of the data by looking at the eigenvectors of the covariance matrix. The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. <br>



In [None]:
# Unsuperviser Learning Estimators
# K means
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters = 3, random_state= 0)
# PCA
# Reduce number of attributes, while presrving as much info as possible
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X_train)

# 3. Model Fitting

In [None]:
# Model fitting
# Supervised learning
clf = svc.fit(X_train, y_train) # Fit the model to data
clf = knc.fit(X_train, y_train) # Fit the model to data
clf = gnb.fit(X_train, y_train) # Fit the model to data
# Unsupervised learning
reg = k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)  # Fit to the data, and trannsform it

# 4. Prediction

In [None]:
# Prediction
# Supervised learning
y_pred = svc.predict(X_test) # Predict labels
y_pred = lr.predict(X_test) # Predict labels
y_pred = knc.predict(X_test) # Predict labels
# Unsupervised learning
y_pred = k_means.predict(X_test) # Predict labels clustering

## 5. Model Performance Evaluation

#### Accuracy score
* Accuracy score calculates subset accuracy: it is the set of labels predicted for a sample that matches the corresponding set of labels in y_true.<br>
* Classification_report: it is text summary of the precision, recall, F1 score for each class.<br>
* Confusion matrix: it is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. It can provide a better idea of what the classification model is getting right and what types of errors it is making.

In [None]:
# Classifcation Metrics
# 1. Accuracy score
from sklearn.metrics import accuracy_score
knc.score(X_test,y_test)
accuracy_score(y_test,y_pred)
# 2. Classfication Report
from sklearn.metrics import classification_report
classification_report(y_test,y_pred)
# 3. Confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

#### Regression Metrics:
* MAE: The Mean Absolute Error represents the average of the absolute difference between the actual and predicted values in the dataset. It measures the average of the residuals in the dataset.
* MSE: Mean Squared Error represents the average of the squared difference between the original and predicted values in the data set. It measures the variance of the residuals.
* RMSE: Root Mean Squared Error is the square root of Mean Squared error. It measures the standard deviation of residuals.
* The coefficient of determination or R-squared represents the proportion of the variance in the dependent variable which is explained by the linear regression model.

In [None]:
# Regression Metrics
# 1. Mean Absolute Error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)
# 2.Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)
# 3. R^2 score
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

#### Clustering Metrics

* Rand index: a measure of the percentage of correct decisions made by the algorithm. It can be computed using the following formula: <br>
${RI = \frac{TP+TN}{TP+FP+FN+TN}}$<br>
TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.<br>
${ARI = \frac{RI-expected(RI)}{max(RI)-expected(RI)}}$<br>

* Homogeneity score: Clustering results satisfy homogeneity if all its clusters contain only data points that are members of a single class. This metric is independent of the absolute value of labels. It’s defined as:<br>
${h = 1- \frac{H(Y_{true}|Y_{pred})}{H(Y_{true})}}$ <br>

* V-measure score: The V-Measure is defined as the harmonic mean of homogeneity h and completeness c of the clustering. Both these measures can be expressed in terms of the mutual information and entropy measures of the information theory.<br>
${v = \frac{(1+beta)*homogeneity*completeness}{beta*homogeneity+completeness}}$

In [None]:
# Clustering Metrics
# Adjusted Rand Index
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y, y_pred)
# Homogeneity
from sklearn.metrics import homogeneity_score
homogeneity_score(y_pred, y_pred)
# V-measure
from sklearn.metrics import v_measure_score
v_measure_score(y, y_pred)

#### Cross validation
Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.


In [None]:
# Cross-validation
from sklearn.model_selection import cross_val_score
clf = svc.fit(X_train, y_train)
scores = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')

# 6. Model Tuning

* GridSearchCV: It is the process of performing hyperparameter tuning in order to determine the optimal values for a given model. GridSearchCV is used to automate the tuning of hyperparameters to try all possible values (exhaustive method) to know the optimal values.<br>

* RandomizedSearchCV: A random search samples a randomly-selected subset of n combinations. The randomized search process requires considerably less compute time and often delivers a similar result. By checking enough randomly-chosen combinations on the grid, the search is likely to identify one that is similar to the one that an exhaustive process would have identified.


In [None]:
# Grid search
from sklearn.model_selection import GridSearchCV
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
clf = GridSearchCV(svc, parameters)
clf.fit(X_train, y_train) 
# To check the results
clf.cv_results_

In [None]:
# Randomized Parameter Optimization
from sklearn.model_selection import RandomizedSearchCV
params ={'n_neighbors': [2,3,4], 'weights':['uniform','distance']}
clf = RandomizedSearchCV(estimator=knc,
                        param_distributions=params,
                        cv=4,
                        n_iter=8,
                        random_state=5)
clf.fit(X_train, y_train)
clf.cv_results_

# 7. Pipeline
A pipeline can be used to assemble several steps that can be cross-validated together while setting different hyperparameters. It enables setting parameters of the various steps using their names and the parameter name separated by a '__'.

In [8]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.pipeline import Pipeline
# Load data
iris = datasets.load_iris()
X, y = iris.data, iris.target
# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
# Making the Pipeline: PCA -> Scaling the data -> Classfication
pipe = Pipeline([('pca', PCA()), 
                 ('scaler', StandardScaler()), 
                 ('classifer', DecisionTreeClassifier())])
# Fitting the model
parameters = {'pca__n_components': [2, 3, 4],
              'classifer__max_depth': [5, 10, 20]}
grid = GridSearchCV(pipe, parameters).fit(X_train, y_train)
# Stores the optimum model in best_pipe
best_pipe = grid.best_estimator_
print(best_pipe)
print('Test set score: ' + str(best_pipe.score(X_test,y_test)))

Pipeline(steps=[('pca', PCA(n_components=4)), ('scaler', StandardScaler()),
                ('classifer', DecisionTreeClassifier(max_depth=5))])
Test set score: 0.9736842105263158
