<a href="https://colab.research.google.com/github/bomlme/Machine-Learning-Cheat-Sheets/blob/main/Scikit-Learn%20Cheat%20Sheet/Scikit_Learn_Cheat_Sheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scikit-Learn Cheat Sheet
Scikit-learn is a robust and popular library for machine learning based on Python, NumPy, SciPy and Matplotlib. It consists a selection of common tools for machine learning and statistical modeling, for example, classification, regression, clustering and dimensionality reduction. This document summaries the functions you may use with Scikit-Learn.

## Workflow of a basic example
This section shows the workflow of solving a problem using machine learning.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

iris = datasets.load_iris() #1.Preprocessing
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
X_train_std = StandardScaler().fit_transform(X_train)
X_test_std = StandardScaler().fit_transform(X_test)
knc = KNeighborsClassifier(n_neighbors=3) #2.Model creation
knc.fit(X_train_std,y_train) #3.Model fitting
y_pred = knc.predict(X_test_std) #4.Prediction
accuracy_score(y_test, y_pred) #5.Performance evaluation

## 1) Preprocessing


### Data set
The data set here consists of a 150x4 numpy.ndarray of 3 different types of irises (Setosa, Versicolour, and Virginica). The rows are the samples and the columns areg Sepal Length, Sepal Width, Petal Length and Petal Width.
* Number of Instances:
150 (50 in each of three classes)
* Attributes:
sepal length, sepal width, petal length, petal width (cm)
* Class:
Iris-Setosa, Iris-Versicolour, Iris-Virginica

In [None]:
# Import data
from sklearn import datasets
iris = datasets.load_iris()
X, y = iris.data[:,:4], iris.target

### Taining and test data
train_test_split: this function split arrays or matrices into random train and test subsets. The default proprtions are 0.75 and 0.25

In [None]:
# Split training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

### Feature Scaling - Standardization
Standardization of datasets is a common requirement for many machine learning estimators, and it is needed when:
* Distribution of the data is normal distribution
* Features of input dataset have large differences between their ranges
* Zero mean and unit standard deviation are needed
* Each of the attributes need to contribute equally to the analysis
* Remove outliers impact (standardization does not have a bounding range)

In [None]:
# Standardization (z score = (x – μ) / σ ) 
scaler = StandardScaler().fit(X_train)
X_train_standardized = scaler.transform(X_train)
X_test_standardized = scaler.transform(X_test)

### Feature Scaling - Normalization 
Normalization is the process of scaling individual samples to have unit norm, and it is useful when:
* Distribution of data is unknown.
* Quadratic form (dot-product or any other kernel) is used to quantify the similarity of any pair of samples
* Features need to be transformed to a similar scale, e.g. [0 1]

In [None]:
# Normalization ((X - X_min)/(X_max - X_min)) 
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
X_train_normalized = scaler.transform(X_train)
X_test_normalized = scaler.transform(X_test)

### Binarization 
Binarization is a common operation on text count (presence or absence of a feature), and for estimators that consider boolean random variables (modelled using the Bernoulli distribution in a Bayesian setting).

In [None]:
# Binarization (numerical features to boolean values)
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold = 3).fit(X_train)
X_binarized = binarizer.transform(X_test)

### Imputation
It is used to impute the missing values and infer them from the known part of the data.
* Univariate imputes values in the i-th feature dimension using only non-missing values in that feature dimension (e.g. impute.SimpleImputer). 
* Multivariate use the entire set of available feature dimensions to estimate the missing values (e.g. impute.IterativeImputer).

In [None]:
# Imputation of Missing Values
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit_transform(X_train)

### Polynomial Feature Transforms
Input features may interact in many ways, engineering new features can expose these interactions and see potential improved model performance.
* Generate a new feature matrix consisting of all specified polynomial combinations of the features degree
* Add new interaction features

In [None]:
# Generating polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
X_poly = poly.fit_transform(X_train)

### Custom Transformers
It is used to convert an existing Python function into a transformer to assist in data cleaning or processing. The the popular log transform returns the natural logarithm of one plus the input array, element-wise.
* Log Transform helps to handle skewed data
* The distribution becomes more approximate to normal after the transform
* Log transform normalizes the magnitude differences
* It also decreases the effect of the outliers
* Forwards the arguments to a user-defined function

In [None]:
# Custom transformers
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X_transformed = transformer.transform(X_train)

### Binning 
Binning is applied to data
* Make the model more robust and prevent overfitting
* Binning can be applied on both categorical and numerical data
* Bin continuous data (features in columns) into intervals
* It has a cost to the performance though



In [None]:
# Binning
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
k_bins = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans').fit(X)
est = k_bins.transform(X_train)

### One-Hot Encoding
One hot encoding is one method of converting data to prepare it for an algorithm and get a better prediction. One-hot encoder converts each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector.
* Encoding Categorical Features 
* Use this when attributes are nominal (mutually exclusive)
* it can take a multidimensional array

In [None]:
# One-Hot Encoder
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder() 
enc.fit(y.reshape(-1,1))  
enc.transform(y.reshape(-1,1)).toarray()

### Label Encoding
Label Encoding converts labels into a numeric form. Machine learning algorithms can then decide in a better way how those labels must be operated.
* Use this when attributes are ordinal


In [None]:
# Label Encoder 
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y_encoded = enc.fit_transform(y)

## 2). Model Creation

### Supervised learning - classfication
Supervised learning is an approach to creating AI algorithms to be trained on input data and "correct" output. The model is trained until it can detect the underlying patterns and relationships between the input data and the output labels. This enables the model to predict results given new input data. classification is a supervised learning concept which basically categorizes a set of data into classes.

In [None]:
# Supervised Learning Estimators - classification

# Support Vector Machines (SVM)
from sklearn.svm import SVC
svc = SVC()

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

# K Nearest Neighbor
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier(n_neighbors=3)

#### Support Vector Machines (SVM)
SVMs are based on the idea of finding a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it.<br>
<img src="https://github.com/bomlme/Machine-Learning-Cheat-Sheets/blob/main/assets/svm1.png?raw=true" width="400"><br>
SVM works without any modifications for linearly separable data.Kernelized SVM can be used for non-linearly separable data.<br>
<img src="https://github.com/bomlme/Machine-Learning-Cheat-Sheets/blob/main/assets/svm2.png?raw=true" width="400">

#### Naive Bayes
Bayes' Theorem is useful when working with conditional probabilities, and it provides us with a way to reverse them:<br>
$P(A|B) = \frac{P(B|A)*P(A)}{P(B)}$<br>
For example, if we want to the determine the possibility of the sentence "had a great night" is postive can be expressed like this:
$P(positive|had\,a\,great\,night) = \frac{P(had\,a\,great\,night|positive)*P(positive)}{P(had\,a\,great\,night)}$<br>
So we just need to compare:
${P(had\,a\,great\,night|positive)*P(positive)}$ with ${P(had\,a\,great\,night|negative)*P(negative)}$<br>
To simplify the calculations, we need to be Naive: we assume that every word in a sentence is independent of the other ones. <br>
So, ${P(had\,a\,great\,night)}$ becomes ${P(had)*P(a)*P(great)*P(night)}$<br>
${P(had\,a\,great\,night|positive)}$ becomes ${P(had|positive)*P(a|positive)*P(great|positive)*P(night|positive)}$<br>
Now based on how often these individual words show up in the training data, and we can calculate them.

### Supervised Learning - regression


In [None]:
# Supervised Learning Estimators - regression

# Linear Regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

# K Nearest Neighbor
from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor(n_neighbors=2)

#### Linear Regression
inear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). This method is mostly used for forecasting and finding out cause and effect linear relationship.

#### K nearest neighbor (KNN)
KNN is a lazy learning algorithm which stores all instances corresponding to training data in n-dimensional space.KNN does not focus on constructing a general internal model, but it works on storing instances of training data.<br>
Classification is calculated from a simple majority vote of the k nearest neighbors of each point, so whichever label most of the neighbors have is the label for the new point.<br>
<img src="https://miro.medium.com/max/1151/0*ItVKiyx2F3ZU8zV5" width="400">

### Unsupervised learning

In [None]:
# Unsuperviser Learning Estimators

# K means
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters = 3, random_state= 0)

# PCA
# Reduce number of attributes, while presrving as much info as possible
# Use Singular Value Decomposition of the data to project it to a lower dimensional space
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X_train)

#### K means
A K-means clustering algorithm tries to group similar items in the form of clusters. The number of groups is represented by K. 
1. K points are placed into the object data space representing the initial group of centroids. 
2. Each data point is then assigned into the closest cluster through reducing the in-cluster variance. 
3. After all points are assigned, the positions of the k centroids are recalculated using the k clusters. 
4. Steps 2 and 3 are repeated until the positions of the centroids no longer move.<br>
In summary, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the clusters as small as possible.

#### Principal component analysis (PCA)
PCA is a way to bring out strong patterns from large and complex datasets. It finds a way to reduce the dimensions of the data by projecting it onto lines drawn through data, starting with the line that goes through the data in the direction of the greatest variance. This is calculated by looking at the eigenvectors of the covariance matrix. The eigenvector with the largest eigenvalue is the first principal component, the eigenvector with the second largest eigenvalue is the second principal component, etc.

In [None]:
# Model fitting
# Supervised learning
clf = svc.fit(X_train, y_train)
clf = knc.fit(X_train, y_train)
clf = gnb.fit(X_train, y_train)
# Unsupervised learning
reg = k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)

In [None]:
# Prediction
# Supervised learning
y_pred = svc.predict(X_test)
y_pred = lr.predict(X_test)
y_pred = knc.predict(X_test)
# Unsupervised learning
y_pred = k_means.predict(X_test)

## 3). Model Performance Evaluation

In [None]:
# Classifcation Metrics

# Accuracy score
from sklearn.metrics import accuracy_score
knc.score(X_test,y_test)
accuracy_score(y_test,y_pred)

# Classfication Report
from sklearn.metrics import classification_report
classification_report(y_test,y_pred)

# Confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

In [None]:
# Regression Metrics

# Mean Absolute Error
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

# Mean Squared Error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

# R^2 score
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

In [None]:
# Clustering Metrics

# Adjusted Rand Index
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y, y_pred)

# Homogeneity
from sklearn.metrics import homogeneity_score
homogeneity_score(y_pred, y_pred)

# V-measure
from sklearn.metrics import v_measure_score
v_measure_score(y, y_pred)

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" width = "300"><br>
When evaluating different settings (“hyperparameters”) for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.<br>

Training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.<br>

By partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.<br>

* k-fold cross-validation
1. A model is trained using k-1  of the folds as training data;
2. The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width="500">

In [None]:
# cross-validation
# https://scikit-learn.org/stable/modules/cross_validation.html

from sklearn.model_selection import cross_val_score
clf = svc.fit(X_train, y_train)
scores = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')

## 4). Model Tuning

In [None]:
# Grid search
from sklearn.model_selection import GridSearchCV
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
clf = GridSearchCV(svc, parameters)
clf.fit(X_train, y_train) 
# To check the results
clf.cv_results_

In [None]:
# Randomized Parameter Optimization
from sklearn.model_selection import RandomizedSearchCV
params ={'n_neighbors': [2,3,4], 'weights':['uniform','distance']}
clf = RandomizedSearchCV(estimator=knc,
                             param_distributions=params,
                             cv=4,
                             n_iter=8,
                             random_state=5)
clf.fit(X_train, y_train)
clf.cv_results_