# What is Sklearn ?

**Scikit-learn (Sklearn)** is Python's most useful and robust machine learning library. It offers a set of efficient tools for machine learning and statistical modeling, such as classification, regression, clustering, and dimensionality reduction, via a Python interface. This predominantly Python-written package is based on NumPy, SciPy, and Matplotlib.

Sklearn consists of different packages :
1. Classification
2. Regression
3. Clustering
4. Dimensionality reduction
5. Model selection
6. Preprocessing

# Sklearn preprocessing Package

**Pre-processing** refers to the tranformations performed on the data , before sending it to the algorithm. 
In python, scikit-learn library includes an existing in-built functionality under sklearn.preprocessing module.

The sklearn. preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. This module includes scaling, centering, normalization, binarization methods.

### Table of Contents :

* <a href='#install'>Installation</a>
* <a href='#import'>Importing sklearn preprocessing module</a>
* <a href='#std'>1.Standardization</a>
    * <a href='#std1'>1.1 Scaling Features to a range</a>
    * <a href='#std2'>1.2 Scaling Sparse Data </a>
    * <a href='#std3'>1.3 Scaling data with Outliers</a>
    * <a href='#std4'>1.4 Centering Kernel Matrices</a>
* <a href='#norm'>2.Normalization</a>    
* <a href='#non-linear'>3.Non-Linear Transformation</a>
    * <a href='#non-linear1'>3.1 Mapping to Uniform Distribution (Quantile Transforms)</a>
    * <a href='#non-linear2'>3.2 Mapping to Gaussian Distribution (Power Transforms)</a>
* <a href='#encode'>4.Encoding</a>
* <a href='#polynomial'>5.Polynomial Features</a>
    * <a href='#polynomial1'>5.1 Polynomial Features</a>
    * <a href='#polynomial2'>5.2 Spline Transformers</a>
* <a href='#custom'>6.Custom Transformers</a>
* <a href='#ref'>References</a>

<a id='install'></a>
# Installation of Sklearn

If we have already installed NumPy and Scipy, two easiest ways to install scikit-learn are:
1. Using pip</br>
$pip install -U scikit-learn 

2. Using conda </br>
$conda install scikit-learn

*And, if NumPy and Scipy is not yet installed on your Python workstation then, you can install them by using either pip or conda. (prereq)

<a id='import'></a>
# Importing Packages

In [1]:
import numpy as np
from sklearn import preprocessing as skp

<a id='std'></a>
# 1. Standardization
Many machine learning estimators used in scikit-learn require dataset standardization; if the individual features do not more or less resemble standard normally distributed data, they may behave improperly. Standardization is also addressed as mean removal or variance scaling

In practice, we often overlook the distribution's structure and simply convert the data to center it by deleting the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

Many aspects in a learning algorithm's objective function, for example, may assume that all features are centered around zero or have variance in the same order. If one variable has a variance that is orders of magnitude greater than others, it may dominate the objective function and prevent the estimator from learning from other features as predicted.

The preprocessing module provides the StandardScaler utility class, which is a quick and easy way to perform the following operation on an array-like dataset:

In [2]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = skp.StandardScaler().fit(X_train)
print(scaler)

StandardScaler()


In [3]:
print('Mean of Scalar:',scaler.mean_)

Mean of Scalar: [1.         0.         0.33333333]


In [4]:
print('Scale: ',scaler.scale_)

Scale:  [0.81649658 0.81649658 1.24721913]


In [5]:
X_scaled = scaler.transform(X_train)
print('Final scaled output:\n',X_scaled)

Final scaled output:
 [[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]


*Scaled data has mean value as zero and unit variance:

In [6]:
print('Mean of Scaled data:',X_scaled.mean(axis=0))
print('Variance of scaled data:',X_scaled.std(axis=0))

Mean of Scaled data: [0. 0. 0.]
Variance of scaled data: [1. 1. 1.]


It is possible to disable either centering or scaling by either passing **'with_mean=False'** or **'with_std=False'** to the constructor of StandardScaler.

Some methods of standardization:
1. Scaling features to a range
2. Scaling sparse data
3. Scaling data with outliers
4. Centering kernel matrices

<a id='std1'></a>
### 1.1 Scaling features to a range
An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. **MinMaxScaler** or **MaxAbsScaler** can be used to achieve this. </br>
</br>
The reasons for using this scaling include feature robustness to very small standard deviations and the retention of zero entries in sparse data (a variable in which the cells do not contain actual data within data analysis).</br></br>
For Example:

In [7]:
X_train2 = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = skp.MinMaxScaler()
X_train2_minmax = min_max_scaler.fit_transform(X_train2)
print('Here\'s the resultant matrix using MinMaxScaler:\n', X_train2_minmax)


Here's the resultant matrix using MinMaxScaler:
 [[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]]


The same transformer instance can then be used to some additional test data that was not seen during the fit call: the same scaling and shifting operations will be used to be consistent with the transformation that was performed on the train data:

In [8]:
X_new = np.array([[-3., -1.,  4.]])
X_new_minmax = min_max_scaler.transform(X_new)
print('New matrix using the same transformation as X_train data:\n', X_new_minmax)

New matrix using the same transformation as X_train data:
 [[-1.5         0.          1.66666667]]


Let's introspect the scaler attributes to find about the exact nature of the transformation learned on the training data:

In [9]:
print('Scale of Transformation: ', min_max_scaler.scale_)

print('Minimum of Transformation: ', min_max_scaler.min_)

Scale of Transformation:  [0.5        0.5        0.33333333]
Minimum of Transformation:  [0.         0.5        0.33333333]


If **MinMaxScaler** is given an explicit feature_range=(min, max) the full formula is:</br>

*X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))*

*X_scaled = X_std * (max - min) + min*

**MaxAbsScaler** works in a similar manner, but scales the training data so that it falls within the range [-1, 1] by dividing across the biggest maximum value in each feature. It is intended for data that is already centered at zero, as well as sparse data.</br>
Above example using MaxAbsScaler scaler:

In [10]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

max_abs_scaler = skp.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
print('Training Matrix:\n', X_train_maxabs)

X_new = np.array([[ -3., -1.,  4.]])
X_new_maxabs = max_abs_scaler.transform(X_new)
print('New matrix using trained transformation:\n',X_new_maxabs)

print('Scale:',max_abs_scaler.scale_)


Training Matrix:
 [[ 0.5 -1.   1. ]
 [ 1.   0.   0. ]
 [ 0.   1.  -0.5]]
New matrix using trained transformation:
 [[-1.5 -1.   2. ]]
Scale: [2. 1. 2.]


<a id='std2'></a>
### 1.2 Scaling Sparse data
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.
MaxAbsScaler was specifically designed for scaling sparse data.</br></br>
If the centered data is expected to be small enough, explicitly converting the input to an array using the toarray method of sparse matrices is another option.

<a id='std3'></a>
### 1.3 Scaling data with outliers
If data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, we can use RobustScaler as a drop-in replacement instead. It uses more robust estimates for the center and range of data.

<a id='std4'></a>
### 1.4 Centering Kernel matrices
If we have a kernel matrix of a kernel 
 that computes a dot product in a feature space (possibly implicitly) defined by a function 
, a KernelCenterer can transform the kernel matrix so that it contains inner products in the feature space defined by 
 followed by the removal of the mean in that space. In other words, KernelCenterer computes the centered Gram matrix associated to a positive semidefinite kernel .

<a id='norm'></a>
# 2. Normalization
**Normalization** is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the **l1**, **l2**, or **max norms**:</br>
Note: *L2 normalization is also known as **spatial sign preprocessing**.*

In [11]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = skp.normalize(X, norm='l2')

X_normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

The preprocessing module further provides a utility class Normalizer that implements the same operation using the Transformer API.

In [12]:
normalizer = skp.Normalizer().fit(X)  # fit does nothing
normalizer.transform([[-1.,  1., 0.]])

array([[-0.70710678,  0.70710678,  0.        ]])

In [13]:
normalizer.transform(X)

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

<a id='non-linear'></a>
# 3. Non-Linear Transformation
Two types of transformations are : 
1. Quantile transforms 
2. Power transforms

Both the transforms are based on monotonic transformations of the features, which maintain the rank of the values along each feature.

**Quantile transforms** put all features into the same desired distribution whereas **Power transforms** are a family of parametric transformations that aim to map data from any distribution to as close to a Gaussian distribution.

<a id='non-linear1'></a>
### 3.1 Mapping to Uniform Distribution (Quantile Transforms) 
QuantileTransformer provides a non-parametric transformation to map the data to a uniform distribution with values between 0 and 1

In [14]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

#loading iris data
X, y = load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
#Quantile transformer
quantile_transformer = skp.QuantileTransformer(random_state=0)

#Fitting
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)

np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]) 



array([4.3, 5.1, 5.8, 6.5, 7.9])

This feature corresponds to the sepal length in cm. Once the quantile transformation applied, those landmarks approach closely the percentiles previously defined:

In [15]:
print('X_train_Trans:',np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100]))
print('X_test:',np.percentile(X_test[:, 0], [0, 25, 50, 75, 100]))
print('X_test_trans',np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100]))

X_train_Trans: [0.         0.23873874 0.50900901 0.74324324 1.        ]
X_test: [4.4   5.125 5.75  6.175 7.3  ]
X_test_trans [0.01351351 0.25       0.47747748 0.60472973 0.94144144]


<a id='non-linear2'></a>
### 3.2 Mapping to Gaussian Distribution (Power Transforms)
The normalcy of the features in a dataset is desirable in many modeling scenarios. Power transforms are a type of parametric, monotonic transformation that aims to map data from any distribution to a Gaussian distribution as closely as feasible in order to stabilize variance and decrease skewness.

PowerTransformer presently supports two such power transformations: the Yeo-Johnson and Box-Cox transforms.

In [16]:
pt = skp.PowerTransformer(method='box-cox', standardize=False)
X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))
print('Matrix:\n',X_lognormal)


print('\nResult after mapping:\n',pt.fit_transform(X_lognormal))


Matrix:
 [[1.28331718 1.18092228 0.84160269]
 [0.94293279 1.60960836 0.3879099 ]
 [1.35235668 0.21715673 1.09977091]]

Result after mapping:
 [[ 0.49024349  0.17881995 -0.1563781 ]
 [-0.05102892  0.58863195 -0.57612414]
 [ 0.69420009 -0.84857822  0.10051454]]


In above example we have set the **standardize option to False**, PowerTransformer will apply **zero-mean, unit-variance normalization** to the transformed output by default.

It is also possible to map data to a normal distribution using QuantileTransformer by setting **output_distribution='normal'**. Using previous example from the iris dataset:


In [17]:
quantile_transformer = skp.QuantileTransformer(
    output_distribution='normal', random_state=0)
X_trans = quantile_transformer.fit_transform(X)
quantile_transformer.quantiles_



array([[4.3, 2. , 1. , 0.1],
       [4.4, 2.2, 1.1, 0.1],
       [4.4, 2.2, 1.2, 0.1],
       [4.4, 2.2, 1.2, 0.1],
       [4.5, 2.3, 1.3, 0.1],
       [4.6, 2.3, 1.3, 0.2],
       [4.6, 2.3, 1.3, 0.2],
       [4.6, 2.3, 1.3, 0.2],
       [4.6, 2.4, 1.3, 0.2],
       [4.7, 2.4, 1.3, 0.2],
       [4.7, 2.4, 1.3, 0.2],
       [4.8, 2.5, 1.4, 0.2],
       [4.8, 2.5, 1.4, 0.2],
       [4.8, 2.5, 1.4, 0.2],
       [4.8, 2.5, 1.4, 0.2],
       [4.8, 2.5, 1.4, 0.2],
       [4.9, 2.5, 1.4, 0.2],
       [4.9, 2.5, 1.4, 0.2],
       [4.9, 2.5, 1.4, 0.2],
       [4.9, 2.6, 1.4, 0.2],
       [4.9, 2.6, 1.4, 0.2],
       [4.9, 2.6, 1.4, 0.2],
       [5. , 2.6, 1.4, 0.2],
       [5. , 2.6, 1.4, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5.1, 2.7, 1.5, 0.2],
       [5.1, 2.8, 1.5, 0.2],
       [5.1, 2

In above the median of the input becomes the mean of the output, centered at 0.The normal output is clipped so that the input’s minimum and maximum do not become infinite under the transformation.

<a id='encode'></a>
# 4. Encoding categorical values/features
Often features are not given as continuous values but categorical. For example a person could have features ["student", "teacher"], ["in UNT", "in UTA", "in UT"], ["uses Jupyter Notebook", "uses VSCode", "uses Colab", "uses PyCharm"]. Such features can be efficiently coded as integers, for instance ["student", "in UTA", "uses Jupyter Notebook"] could be expressed as [0, 1, 0] while ["teacher", "in UT", "uses PyCharm"] would be [1, 2, 3].

To convert categorical features to such integer codes, we can use the OrdinalEncoder. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1):

In [18]:
enc = skp.OrdinalEncoder()
X = [['student', 'in UTA', 'uses Jupyter Notebook'], ['teacher', 'in UT', 'uses PyCharm']]
enc.fit(X)

enc.transform([['teacher', 'in UTA', 'uses Jupyter Notebook']])

array([[1., 1., 0.]])

By default, OrdinalEncoder will also passthrough missing values that are indicated by np.nan.

In [19]:
enc = skp.OrdinalEncoder()
X = [['student'], ['teacher'], [np.nan], ['teacher']]
enc.fit_transform(X)

array([[ 0.],
       [ 1.],
       [nan],
       [ 1.]])

Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

In [20]:
enc = skp.OneHotEncoder()
X = [['student', 'in UTA', 'uses Jupyter Notebook'], ['teacher', 'in UT', 'uses PyCharm']]
enc.fit(X)

enc.transform([['teacher', 'in UT', 'uses Jupyter Notebook'],
               ['student', 'in UTA', 'uses PyCharm']]).toarray()


array([[0., 1., 1., 0., 1., 0.],
       [1., 0., 0., 1., 0., 1.]])

It is possible to specify categories explicitly using the parameter categories. There are two designations, four possible universities and four web tools in our dataset:

In [21]:
genders = ['student', 'teacher']
locations = ['in UT', 'in UTA', 'in TAMU', 'in UNT']
browsers = ['uses Jupyter Notebook', 'uses PyCharm', 'uses Colab', 'uses VSCode']
enc = skp.OneHotEncoder(categories=[genders, locations, browsers])
# Note that for there are missing categorical values for the 2nd and 3rd
# feature
X = [['teacher', 'in UNT', 'uses PyCharm'], ['student', 'in UTA', 'uses Jupyter Notebook']]
enc.fit(X)
enc.transform([['student', 'in TAMU', 'uses VSCode']]).toarray()

array([[1., 0., 0., 0., 1., 0., 0., 0., 0., 1.]])

In [22]:
X = [['student', 'in UTA', 'uses VSCode'],
     ['teacher', 'in TAMU', 'uses PyCharm']]
drop_enc = skp.OneHotEncoder(drop='first').fit(X)
drop_enc.categories_


drop_enc.transform(X).toarray()

array([[0., 1., 1.],
       [1., 0., 0.]])

<a id='polynomial'></a>
# 5. Generating Polynomial Features
It is frequently advantageous to add complexity to a model by taking nonlinear aspects of the input data into account. We have two options, both of which are based on polynomials:
The first employs pure polynomials, while the second uses splines, which are piecewise polynomials.

<a id='polynomial1'></a>
### 5.1 Polynomial Features
Polynomial features are a straightforward and widespread way for obtaining the high-order and interaction terms of features.

In [23]:
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3, 2)
print('Matrix:',X)

poly = PolynomialFeatures(2)
poly.fit_transform(X)

Matrix: [[0 1]
 [2 3]
 [4 5]]


array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])

The features of X have been transformed from (X1,X2) to (1, X1, X2, X1^2, X1, X2, X2^2)

<a id='polynomial2'></a>
### 5.2 Spline Transformer
Another method for include nonlinear factors in place of pure polynomials of features is to use the SplineTransformer to construct spline basis functions for each feature. Splines are piecewise polynomials that are parametrized by their polynomial degree and knot places. 

Note: The SplineTransformer treats each feature separately, i.e. it won’t give us interaction terms.

In [24]:
import numpy as np
from sklearn.preprocessing import SplineTransformer
X = np.arange(5).reshape(5, 1)
print('Matrix:',X)

spline = SplineTransformer(degree=2, n_knots=3)
print('Result after Transformation:\n',spline.fit_transform(X))

Matrix: [[0]
 [1]
 [2]
 [3]
 [4]]
Result after Transformation:
 [[0.5   0.5   0.    0.   ]
 [0.125 0.75  0.125 0.   ]
 [0.    0.5   0.5   0.   ]
 [0.    0.125 0.75  0.125]
 [0.    0.    0.5   0.5  ]]


<a id='custom'></a>
# 6. Custom Transformers
To help with data cleansing or processing, we may want to turn an existing Python function into a transformer. FunctionTransformer allows us to create a transformer from any function. For example, to create a pipeline transformer that does a log transformation, do the following:

In [25]:
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate=True)
X = np.array([[0, 1], [2, 3]])

# Since FunctionTransformer is no-op during fit, we can call transform directly
print('Transformed Matrix:\n',transformer.transform(X))

Transformed Matrix:
 [[0.         0.69314718]
 [1.09861229 1.38629436]]


<a id='ref'></a>
# References
https://www.tutorialspoint.com/ </br>
https://www.google.com/ </br>
https://scikit-learn.org/ </br>
https://pypi.org/