# MinMaxScaler

- Rescaling using MinMaxScaler

**Decr:** Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

- `X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))`
- `X_scaled = X_std * (max - min) + min`

[Source](https://data-flair.training/blogs/python-ml-data-preprocessing/)

In [1]:
import pandas as pd
import numpy as np

[link](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv)

In [2]:
df = pd.read_csv('winequality-red.csv', delimiter=";")

In [3]:
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [5]:
df.shape

(1599, 12)

In [6]:
array = df.values

In [7]:
type(array)

numpy.ndarray

In [8]:
array.shape

(1599, 12)

In [9]:
x = array[:, 0:8]
y = array[:, 8]

In [10]:
x.shape

(1599, 8)

In [11]:
from sklearn.preprocessing import MinMaxScaler

In [12]:
scaler = MinMaxScaler(feature_range = (0, 1))
rescaledX = scaler.fit_transform(x)

# above line is combo of:
# rescaledX = scaler.fit(x)
# rescaledX = scaler.transform(x)

In [13]:
# for viewing only 3 decimals
np.set_printoptions(precision=3)

In [14]:
rescaledX[0:5, :]

array([[0.248, 0.397, 0.   , 0.068, 0.107, 0.141, 0.099, 0.568],
       [0.283, 0.521, 0.   , 0.116, 0.144, 0.338, 0.216, 0.494],
       [0.283, 0.438, 0.04 , 0.096, 0.134, 0.197, 0.17 , 0.509],
       [0.584, 0.11 , 0.56 , 0.068, 0.105, 0.225, 0.191, 0.582],
       [0.248, 0.397, 0.   , 0.068, 0.107, 0.141, 0.099, 0.568]])

# StandardScaler

**Decr**:
Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

`z = (x - u) / s`
- where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

In [15]:
from sklearn.preprocessing import StandardScaler

In [16]:
scaler = StandardScaler()
rescaled = scaler.fit_transform(x)

# above line is combo of:
# rescaled = scaler.fit(x)
# rescaled = scaler.transform(x)

In [17]:
rescaled[0:5, :]

array([[-0.528,  0.962, -1.391, -0.453, -0.244, -0.466, -0.379,  0.558],
       [-0.299,  1.967, -1.391,  0.043,  0.224,  0.873,  0.624,  0.028],
       [-0.299,  1.297, -1.186, -0.169,  0.096, -0.084,  0.229,  0.134],
       [ 1.655, -1.384,  1.484, -0.453, -0.265,  0.108,  0.412,  0.664],
       [-0.528,  0.962, -1.391, -0.453, -0.244, -0.466, -0.379,  0.558]])

In [18]:
rescaled[0:5, :]

array([[-0.528,  0.962, -1.391, -0.453, -0.244, -0.466, -0.379,  0.558],
       [-0.299,  1.967, -1.391,  0.043,  0.224,  0.873,  0.624,  0.028],
       [-0.299,  1.297, -1.186, -0.169,  0.096, -0.084,  0.229,  0.134],
       [ 1.655, -1.384,  1.484, -0.453, -0.265,  0.108,  0.412,  0.664],
       [-0.528,  0.962, -1.391, -0.453, -0.244, -0.466, -0.379,  0.558]])

# Normalizer

In [19]:
from sklearn.preprocessing import Normalizer

In [20]:
scaler = Normalizer().fit(x)
normalizedX = scaler.transform(x)

In [21]:
normalizedX[0:5, :]

array([[2.024e-01, 1.914e-02, 0.000e+00, 5.196e-02, 2.079e-03, 3.008e-01,
        9.299e-01, 2.729e-02],
       [1.083e-01, 1.222e-02, 0.000e+00, 3.611e-02, 1.361e-03, 3.472e-01,
        9.306e-01, 1.385e-02],
       [1.377e-01, 1.342e-02, 7.061e-04, 4.060e-02, 1.624e-03, 2.648e-01,
        9.533e-01, 1.760e-02],
       [1.767e-01, 4.416e-03, 8.833e-03, 2.997e-02, 1.183e-03, 2.681e-01,
        9.464e-01, 1.574e-02],
       [2.024e-01, 1.914e-02, 0.000e+00, 5.196e-02, 2.079e-03, 3.008e-01,
        9.299e-01, 2.729e-02]])

# Binarizer

Binarize data (set feature values to 0 or 1) according to a threshold.

Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.

In [22]:
from sklearn.preprocessing import Binarizer

In [23]:
binarizer = Binarizer(threshold=0.0).fit(x)
binaryX = binarizer.transform(x)

In [24]:
binaryX[0:5, :]

array([[1., 1., 0., 1., 1., 1., 1., 1.],
       [1., 1., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 0., 1., 1., 1., 1., 1.]])

# Scaling

Standardize a dataset along any axis.

Center to the mean and component wise scale to unit variance.

In [25]:
from sklearn.preprocessing import scale

In [26]:
data_standardized = scale(x)
data_standardized.mean(axis=0)

array([ 3.555e-16,  1.733e-16, -8.887e-17, -1.244e-16,  3.733e-16,
       -6.221e-17,  4.444e-17, -3.473e-14])

# Label Encoding

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

In [27]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [28]:
data = np.array(['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot'])

In [29]:
label_encoder = LabelEncoder()
interger_encoded = label_encoder.fit_transform(data)

In [30]:
interger_encoded

array([0, 0, 2, 0, 1, 1, 1, 2, 0, 2, 1])

# OneHotEncoding

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse_output parameter)

In [31]:
ohe_encoder = OneHotEncoder(sparse_output=False) # renamed sparse-> sparse_output in v1.2
vals_encoded = interger_encoded.reshape(len(interger_encoded),1)
ohe_encoded = ohe_encoder.fit_transform(vals_encoded)

In [32]:
print(ohe_encoded)

[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


- **inverse-transform** -> Transform labels back to original encoding.

In [33]:
inverted = label_encoder.inverse_transform([np.argmax(ohe_encoded)])

In [34]:
print(inverted)

['cold']


In [35]:
input_classes = ["Havells", "Philips", "Syska", "Everady", "Lloyd"]
label_encoder.fit(input_classes)

for i, item in enumerate(label_encoder.classes_):
    print(item , '-->', i)                         
    label_encoder.inverse_transform(label_encoder.transform([item]))

Everady --> 0
Havells --> 1
Lloyd --> 2
Philips --> 3
Syska --> 4
