# Preprocessing

* Machine learning (ML) helps in automatically finding complex and potentially useful patterns in data.
* Preprocessing the data for ML involves both data engineering and feature engineering. Data engineering is the process of converting raw data into prepared data.
* Feature engineering then tunes the prepared data to create the features expected by the ML model. 
* Pre-processing refers to the transformations applied to our data before feeding it to the algorithm. 
* Data Preprocessing is a technique that is used to convert the raw data into a clean data set. 

## Imputation

* Many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. 
* Imputation transformer for completing missing values.

In [30]:

raw_data =  np.array([[5.2, np.nan, 3.5],
                [-2.0,7.0,-6.2],
                [-7.4,-5.4,np.nan]])

# Taking care of missing data
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values = np.nan, strategy = 'mean')
data = imp_mean.fit_transform(raw_data)
print("\n Raw data before imputaton\n",raw_data)
print("\n Preprocessd data After imputation\n",data)


 Raw data before imputaton
 [[ 5.2  nan  3.5]
 [-2.   7.  -6.2]
 [-7.4 -5.4  nan]]

 Preprocessd data After imputation
 [[ 5.2   0.8   3.5 ]
 [-2.    7.   -6.2 ]
 [-7.4  -5.4  -1.35]]


## Binarization

* Binarization is the process of transforming data features of any entity into vectors of binary numbers to make classifier algorithms more efficient.
* All the values above 2.0 becomes 1 and the remains values become 0.0



In [6]:
import numpy as np
from sklearn.preprocessing import Binarizer

data = np.array([[5.2, -3, 3.5],
                [-2.0,7.0,-6.2],
                [-7.4,-9.9,-5.4]])

# Binarize data 
binarized = Binarizer(threshold=2.0).transform(data)
print("\nBinarized data:\n", binarized)

# Print mean and standard deviation
print("\nBEFORE:")
print("Mean =", data.mean(axis=0))
print("Std deviation =", data.std(axis=0))


Binarized data:
 [[1. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]

BEFORE:
Mean = [ 2.2  -0.85 -0.3 ]
Std deviation = [3.75699348 5.69144094 3.03315018]


## Mean Removal

* Removing the mean is a common preprocessing technique used in machine learning.
* It helps to center each feature mean on zero  in order to remove bias from the featured in feature vectors




In [7]:
import numpy as np
from sklearn.preprocessing import scale

data = np.array([[5.2, -3, 3.5],
                [-2.0,7.0,-6.2],
                [-7.4,-9.9,-5.4]])
# Remove mean
scaled = scale(data)
print("\nAFTER:")
print("Mean =", scaled.mean(axis=0))
print("Std deviation =", scaled.std(axis=0))


AFTER:
Mean = [-6.93889390e-17 -5.55111512e-17  5.55111512e-17]
Std deviation = [1. 1. 1.]


## MinMax Scaling

* When the value of each feature varies between many random values, it becomes important to scale those features so that it is a  level playing field for the ML algorithm to train on.

$$minmax = \frac{x -min(x)}{max(x) -min(x)}$$


In [10]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

data = np.array([[5.2, -3, 3.5],
                [-2.0,7.0,-6.2],
                [-7.4,-9.9,-5.4]])
# Min max scaling
scaler_minmax = MinMaxScaler(feature_range=(0, 1))
scaled_minmax = scaler_minmax.fit_transform(data)
print("\nMin max scaled data:\n", scaled_minmax)


Min max scaled data:
 [[1.         0.43396226 1.        ]
 [0.         1.         0.05882353]
 [0.92553191 0.59119497 0.82352941]
 [0.79787234 0.         0.        ]]


## Normalization

* Normalization modify the values in the feature vectors so that we can measure them on a common scale.
The most common forms of normalization aim to modify the values so that they sum up to one(1).
  * L1 normalization which refers to Least Absolute Deviations works by making sure that the sum of absolute values is 1 in each row.
  * L2 normalization which refers to least squares works by making sure that the sum of squares is 1.
* In general L1 normalization technique is considered more robust than L2 normalization.


In [13]:
import numpy as np
from sklearn.preprocessing import normalize
data = np.array([[5.2, -3, 3.5],
                [-2.0,7.0,-6.2],
                [-7.4,-9.9,-5.4]])
# Normalize data
l1_norm = normalize(data, norm='l1')
l2_norm = normalize(data, norm='l2')
print("\nL1 normalized data:\n", l1_norm)
print("\nL2 normalized data:\n", l2_norm)


L1 normalized data:
 [[ 0.4952381  -0.19047619  0.31428571]
 [-0.29370629  0.48951049 -0.21678322]
 [ 0.63380282  0.07042254  0.29577465]
 [ 0.21019108 -0.56687898 -0.22292994]]

L2 normalized data:
 [[ 0.8030469  -0.30886419  0.50962592]
 [-0.4809826   0.80163767 -0.35501097]
 [ 0.90162439  0.10018049  0.42075805]
 [ 0.32618953 -0.87972328 -0.34595859]]


## Standardization

* Feature Scaling means scaling features to the same scale
* Standardization scales features to have a mean($\mu$) of o and standard deviation ($\alpha$) of 1:
 $$Standardization =\frac{x-\mu}{\alpha}$$

In [17]:
import numpy as np
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
data = np.array([[5.2, -3, 3.5],
                [-2.0,7.0,-6.2],
                [-7.4,-9.9,-5.4]])
# standarized data
standard_scaler = std.fit_transform(data)
print("\nStandard Scaler:\n", standard_scaler)
print("\n Mean of Standard Scaler \n",standard_scaler.mean(axis=0))


Standard Scaler:
 [[ 1.27872403 -0.14893866  1.41030554]
 [-0.11624764  1.29240322 -0.79614022]
 [-1.16247639 -1.14346456 -0.61416532]]

 Mean of Standard Scaler 
 [7.40148683e-17 0.00000000e+00 1.11022302e-16]



## Label Encoder

* When we perform classification, we usually deal with a lot of categorical labels. 
* These labels can be in the form of words, numbers or something else. However, the ML algorithms expect them to be numbers

In [14]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Sample input labels
input_labels = ['rainy', 'cloudy', 'rainy', 'sunny', 'cloudy', 'sunny', 'sunny']

# Create label encoder and fit the labels
encoder = LabelEncoder()
encoder.fit(input_labels)

# Print the mapping 
print("\nLabel mapping:")
for i, item in enumerate(encoder.classes_):
    print(item, '-->', i)



Label mapping:
cloudy --> 0
rainy --> 1
sunny --> 2


In [3]:
# Encode a set of labels using the encoder
test_labels = ['sunny', 'sunny', 'rainy']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)
print("Encoded values =", list(encoded_values))

# Decode a set of values using the encoder
encoded_values = [2, 0, 2, 1]
decoded_list = encoder.inverse_transform(encoded_values)
print("\nEncoded values =", encoded_values)
print("Decoded labels =", list(decoded_list))


Labels = ['sunny', 'sunny', 'rainy']
Encoded values = [2, 2, 1]

Encoded values = [2, 0, 2, 1]
Decoded labels = ['sunny', 'cloudy', 'sunny', 'rainy']
