# AI with Python – Data Preparation

# Preprocessing theData

## Step 1: Importing the useful packages:

In [6]:
import numpy as np 
from sklearn import preprocessing as pp 


## Step 2: Defining sample data:

In [5]:
input_data = np.array([[2.1,-1.9,5.5],
                      [-1.5,2.4,3.5],
                      [0.5,-7.9,5.6],
                      [5.9,2.3,-5.8]])


## Step3: Applying preprocessing technique

## Techniques for Data Preprocessing

## 1. Binarization
This is the preprocessing technique which is used when we need to convert our numerical 
values into Boolean values. We can use an inbuilt method to binarize the input data say 
by using 0.5 as the threshold value in the following way: 

In [None]:
binarized_data = pp.Binarizer(threshold=0.5).transform(input_data)
binarized_data

Now, after running the above code we will get the following output, all the values above 
0.5(threshold value) would be converted to 1 and all the values below 0.5 would be 
converted to 0.

## 2. Mean Removal
It is another very common preprocessing technique that is used in machine learning. 
Basically it is used to eliminate the mean from feature vector so that every feature is 
centered on zero. We can also remove the bias from the features in the feature vector. 
For applying mean removal preprocessing technique on the sample data, we can write the 
Python code shown below. The code will display the Mean and Standard deviation of the 
input data:

In [11]:
the_mean = input_data.mean(axis=0)
the_standard_deviation = input_data.std(axis=0)
the_mean
the_standard_deviation

array([2.71431391, 4.20022321, 4.69414529])

Now, the code below will remove the Mean and Standard deviation of the input data:

In [15]:
scaled_data = pp.scale(input_data)
the_mean_1 = scaled_data.mean(axis=0)
the_standard_deviation_1 = scaled_data.std(axis=0)
the_mean_1
the_standard_deviation_1

array([1., 1., 1.])

## 3. Scaling
It is another data preprocessing technique that is used to scale the feature vectors. Scaling 
of feature vectors is needed because the values of every feature can vary between many 
random values. In other words we can say that scaling is important because we do not 
want any feature to be synthetically large or small. With the help of the following Python 
code, we can do the scaling of our input data, i.e., feature vector:

###  Min max scaling

In [17]:
data_scaler_minmax = pp.MinMaxScaler(feature_range=(0,1))
data_scaled_mimmax = data_scaler_minmax.fit_transform(input_data)
data_scaled_mimmax

array([[0.48648649, 0.58252427, 0.99122807],
       [0.        , 1.        , 0.81578947],
       [0.27027027, 0.        , 1.        ],
       [1.        , 0.99029126, 0.        ]])

## 4. Normalization
It is another data preprocessing technique that is used to modify the feature vectors. Such 
kind of modification is necessary to measure the feature vectors on a common scale. 
Followings are two types of normalization which can be used in machine learning:

### L1 Normalization
It is also referred to as Least Absolute Deviations. This kind of normalization modifies 
the values so that the sum of the absolute values is always up to 1 in each row. It can be 
implemented on the input data with the help of the following Python code:

In [18]:
normalized_data = pp.normalize(input_data,norm='l1')
normalized_data

array([[ 0.22105263, -0.2       ,  0.57894737],
       [-0.2027027 ,  0.32432432,  0.47297297],
       [ 0.03571429, -0.56428571,  0.4       ],
       [ 0.42142857,  0.16428571, -0.41428571]])

### L2 Normalization
It is also referred to as least squares. This kind of normalization modifies the values so 
that the sum of the squares is always up to 1 in each row. It can be implemented on the 
input data with the help of the following Python code:

In [19]:
normalized_data_l2 = pp.normalize(input_data, norm='l2')
normalized_data_l2

array([[ 0.33946114, -0.30713151,  0.88906489],
       [-0.33325106,  0.53320169,  0.7775858 ],
       [ 0.05156558, -0.81473612,  0.57753446],
       [ 0.68706914,  0.26784051, -0.6754239 ]])

# Labeling the Data