# Data Preparation

When you get data it is sometimes not in the right format and before you can input data to your Machine Learning Models you have to prepare it (clean the data)

## Pre-Processing the Data

### Steps fopr Data Preprocessing

## Step 1: Import Useful packages

In [1]:
import numpy as np

import sklearn.preprocessing as sp

## Step 2: Getting the data

In [2]:
#we are going to create a sample array
input_data = np.array([[2.1, -1.4, 3.2], [3.2, 1.4,  4.0], [1.3, 4.3, 2.]])

## Step 3: Applying Pre-Processing Technique

### Techniques for Data Preprocessing 

## Binarization


This is a preprocesing technique where we turn our numerical values into boolean((yes/no), (1/0))

In [3]:
binarizer = sp.Binarizer().fit(input_data)

print("Binarized data:")
binarizer.transform(input_data)

Binarized data:


array([[1., 0., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

## Mean Removal

This is a preprocessing technique where you remove the mean from every feature vector so that every feature will thenhave zero as their mean

In [4]:

#we will find the meand and standard deviation of the inout_data array, and output them. 

mean = input_data.mean(axis=0)

std_deviation = input_data.std(axis=0)

print("mean:", str(mean))

print("std_dev:", str(std_deviation))

mean: [2.2        1.43333333 3.06666667]
std_dev: [0.7788881  2.32713462 0.82192187]


In [5]:
#Now we will remove the mean and Standardd deviation of the input_data array
scaled_data = sp.scale(input_data)

#now the mean of the input_data array have been removed and saved to the variable scaled_data
#see proof below
mean = scaled_data.mean(axis=0)

std_deviation = scaled_data.std(axis=0)

print("scaled data mean:", str(mean))

print("scaled data std_deviation:", str(std_deviation))

scaled data mean: [-7.40148683e-17  0.00000000e+00  3.70074342e-16]
scaled data std_deviation: [1. 1. 1.]


## Scaling

Scaling (AKA as Feature Scaling) is simply the act of normalizing the range of values in the data set. In simple words, some values in the data array are sometimes too big or too small and so feature scaling helps to make the data look organized

In [7]:
#Min Max scaling

#here we are providing the range which the data will be scaled
data_scaler_minmax = sp.MinMaxScaler(feature_range=(0,1))

#now we are fitting the scaled minmax with the input_data array
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)

#now we output our scaled data
print("Min Max Scaled Data:", str(data_scaled_minmax))

Min Max Scaled Data: [[0.42105263 0.         0.6       ]
 [1.         0.49122807 1.        ]
 [0.         1.         0.        ]]


## Normalization

This is another preprocessing techique which is used to preprocess data. It is used when you want to measure the feature vectors on a common scale.


There are 2 types of Normalization techniques in Machine Learnig

### L1 Normalization 


This is also know as Least Absolute Deviations. This modifies the values so that the sum of the absolute values is always up to one in each row.

In [8]:
#L1 Normalization implementation 

L1_normalized_data = sp.normalize(input_data, norm='l1')

#output L1 normalized data
print("L1 normalized data:", str(L1_normalized_data))

L1 normalized data: [[ 0.31343284 -0.20895522  0.47761194]
 [ 0.37209302  0.1627907   0.46511628]
 [ 0.17105263  0.56578947  0.26315789]]


### L2 Normalization

This is also known as Least squares, this kind of normalization modifies the values so that the sum of the squares of the squares is always up to one on each row

In [9]:
#L2 Normalization Implementation
L2_normalized_data = sp.normalize(input_data, norm = 'l2')

#output L2 normalized data
print("L2 Normalized data:", str(L2_normalized_data))

L2 Normalized data: [[ 0.51526955 -0.34351303  0.78517265]
 [ 0.60259486  0.26363525  0.75324358]
 [ 0.26437185  0.87446072  0.40672592]]


# Labelling of Data

In machine Learning data has to be in a certain format before it can be inputed to a machine learning model. In Classification for instance, there are a lot of labels on the data and for an Machine Learning algorithm to recognize those labels the labels need to be in numeric form and if they are not in numeric for they need to be changed and this leads to label Encoding.

## Steps Involved in Label Encoding

### Step 1: Importing Packages

In [10]:
import numpy as np

from sklearn import preprocessing as sp

### Step 2: Defining Sample Data

After importing the packages we need to define some labels so that we can create and train our label encoder.

In [11]:
#Sample Input Labels

input_labels = ['dog', 'cat', 'bird', 'insects', 'ape', 'snake', 'lizard']

### Step 3: Creating and Training of Label Encoder Object

In [12]:
# Creating the label enoder

encoder = sp.LabelEncoder()

encoder.fit(input_labels)

LabelEncoder()

### Step 4: Checking the performance by encoding a random ordered list

In this step we check the ability of our Label Encoder by making it encode a couple of test_labels. You have to make sure you test the encoder label on labels it has already been trained on in our case -dog, cats, etc. if you test it on anything else you'll get an error message saying "unseen elements label"

In [13]:
test_labels = ['snake', 'dog', 'cat']

encoded_values = encoder.transform(test_labels)

print("Encoded Test Label:")
print(encoded_values)

Encoded Test Label:
[6 3 2]


### Step 5: Checking the Performance by decoding a random set of numbers

We checked the ability of our label encoder earlier on encoding labels now let's see it's ability on decoding a random set of numbers

In [14]:
#we provide a set of encoded values to be decoded
encoded_values = [3, 0, 4, 1]

#Now we decode the list
decoded_list = encoder.inverse_transform(encoded_values)

print("Decoded_Labels:")
print(decoded_list)

Decoded_Labels:
['dog' 'ape' 'insects' 'bird']


## Labelled vs Unlabelled Data

Unlabelled data consists mostly of the samples of the samples of natural or human created object that can easily be obtained from the world they include: audio, video, photos, news articles

Labelled data is simply data with a tag on it. like a picture now would usually just be unlabbeled data but it becomes labelled when it has something like a tag showing what the data is all about.