# **LAB 25-08-2021**

In this tutorial, we will be continuing with data preprocessing and we will move on to the second topic -> Data Transformation

We will be working on the dataset named 'Tutorial3' which is attached to the file.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('tutorial3.csv', encoding = 'unicode_escape')

In [None]:
df.head(10)

Unnamed: 0,feat1,feat2,feat3,feat4,Output
0,103.492233,11.178533,1177.929685,E,-33.738798
1,105.583257,11.271211,1167.340693,E,-20.015269
2,102.362544,11.02237,1152.2741,D,-121.200909
3,104.179562,11.168655,1205.196633,D,-17.665275
4,107.74884,11.263845,1213.642301,D,29.058837
5,108.92001,11.22355,1135.5457,G,-21.635042
6,104.300347,11.447954,1327.340492,E,205.902826
7,106.874841,10.907883,1223.350902,E,-11.400533
8,104.119113,10.671402,1172.109662,E,-143.765921
9,102.896386,10.627526,1275.573816,G,12.346511


This is the  dataset we will work with.



---



---



# DATA TRANSFORMATION

After data cleaning is complete, we move on to the next step of data preprocessing. It is data transformation. We start converting the raw data obtained from the dataset into more useful and efficient form.

# NORMALIZATION

If we observe the dataset we are working with today, we will see that 3 features are of integer type and 1 of them is of character type.

Whenever we get a new dataset, our first objective should be to check the description and the size of the dataset.

In [None]:
df.shape

(500, 5)

In [None]:
df.describe()

Unnamed: 0,feat1,feat2,feat3,Output
count,500.0,500.0,500.0,500.0
mean,105.032854,10.952021,1241.186038,0.166738
std,1.771904,0.339066,76.437957,117.199198
min,100.0,10.0,1000.0,-403.693521
25%,103.835309,10.694112,1190.368415,-74.842437
50%,105.039326,10.987295,1240.385493,-4.272663
75%,106.349374,11.185465,1292.071193,73.447639
max,110.0,12.0,1500.0,377.698893


I have purposely chosen a dataset with no NaN values to only focus on the data transformation part.

Let us take a closer look at the mean of the features. The means are of the range 100,10 and 1000 for feat1, feat2, feat3 respectively. There is a huge variation in the dataset.

While applying Machine Learning Algorithms, we prefer to keep the range of the features in the same range. Hence we use normalization of data.

We bring all the data in a small range which is close to same for all the features.

For this tutorial, we will be using **Linear Scaling**. In this type of normalization, we replace every data in a column by the following :

**val = (val - min)/(max - min)**

where 

val = values in column

max = maximum value in column

min = minimum value in column


 There is another method known as **Z-score** in which we use the following formula:

 **val = (val - mean)/std**

 where

 val = values in column

 mean = mean value of column

 std = standard deviation of column      

Now let us implement Linear Scaling :

In [None]:
df['feat1'] = (df['feat1'] - df['feat1'].min())/(df['feat1'].max()-df['feat1'].min())
df.head()

Unnamed: 0,feat1,feat2,feat3,feat4,Output
0,0.349223,11.178533,1177.929685,E,-33.738798
1,0.558326,11.271211,1167.340693,E,-20.015269
2,0.236254,11.02237,1152.2741,D,-121.200909
3,0.417956,11.168655,1205.196633,D,-17.665275
4,0.774884,11.263845,1213.642301,D,29.058837


In [None]:
df['feat2'] = (df['feat2'] - df['feat2'].min())/(df['feat2'].max()-df['feat2'].min())
df.head()

Unnamed: 0,feat1,feat2,feat3,feat4,Output
0,0.349223,0.589267,1177.929685,E,-33.738798
1,0.558326,0.635606,1167.340693,E,-20.015269
2,0.236254,0.511185,1152.2741,D,-121.200909
3,0.417956,0.584327,1205.196633,D,-17.665275
4,0.774884,0.631923,1213.642301,D,29.058837


In [None]:
df['feat3'] = (df['feat3'] - df['feat3'].min())/(df['feat3'].max()-df['feat3'].min())
df.head()

Unnamed: 0,feat1,feat2,feat3,feat4,Output
0,0.349223,0.589267,0.355859,E,-33.738798
1,0.558326,0.635606,0.334681,E,-20.015269
2,0.236254,0.511185,0.304548,D,-121.200909
3,0.417956,0.584327,0.410393,D,-17.665275
4,0.774884,0.631923,0.427285,D,29.058837


Now we see that all the features have the same range (0,1). This helps in increasing the speed as well as the accuracy of the ML algorithm.

In [None]:
df.describe()

Unnamed: 0,feat1,feat2,feat3,Output
count,500.0,500.0,500.0,500.0
mean,0.503285,0.47601,0.482372,0.166738
std,0.17719,0.169533,0.152876,117.199198
min,0.0,0.0,0.0,-403.693521
25%,0.383531,0.347056,0.380737,-74.842437
50%,0.503933,0.493647,0.480771,-4.272663
75%,0.634937,0.592732,0.584142,73.447639
max,1.0,1.0,1.0,377.698893


# CATEGORICAL DATA

Now let us take a look at feature 4. We see that this column is a category with the values ranging from A to F. Many machine learning algorithms cannot work with categorical data directly. So we should try and convert such data to integer type as well.

For this purpouse, we will use **Label Encoder**. A Label Encoder allows the representation of categorical data to be more expressive. It converts the categorical data to numerical type which is easier to deal with from machine learning aspect. Let us see how to use this function :

Let us learn about a new library which is essential to Machine Learning : **sklearn**

Scikit-learn is a free software machine learning library for the Python programming language. 

Label Encoder is a function available in preprocessing library of sklearn library. Let us see what it does :

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder_feat4 = LabelEncoder()
df['feat4'] = labelencoder_feat4.fit_transform(df['feat4'])
df.head()

Unnamed: 0,feat1,feat2,feat3,feat4,Output
0,0.349223,0.589267,0.355859,4,-33.738798
1,0.558326,0.635606,0.334681,4,-20.015269
2,0.236254,0.511185,0.304548,3,-121.200909
3,0.417956,0.584327,0.410393,3,-17.665275
4,0.774884,0.631923,0.427285,3,29.058837


Let us look at the description of the dataset :

In [None]:
df.describe()

Unnamed: 0,feat1,feat2,feat3,feat4,Output
count,500.0,500.0,500.0,500.0,500.0
mean,0.503285,0.47601,0.482372,3.402,0.166738
std,0.17719,0.169533,0.152876,1.162367,117.199198
min,0.0,0.0,0.0,0.0,-403.693521
25%,0.383531,0.347056,0.380737,3.0,-74.842437
50%,0.503933,0.493647,0.480771,3.0,-4.272663
75%,0.634937,0.592732,0.584142,4.0,73.447639
max,1.0,1.0,1.0,6.0,377.698893


We see that the minimum value for feat4 is 0 and the maximum is 6. As we already knew that the feat4 previously contained 7 different categories from A to F. Label Encoder converts all those categories to numerical values which are easy to access. 

There are many other possible changes we can make in data transformation but it is highly dependent on the dataset. Not everything is useful everywhere.


That is all for tutorial 3. Now you can move on to the assignment.



---

---

