# <b>Feature Engineering</b>

## <b>Encoding of Categorical Variables</b>

Let's first see what encoding is.

#### **What Is Encoding?**

Encoding is used to transform the categorical variable into numerical features.

For example, we have an attribute called gender in a data set where values are male and female. In this case, the numerical encoded version of the values will be 1 for male, 0 for female, or vice versa.


Since different kinds of categorical variables capture different amount of information, we need different techniques to encode them.

### **Label Encoding**

Label encoding is a handy technique to encode categorical variables. 

## Techniques Used for Encoding Variables
There are two types of broadly used algorithms which perform the task of encoding of variables.


### There are few libraries required to perform encoding variables:
* Pandas - It helps to retrieve datasets, handle missing data and perform data wrangling.
* NumPy - It helps to perform numerical operations in the dataset.
* sklearn.preprocessing - It helps in data transformation.

In [1]:
#Select the cell and click on run icon to import libraries. 

import pandas as pd
import numpy as np

# Import label encoder 
from sklearn import preprocessing 

In [24]:
#Select the cell and click on run icon to retrieve and view the dataset.
# Import dataset 

iris_df = pd.read_csv('Iris.csv') 
iris_df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
iris_df.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


In [4]:
iris_df.shape

(150, 6)


**Note 2:**  
The **info()** function helps to understand the dataset, the column name, total null values, and data type.

In [5]:
#Select the cell and click on run icon 
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


**Note 3:**

The **LabelEncoder()** function is used to convert categorical variables into numerical. According to our dataset, variable **`species`** is categorical, so convert **`species`** into numerical type.

In [6]:
#Select the cell and click on run icon to define LabelEnocder 
label_encoder = preprocessing.LabelEncoder() 

**Note 4:**

The **unique()** function is used to identify distinct rows present in the **`iris_df`** dataframe.

In [7]:
#Select the cell and click on run icon
iris_df['Species'].unique() 

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [26]:
# i2 = iris_df[iris_df['Species'] == 'Iris-setosa']
# print(i2)

**Observations from the above output:**
>There are three categories of variable **`species`** such as *'setosa'*, *'versicolor'*, and *'virginica'* to be encoded.

**Note 5:**

* The **fit_transform()** method calculates the mean and variance of each feature and transforms all the features using the respective mean and variance.
* The **head()** function helps to view the first few data present in the **`iris_df`** dataframe.

In [8]:
#Select the cell and click on run icon
iris_df['Species']= label_encoder.fit_transform(iris_df['Species']) 
iris_df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,0
1,2,4.9,3.0,1.4,0.2,0
2,3,4.7,3.2,1.3,0.2,0
3,4,4.6,3.1,1.5,0.2,0
4,5,5.0,3.6,1.4,0.2,0


In [9]:
iris_df.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,146,6.7,3.0,5.2,2.3,2
146,147,6.3,2.5,5.0,1.9,2
147,148,6.5,3.0,5.2,2.0,2
148,149,6.2,3.4,5.4,2.3,2
149,150,5.9,3.0,5.1,1.8,2


**Observations from the above output:**
>The variable **`species`** is converted into numerical variable as 0's and 1's.

## **One Hot Encoding**


In the below example, you can see that category of Variable X is enocded into separate columns such as Variable X_Blue, Variable X_Yellow, Variable X_Red.

![one_hot](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/Applied_Machine_Learning/Images/0.4_Feature_Engineering/Trainer_PPT_and_IPYNB/one_hot.png)

There are few libraries required to perform **One Hot Encoding**:

* **datasets**: It helps to load the datasets from sklearn library.

* **OneHotEncoder**: It helps to encode categorical variable.

In [49]:
iris_df = pd.read_csv('Iris.csv') 
iris_df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [50]:
iris_df.iloc[0:151,:]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [44]:
x = iris_df.iloc[:,1:5]   #[row[start:end-1],col[start:end-1]]  Independent var
y = iris_df.Species       # Dependent var

In [45]:
x

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [48]:
y

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: Species, Length: 150, dtype: object

In [47]:
#One-Hot encoding the categorical parameters using get_dummies()
y1 = pd.get_dummies(y)
print(y1)

     Iris-setosa  Iris-versicolor  Iris-virginica
0              1                0               0
1              1                0               0
2              1                0               0
3              1                0               0
4              1                0               0
..           ...              ...             ...
145            0                0               1
146            0                0               1
147            0                0               1
148            0                0               1
149            0                0               1

[150 rows x 3 columns]


In [34]:
y.head()

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


In [35]:
y.tail()

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
145,0,0,1
146,0,0,1
147,0,0,1
148,0,0,1
149,0,0,1


In [36]:
y.iloc[51:56,:]

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
51,0,1,0
52,0,1,0
53,0,1,0
54,0,1,0
55,0,1,0
