# Feature Engineering 

We have come across the different kind of dataset where we usually deal with the numerical values . A continuous feature columns has usually floating point numbers that makes up datapoints. However we might sometime encounter the categorical or discrete features in our dataset which is not represent by a number but by a text. In classification problem for ML , we usually have the categorical values in our labels/target/ouput where as in regression problem , we always have continuous values as our label.
No matter which types of features our dataset consists of , how we represent them has a huge effect in the performance of machine learning models.The process of how we represent our data for a particular application is known as feature engineering. Feature engineering has a great role when a ML/Data scientist works with a real world dataset.

## Categorical Variables

Machine Learning algorithm works only with the numerical variables and therefore it is important to convert the raw data such text, audio,videos,images into some numerical representation . There are three types of categorical variables: binary, nominal and ordinal. Binary variables represent yes/no or true/false or 1/0 outcomes. For example, posibility of observing head/tail in a coin flip or winning/losing a football game.

Nominal Variable represents a variable groups with no rank or order between them. For example:Gender(Male/Female),Religion(Hindu,christian,Jew), Eye colour(Blue,green,brown).

Ordinal Variables are the groups that are ranked in a specific order . For example letter grades like A,B,C(A is better and C is worst), Customer rating (1-10),Education level like Elementary,high school ,college (college being the top most level and elementary being the lower level.)

###  Encoding Ordinal Values

####  Encoding Ordinal Values Using Map and Apply  function

In [2]:
import pandas as pd 

df_ordinal=pd.DataFrame({    #  Create a panda dataframe from a toy dataset
    "Age":[33,22,44,55,20,21,37,65],
    "Income":["medium","low","high","high","low","low","medium","high"]
})

print(df_ordinal)

   Age  Income
0   33  medium
1   22     low
2   44    high
3   55    high
4   20     low
5   21     low
6   37  medium
7   65    high


In the above example we can see the Income of the people based on their ages. Income has three category: Low, Medium and High .
We are going to use map function that is available in pandas dataframe object as shown below.

In [3]:
encoded_ordinal_map= df_ordinal.Income.map({"low":1,"medium":2,"high":3})
print(encoded_ordinal_map )

0    2
1    1
2    3
3    3
4    1
5    1
6    2
7    3
Name: Income, dtype: int64


In encoded_ordinal_map , we can see that numbers (1,2,3) has been assigned to the three different category low, medium and high  

Now we are going to use apply function so as to change the ordinal values to numerical values. We are using same dataset as above 

In [4]:
d={"low":1,"medium":2,"high":3}
encoded_ordinal_apply = df_ordinal.Income.apply(lambda x:d[x])
print(encoded_ordinal_apply)

0    2
1    1
2    3
3    3
4    1
5    1
6    2
7    3
Name: Income, dtype: int64


#### Encoding Ordinal Values Using  Scikit -learn  Libraries 
The most efficient way to encode the categorical varibles is to use Scikit learn libraries. They are highly optimized and easy to implement. We can use OrdinalEncoder from scikit learn. 

In [5]:
from sklearn.preprocessing import OrdinalEncoder  # import the library

sklearn_ordinal= OrdinalEncoder() # Instantiate the class and create sklearn_ordinal object
encoded_ordinal_sklearn = sklearn_ordinal.fit_transform(df_ordinal[["Income"]])
print(encoded_ordinal_sklearn)

[[2.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [2.]
 [0.]]


In the above encoded data , it can be seen that sklearn has encoded three lables (low =1, medium=2 and high=0 ). Dont forget to use <b>fit_transform</b> method for conversion.
Similarly we can use Ordinal encoder to perform similar operation with the ordinal values as shown below.

### Encoding Nominal Values 
To represent categorical Nominal values we are going to use two most popular libraries that are availabe in scikit learn: Label Encoder and One-hot Encoding.

#### Label Encoder 

In [8]:
# Let us create a toy dataframe using pandas 
df_nominal=pd.DataFrame({
    "Age":[33,22,44,55,20,21,37,65],
    "Sex":["m","f","m","f","m","f","m","m"]
})
print(df_nominal)

   Age Sex
0   33   m
1   22   f
2   44   m
3   55   f
4   20   m
5   21   f
6   37   m
7   65   m


In [10]:
from sklearn.preprocessing import LabelEncoder

sklearn_label= LabelEncoder()
encoded_label_sklearn = sklearn_label.fit_transform(df_nominal.Sex)
print(encoded_label_sklearn)

[1 0 1 0 1 0 1 1]


#### One hot Encoder 
The one-hot-encoding(OHE) works by replacing a categorical variables with one or more new features that can have the values 0 and 1.

In [29]:
from sklearn.preprocessing import OneHotEncoder

sklearn_OHE= OneHotEncoder()
encoded_OHE_sklearn = sklearn_OHE.fit_transform(df_nominal[["Sex"]]).toarray()
print(encoded_OHE_sklearn)


[[0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]]


In the above encoded values for male "m " and female "f" , we have size of matrix 8x2.If we scan those matrix horizonatal there are are two distinct vectors in the matrix: [0. 1.] and [1   0] Here these two vectors [0. 1.] and [1   0] represents male and female respectively . So in the dataset df_nominal in the first row  we can see that there is presence of "m" and absence of "f" , so its corresponding vector [0  1] indicates the presence of "m".
Similary in the second row of dataset df_nominal, there is presence of "f" and no "m". So [1 0 ] indicates the presence of "f".
A Sex feature in the datset set df_nominal has been increase by two times as we can see we have two columns in the encoded_OHE_Sklearn. After the encoding , we always drop out the encoded columns/feature from our dataset and keep the encoded one. so the new dataset would look like below


In [33]:
pd.concat([df_nominal.Age,pd.DataFrame(encoded_OHE_sklearn,columns=["f","m"])],axis=1)

Unnamed: 0,Age,f,m
0,33,0.0,1.0
1,22,1.0,0.0
2,44,0.0,1.0
3,55,1.0,0.0
4,20,0.0,1.0
5,21,1.0,0.0
6,37,0.0,1.0
7,65,0.0,1.0


# Exercise For Students

For this exercise we are going to use a dataset from kaggle.com known as "cat in dat-categorical Feature Encoding Challenge". You are going to download those file and upload it to jupyter notebook and read those files using pandas.read_csv. Please download dataset from here : https://www.kaggle.com/c/cat-in-the-dat-ii/data



In [1]:
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder


# TO  DO By Student 

# Download the dataset (use the file train.csv) from the above link (CSV file ) and  upload it to the jupyter notebook
# read the file using pandas 
# Explore the data 
# Use suitable encoding on the features . You are responsible to find out which features needs which type of encoding  
# Use pandas concat(as in example) to get the final dataframe 
# WRITE YOUR CODE DOWN HERE 