<a href="https://colab.research.google.com/github/8291606522/ML-Pracs/blob/Prac3/P3FeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Practical No: 03** 
# **Aim: Feature Engineering**

##### **What is feature engineering?**
All machine learning algorithms use some input data to generate outputs. Input data contains many features which may not be in proper form to be given to the model directly. It needs some kind of processing and here feature engineering helps. Feature engineering fulfils mainly two goals:

It prepares the input dataset in the form which is required for a specific model or machine learning algorithm.
Feature engineering helps in improving the performance of machine learning models magically.

According to some surveys, data scientists spend their time on data preparation

In [1]:
import pandas as pd
import numpy as np

The main feature engineering techniques that will be discussed are:

Missing data imputation

Categorical encoding

Variable transformation

Outlier engineering

Date and time engineering

### **Missing Data Imputation for Feature Engineering**

In your input data, there may be some features or columns which will have missing data, missing values. It occurs if there is no data stored for a certain observation in a variable. Missing data is very common and it is an unavoidable problem especially in real-world data sets. If this data containing a missing value is used then you can see the significance in the results. So, imputation is the act of replacing missing data with statistical estimates of the missing values. It helps you to complete your training data which can then be provided to any model or an algorithm for prediction.

There are multiple techniques for missing data imputation. These are as follows:-

Complete case analysis
Mean / Median / Mode imputation
Missing Value Indicator

**Complete Case Analysis for Missing Data Imputation**

Complete case analysis is basically analyzing those observations in the dataset that contains values in all the variables. Or you can say, remove all the observations that contain missing values. But this method can only be used when there are only a few observations which has a missing dataset otherwise it will reduce the dataset size and then it will be of not much use.

So, it can be used when missing data is small but in real-life datasets, the amount of missing data is always big. So, practically, complete case analysis is never an option to use, although you can use it if the missing data size is small.

Let’s see the use of this on the titanic dataset.

In [3]:
titanic = pd.read_csv('train.csv')
# make a copy of titanic dataset
data1 = titanic.copy()
data1.isnull().mean()

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

If we remove all the missing observations, we would end up with a very small dataset, given that the Cabin is missing for 77% of the observations

In [4]:
# check how many observations we would drop
print('total passengers with values in all variables: ', data1.dropna().shape[0])
print('total passengers in the Titanic: ', data1.shape[0])
print('percentage of data without missing values: ', data1.dropna().shape[0]/ np.float(data1.shape[0]))

total passengers with values in all variables:  183
total passengers in the Titanic:  891
percentage of data without missing values:  0.2053872053872054


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  after removing the cwd from sys.path.


So, we have complete information for only 20% of our observations in the Titanic dataset. Thus, Complete Case Analysis method would not be an option for this dataset.

**Mean/ Median/ Mode for Missing Data Imputation**

Missing values can also be replaced with the mean, median, or mode of the variable(feature). It is widely used in data competitions and in almost every situation. It is suitable to use this technique where data is missing at random places and in small proportions.

In [5]:
# impute missing values in age in train and test set
median =data1.Age.median()
data1['Age'].fillna(median, inplace=True)
data1['Age'].isnull().sum()

0

0 represents that now the Age feature has no null values.

One important point to consider while doing imputation is that it should be done over the training set first and then to the test set. All missing values in the train set and test set should be filled with the value which is extracted from the train set only. This helps in avoiding overfitting.

**Missing Value Indicator For Missing Value Indication**

This technique involves adding a binary variable to indicate whether the value is missing for a certain observation. This variable takes the value 1 if the observation is missing, or 0 otherwise. But we still need to replace the missing values in the original variable, which we tend to do with mean or median imputation. By using these 2 techniques together, if the missing value has predictive power, it will be captured by the missing indicator, and if it doesn’t it will be masked by the mean / median imputation.

In [6]:
data1['Age_NA'] = np.where(data1['Age'].isnull(), 1, 0)

data1.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_NA
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


In [7]:
data1.Age.mean(), data1.Age.median()

(29.36158249158249, 28.0)

Now, since mean and median are the same, let’s replace them with the median.

In [8]:
data1['Age'].fillna(data1.Age.median(), inplace=True)

data1.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_NA
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0
5,6,0,3,"Moran, Mr. James",male,28.0,0,0,330877,8.4583,,Q,0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,0


So, the Age_NA variable was created to capture the missingness.

### **Categorical encoding in Feature Engineering**

Categorical data is defined as that data that takes only a number of values. Let’s understand this with an example. Parameter Gender in a dataset will have categorical values like Male, Female. If a survey is done to know which car people own then the result will be categorical (because the answers would be in categories like Honda, Toyota, Hyundai, Maruti, None, etc.). So, the point to notice here is that data falls in a fixed set of categories.

If you directly give this dataset with categorical variables to a model, you will get an error. Hence, they are required to be encoded. There are multiple techniques to do so:
<pre>
One-Hot encoding (OHE)
Ordinal encoding
Count and Frequency encoding
Target encoding / Mean encoding
</pre>

**One-Hot Encoding**

It is a commonly used technique for encoding categorical variables. It basically creates binary variables for each category present in the categorical variable. These binary variables will have 0 if it is absent in the category or 1 if it is present. Each new variable is called a dummy variable or binary variable.

Example: If the categorical variable is Gender with labels female and male, two boolean variables can be generated called male and female. Male will take 1 if the person is male or 0 otherwise. Similarly for a female variable. See this code below for the titanic dataset.

pd.get_dummies(data1['Sex']).head()

In [9]:
pd.concat([data1['Sex'], pd.get_dummies(data1['Sex'])], axis=1).head()

Unnamed: 0,Sex,female,male
0,male,0,1
1,female,1,0
2,female,1,0
3,female,1,0
4,male,0,1


But you can see that we only need 1 dummy variable to represent Sex categorical variable. So, you can take it as a general formula where if there are n categories, you only need an n-1 dummy variable. So you can easily drop anyone dummy variable. To get n-1 dummy variables simply use this:

In [10]:
pd.get_dummies(data1['Sex'], drop_first=True).head()


Unnamed: 0,male
0,1
1,0
2,0
3,0
4,1


### **Count and Frequency Encoding**

In this encoding technique, categories are replaced by the count of the observations that show that category in the dataset. Replacement can also be done with the frequency of the percentage of observations in the dataset. Suppose, if 30 of 100 genders are male we can replace male with 30 or by 0.3.

This approach is popularly used in data science competitions, so basically it represents how many times each label appears in the dataset.

**Target / Mean Encoding**

In target encoding, also called mean encoding, we replace each category of a variable with the mean value of the target for the observations that show a certain category. For example, there is a categorical variable “city”, and we want to predict if the customer will buy a TV provided we send a letter. If 30 percent of the people in the city “London” buy the TV, we would replace London with 0.3. So it helps in capturing some information regarding the target at the time of encoding the category and it also does not expands the feature space. Hence, it also can be considered as an option for encoding. But it may cause over-fitting to the model, so be careful. Look at this code for implementation:

In [11]:
import pandas as pd
# creating dataset
data={'CarName':['C1','C2','C3','C1','C4','C3','C2','C1','C2','C4','C1'],
      'Target':[1,0,1,1,1,0,0,1,1,1,0]}
df = pd.DataFrame(data)
print(df)

   CarName  Target
0       C1       1
1       C2       0
2       C3       1
3       C1       1
4       C4       1
5       C3       0
6       C2       0
7       C1       1
8       C2       1
9       C4       1
10      C1       0


In [12]:
df.groupby(['CarName'])['Target'].count()

CarName
C1    4
C2    3
C3    2
C4    2
Name: Target, dtype: int64

In [13]:
df.groupby(['CarName'])['Target'].mean()

CarName
C1    0.750000
C2    0.333333
C3    0.500000
C4    1.000000
Name: Target, dtype: float64

In [14]:
Mean_encoded = df.groupby(['CarName'])['Target'].mean().to_dict()
df['CarName'] = df['CarName'].map(Mean_encoded)
print(df)

     CarName  Target
0   0.750000       1
1   0.333333       0
2   0.500000       1
3   0.750000       1
4   1.000000       1
5   0.500000       0
6   0.333333       0
7   0.750000       1
8   0.333333       1
9   1.000000       1
10  0.750000       0


### **Variable Transformation**

Machine learning algorithms like linear and logistic regression assume that the variables are normally distributed. If a variable is not normally distributed, sometimes it is possible to find a mathematical transformation so that the transformed variable is Gaussian. Gaussian distributed variables many times boost the machine learning algorithm performance.

Commonly used mathematical transformations are:

<pre>
Logarithm transformation – log(x)
Square root transformation – sqrt(x)
Reciprocal transformation – 1 / x
Exponential transformation – exp(x)
</pre>

Let’s check these out on the titanic dataset.

In [15]:
cols_reqiuired = ['Age', 'Fare', 'Survived']
data1[cols_reqiuired].head()

Unnamed: 0,Age,Fare,Survived
0,22.0,7.25,0
1,38.0,71.2833,1
2,26.0,7.925,1
3,35.0,53.1,1
4,35.0,8.05,0


First, we need to fill in missing data. We will start with filling missing data with a random sample.

In [16]:
def impute(data1, variable):
    df = data1.copy()
    df[variable+'_random'] = df[variable]
    # extract the random sample to fill the na
    random_sample = df[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
    random_sample.index = df[df[variable].isnull()].index
    df.loc[df[variable].isnull(), variable+'_random'] = random_sample
    return df[variable+'_random']
# fill na
data1['Age'] = impute(data1, 'Age')

### **Now, to visualize the distribution of the age variable we will plot histogram and Q-Q-plot.**

https://www.statology.org/transform-data-in-python/ 

**Date and Time Feature Engineering**

Date variables are considered a special type of categorical variable and if they are processed well they can enrich the dataset to a great extent. From the date we can extract various important information like: Month, Semester, Quarter, Day, Day of the week, Is it a weekend or not, hours, minutes, and many more. Let’s use some dataset and do some coding around it.

For this, we will use the Lending club dataset.

We will use only two columns from the dataset: issue_d and last_pymnt_d.

In [17]:
import pandas as pd
import numpy as np

In [19]:
use_cols = ['issue_d', 'last_pymnt_d']
data = pd.read_csv('loan.csv', usecols=use_cols, nrows=10000)
data.head()

Unnamed: 0,issue_d,last_pymnt_d
0,Dec-2018,Feb-2019
1,Dec-2018,Feb-2019
2,Dec-2018,Feb-2019
3,Dec-2018,Feb-2019
4,Dec-2018,Feb-2019


Now, parse dates into DateTime format as they are coded in strings currently.

In [20]:
data['issue_dt'] = pd.to_datetime(data.issue_d)
data['last_pymnt_dt'] = pd.to_datetime(data.last_pymnt_d)
data[['issue_d','issue_dt','last_pymnt_d', 'last_pymnt_dt']].head()

Unnamed: 0,issue_d,issue_dt,last_pymnt_d,last_pymnt_dt
0,Dec-2018,2018-12-01,Feb-2019,2019-02-01
1,Dec-2018,2018-12-01,Feb-2019,2019-02-01
2,Dec-2018,2018-12-01,Feb-2019,2019-02-01
3,Dec-2018,2018-12-01,Feb-2019,2019-02-01
4,Dec-2018,2018-12-01,Feb-2019,2019-02-01


Now, extracting month from date.

In [21]:
data['issue_dt_month'] = data['issue_dt'].dt.month
data[['issue_dt', 'issue_dt_month']].head()

Unnamed: 0,issue_dt,issue_dt_month
0,2018-12-01,12
1,2018-12-01,12
2,2018-12-01,12
3,2018-12-01,12
4,2018-12-01,12


In [22]:
data['issue_dt_quarter'] = data['issue_dt'].dt.quarter
data[['issue_dt', 'issue_dt_quarter']].head()

Unnamed: 0,issue_dt,issue_dt_quarter
0,2018-12-01,4
1,2018-12-01,4
2,2018-12-01,4
3,2018-12-01,4
4,2018-12-01,4


Extracting the day of the week from the date.

In [23]:
data['issue_dt_dayofweek'] = data['issue_dt'].dt.dayofweek
data[['issue_dt', 'issue_dt_dayofweek']].head()

Unnamed: 0,issue_dt,issue_dt_dayofweek
0,2018-12-01,5
1,2018-12-01,5
2,2018-12-01,5
3,2018-12-01,5
4,2018-12-01,5


##### **Outlier engineering**

Outliers are defined as those values that are unusually high or low with respect to the rest of the observations of the variable. Some of the techniques to handle outliers are:
 
1. Outlier removal

2. Treating outliers as missing values

3. Outlier capping

How to identify outliers?

For that, the basic form of detection is an extreme value analysis of data. If the distribution of the variable is Gaussian then outliers will lie outside the mean plus or minus three times the standard deviation of the variable. But if the variable is not normally distributed, then quantiles can be used. Calculate the quantiles and then inter quartile range:

Inter quantile is 75th quantile-25quantile.

upper boundary: 75th quantile + (IQR * 1.5)

lower boundary: 25th quantile – (IQR * 1.5)

So, the outlier will sit outside these boundaries. Outlier removal

In this technique, simply remove outlier observations from the dataset. In datasets if outliers are not abundant, then dropping the outliers will not affect the data much. But if multiple variables have outliers then we may end up removing a big chunk of data from our dataset. So, this point has to be kept in mind whenever dropping the outliers. Treating outliers as missing values

You can also treat outliers as missing values. But then these missing values also have to be filled. So to fill missing values you can use any of the methods as discussed above in this article. Outlier capping

This procedure involves capping the maximum and minimum values at a predefined value. This value can be derived from the variable distribution. If a variable is normally distributed we can cap the maximum and minimum values at the mean plus or minus three times the standard deviation. But if the variable is skewed, we can use the inter-quantile range proximity rule or cap at the bottom percentiles.
 

In [24]:
import pandas as pd

In [26]:
df = pd.read_csv('height.csv')
df.head()

Unnamed: 0,name,height
0,mohan,5.9
1,maria,5.2
2,sakib,5.1
3,tao,5.5
4,virat,4.9


**Detect outliers using percentile**

In [27]:
max_thresold = df['height'].quantile(0.95)
max_thresold

9.689999999999998

In [28]:
df[df['height']>max_thresold]

Unnamed: 0,name,height
9,imran,14.5


In [29]:
min_thresold = df['height'].quantile(0.05)
min_thresold

3.6050000000000004

In [30]:
df[df['height']<min_thresold]

Unnamed: 0,name,height
12,yoseph,1.2


**Remove outliers**

In [31]:
df[(df['height']<max_thresold) & (df['height']>min_thresold)]

Unnamed: 0,name,height
0,mohan,5.9
1,maria,5.2
2,sakib,5.1
3,tao,5.5
4,virat,4.9
5,khusbu,5.4
6,dmitry,6.2
7,selena,6.5
8,john,7.1
10,jose,6.1


In [33]:
df = pd.read_csv("BHP.csv")
df.head()


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [34]:
df.shape

(13320, 9)

In [35]:
df.describe()

Unnamed: 0,bath,balcony,price
count,13247.0,12711.0,13320.0
mean,2.69261,1.584376,112.565627
std,1.341458,0.817263,148.971674
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,3600.0


**Explore samples that are above 99.90% percentile and below 1% percentile rank**

In [36]:
min_thresold, max_thresold = df.price.quantile([0.001, 0.999])
min_thresold, max_thresold

(11.159500000000001, 2000.0)

In [37]:
df[df.price < min_thresold]

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
171,Super built-up Area,Ready To Move,Attibele,1 BHK,Jae 1hu,450,1.0,1.0,11.0
942,Built-up Area,Ready To Move,Attibele,1 BHK,Jae 2hu,400,1.0,1.0,11.0
1471,Built-up Area,18-Mar,Kengeri,1 BHK,,340,1.0,1.0,10.0
2437,Built-up Area,Ready To Move,Attibele,1 BHK,Jae 1hu,395,1.0,1.0,10.25
4113,Super built-up Area,18-Jan,BTM Layout,3 BHK,,167Sq. Meter,3.0,2.0,10.0
5410,Super built-up Area,Ready To Move,Attibele,1 BHK,Jae 1hu,400,1.0,1.0,10.0
7482,Super built-up Area,Ready To Move,Alur,1 BHK,,470,2.0,1.0,10.0
8594,Built-up Area,Ready To Move,Chandapura,1 BHK,,450,1.0,1.0,9.0
8653,Plot Area,Ready To Move,Doddaballapur,2 Bedroom,,640,1.0,0.0,10.5
10526,Super built-up Area,Ready To Move,Yelahanka New Town,1 BHK,KHatsFl,284,1.0,1.0,8.0


In [38]:
df[df.price > max_thresold]

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
408,Super built-up Area,19-Jan,Rajaji Nagar,7 BHK,,12000,6.0,3.0,2200.0
605,Super built-up Area,19-Jan,Malleshwaram,7 BHK,,12000,7.0,3.0,2200.0
2623,Plot Area,18-Jul,Dodsworth Layout,4 Bedroom,,30000,4.0,,2100.0
3180,Super built-up Area,Ready To Move,Shanthala Nagar,5 BHK,Kierser,8321,5.0,3.0,2700.0
4162,Built-up Area,Ready To Move,Yemlur,4 Bedroom,Epllan,7000,5.0,,2050.0
6421,Plot Area,18-Sep,Bommenahalli,4 Bedroom,Prood G,2940,3.0,2.0,2250.0
10304,Plot Area,Ready To Move,5th Block Jayanagar,4 Bedroom,,10624,4.0,2.0,2340.0
11080,Super built-up Area,18-Jan,Ashok Nagar,4 BHK,,8321,5.0,2.0,2912.0
11763,Plot Area,Ready To Move,Sadashiva Nagar,5 Bedroom,,9600,7.0,2.0,2736.0
12443,Plot Area,Ready To Move,Dollars Colony,4 Bedroom,,4350,8.0,,2600.0


**Remove outliers**

In [39]:
df2 = df[(df.price<max_thresold) & (df.price>min_thresold)]
df2.shape

(13291, 9)

In [40]:
df2.describe()

Unnamed: 0,bath,balcony,price
count,13219.0,12688.0,13291.0
mean,2.690673,1.584253,110.010361
std,1.335757,0.817169,125.434347
min,1.0,0.0,11.5
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,1950.0
