# Feature Engineering
<hr>

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.

Source : <a href="https://en.wikipedia.org/wiki/Feature_engineering#:~:text=Feature%20engineering%20is%20the%20process,as%20applied%20machine%20learning%20itself.\">Wikipedia</a>


### Process
<hr>
    
The feature engineering process is :

1. Brainstorming or testing features
2. Deciding what features to create
3. Creating features
4. Checking how the features work with your model
5. Improving your features if needed
6. Go back to brainstorming/creating more features until the work is done

In [1]:
# import necesary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

pd.options.display.max_columns = None
%matplotlib inline

In [2]:
# import data set

df_2010 = pd.read_csv('data/data_2010_v1.csv')
df_2020 = pd.read_csv('data/data_2020_v1.csv')

In [3]:
def low_high(x):
    '''
    
    Author : Niladri Ghosh
    Email : niladri1406@gmail.com
    
    
    A function that takes in a single argument, a dataframe , then creates a extra column with 
    data "(column_name)_level" which describes whether the values in the data are low , medium, 
    mod_high or high. The values are calculated against the whole columns statistical data. We 
    use .describe() method to fetch the values for min, 25%, 50% and so on. For instance creating 
    a column for cmb_mpg (combined miles per gallon) values, the higher the value the more is the 
    mileage level.
    
    level          cmb_mpg_level   
    
    min - 25%      low        
    25% - 50%      medium    
    50% - 75%      mod_high      
    75% - max      high         
    
    
    '''
    
    
    for col in ['displ','cyl','air_pollution_score','city_mpg','hwy_mpg','cmb_mpg','greenhouse_gas_score']: 
        
        bin_edges = [x[col].describe()['min']-0.00001, x[col].describe()['25%']-0.00002, x[col].describe()['50%']-0.00001, 
                     x[col].describe()['75%'], x[col].describe()['max']]
        bin_names = ['low', 'medium', 'mod_high', 'high']
        x[col+"_level"] = pd.cut(x[col], bin_edges, labels=bin_names)

### Now we will create new rows of numerical data and divide it on the basis of scores if the values is between minimum value and 25% of the value -- it will be assigned low , similarly for values between 25% - 50% -- medium, 50% - 75% -- mod_high and 75% - Max -- high.

In [4]:
# data set 2010
low_high(df_2010)

In [5]:
# data set 2020
low_high(df_2020)

We need to perform __one hot encode__ on the newly created categorical data. By definition - "*__One Hot Encode refers to 
    splitting the column which contains numerical categorical data to many columns depending on the number of categories 
    present in that column. Each column contains “0” or “1” corresponding to which column it has been placed__*". For 
    example a column color has values red, blue and green (categorical data). Threfor the corresponding columns created 
    with values would be -
    
    color          color_red      color_blue    color_green
    
    red            1              0             0
    blue           0              1             0
    green          0              0             1
    blue           0              1             0
    red            1              0             0
    blue           0              1             0
    red            1              0             0
    green          0              0             1

In [6]:
# copy data set
data_2010_cp = df_2010.copy()

In [7]:
# rename columns
data_2010_cp.rename(columns = lambda  x : x + "_2010", inplace=True)

### Data 2010

In [8]:
# copy data set
df_2010_cp = data_2010_cp.copy()

In [9]:
# since model columns has unique data we'll be dropping it for the time being
df_2010_cp.drop('model_2010', axis=1, inplace=True)

In [10]:
# create onehot / dummies
df_2010_onehot = pd.get_dummies(df_2010_cp)

In [11]:
# insert columns model
df_2010_onehot.insert(0, 'model_2010' ,data_2010_cp['model_2010'])

In [12]:
df_2010_onehot.head(5)

Unnamed: 0,model_2010,displ_2010,cyl_2010,air_pollution_score_2010,city_mpg_2010,hwy_mpg_2010,cmb_mpg_2010,greenhouse_gas_score_2010,trans_2010_Auto-4,trans_2010_Auto-5,trans_2010_Auto-6,trans_2010_Auto-7,trans_2010_AutoMan-5,trans_2010_AutoMan-6,trans_2010_AutoMan-7,trans_2010_CVT,trans_2010_Man-5,trans_2010_Man-6,trans_2010_Other-1,trans_2010_SemiAuto-4,trans_2010_SemiAuto-5,trans_2010_SemiAuto-6,trans_2010_SemiAuto-7,trans_2010_SemiAuto-8,drive_2010_2WD,drive_2010_4WD,fuel_2010_CNG,fuel_2010_Diesel,fuel_2010_Ethanol,fuel_2010_Gas,fuel_2010_Gasoline,cert_region_2010_CA,cert_region_2010_FA,cert_region_2010_FC,veh_class_2010_SUV,veh_class_2010_large car,veh_class_2010_midsize car,veh_class_2010_minivan,veh_class_2010_pickup,veh_class_2010_small car,veh_class_2010_special purpose,veh_class_2010_station wagon,veh_class_2010_van,smartway_2010_no,smartway_2010_yes,displ_level_2010_low,displ_level_2010_medium,displ_level_2010_mod_high,displ_level_2010_high,cyl_level_2010_low,cyl_level_2010_medium,cyl_level_2010_mod_high,cyl_level_2010_high,air_pollution_score_level_2010_low,air_pollution_score_level_2010_medium,air_pollution_score_level_2010_mod_high,air_pollution_score_level_2010_high,city_mpg_level_2010_low,city_mpg_level_2010_medium,city_mpg_level_2010_mod_high,city_mpg_level_2010_high,hwy_mpg_level_2010_low,hwy_mpg_level_2010_medium,hwy_mpg_level_2010_mod_high,hwy_mpg_level_2010_high,cmb_mpg_level_2010_low,cmb_mpg_level_2010_medium,cmb_mpg_level_2010_mod_high,cmb_mpg_level_2010_high,greenhouse_gas_score_level_2010_low,greenhouse_gas_score_level_2010_medium,greenhouse_gas_score_level_2010_mod_high,greenhouse_gas_score_level_2010_high
0,ACURA MDX,3.7,6.0,7.0,16,21,18,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0
1,ACURA MDX,3.7,6.0,6.0,16,21,18,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0
2,ACURA RDX,2.3,4.0,7.0,19,24,21,5.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0
3,ACURA RDX,2.3,4.0,7.0,17,22,19,4.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0
4,ACURA RDX,2.3,4.0,6.0,19,24,21,5.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0


### Dataset 2020

In [13]:
# create a copy of data
data_2020_cp = df_2020.copy()

In [14]:
# since model columns has unique data we'll be dropping it for the time being
data_2020_cp.drop('model', axis=1, inplace=True)

In [15]:
# create onehot / dummies
df_2020_onehot = pd.get_dummies(data_2020_cp)

In [16]:
# insert columns model
df_2020_onehot.insert(0, 'model' ,df_2020['model'])

In [17]:
df_2020_onehot.head(5)

Unnamed: 0,model,displ,cyl,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,trans_AMS-6,trans_AMS-7,trans_AMS-8,trans_AMS-9,trans_Auto-1,trans_Auto-10,trans_Auto-4,trans_Auto-6,trans_Auto-7,trans_Auto-8,trans_Auto-9,trans_AutoMan-6,trans_AutoMan-7,trans_AutoMan-8,trans_CVT,trans_Man-5,trans_Man-6,trans_Man-7,trans_SCV-10,trans_SCV-6,trans_SCV-7,trans_SCV-8,trans_SemiAuto-10,trans_SemiAuto-5,trans_SemiAuto-6,trans_SemiAuto-7,trans_SemiAuto-8,trans_SemiAuto-9,drive_2WD,drive_4WD,fuel_Diesel,fuel_Electricity,fuel_Ethanol,fuel_Gas,fuel_Gasoline,cert_region_CA,cert_region_FA,veh_class_large car,veh_class_midsize car,veh_class_minivan,veh_class_pickup,veh_class_small SUV,veh_class_small car,veh_class_special purpose,veh_class_standard SUV,veh_class_station wagon,veh_class_van,smartway_Elite,smartway_No,smartway_Yes,displ_level_low,displ_level_medium,displ_level_mod_high,displ_level_high,cyl_level_low,cyl_level_medium,cyl_level_mod_high,cyl_level_high,air_pollution_score_level_low,air_pollution_score_level_medium,air_pollution_score_level_mod_high,air_pollution_score_level_high,city_mpg_level_low,city_mpg_level_medium,city_mpg_level_mod_high,city_mpg_level_high,hwy_mpg_level_low,hwy_mpg_level_medium,hwy_mpg_level_mod_high,hwy_mpg_level_high,cmb_mpg_level_low,cmb_mpg_level_medium,cmb_mpg_level_mod_high,cmb_mpg_level_high,greenhouse_gas_score_level_low,greenhouse_gas_score_level_medium,greenhouse_gas_score_level_mod_high,greenhouse_gas_score_level_high
0,ACURA ILX,2.4,4.0,3,24,34,28,6,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0
1,ACURA ILX,2.4,4.0,3,24,34,28,6,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0
2,ACURA MDX,3.0,6.0,3,26,27,27,6,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0
3,ACURA MDX,3.0,6.0,3,26,27,27,6,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0
4,ACURA MDX,3.5,6.0,3,20,27,23,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0


## Combining Data set into one based on car models

Original data sets for 2010 and 2020, before combining we need to identify each data separately, therefor we will add a new column  __"year"__  which will determine the vehicle which year is it from. 

In [18]:
# merge data sets
df_combined = data_2010_cp.merge(df_2020, left_on="model_2010", right_on="model", how="inner")

### Combine onehot data set

In [19]:
df_combined_onehot = df_2010_onehot.merge(df_2020_onehot, left_on="model_2010", right_on="model", how="inner")

In [20]:
# check for null values
df_combined_onehot.isna().sum().any()

False

## Save Data set to csv files

In [21]:
# original data sets 2010 and 2020
df_2010.to_csv("data/data_2010_v2.csv", index=False)
df_2020.to_csv("data/data_2020_v2.csv", index=False)

In [22]:
# cobined data
df_combined.to_csv("data/data_combined.csv", index=False)

In [23]:
# combined onehot data
df_combined_onehot.to_csv("data/data_combined_onehot.csv", index=False)