# Feature Engineering
<hr>

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.

Source : <a href="https://en.wikipedia.org/wiki/Feature_engineering#:~:text=Feature%20engineering%20is%20the%20process,as%20applied%20machine%20learning%20itself.">Wikipedia</a>


### Process
<hr>

The feature engineering process is:

1. Brainstorming or testing features
2. Deciding what features to create
3. Creating features
4. Checking how the features work with your model
5. Improving your features if needed
6. Go back to brainstorming/creating more features until the work is done

In [1]:
# import necesary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

pd.options.display.max_columns = None
%matplotlib inline

In [2]:
# import clean data set

data_wine = pd.read_csv("winequality_clean.csv")

In [3]:
def low_high(x):
    '''
    
    Author : Niladri Ghosh
    Email : niladri1406@gmail.com
    
    
    A function that takes in a single argument, a dataframe, then creates a extra column with data "(column_name)_level" 
    which describes whether the values in the data are low , medium, mod_high or high. The values are calculated against 
    the whole columns statistical data. We use .describe() method to fetch the values for min, 25%, 50% and so on. For 
    instance creating a column for pH values, the higher the value the lower is the concentration level and for the 
    the remaining columns its the other was i.e, higher the values high terminology denoted.
    
    level          pH          rest of the columns
    
    min - 25%      high        low
    25% - 50%      mod_high    medium
    50% - 75%      medium      mod_high
    75% - max      low         high
    
    
    '''
    
    
    for col in x.columns:
        if col == 'pH':
            bin_edges = [x[col].describe()['min']-0.001, x[col].describe()['25%'], x[col].describe()['50%'], 
                         x[col].describe()['75%'], x[col].describe()['max']]
            bin_names = ['high', 'mod_high', 'medium', 'low']
            x[col+"_level"] = pd.cut(x[col], bin_edges, labels=bin_names)
        
        elif col == 'quality' or col == 'color':
            pass
        
        else:
            bin_edges = [x[col].describe()['min']-0.001, x[col].describe()['25%'], x[col].describe()['50%'], 
                         x[col].describe()['75%'], x[col].describe()['max']]
            bin_names = ['low', 'medium', 'mod_high', 'high']
            x[col+"_level"] = pd.cut(x[col], bin_edges, labels=bin_names)


In [4]:
# create columns with feaures
low_high(data_wine)

In [5]:
data_wine.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,fixed_acidity_level,volatile_acidity_level,citric_acid_level,residual_sugar_level,chlorides_level,free_sulfur_dioxide_level,total_sulfur_dioxide_level,density_level,pH_level,sulphates_level,alcohol_level
0,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,high,high,low,medium,high,medium,low,high,mod_high,high,medium
1,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,high,high,low,medium,high,low,low,high,medium,high,medium
2,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,high,medium,high,medium,high,medium,low,high,mod_high,mod_high,medium
3,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,mod_high,high,low,medium,high,low,low,high,low,mod_high,low
4,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,red,mod_high,high,low,low,high,low,low,high,low,mod_high,low


In [6]:
# check shape of dataset
data_wine.shape

(5320, 24)

We need to perform __one hot encode__ on the newly created categorical data. By definition - "*__One Hot Encode refers to 
    splitting the column which contains numerical categorical data to many columns depending on the number of categories 
    present in that column. Each column contains “0” or “1” corresponding to which column it has been placed__*". For 
    example a column color has values red, blue and green (categorical data). Threfor the corresponding columns created 
    with values would be -
    
    color          color_red      color_blue    color_green
    
    red            1              0             0
    blue           0              1             0
    green          0              0             1
    blue           0              1             0
    red            1              0             0
    blue           0              1             0
    red            1              0             0
    green          0              0             1

In [7]:
df_wine = pd.get_dummies(data_wine)

In [8]:
df_wine.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color_red,color_white,fixed_acidity_level_low,fixed_acidity_level_medium,fixed_acidity_level_mod_high,fixed_acidity_level_high,volatile_acidity_level_low,volatile_acidity_level_medium,volatile_acidity_level_mod_high,volatile_acidity_level_high,citric_acid_level_low,citric_acid_level_medium,citric_acid_level_mod_high,citric_acid_level_high,residual_sugar_level_low,residual_sugar_level_medium,residual_sugar_level_mod_high,residual_sugar_level_high,chlorides_level_low,chlorides_level_medium,chlorides_level_mod_high,chlorides_level_high,free_sulfur_dioxide_level_low,free_sulfur_dioxide_level_medium,free_sulfur_dioxide_level_mod_high,free_sulfur_dioxide_level_high,total_sulfur_dioxide_level_low,total_sulfur_dioxide_level_medium,total_sulfur_dioxide_level_mod_high,total_sulfur_dioxide_level_high,density_level_low,density_level_medium,density_level_mod_high,density_level_high,pH_level_high,pH_level_mod_high,pH_level_medium,pH_level_low,sulphates_level_low,sulphates_level_medium,sulphates_level_mod_high,sulphates_level_high,alcohol_level_low,alcohol_level_medium,alcohol_level_mod_high,alcohol_level_high
0,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0
1,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0
2,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0
3,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0
4,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,1,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0


In [9]:
# save data set to csv file 
data_wine.to_csv('winequality_modified.csv', index = False)

In [10]:
# save data set to csv file - ONE HOT ENCODE
df_wine.to_csv('winequality_onehot.csv', index=False)