## Feature Engineering: Basic Introduction

### Introduction
'Feature engineering is the process of using domain knowledge to extract features from raw data. These features can be used to improve the performance of machine learning(ML) algorithms. Feature engineering can be considered as applied machine learning itself'.[Wikipedia](https://en.wikipedia.org/wiki/Feature_engineering)

How features and what features are presented to an ML algorithm is very important, as this would greatly influence the qaulity of the ML algorithm result/output.  So much lies in the quality of the data and how data is presented.  "Gabage In, Gabage Out", "Quality In, Quality Out".

At the heart of it, a machine learning problem is a knowledge representation problem.  How the knowledge is represented would determine if or how quickly the ML algorithm would provide results.

#### How many strokes can you count?

In [None]:
from IPython.display import Image
Image(filename='uncounted_strokes.PNG',width=400, height=300)

How many seconds did it take you to determine the number of strokes we have? Are there approaches we can apply to make this process rather simple or simpler? What if these strokes are grouped into groups of 5 and clustered together would it simplify the process for us?  

In [None]:
from IPython.display import Image
Image(filename='counted_strokes.PNG',width=400, height=300)

Feature engineering  involves the the possible approaches that is required to make  it easier to present data to a machine learning algorithm to enable it  find insights and relationships between the features and the target variables

The type of feature engineering approach we use is influenced by the type of data provided and the machine learning algorithm to be used.

In this Jupyter notebook we would focus on numeric data type, categorical data type and time series.  Other data types taht would not be considered include text, images, speech etc. 

One approach to feature engineering that is often neglected and of great importance is engaging the SME. e.g height, weight and the interaction between the two features - body mass index (BMI). BMI is a better representation of being overweight than height or weight.

Content

1. Feature Engineering Approaches
2. Feature Selection Approaches
3. References 

### Dataset

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

dataset = fetch_california_housing()

retail_analytics_dataset = pd.read_csv("clean_retail_analytics_data.csv")

In [None]:
df_features = pd.DataFrame(dataset['data'], columns=dataset['feature_names'])
df_target = dataset.target

In [None]:
df_features.describe()

In [None]:
df_features.head()

In [None]:
df_features[dataset['feature_names'][:-2]].boxplot();

In [None]:
fig = plt.figure(figsize =(15,15))

features_columns = df_features.columns.tolist()

for x,y in enumerate(features_columns):
    plt.subplot(4,3,x+1)
    plt.hist(x=df_features[y], bins=30)
    plt.title(y + ' distribution') 

### 1. Feature Engineering Approach

a. Feature Transformation  
b. Feature Extraction/Creation 

#### a. Feature Transformation 

- Scaling
  - Normalization - tranforming range of values from between 0.0 to 1.0
  - Standardization - transforming range of value such that the mean is 0 and the standard deviation is 1
- Log Transformation - the log of a value. Other transformation like squared exist

The data range of features can have undue influence on a ML algorithm.  The impact varies per algorithm type. ML algorithms like linear regression, Support Vector machine  are influenced by this varying data range. Linear regression also expect some statistical assumption that lends itself to using standardization. Decision trees, random forest are less influenced by this.  In practice it is important the features are scaled.   

It is important to note while Normalization reduces the range of outliers "it does guarantee balanced feature scales in the presence of outliers" [Scalers & Outliers](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) 

Generally if you expect the feature to be normally distributed you would use standardization.  In practice it does not hurt to try both approaches.  Data Science is both a science and an art.

In [None]:
# Normalization - MinMaxScaler
# Standardization - StandardScaler
# Feature Transformation - PowerTransformer
# Category variable transformation - Category
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PowerTransformer, OneHotEncoder

#### Original Data

In [None]:
df2 = df_features.loc[:,['MedInc', 'HouseAge', 'AveBedrms', 'Population']]
df2.head()

In [None]:
df_features.loc[:,['MedInc', 'HouseAge', 'AveBedrms', 'Population']].describe().T

In [None]:
fig = plt.figure(figsize =(12,7))

for x,y in enumerate(['MedInc', 'HouseAge', 'AveBedrms', 'Population']):
    plt.subplot(2,3,x+1)
    plt.hist(x=df_features[y], bins=30)
    plt.title(y + ' distribution') 

#### Normalization 

In [None]:
# Scaling using sklearn

minmax = MinMaxScaler()
df_normalized = minmax.fit_transform(df2)
df_normalized = pd.DataFrame(df_normalized, columns=['MedInc', 'HouseAge', 'AveBedrms', 'Population'])
df_normalized.head()

In [None]:
df_normalized.describe().T

In [None]:
fig = plt.figure(figsize =(12,7))

for x,y in enumerate(df_normalized.columns.tolist()):
    plt.subplot(2,3,x+1)
    plt.hist(x=df_normalized[y], bins=30)
    plt.title(y + ' distribution') 

#### Standardization

In [None]:
# Scaling using sklearn

standard = StandardScaler()
standard.fit(df2)
df_standardized = standard.transform(df2)
df_standardized = pd.DataFrame(df_standardized, columns=['MedInc', 'HouseAge', 'AveBedrms', 'Population'])
df_standardized.head()

In [None]:
df_standardized.describe().T

In [None]:
fig = plt.figure(figsize =(12,7))

for x,y in enumerate(df_standardized.columns.tolist()):
    plt.subplot(2,3,x+1)
    plt.hist(x=df_standardized[y], bins=30)
    plt.title(y + ' distribution') 

looking at the histogram of the original dataset, the nomalization and standardization would do you observe in the scaling of the distribution of the features? 

#### Transformation
- log transform
- squared transform
- Power transform (supports  Box-Cox and yeo-johnson transformation)

Tranformation help to normalize originally  skewed data. Log transfrom is most frequently used.

However when using log transform you must ensure that the values transformed are greater than zero as log(0) is undefined.Numpy caters for this by providing np.log1p 

#### log transformation

In [None]:
df_log = df2[['MedInc', 'HouseAge', 'AveBedrms', 'Population']].applymap(np.log1p)
df_log.head()

In [None]:
df_log.describe().T

In [None]:
fig = plt.figure(figsize =(12,7))

for x,y in enumerate(['MedInc', 'HouseAge', 'AveBedrms', 'Population']):
    plt.subplot(2,3,x+1)
    plt.hist(x=df_log[y], bins=30)
    plt.title(y + ' distribution') 

#### Power Transformation

In [None]:
power = PowerTransformer()
power.fit(df2)
df_power = power.transform(df2)
df_power = pd.DataFrame(df_power, columns=['MedInc', 'HouseAge', 'AveBedrms', 'Population'])
df_power.head()

In [None]:
df_power.describe().T

In [None]:
fig = plt.figure(figsize =(12,7))

for x,y in enumerate(df_power.columns.tolist()):
    plt.subplot(2,3,x+1)
    plt.hist(x=df_power[y], bins=30)
    plt.title(y + ' distribution') 

#### Categorical Variable transformation
1. OneHot encoding
2. Creating categorical variables from numeric variables

Categorical variables need to converted to numeric variables.  Two approaches are the label encoding and the OneHot encoding.  OneHot is most appropriate as Label encoding would introduce an order or some form of magnitude which may not exist in the dataset.  E.g. a categorical feature than contain Low, Medium, High when using Label encoding would be translated into 1,2,3 while when using OneHot encoding would be translated into [100],[010], [001]. 

In [None]:
retail_analytics_dataset.head()

In [None]:
retail_analytics_dataset.info()

In [None]:
# converting features to the appropriate variable type
retail_analytics_dataset['region'] = retail_analytics_dataset['region'].astype('category')
retail_analytics_dataset['order_date'] = pd.to_datetime(retail_analytics_dataset['order_date'])

In [None]:
retail_analytics_dataset.head()

In [None]:
retail_analytics_dataset.region.unique()

#### OneHot Encoding

In [None]:
onehotencoder = OneHotEncoder()

onehotencoder.fit(retail_analytics_dataset[['region']])
encoded = onehotencoder.transform(retail_analytics_dataset[['region']])

In [None]:
encoded.shape

In [None]:
encoded.toarray()

In [None]:
pd.get_dummies(retail_analytics_dataset['region']).head()

pandas provides the *pd.get_dummies* method and scikit-learn offers the *OneHotEncoder* object.  What is the difference and when is it appropraite to use either?

OneHotEncoding by scikit-learn enables us apply oneHotEncoding to new dataset using the same encoder created from previous  (original) data. pd.get_dummies does not afford that opportunity. You must transform categorical data ahead. 

#### Creating categorical variables from numeric variables

Sometimes there may be need to create a categorical variable from numeric variables.  E.g. grouping business based on their revenue as small, medium and large scale enterprise.  This can be achieved by creating bins from the numeric variables

In [None]:
df_features[['Population']].describe().T

In [None]:
df_features.isna().sum()

In [None]:
pd.cut(df_features.Population, [0,1000,5000,35682])

In [None]:
df_features[['Population_group']] = pd.cut(df_features.Population, [0,1000,5000,35682], labels=['small', 'medium', 'large'])
df_features.head()

#### Feature Extraction & Creation

There are instances where you would like to create additional features from existing features. This might be extracting features from a date field, or creating features based on the interaction of already known fields

##### Datetime features

In [None]:
retail_analytics_dataset.head()

In [None]:
# function to create date features
def quarter_checker(x):
    if x in [1,2,3]:
        # Qtr 1
        return 1
    elif x in [4,5,6]:
          # Qtr 2
        return 2
    elif x in [7,8,9]:
        # Qtr 3
        return 3
    else:
        # Qtr 4
        return 4
    
public_holidays_in_brazil = []

def date_feature_engineering(df2, column_name, drop_others=False):
    df =df2.copy()
    if drop_others:
        columns = list(df.columns)
        columns.remove(column_name)
        df.drop(columns, axis=True, inplace=True)
    df[column_name + '_year'] = df[column_name].dt.year
    df[column_name+ '_month'] = df[column_name].dt.month
    df[column_name+ '_week'] = df[column_name].dt.isocalendar().week
    df[column_name+ '_day'] = df[column_name].dt.day
    df[column_name+ '_dayofweek'] = df[column_name].dt.dayofweek

    df[column_name +'_ismonth_start'] =  df[column_name + '_day'] == 1
    
    df[column_name +'_isweekend'] =   df[column_name].dt.dayofweek.isin([5,6])
    
    df['order_quarter'] = df[column_name].dt.month.apply(lambda x: quarter_checker(x))

    
    return df

In [None]:
retail_new_features = date_feature_engineering(retail_analytics_dataset,'order_date')
retail_new_features.head()

#### Building ML Models using the different Transformation

In [None]:
df_features.head()

In [None]:
df_target, dataset.target_names

In [None]:
df_features.isnull().sum()

In [None]:
df_features.info()

In [None]:
from sklearn.model_selection import train_test_split

df_updated = pd.get_dummies(df_features, columns=['Population_group'])
df_updated.head()

In [None]:
df_updated.info()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_updated, df_target, test_size=0.30, random_state=42)

In [None]:
# a simple regression model with no transformation
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

reg = LinearRegression()

reg.fit(X_train, y_train)

print('model without scaling\n')
print('score:', reg.score(X_train, y_train))

print ('\ntraining rmse')
print(mean_squared_error(y_train, reg.predict(X_train))** 0.5)


print ('\ntest rmse')
print(mean_squared_error(y_test, reg.predict(X_test))** 0.5)

In [None]:
X_train.columns

In [None]:
# a simple regression model with Normalization (MinMaxScaler)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# transform data
minmax = MinMaxScaler()
minmax.fit(X_train)

X_train_transform = minmax.transform(X_train)
X_test_transform = minmax.transform(X_test)



reg = LinearRegression()



reg.fit(X_train_transform, y_train)

print('model  with Normalization (MinMaxScaler)\n')
print('score:', reg.score(X_train_transform, y_train))

print ('\ntraining rmse')
print(mean_squared_error(y_train, reg.predict(X_train_transform))** 0.5)


print ('\ntest rmse')
print(mean_squared_error(y_test, reg.predict(X_test_transform))** 0.5)

In [None]:
X_train_transform

In [None]:
# a simple regression model with Standardization (StandardScaler)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error


# transform data
standard = StandardScaler()
standard.fit(X_train)

X_train_transform_s = standard.transform(X_train)
X_test_transform_s= standard.transform(X_test)



reg = LinearRegression()



reg.fit(X_train_transform_s, y_train)

print('model  with Standardization (StandardScaler)\n')
print('score:', reg.score(X_train_transform_s, y_train))

print ('\ntraining rmse')
print(mean_squared_error(y_train, reg.predict(X_train_transform_s))** 0.5)


print ('\ntest rmse')
print(mean_squared_error(y_test, reg.predict(X_test_transform_s))** 0.5)

#### Features Interactions

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(2)
poly.fit(X_train_transform)

X_train_poly = poly.transform(X_train_transform)
X_test_poly = poly.transform(X_test_transform)

print('shape (no interaction): ', X_train_transform.shape)
print('shape (interaction): ', X_train_poly.shape)


reg = LinearRegression()


reg.fit(X_train_poly, y_train)

print('model  with interaction \n')
print('score:', reg.score(X_train_poly, y_train))

print ('\ntraining rmse')
print(mean_squared_error(y_train, reg.predict(X_train_poly))** 0.5)


print ('\ntest rmse')
print(mean_squared_error(y_test, reg.predict(X_test_poly))** 0.5)

In [None]:
# print out features interaction
print(poly.get_feature_names())

### Feature Selection

Several approaches exist for future selection. We would demonstrate Stepwise selection in this notebook. Other approaches include:
- removing features with low variance
- leveraging features importance property of RandomForest
- Leveraging regularized linear regression model.
- Dimension reduction techniques like PCA

#### Sequential Feature Selection


In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(2)
poly.fit(X_train_transform)

X_train_poly = poly.transform(X_train_transform)
X_test_poly = poly.transform(X_test_transform)

print('shape (no interaction): ', X_train_transform.shape)
print('shape (interaction): ', X_train_poly.shape)


reg = LinearRegression()

sfs = SequentialFeatureSelector(reg, scoring='neg_root_mean_squared_error')

sfs.fit(X_train_poly, y_train)


In [None]:
sfs.get_support()

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(2)
poly.fit(X_train_transform)

X_train_poly = poly.transform(X_train_transform)
X_test_poly = poly.transform(X_test_transform)

#reduce X to the selected features
X_train_poly_reduced = sfs.transform(X_train_poly) 
X_test_poly_reduced = sfs.transform(X_test_poly) 

print('shape (no interaction): ', X_train_transform.shape)
print('shape (interaction): ', X_train_poly.shape)
print('shape (interaction with feature selection): ', X_train_poly_reduced.shape)


reg = LinearRegression()
reg.fit(X_train_poly_reduced, y_train)

print('model  with interaction \n')
print('score:', reg.score(X_train_poly_reduced, y_train))

print ('\ntraining rmse')
print(mean_squared_error(y_train, reg.predict(X_train_poly_reduced))** 0.5)


print ('\ntest rmse')
print(mean_squared_error(y_test, reg.predict(X_test_poly_reduced))** 0.5)

#### Feature Importance using Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(random_state=42)



forest.fit(X_train_transform, y_train)

print('model  - Random Foreset \n')
print('score:', forest.score(X_train_transform, y_train))

print ('\ntraining rmse')
print(mean_squared_error(y_train, forest.predict(X_train_transform))** 0.5)


print ('\ntest rmse')
print(mean_squared_error(y_test, forest.predict(X_test_transform))** 0.5)

In [None]:
features_importance = pd.DataFrame(forest.feature_importances_, X_train.columns).reset_index()
features_importance.columns =['feature', 'measure']
features_importance=features_importance.sort_values('measure', ascending=True)
features_importance

In [None]:
plt.barh('feature','measure', data=features_importance)
plt.title('Features Importance');

#### Building a Regression Model using the Importance Features Identified using Random Forest

In [None]:
# a simple regression model with Normalization (MinMaxScaler)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# transform data
minmax = MinMaxScaler()

importance_features = ['MedInc', 'AveOccup','Latitude', 'Longitude', 'HouseAge', 'AveRooms', 'Population', 'AveBedrms']

X_train2 = X_train.loc[:, importance_features]
X_test2 = X_test.loc[:,importance_features]

minmax.fit(X_train2)

X_train_transform2 = minmax.transform(X_train2)
X_test_transform2 = minmax.transform(X_test2)



reg = LinearRegression()



reg.fit(X_train_transform2, y_train)

print('model  with Important Features Only\n')
print('score:', reg.score(X_train_transform2, y_train))

print ('\ntraining rmse')
print(mean_squared_error(y_train, reg.predict(X_train_transform2))** 0.5)


print ('\ntest rmse')
print(mean_squared_error(y_test, reg.predict(X_test_transform2))** 0.5)

**Reference materials**
1. [Kaggle Feature Engineering Course](https://www.kaggle.com/learn/feature-engineering)
2. [Topic 6. Feature Engineering and Feature Selection](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection)
3. [Feature Selection using sklearn](https://scikit-learn.org/stable/modules/feature_selection.html)
3. [Featuretools](https://www.featuretools.com/)