In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Feature Engineering for Continuous Variables 
In this notebook we will cover scaling, transformations, and interactive features. This notebook is the This is a companion workbook for the 365 Data Science course on ML Process. The in-depth explanantions theories and pros/cons for each of these techniques can be found there. 

## Feature Scaling
Feature scaling is important for we are using models with a distance metric. If our features are of different scales, they can be overcompensated for in the models. 
- Absolute Max Scaling
- MinMax Scaling
- Z-Score Normalization (Standard Scaler)
- Robust Scaler 
## Transformations 
- Logarithmic 
- Square Root 
- Exponential
- Box-Cox
## Interaction Features
- Arethmetic Interaction
- Binning
- Creative Features 



In [None]:
df = pd.read_csv('/kaggle/input/craigslist-carstrucks-data/vehicles.csv')

#let's add a column for car age that will help us later on: 
df['car_age'] = df['year'].max() - df['year']

In [None]:
df.columns

In [None]:
df.describe()
# Columns we may want to normalize 
# Price, Year, Odometer

In [None]:
#let's just use a few features to create an example model and remove Nulls. Learn mnore about different imputation techniques in this other companion notebook. 
#pd.get_dummie() creates dummy variables for the categorical features (see this notebook for more on that)
df_example = pd.get_dummies(df.loc[:,['price','car_age','odometer','manufacturer','condition']].dropna())


In [None]:
from sklearn.model_selection import train_test_split

X = df_example.drop('price',axis =1 )
y = df_example[['price']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
df.price.plot.box()

In [None]:
#df.price.plot.box()
df.car_age.plot.box()
#df.odometer.plot.box()

In [None]:
df.odometer.plot.box()

# Feature Scaling
Feature scaling is important for we are using models with a distance metric. If our features are of different scales, they can be overcompensated for in the models. 
- Absolute Max Scaling
- MinMax Scaling
- Z-Score Normalization (Standard Scaler)
- Robust Scaler 


## Absolute Maximum Scaling
Absolute maximum scaling will have you take the maximum value within the data and then divide the raw data by this absolute maximum value.

For absolute max scaling, this works best if our data doesn't have massive outliers. In this case, we would likely want to remove outliers from price and odometer. This also keeps the same distribution of the data. For absolute maximum scaling, let's do this on the year data for the cars. 

In [None]:
from sklearn.preprocessing import MaxAbsScaler

#Scale data 
df_am = MaxAbsScaler().fit_transform(X_train)

#convert to dataframe to see table
df_am = pd.DataFrame(df_am, columns = X_train.columns)

#obvious problems with outliers regarding price & odometer 

# Min Max Scaling
Another simple form of scaling is called min max. Min Max scaling will scale all our data points between 0 and 1. We’d use the following formula to scale our data, where we subtract the min from the raw data and then divide it by the max minus the min. 

Again, this approach is not robust to outliers.

In [None]:
from sklearn.preprocessing import MinMaxScaler
df_min_max = MinMaxScaler().fit_transform(X_train)
df_min_max = pd.DataFrame(df_min_max, columns = X_train.columns)

# Z Score Normalization (Standard Scaling)

Another approach is standardization which transforms the data into the z-score, where the mean is zero and the standard deviation is 1.

This approach is more robust to outliers, but still can have issues if outliers cause massive changes to standard deviation. However, this does assume a normal distribution which is inaccurate for some of our data (Year).

In [None]:
from sklearn.preprocessing import StandardScaler
df_std = X_train.copy()
#only scale numeric varaibles in this case rather than the dummy variables for categories 
df_std.loc[:,['car_age','odometer']] = StandardScaler().fit_transform(df_std.loc[:, ['car_age','odometer']])
df_std

# Robust Scaler
With Robust Scaler, we’re subtracting the median and then scaling the column by the IQR.

This is the approach most robust to outliers that we will cover.

In [None]:
from sklearn.preprocessing import RobustScaler
df_rob = X_train.copy()
#only scale numeric varaibles in this case rather than the dummy variables for categories 
df_rob.loc[:,['car_age','odometer']] = RobustScaler().fit_transform(df_rob.loc[:, ['car_age','odometer']])
df_rob

In [None]:
#let's do a simple exmaple where we compare results with the different features scaling techniques. We will remove the categorical data for this. 

#the model we will be using is K Nearest Neighbors which can use euclidean distance. 

#we will use year and odometer to predict price 

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error

#noscaling 
neigh_am = KNeighborsRegressor(n_neighbors=3)
neigh_am.fit(X_train.loc[:,['car_age','odometer']], y_train)
pred = neigh_am.predict(X_test.loc[:,['car_age','odometer']])

#absolute max 
neigh_am = KNeighborsRegressor(n_neighbors=3)
neigh_am.fit(df_am.loc[:,['car_age','odometer']], y_train)
am_pred = neigh_am.predict(X_test.loc[:,['car_age','odometer']])

#min max (should get same results)
neigh_mm = KNeighborsRegressor(n_neighbors=3)
neigh_mm.fit(df_min_max.loc[:,['car_age','odometer']], y_train)
mm_pred = neigh_mm.predict(X_test.loc[:,['car_age','odometer']])

#standard (z score)
neigh_std = KNeighborsRegressor(n_neighbors=3)
neigh_std.fit(df_std.loc[:,['car_age','odometer']], y_train)
std_pred = neigh_std.predict(X_test.loc[:,['car_age','odometer']])

#robust scaler 
neigh_rob = KNeighborsRegressor(n_neighbors=3)
neigh_rob.fit(df_rob.loc[:,['car_age','odometer']], y_train)
rob_pred = neigh_rob.predict(X_test.loc[:,['car_age','odometer']])



In [None]:
print('No Scaling: %.3f' % mean_absolute_error(y_test,pred))
print('Abosolute Max Score: %.3f' % mean_absolute_error(y_test,am_pred))
print('Min Max Score: %.3f' % mean_absolute_error(y_test,mm_pred))
print('Standard Scaling Score: %.3f' % mean_absolute_error(y_test,std_pred))
print('Robust Scaler Score: %.3f' % mean_absolute_error(y_test,rob_pred))


# Transformations 
A data transformation is the process of using a math expression to change the structure of our data. As we mentioned before, some models need data to fit a specific type of distribution for them to produce optimal results. Unfortunately, the data we get in the real world, doesn’t always fit the distributions our models call for. 

Let's look at the shape of our data and if it has any outliers before we do our transforms

In [None]:
# visual of the distribution of the odometer without any outlier removal (see boxplots above)
#data is clearly impacted heavily by outliers 
print("max odometer: " + str(df_example['odometer'].max()))
print("median odometer: " + str(df_example['odometer'].median()))

df_example['odometer'].hist(bins=50)


In [None]:
#shape of the data after very basic oultier removal (kept only data < 99th percentile)
#clear right skew in data 
df_example[df['odometer']<df['odometer'].quantile(.99)]['odometer'].hist(bins=50)

In [None]:
# visual of the distribution of the odometer without any outlier removal (see boxplots above)
#data is clearly impacted heavily by outliers 
print("max price: " + str(df_example['price'].max()))
print("median price: " + str(df_example['price'].median()))

df_example['price'].hist(bins=50)

In [None]:
#shape of the data after very basic oultier removal (kept only data < 99th percentile)
#clear right skew in data 

df_example[df['price']<df['price'].quantile(.99)]['price'].hist(bins=50)

In [None]:
# Let's do some simple feature engineering to get how old the cars are

df_example['car_age'] = df_example['car_age'].max() - df_example['car_age']

print("max age: " + str(df_example['car_age'].max()))
print("median age: " + str(df_example['car_age'].median()))

df_example['car_age'].hist(bins=50)

# Logarithmic Transformation
A very popular, common type of transformation is the log transformation. Log transformations fall under the family of power transformations. Typically, we apply logarithmic transformations to our variables when our variables are heavily right skewed, driven by a few outliers. 

Let's see how these transformations impact some of our skewed data (Odometer & Price)

### Transforms we will cover
- Logarithmic 
- Exponential
- Square Root 
- Box-Cox

In [None]:
from sklearn.preprocessing import FunctionTransformer

def log_transform(x):
    return np.log(x + 1)

transformer_log = FunctionTransformer(log_transform)
transformed_log = transformer_log.fit_transform(X_train)

In [None]:
transformer_logp = FunctionTransformer(log_transform)
transformed_logp = transformer_logp.fit_transform(y_train)

In [None]:
#as you can see, using log transform in this case actually creates some right skew. 
#It does however almost completely normalize the outliers that were present

transformed_log['odometer'].hist(bins = 100)

In [None]:
transformed_log['car_age'].hist(bins = 20)

In [None]:
#as you can see, using log transform in this case actually creates some right skew. 
#It does however almost completely normalize the outliers that were present

transformed_logp.hist(bins =100)

# Square Root Transform
Square/Square Root transformations will compress the spread of your larger values but spread out your lower values. Log transformations have a similar effect but are much more aggressive

In [None]:
def sqrt_transform(x):
    return np.sqrt(x)

transformer_sqrt = FunctionTransformer(sqrt_transform)
transformed_sqrt = transformer_sqrt.fit_transform(X_train)

In [None]:
transformer_sqrtp = FunctionTransformer(sqrt_transform)
transformed_sqrtp = transformer_sqrtp.fit_transform(y_train)

In [None]:
transformed_sqrt['odometer'].hist(bins = 100)

In [None]:
transformed_sqrt['car_age'].hist(bins = 20)

In [None]:
transformed_sqrtp.hist(bins = 100)

In [None]:
transformer_sqrtp = FunctionTransformer(sqrt_transform)
transformed_sqrtp = transformer_sqrtp.fit_transform(y_train[y_train['price'] < y_train['price'].quantile(.99)])

In [None]:
transformed_sqrtp.hist(bins=50) 

In [None]:
# Exponential Transforms (needs work)
A close cousin of the log transform is the exponential transformation. Anytime you apply a log transform to your dataset, you can apply an exponential transformation to revert it back to the original value. 

In [None]:
def exp_transform(x):
    return np.exp(x)

transformer_exp = FunctionTransformer(exp_transform)
transformed_exp = transformer_exp.fit_transform(X_train)

In [None]:
transformed_exp['odometer']

# Box-Cox Transformation
The Box-Cox transformation is a transformation that helps your dataset follow a normal distribution. The benefit is that this opens up a number of additional tests you can run on your data. 

Box-Cox is an interesting transformation because it essentially aggregates multiple power transformations into a single transformer. You use lambda to adjust the transformation. Lambda varies from -5 to 5. 

In [None]:
X_train.min()

In [None]:
from sklearn.preprocessing import PowerTransformer

transformer_bc = PowerTransformer(method ='box-cox')
transformed_bc = transformer_bc.fit_transform(X_train)

X_train.min()

Like a chef remixing their ingredients, as a data scientist, we have a ton of different ways we can engineer features with our variables. Here are a few common methods:

- Arethmetic Interaction (addition, subtraction, division, or multiplication of variables)
- Binning (grouping variables in ranges)
- Creative Features (alternative metrics for evaluation)