# Exploration and Regression on the Auto MPG Data Set
Kernel by [chmaxx](https://www.kaggle.com/chmaxx) – Oktober 2019

My personal goal with this kernel is to use this simple dataset to learn more about: 
- different **scaling methods** (e.g. standard scaling, power transforms, box-cox)
- the various **error and scoring metrics** and **how scikit-learn implements these**
- **dimensionality reduction and clustering with PCA and t-SNE** to get hidden insights about the data

---
>*If you use parts of this notebook in your own scripts or kernels, please give credit (for example link back to this, upvote or send flowers). Thanks! 
And I very much appreciate your feedback or comments! Thanks for that too. 👍*

---

## **For what was the dataset created?**

According to [the dataset source...](https://archive.ics.uci.edu/ml/datasets/auto+mpg) 

>the **«dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.»**

The dataset is a slightly modified version where 8 samples with missing values for the "mpg" feature have been removed.

According to data scientist Ross Quinlan the data **«concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes."**

## **Problem Definition**

We are asked to **predict miles per gallon («mpg») of various car models** from describing features. We have a dataset with 398 samples.

><span style="color:darkgreen">**Our goal is to use the Auto MPG data to build a machine learning model that can predict the fuel consumption of a given car.**

The data in the training set includes the fuel consumption, which makes this **a supervised regression machine learning task**:

>**Supervised**: We have access to both the features and the target and our goal is to train a model that can learn a mapping between the two.
>**Regression**: The fuel consumpteion is a continuous variable.

## **Machine Learning Methodology – step by step**

We will follow these steps:

1. Exploratory data analysis (EDA)
2. Data cleaning and formatting
3. Try and compare various machine learning models on a performance metric
4. Perform hyperparameter tuning for the most promising model

During **data cleaning and formatting** will especially take care of:

- Missing values
- Wrong values
- Wrong datatypes
- Outliers
- Skewed distributions



# 1. Import libraries and set globals

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import matplotlib.cm as cm
import seaborn as sns

import pandas as pd
import pandas_profiling
import numpy as np
from numpy import percentile
from scipy import stats
from scipy.stats import skew
from scipy.special import boxcox1p

import os, sys
import re
from tabulate import tabulate

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from sklearn.metrics import explained_variance_score
from sklearn.metrics import max_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import r2_score

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import RandomForestRegressor 
from sklearn.ensemble import GradientBoostingRegressor

import xgboost as xgb
import lightgbm as lgb

from umap import UMAP

import warnings
warnings.filterwarnings('ignore')

plt.rc('font', size=18)        
plt.rc('axes', titlesize=22)      
plt.rc('axes', labelsize=18)      
plt.rc('xtick', labelsize=12)     
plt.rc('ytick', labelsize=12)     
plt.rc('legend', fontsize=12)   

plt.rcParams['font.sans-serif'] = ['Verdana']

pd.options.mode.chained_assignment = None
pd.options.display.max_seq_items = 500
pd.options.display.max_rows = 500
pd.set_option('display.float_format', lambda x: '%.5f' % x)

# **2. Exploratory data analysis**

## **First look into the data**


In [None]:
df = pd.read_csv("../input/autompg-dataset/auto-mpg.csv")
df.head()

- We have 9 columns / features. 
- `mpg` is the target variable that we want to predict. 
- So we have **8 features to predict from**.

In [None]:
df.info(verbose=True, null_counts=True)

According to the dataset description we have:

- 5 **continuous features**: `mpt` (the target), `displacement`, `horsepower`, `weight`, `acceleration`
- 4 **categorical features**: `cylinders` and `model_year` (ordinal), `origin` and `name`

## **Data cleaning and preprocessing**
In a first step we do some preprocessing and cleaning in order to be able to understand the dataset.

In [None]:
# just for good measure: reduce the memory footprint 
# by setting floats and ints to 32bit

for col in df.columns:
    if df[col].dtype =='float64': df[col] = df[col].astype('float32')
    if df[col].dtype =='int64': df[col] = df[col].astype('int32')
        
# remove spaces from column names for `car name` and `model year`
df.rename({"car name" : "name",
          "model year" : "model_year"}, axis=1, inplace=True)

# convert `origin` back to actual names for now
car_origin = {1 : "usa", 2 : "europe", 3 : "japan"}
df["origin"] = df.origin.map(car_origin)

Do we have duplicates? No, we don't.

In [None]:
print(df.shape)
print(df.drop_duplicates().shape)

Now checking for missing values in more detail.

In [None]:
missing = [(c, df[c].isna().mean()*100) for c in df]
missing = pd.DataFrame(missing, columns=["feature", "percentage"])
missing = missing[missing.percentage > 0]
display(missing.sort_values("percentage", ascending=False))

We don't seem to have missing values. However, let's look more closely at `horsepower`.

In [None]:
df.horsepower.unique()

> There are questions marks ("?") amongst the values. This likely represents a missing value. Let's fix this by filling in the mean.

In [None]:
df.horsepower = df.horsepower.apply(lambda x: np.nan if x is "?" else x)
df.horsepower = df.horsepower.astype("float32")
df.horsepower.fillna(df.horsepower.mean(), inplace=True)
df.info()

On to preparing the categorical features. 

- `model_year` is ordinal. So we simply stick to dtype `int`.
- `origin` is a true categorical. We need to one hot encode it later.
- `name` actually contains two useful bits of information: the cars manufacturer and the model. We split that into two new columns.

In [None]:
# split() with expand=True yields one column per list element
# we only split on the first space by setting n=1
df[["manufacturer", "model"]] = df["name"].str.split(" ", n=1, expand=True)
df.drop("name", axis=1, inplace=True)
df.head(1).T

Are the categorical entries correct? No they aren't...

In [None]:
print(sorted(df.manufacturer.unique()))

There are several errors in the manufacturers names. E.g. «vokswagen», «maxda» etc. We fix these by replacing the wrong entries. We also fix some synomyms like «vw».

In [None]:
errors = {
         "vokswagen" : "volkswagen", 
         "vw" : "volkswagen", 
         "toyouta" : "toyota", 
         "mercedes-benz" : "mercedes", 
         "chevroelt": "chevrolet",
         "chevy" : "chevrolet", 
         "maxda" : "mazda"
         }

df.manufacturer = df.manufacturer.map(errors).fillna(df.manufacturer)

Some model names also seem redundant or wrong. 

Hand checking would be way too time-consuming so we only improve this features brute-force by removing all special characters. 

In [None]:
def alphanumeric(x):
    return re.sub('[^A-Za-z0-9]+', '', (str(x)))

df["model"] = df.model.apply(lambda x: alphanumeric(x))

## **Examine and plot categorical features**

Let's plot bar graphs for the categoricals to gain more insight.

In [None]:
print("Origin")
print(tabulate(pd.DataFrame(df.origin.value_counts())))

plt.figure(figsize=(16,5));
df.groupby("origin")["origin"].count().sort_values(ascending=False).plot(kind="bar")
plt.title("Origin")
plt.ylabel("Count")
plt.xlabel("Country")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(f"Cars from {df.manufacturer.nunique()} manufacturers (Top10)")
print(tabulate(pd.DataFrame(df.manufacturer.value_counts()[:10])))

plt.figure(figsize=(16,5));
df.groupby("manufacturer")["manufacturer"].count().sort_values(ascending=False).plot(kind="bar")
plt.title("Manufacturers")
plt.ylabel("Count")
plt.xlabel("Manufacturer")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(f"{df.model.nunique()} car models (Top 10)")
print(tabulate(pd.DataFrame(df.model.value_counts()[:10])))

plt.figure(figsize=(16,5));
df.groupby("model")["model"].count().sort_values(ascending=False)[:20].plot(kind="bar")
plt.title("Car models (Top 20 plotted)")
plt.ylabel("Count")
plt.xlabel("Car models")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

plt.figure(figsize=(16,5));
df.groupby("model_year")["model_year"].count().sort_index().plot(kind="bar")
plt.title("Car models per year")
plt.ylabel("Count")
plt.xlabel("Year")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

So after the first exploration – **our data in a nutshell:**

- 398 different cars from 1970 to 1982
- 296 distinct models from 30 manufacturers, a few of them dominating like Ford, Chevrolet.
- Cars mainly from USA (249), much less from Japan (79) and Europe (70)
- 1973 and 1978 seem a little bit stronger years with more samples.


Quick reexamination: Why do we have just 296 models but 398 samples?

Easy answer: Cars have different technical properties depending on the year they were built. E.g. the Ford Pinto has six different flavours... 

In [None]:
df[df.model == "pinto"]

## **Examine and plot numerical features**

We now have a look at the basic stats of our numerical features.

In [None]:
stats_ = df.describe().T.drop(["count", "25%", "75%"], axis=1)
stats_ = pd.concat([stats_, df.skew()], axis=1)
stats_.columns = ["mean", "std", "min", "median", "max", "skew" ]
cols = ["mean", "median", "std", "skew", "min", "max"]
stats_ = stats_[cols]
print(tabulate(stats_, headers="keys", floatfmt=".1f"))

Again a quick summary:

- The range of our target `mpg` is between 9 and 46.6.
- Cars seem to have a wide range of technical specs.
- All features are positively skewed. 
- `horsepower` seems significantly more skewed than the other features. 

So let's plot all distributions in comparison.

In [None]:
for feature in df.select_dtypes("number").columns:
    if feature == "model_year":
        continue
    plt.figure(figsize=(16,5))
    #sns.distplot(df[feature], hist_kws={"rwidth": 0.9})
    #plt.xlim(df[feature].min(), df[feature].max())
    df[feature].plot(kind="hist", rwidth=0.9, bins=50)
    plt.title(f"{feature.capitalize()}")
    plt.tight_layout()
    plt.show()

And indeed: `horsepower` appears more skewed than the other variables. `acceleration` in turn almost looks normally distributed.

## **Check for outliers**

In the next step we check for outliers that might negatively influence our regression algorithms. **But what exactly is an outlier?** And **can we isolate the outliers with statistical methods?**

[Taken from machinelearningmastery.com:](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/)

>**An outlier is an observation that is unlike the other observations.** It is rare, or distinct, or does not fit in some way. 

>_Outliers can have many causes, such as: Measurement or input error, data corruption or true outlier observation (e.g. Michael Jordan in basketball).
**There is no precise way to define and identify outliers in general because of the specifics of each dataset.** Instead, you, or a domain expert, must interpret the raw observations and decide whether a value is an outlier or not.
Nevertheless, we can use statistical methods to identify observations that appear to be rare or unlikely given the available data._ 

Outliers will very likely decrease our models accuracy – since they make no sense and do not follow any regularities that can be learned by the algorithm. They are wrong and possibly have to be excluded from the training data. 

A sound practical approach for _normally distributed data_ is to filter values that lie beyond 3 standard deviations from the mean. 

In [None]:
# calculate normal and extreme upper and lower cut off
for feature in df.select_dtypes("number").columns:

    cut_off = df[feature].std() * 3
    lower   = df[feature].mean() - cut_off 
    upper   = df[feature].mean() + cut_off
    df_lower = df[df[feature] < lower]
    df_upper = df[df[feature] > upper]
    if df_lower.shape[0] != 0 or df_upper.shape[0] != 0:
        print(f"{feature}")
        print(f"lower bound: {lower:.2f}\nupper bound: {upper:.2f}")
        if df_lower.shape[0] != 0:
            display(df[df[feature] < lower].sort_values(feature))
        if df_upper.shape[0] != 0:
            display(df[df[feature] > upper].sort_values(feature))

We have found 5 outliers in the feature values of `horsepower` and 2 outliers in `acceleration`. We might consider dropping these samples during training. I think though, that they still lie quite close to the normal value range of +/- 3 standard deviations and can safely be used for modelling.

## **Examine the data with PCA and t-SNE**

By applying dimensionality reduction methods like PCA and manifold learning algorithms like t-SNE we aim to gain more hidden insights.

>**PCA is a statistical procedure that converts samples of possibly correlated features into linearly uncorrelated features called principal components.** 

The first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated set of features orthogonal to each other. 

We **define the number of target dimensions when instantiating the PCA model**. We fit the instance to our data and PCA will try to retain as much variance as possible under this constraint. 

We can use PCA as a **tool for visualization** (reducing many feature dimensions to 3d or 2d for plotting), for **noise filtering** (remove unneeded signal that doesn't meaningfully contribute to variance), for **feature extraction and engineering** and much more. 

Features influence the result in accordance to their absolute values. So PCA is sensitive to the scaling of the data and [results improve very much by standard scaling before fitting PCA.](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html) The PCA implementation of scitkit-learn centers the data by default but doesn't standardize, so we add this step in form of a quick pipeline.

In [None]:
pca = PCA(n_components=2, random_state=1)
# fit on all numerical features and reduce dimensionality to two dimensions
cols = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration']
pipe = make_pipeline(StandardScaler())
df_std = pipe.fit_transform(df[cols])
data_pca = pca.fit_transform(df_std)

# get percentage of retained variance after dimensionality reduction
ex_variance = pca.explained_variance_ratio_.sum()*100
print(f"The reduced 2-dimensional data still contains {ex_variance:.1f}% of the variance of the original data.")

# we colorize the plot with the two features `cylinders` and `origin`
data_pca = pd.DataFrame(data_pca, columns=["x", "y"])
data_pca = data_pca.join(df)
plt.figure(figsize=(16,7))
sns.scatterplot(x="x", y="y", data=data_pca, hue="cylinders", linewidth=0.2, alpha=0.9)
plt.title(f"PCA of numerical features")
plt.tight_layout()
plt.show()

There seem to be three distinct groups in the data that correspond to the number of cylinders (4, 6 or 8). 

PCA is flexible, fast, and easily interpretable. Yet it **does not perform well when there are nonlinear relationships within the data.** To address this we use so called manifold learning methods. These are unsupervised estimators that seek to describe datasets as low-dimensional manifolds embedded in high-dimensional spaces. These algorithms basically search for locally connected structures and try to project these into less dimensions so that we can plot these.

We start to visualize our data with [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE).

In [None]:
tsne = TSNE(n_components=2, random_state=1)
data_tsne = tsne.fit_transform(df_std)

for feature in ["cylinders", "origin"]:
    data_tsne = pd.DataFrame(data_tsne, columns=["x", "y"])
    data_tsne = data_tsne.join(df)
    plt.figure(figsize=(16,7))
    sns.scatterplot(x="x", y="y", data=data_tsne, hue=feature, linewidth=0.2, alpha=0.9)
    plt.title(f"TSNE of numerical features")
    plt.tight_layout()
    plt.show()

We observe three clusters that correspond to cylinders and even the origin of the cars. Probably US American cars are the only ones with 8 cylinders in the dataset. The group with 4 cylinders is a more heterogenuous group. 

In [None]:
plt.figure(figsize=(16,7))
sns.swarmplot(x="origin", y="cylinders", hue="origin", data=df)
plt.title(f"Cylinders vs origin")
plt.tight_layout()
plt.show()

## **Correlation among features**
How are the features correlated to each other?

In [None]:
# get correlation among all numerical features with pandas .corr() function
corr = df.corr()
# filter correlations less than 0.5
corr = corr[(corr > 0.5) | (corr < -0.5)]

plt.subplots(figsize=(16,16));
sns.heatmap(corr, cmap="RdBu", square=True, cbar_kws={"shrink": .7}, )
plt.title("Correlation matrix of numerical features")
plt.tight_layout()
plt.show()

Not surprisingly the feature describing the engine are strongly correlated to another. Many cylinders equals more displacement equals more horsepower which in turn is appropiate for heavier cars.

## **Examining our target variable**

We now plot the feature correlation more specifically in relation to our target variable.

In [None]:
plt.figure(figsize=(16,5))
corr["mpg"].sort_values(ascending=True)[:-1].plot(kind="barh")
plt.title("Correlation of numerical features to mpg")
plt.xlabel("Correlation to «mpg»")
plt.tight_layout()
plt.show()

- **A high powered engine and a heavier car are negatively correlated to `mpg`.** 
- **Newer cars** and **high acceleration** have a **positive correlation**.

Since several regression algorithms rely on a linear relationship between features and target we examine the distribution of `mpg` more thoroughly.

In [None]:
plt.figure(figsize=(16,5))
df.mpg.plot(kind="hist", bins=100, rwidth=0.9)
plt.title("Miles per Gallon: value distribution")
plt.xlabel("Miles per Gallon (mpg)")
plt.tight_layout()
plt.show()

plt.figure(figsize=(16,5))
df.mpg.plot(kind="box", vert=False)
plt.title("Miles per Gallon: value distribution")
plt.xlabel("Miles per Gallon (mpg)")
plt.yticks([0], [''])
plt.ylabel("Miles per gallon\n", rotation=90)
plt.tight_layout()
plt.show()

[**Skewness**](https://en.wikipedia.org/wiki/Skewness) is the measure how skewed (aka deviant from a normal distribution) our data is. 

A normal distribution has a skewness of `0`. A positive value of skewness means that the tail is on the right, and vice versa with negative values. 

Pandas provides a convenient function. We add a probability plot to visualize the skewness of the sale prices.

In [None]:
print(f"The skewness of mpg is: {df.mpg.skew():.2f}")

plt.figure(figsize=(16,7))
_ = stats.probplot(df['mpg'], plot=plt)
plt.title("Probability plot: mpg")
plt.show()

> The data is skewed on the lower end and a little bit on the higher end of values. Can we improve this by e.g. log transforming or boxcox transforming `mpg`?

In [None]:
print(f"The skewness of mpg is                   :  {df.mpg.skew():.2f}")
print(f"The skewness of mpg log transformed is   : {np.log1p(df.mpg).skew():.2f}")
print(f"The skewness of mpg boxcox transformed is: {boxcox1p(df.mpg, 0.15).skew():.2f}")

plt.figure(figsize=(16,7))
_ = stats.probplot(np.log1p(df.mpg), plot=plt)
plt.title("Probability plot: mpg log transformed")
plt.show()

plt.figure(figsize=(16,7))
_ = stats.probplot(boxcox1p(df.mpg, 0.15), plot=plt)
plt.title("Probability plot: mpg boxcox transformed")
plt.show()

Both transforms improve the linearity of `mpg` significantly. **Boxcox seems to almost totally bring the values back to a normal distribution.** 

We keep that in mind for later, when we actually model the data.

Let's get more **detailed stats about the distribution**.

In [None]:
plt.figure(figsize=(16,7))
order = df.groupby("origin")["mpg"].median().sort_values(ascending=True).index
sns.boxplot(x="origin", y="mpg", data=df, order=order, width=0.5)
plt.title("Distribution of miles per gallon in relation to origin")
plt.ylabel("Miles per gallon (in 1000)")
plt.tight_layout()
plt.show()

We observe that:
- **US American cars have the least efficient consumption**.
- **Japanese cars do get the most out of their fuel** and **European cars rank 2nd**.

In [None]:
plt.figure(figsize=(16,7))
top20 = df.manufacturer.value_counts().index[:20]
top20 = df[df.manufacturer.isin(top20)]
order = top20.groupby("manufacturer")["mpg"].median().sort_values(ascending=True).index
sns.boxplot(x="manufacturer", y="mpg", hue="origin", dodge=False, data=top20, order=order)
plt.title("MPG per manufacturer (top20 plotted)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This confirms our previous insight: 

- American manufacturers have a much lower median fuel efficiency.
- European and japanese manufacturers build significantly more efficient cars.

In [None]:
for color, country in zip(sns.color_palette()[:3], df.origin.unique()):
    plt.figure(figsize=(16,7))
    sns.boxplot(x="model_year", y="mpg", color=color, data=df[df.origin==country])
    plt.ylim(8, 48)
    plt.title(f"Miles per gallon per model_year ({country.upper()})")
    plt.tight_layout()
    plt.show()

- Throughout the years cars in all three country get more efficient. 
- Interestingly the US shows a slow and steady progress whereas Japan and Europe have ups and downs. This could be due to the small size of the dataset – we have no idea, how much car models actually where on the market during these 12 years. At the same time japanese and european manufacturers might have reacted much faster and more sensitively to market changes and maybe oil prices. 
- [1973 and 1979 were years of oil crisis](https://en.wikipedia.org/wiki/Oil_crisis) and this might have affected car development in later years. We see a huge jump in efficiency for the US from 79 to the next years. We notice a jump too in Japan from 1973 in relation to the next years.

We are now **examining the relation of our target variable to the other correlated features**.

In [None]:
for feature in ['displacement', 'horsepower', 'weight', 'acceleration']:
    plt.figure(figsize=(16,5));
    sns.scatterplot(x=feature, y="mpg", data=df, linewidth=0.2, alpha=0.7, hue="origin")
    plt.title(f"{feature} vs. mpg")
    plt.legend(bbox_to_anchor=(1, 1), loc=2)
    plt.tight_layout()
    plt.show()

The correlations are clearly noticable. And again: We see clear groups in regard to origin.

# **3. Data preparation**

In this step we will:
    
- **scale** our numerical features
- **un-skew numerical features** and especially `mpg` to be more normally distributed
- **one hot encode** true categoricals
- remove `model` as potentially misleading feature since it is (almost) unique to a specific car. We very likely won't see the same model names as categories in unseen data (although we saw in EDA that some models have several makings throughout the years).

### **A quick [refresher on scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py)**:

- **Different scales and outliers degrade the performance of many machine learning algorithms.** 
- Unscaled data can also **slow down or even prevent the convergence of many gradient-based estimators.**
- Many **estimators are designed with the assumption that each feature takes values close to zero or more importantly that all features vary on comparable scales.** Exception: Decision tree-based estimators that are robust to arbitrary scaling of the data.
- sklearn's **`Scalers` are linear transformers** and differ in the way to estimate the parameters used to shift and scale each feature.
- sklearn's **`QuantileTransformer` provides non-linear transformations** in which distances between marginal outliers and inliers are shrunk. 
- sklearn's **`PowerTransformer` provides non-linear transformations** in which data is mapped to a normal distribution to stabilize variance and minimize skewness.
- In addition to scikit's transformers we can apply any mathematical transformation directly, like log transform or box-cox.
- **Normalization** refers to a per sample transformation instead of a per feature transformation.

<font color="darkred">**Caveat:** 
- We always have to be **careful not to introduce a so called data leakage** by fitting a scaler on the whole dataset (test and train). 
- We **always fit just on the training data** and then just **transform on the training and test data!**
</font>

>**<font color="darkgreen">Standardization</font>** is done by subtracting the mean and dividing by standard deviation. So a value which lies 1 SD above the mean – say 150 – will have a value of 1 after standardization. However, outliers have an influence on the calculation of the mean and standard deviation. So the spread of the transformed data on each feature can still be very different after standardization. **`StandardScaler()` doesn't guarantee balanced feature scales in the presence of outliers.**

>**<font color="darkgreen">MinMaxScaling</font>** rescales the data such that all feature values are in the range [0, 1]. Like `StandardScaler()`, the **`MinMaxScaler()` is very sensitive to the presence of outliers.**

>**<font color="darkgreen">RobustScaling</font>** rescales the data based on percentiles and is therefore **not influenced by a few number of large marginal outliers**. The RobustScaler keeps the outliers so they are still present in the transformed data. If a outlier clipping is desirable, a non-linear transformation like with a PowerTransformer() is required.

>**<font color="darkgreen">QuantileTransformer</font>** transforms the features to follow a uniform or a normal distribution. For a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers. It is therefore a robust preprocessing scheme. Feature values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. **This transform is non-linear.** It may **distort linear correlations between variables measured at the same scale** but renders variables measured at different scales more directly comparable.

>**<font color="darkgreen">PowerTransformer</font>** applies a power transformation to each feature **to make the data more Gaussian-like**. scikit-learn's `PowerTransformer()` implements the Yeo-Johnson and Box-Cox transforms. **The power transform finds the optimal scaling factor to stabilize variance and mimimize skewness**. By default, PowerTransformer also applies zero-mean, unit variance normalization to the transformed output. **Box-Cox can only be applied to positive data**. If negative values are present the Yeo-Johnson transformed is to be preferred.

>**<font color="darkgreen">Normalization</font>** scales samples **row-wise** to unit norm. Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one. Scaling inputs to unit norms is a **common operation for text classification or clustering**.

Let's plot all these transforms on `mpg` (apart from normalization which doesn't make sense here).

In [None]:
plt.figure(figsize=(16,5))
sns.distplot(df.mpg)
plt.title("mpg – unscaled")
plt.show()

scalers = [StandardScaler(), 
           MinMaxScaler(), 
           RobustScaler(), 
           PowerTransformer(method="box-cox"),
           QuantileTransformer(output_distribution="normal")]

for scaler in scalers:
    pipe = make_pipeline(scaler)
    standardized = pipe.fit_transform(df.mpg.values.reshape(-1, 1))
    plt.figure(figsize=(16,5))
    sns.distplot(pd.DataFrame(standardized))
    plt.title(f"mpg – {str(scaler.__class__).split('.')[-1][:-2]}")
    plt.show()

As expected – **only PowerTransformer and QuantileTransformer scale to a more normal distribution.** 

## **Setup our data preparation function**

We´re ready now to actually model our data and make predictions. I setup a simple data preparation function that allows to use any of the above scikit-learn scalers. 

In [None]:
def prepare_data(data, scaler):
    # remove feature «model» from dataset
    data.pop("model")
    df_numeric = data.select_dtypes(exclude=['object'])
    df_obj = data.select_dtypes(include=['object']).copy()

    cols = []
    for c in df_obj:
        dummies = pd.get_dummies(df_obj[c])
        dummies.columns = [c + "_" + str(x) for x in dummies.columns]
        cols.append(dummies)
    df_obj = pd.concat(cols, axis=1)
    
    pipe = make_pipeline(scaler)
    scaled = pipe.fit_transform(df_numeric)
    df_numeric = pd.DataFrame(scaled, columns=df_numeric.columns)
    
    data = pd.concat([df_numeric, df_obj], axis=1)
    data.reset_index(inplace=True, drop=True)
    return data

clean_df = prepare_data(df.copy(), StandardScaler())
clean_df.head()

# **4. Try out and tune classifiers**

Let's start with quick baselines. First we have to choose our error / scoring metric(s).

### **A quick [refresher on model evaluation and scoring](https://scikit-learn.org/stable/modules/model_evaluation.html)**...

How exactly to do choose an appropiate metric for evaluating our models?

- We make a general distinction between **error metrics** and **scoring metrics**. 
- We want **errors to be as small as possible** and **scores, like accuracy, to be as high as possible.**
- scikit-learn provides **three different options to evaluate a model**: 
   - **score methods of estimators**
   - the **scoring parameter of crossvalidation tools** (e.g. cross_val_score()) 
   - and **metric functions**.


- To streamline all scikit APIs in a consistent way all scorer objects follow the convention that **higher return values are better than lower return values**. [More info here](https://stackoverflow.com/questions/43081251/sklearn-metrics-log-loss-is-positive-vs-scoring-neg-log-loss-is-negative) too.
- Metrics which measure the distance between the model and the data, like `metrics.mean_squared_error`, are therefore available as their negative counterpart, e.g. `neg_mean_squared_error`, which **returns the negated value of the metric**. This way all functions work in the same numerical direction. 
- Available scorers can be listed with `sklearn.metrics.SCORERS.keys()`.
- We can **use many metrics at once** with `GridSearchCV`, `RandomizedSearchCV` and `cross_validate()`.
- **Dummy estimators** are useful to get a baseline of those metrics for random predictions.

In [None]:
import sklearn
sorted(sklearn.metrics.SCORERS.keys())

We now setup up all available regression metrics as a list and feed that as an argument into our crossvalidation.

A **regressor can be evaluated using many different metrics** e.g. the following:
- **Mean absolute error** (MAE): The average of absolute errors of all predicted data points. It uses the same scale as the target variable and gives an idea of how close predictions are to the actual values.
- **Median absolute error**: The median of all the errors in the given dataset. The main advantage of this metric is that it's robust to outliers. A single bad point in the test dataset wouldn't skew the entire error metric, as opposed to a mean error metric.
- **Mean squared error** (MSE): The average of the squares of the errors of all predicted data points and a very common metric. We can take the square root on top of the MSE in order to convert the values back into the original scale of the target variable being estimated. This yields the **root mean squared error (RMSE).**
- **Explained variance score**: This score measures how well our model can account for the variation in our dataset. A score of 1.0 indicates that our model is perfectly able to replicate our target distribution.
- **R2 score**: indicates the goodness of the fit of a regression model. It ranges from 0 to 1 (but can be negative too), meaning from no fit to perfect prediction.

`mean_squared_log_error` can only be used for strictly positive target values so we leave it out for now.

In [None]:
metrics = ["r2",
           "explained_variance",
           "max_error", 
           "neg_mean_absolute_error", 
           "neg_mean_squared_error", 
           "neg_median_absolute_error",
]

clean_df = prepare_data(df.copy(), StandardScaler())
X = clean_df.drop("mpg", axis=1)
y = clean_df.mpg

clf = make_pipeline(LinearRegression())
results = cross_validate(clf, X, y, scoring=metrics)

for k, v in results.items():
    if k in ["fit_time", "score_time"]:
        continue
    print(f"{v.mean(): .4f} {k.replace('test_', '')}")

## **Try out 10 of the most common classifiers**

We now setup a couple of regressors to try out in combination with the scalers mentioned above.

In [None]:
classifiers = [
               DummyRegressor(),
               LinearRegression(), 
               Ridge(random_state=1), 
               Lasso(random_state=1), 
               ElasticNet(random_state=1),
               KernelRidge(),
               SVR(kernel="linear"),
               RandomForestRegressor(n_jobs=-1, random_state=1),
               GradientBoostingRegressor(random_state=1),
               lgb.LGBMRegressor(n_jobs=-1, random_state=1),
               xgb.XGBRegressor(objective="reg:squarederror", n_jobs=-1, random_state=1),
]

clf_names = [
            "dummy       ", 
            "linear      ", 
            "ridge       ", 
            "lasso       ",
            "elasticnet  ",
            "kernlrdg    ",
            "svr         ",
            "randomforest", 
            "gbm         ", 
            "lgbm        ", 
            "xgboost     ",
]


scalers = [
           StandardScaler(), 
           MinMaxScaler(), 
           RobustScaler(), 
           PowerTransformer(method="box-cox"),
           QuantileTransformer(output_distribution="normal"),
          ]

scalers_names = [
                "Standard", 
                "MinMax", 
                "Robust", 
                "PowerTransform",
                "QuantileTransform",
]

In [None]:
def score_models(data, metric):
    
    frames = []
    
    for scaler_name, scaler in zip(scalers_names, scalers):
        
        clean_df = prepare_data(data.copy(), scaler)
        X = clean_df.drop("mpg", axis=1)
        y = clean_df.mpg

        scores = []

        for clf_name, clf in zip(clf_names, classifiers):
            score = cross_val_score(clf, X, y, cv=10, scoring=metric).mean()
            scores.append(score)

        frames.append(pd.DataFrame(scores, columns=[scaler_name]))

    score_df = pd.concat(frames, axis=1)
    score_df["clf"] = clf_names
    score_df.set_index("clf", inplace=True)

    score_df.sort_values("Standard", ascending=False, inplace=True)
    return score_df

score_models(df, "r2")

Observations:

- LightGBM performs best with all scalers and yields the highest score with standard scaling.
- LinearRegression doesn't work at all. This is very likely due to exploding coefficients and it's lack of regularization.

We now compare to **explained_variance**.

In [None]:
score_models(df, "explained_variance")

- According to explained_variance too LightGBM performs best with all scalers but PowerTransform. 
- PowerTransform gives a slight edge to the Ridge and KernelRidge estimators, likely due to fixing the skewed feature and target distributions.

Let's now try the **mean squared error** as a metric.

In [None]:
score_models(df, "neg_mean_squared_error")

Observations:

- LightGBM again yields the best results overall.
- **On first sight we get the lowest errors with MinMax** and might **be tempted to use this scaling** rather than the others. **However, MinMax compresses all values to a range between 0 and 1 and thus to a much smaller scale than the other scalers. It therefore yields much lower errors regardless of the models quality!**

Let's stick with standard scaling and let's setup RMSE.

In [None]:
clean_df = prepare_data(df.copy(), StandardScaler())
X = clean_df.drop("mpg", axis=1)
y = clean_df.mpg
metric = 'neg_mean_squared_error'

scores = []

# remove the dummy regressor and linear regression from set of estimators
for clf_name, clf in zip(clf_names[2:], classifiers[2:]):
    kfold = KFold(n_splits=5, shuffle=True, random_state=1)
    scores.append([clf_name, np.sqrt(-cross_val_score(clf, X, y, cv=kfold, scoring=metric)).mean()])

pd.DataFrame(scores, columns=["clf", "rmse"]).sort_values("rmse")

## **Which features do the classifiers pick up?**

What features are relevant for the classifiers? We'll get the feature importance of LightGBM as an example.

In [None]:
clf = lgb.LGBMRegressor(n_jobs=-1, random_state=1)
coeffs = clf.fit(X, y).feature_importances_
df_co = pd.DataFrame(coeffs, columns=["importance_"])
df_co.index = X.columns
df_co.sort_values("importance_", ascending=True, inplace=True)

plt.figure(figsize=(16,16))
df_co.importance_.plot(kind="barh")
plt.title(f"LightGBM feature importance")
plt.show()

As expected – top correlated features like weight, model_year etc. are very informative to the estimator. 

## **How much sense do make predictions from our model?**

Now I'd like to do something, that I feel I should do way more often – look at the predictions in a systematic way.

Again we train a classifier, make predictions with a test set and examine the predictions in more detail.

In [None]:
clean_df = prepare_data(df.copy(), StandardScaler())
X = clean_df.drop("mpg", axis=1)
y = clean_df.mpg
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
metric = 'neg_mean_squared_error'

clf = lgb.LGBMRegressor(n_jobs=-1, random_state=1)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

In [None]:
plt.figure(figsize=(10, 10))
sns.scatterplot(x=y_test, y=predictions)
plt.plot(np.arange(-1.5, 2.6), np.arange(-1.5, 2.6))
for x_, y_, t_ in zip(y_test, predictions, X_test.index):
    plt.text(x_, y_, t_, fontsize=10)
plt.title("Residuals: predicted vs actual values")
plt.xlabel("actual values")
plt.ylabel("predicted values")
plt.show()

In general our model predicts most of the values closely to actual values with some deviations and minor outliers. However, the higher `mpg` is the less precise the model seems to be. We can now more closely examine some of the outliers.

In [None]:
print("Predictions too high")
outliers_too_high = [366, 332, 303, 346]
display(df.loc[outliers_too_high])
#display(X_test.loc[outliers_too_high])

print("Predictions too low")
outliers_too_low = [298, 309, 327]
display(df.loc[outliers_too_low])
#display(X_test.loc[outliers_too_low])

# print basic stats for comparison
display(df.median())

[](http://)Ad hoc nothing looks particularly special about these samples. It only looks like that younger car models yield worse predictions. 

In [None]:
plt.figure(figsize=(16,7))
sns.distplot(y_test)
sns.distplot(predictions)
plt.legend(["mpg: actual values", "mpg: predicted"])
plt.title("Distributions of trained and predicted values")
plt.tight_layout()
plt.show()

From this plot we too can confirm that there is a noticable deviation on the right end with higher values for `mpg` – see the big blue bar. 

## **Tune hyperparameters**

As an example we grid search the optimal settings for LightGBM, our most promising classifier.

In [None]:
clean_df = prepare_data(df.copy(), StandardScaler())
X = clean_df.drop("mpg", axis=1)
y = clean_df.mpg

clf = lgb.LGBMRegressor(n_jobs=-1, random_state=1)
# increase steps in np.linspace for more granular search
param_grid = {
     'n_estimators' : np.linspace(100, 500, 3, dtype="int"),
     'learning_rate' : np.linspace(0.01, 0.1, 3),
}

metric = 'r2'
# increase cv for more precision but longer processing time
search = GridSearchCV(clf, param_grid, cv=3, scoring=metric, n_jobs=-1, verbose=True)
search.fit(X, y)

print(f"{search.best_params_}")
print(f"{search.best_score_:.4}")

> Job done! We found suitable hyperparameters and now could train our classifier on the full data set and e.g. submit to a competition.

# **Conclusion**

- I tried to **follow a systematic way of examining, preparing and modelling data.**
- I especially tried to dig deeper into the **various ways we can scale data**, **un-skew distributions that aren't normal** and learn more about **error and scoring metrics.**

## **References**

[Jake VanderPlas's Data Science Handbook](https://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/00.00-Preface.ipynb)

**My other kernels:**

https://www.kaggle.com/chmaxx/sklearn-pipeline-playground-for-12-classifiers<br>
https://www.kaggle.com/chmaxx/extensive-data-exploration-modelling-python<br>
https://www.kaggle.com/chmaxx/slim-data-cleaning-modelling-weighted-ensemble<br>

https://www.kaggle.com/chmaxx/train-12-classifiers-with-one-line-of-code<br>
https://www.kaggle.com/chmaxx/train-12-regressors-with-just-one-line-of-code<br>

**Utility scripts:**

https://www.kaggle.com/chmaxx/quick-regression<br>
https://www.kaggle.com/chmaxx/quick-classification<br>