# Scaling

Scaling changes the range of features in our dataset.

Scaling

- why
    - some model types can be thrown off by different feature scales
    - improves most model's implementation
    - visualize the combination of 2 variables with different scales
    - a better interpretation of the data (e.g. log scaling)
    - combining features
- when
    - data prep / exploration
    - pipeline: prep
    - lifecycle: prep/exploration
    - when one of the conditions above is met. Otherwise, it's better to work with the original units
- where
    - the training dataset
    - usually just the independent variables
    - indep vars are scaled independently, i.e. the scaling of one feature doesn't affect the scaling of another
    - scale whatever goes into the model
- how
    - `sklearn.preprocessing` -- requires 2d array
    - make the thing, fit the thing, use the thing
    - `.fit` to learn parameters, `.transform` to apply the scaling
    - seperate scaled dataframes and/or columns

In [10]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import pydataset

from sklearn.neighbors import KNeighborsClassifier

## Why Scale? A Motivating Example

Goals: predict the flavor of ice cream

Note: for the purposes of the demo, we will only be looking at train and validate

In [11]:
train = pd.read_csv('https://gist.githubusercontent.com/zgulde/\
66989745314d2c68ab62fae13743f094/raw/71635c6281b5e2a36e3eb4578cab277eb09743ec/train.csv')
validate = pd.read_csv('https://gist.githubusercontent.com/zgulde/\
66989745314d2c68ab62fae13743f094/raw/71635c6281b5e2a36e3eb4578cab277eb09743ec/test.csv')

In [12]:
print('train shape:', train.shape)
print('validate shape:', validate.shape)

train shape: (90, 3)
validate shape: (10, 3)


In [13]:
train.head()

Unnamed: 0,flavor,pints,n_sprinkles
0,blueberry,7.675963,921.808798
1,blueberry,7.129386,1186.329821
2,pistachio,12.182332,443.310335
3,pistachio,13.955832,832.502384
4,chocolate,10.748216,892.0


In [14]:
X_train, X_validate = train[['pints', 'n_sprinkles']], validate[['pints', 'n_sprinkles']]
y_train, y_validate = train.flavor, validate.flavor

In [15]:
# make the thing
model = KNeighborsClassifier(n_neighbors=3)

# fit the thing
model.fit(X_train, y_train)

# use the thing
model.score(X_validate, y_validate)

0.3

### Scale the data
1. make the thing
2. fit the thing
3. use the thing

In [16]:
from sklearn.preprocessing import MinMaxScaler

In [17]:
#making our scaler


#fitting our scaler 
# AND!!!!
#using the scaler on train

#using our scaler on validate


#### now re-fit our model with scaled data and score

In [18]:
#make the thing (its already made)

#fit the thing


#use the thing


What's going on?

In [20]:
# train.head()

In [21]:
# sns.relplot(data=train, y='n_sprinkles', x='pints', hue='flavor');

In [22]:
# sns.relplot(data=train, y='n_sprinkles', x='pints', hue='flavor')
# plt.xlim(-800, 800);

Distance between 2 points

$$
\sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
$$

### Another Example

Dataset: demographics and self-reported scores for the ACT, SAT Verbal, and SAT Quantitative

In [24]:
df = pydataset.data('sat.act')
# df.head()

#### how does gender relate each test score? 

In [25]:
# df[['gender', 'ACT', 'SATV', 'SATQ']].groupby('gender').mean()
# .plot.bar(figsize=(11, 6), ec='black', width=.9);

#### lets scale our scores

In [26]:
from sklearn.preprocessing import StandardScaler

In [28]:
# defining my columns explicitly to scale
cols = ['ACT', 'SATQ', 'SATV']

# making my scaler object


# calling my fit_transform (note i dont have a train/test)
# reassign those transformed values back into the dataframe


In [29]:
# grouping and plotting as we did before
# df[['gender', 'ACT', 'SATV', 'SATQ']].groupby('gender').mean()
# .plot.bar(figsize=(11, 6), ec='black', width=.95);

## Linear Scaling

- Units are changed, but the distance between points is preserved.

- MinMax: everything between 0 and 1

    $$ x' = \frac{x - \text{max}(x)}{\text{max}(x) - \text{min}(x)} $$

- Standard: a zscore, standard deviations from the mean, **center** + **scale**

    $$ x' = \frac{x - \bar{x}}{s_x} $$

    - **centering**: subtracting the mean
    - **scaling**: dividing by the standard deviation

- Robust: robust to and preserves outliers

    $$ x' = \frac{x - \text{med}(x)}{\text{IQR}_x} $$

In [30]:
scaling_example = pd.DataFrame()
scaling_example['x1'] = np.arange(1, 11)
scaling_example['x2'] = [-100, -1, 0, 1, 2, 3, 4, 5, 100, 1000]

When scaling a single column, make sure to pass it to the scaler object as a dataframe, not as a series.

In [32]:
# scaling this
# one column:


#### linear scalers for both columns

In [99]:
# scaler = MinMaxScaler()
# scaling_example[['x1_minmax', 'x2_minmax']] = scaler.fit_transform(scaling_example[['x1','x2']])

In [100]:
# scaler = StandardScaler()
# scaling_example[['x1_standard', 'x2_standard']] = scaler.fit_transform(scaling_example[['x1','x2']])

In [34]:
from sklearn.preprocessing import RobustScaler

In [35]:
# scaler = RobustScaler()
# scaling_example[['x1_robust', 'x2_robust']] = scaler.fit_transform(scaling_example[['x1', 'x2']])

In [36]:
# scaling_example[sorted(scaling_example)]

In [37]:
# plt.scatter(scaling_example.x1, scaling_example.x2);

In [38]:
# plt.scatter(scaling_example.x1_minmax, scaling_example.x2_minmax);

In [39]:
# plt.scatter(scaling_example.x1_robust, scaling_example.x2_robust);

In [40]:
# plt.scatter(scaling_example.x1_standard, scaling_example.x2_standard);

## Non-linear Scaling

- The distance between points is **not** preserved, but order is
- Not as common as linear scalers
- In sklearn: 
    - power transformation: box-cox, yeo-johnson;
    - quantile transformation
- Log

    $$ x' = \log_b{x} $$

    $$ b^{x'} = x $$

    Sometimes you can just set the x/y scale w/ matplotlib instead of
    actually transforming the data

In [44]:
np.random.seed(1)
n = 100

df = pd.DataFrame()
df['x1'] = np.random.randn(n)
df['x2'] = 10 ** (df.x1 + np.random.randn(n) * .5)

In [45]:
# fig, ax = plt.subplots(figsize=(10,5))
# ax.scatter(df.x1, df.x2)
# plt.show()

In [46]:
# fig, ax = plt.subplots(figsize=(10,5))
# ax.scatter(df.x1, df.x2)
# ax.set_yscale('log')

If we want to predict $y$ based on $x$ and there is a linear relationship between them, the model

$$ y = mx + b $$

works pretty well.

However, if the relationship is exponential, that model does not make great predictions, but instead we could transform x:

$$ y = m(log(x)) + b $$

And still get decent predictions

## Recap

- Generally use unscaled data for exploration
- Use scaled data for modeling, typically min-max scaler
- Learn parameters for scaling from the training split