# Scaling Features

### Introduction

If we review our process, we can see that we have already been performing feature selection.  Remember, that we first loaded our data and fit a model.

In [2]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
dataset = fetch_california_housing()

columns = dataset['feature_names']

X = pd.DataFrame(dataset['data'], columns = columns)
y = pd.Series(dataset['target'])

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

And then we sorted the features in our model by the size of the coefficients.

In [3]:
import numpy as np
sorted_coefs = np.argsort(model.coef_)[::-1]
sorted_cols = np.array(columns)[sorted_coefs]
sorted_cols, model.coef_

(array(['AveBedrms', 'MedInc', 'HouseAge', 'Population', 'AveOccup',
        'AveRooms', 'Latitude', 'Longitude'], dtype='<U10'),
 array([ 4.36693293e-01,  9.43577803e-03, -1.07322041e-01,  6.45065694e-01,
        -3.97638942e-06, -3.78654265e-03, -4.21314378e-01, -4.34513755e-01]))

The logic here is that the larger the coefficient, the more important the feature.  So let's try adding more features, from most important to least important, to see if we continue to improve our dataset when we use our validation and test data.  

Unfortunately, this approach is slightly wrong, and in this lesson we'll see why.

### A new model

Let's say that we are again predicting the prices of homes, and we have a linear model in the following form: 



$$ \hat{price} =  12*sq\_feet + 8*num\_of\_bathrooms$$ 

This says that a one unit increase in square feet increases our price by 12 dollars, and that increasing the number of bathrooms increases the price by eight dollars.  In other words, our most important features should simply be the features with the largest coefficients, and so we just select the features with the largest coefficients. 

To see the problem with this, let's take a look at some data.

In [19]:
import pandas as pd
df = pd.read_csv('./bathrooms_house_size.csv', index_col = 0)

In [20]:
X = df[['bathrooms', 'square_feet']]
y = df['price']

Our dataset consists of just two features - the number of bathrooms and the number of square feet.

In [21]:
X[:2]

Unnamed: 0,bathrooms,square_feet
2,1.0,720.0
3,1.0,0.0


And our target is the listing price.

In [22]:
y[:2]

2    90.0
3    26.0
Name: price, dtype: float64

Now, we fit our model with the following:

In [23]:
from sklearn.linear_model import LinearRegression
model_1 = LinearRegression()
model_1.fit(X, y)
model_1.coef_

array([4.50945542e+01, 3.79320162e-02])

## Introducing a problem

But now let's say that instead of using square feet for the size we use square meters.  We already have data loaded for you.

In [24]:
df[:3]

Unnamed: 0,bathrooms,square_feet,square_meters,price
2,1.0,720.0,66.889632,90.0
3,1.0,0.0,0.0,26.0
8,1.0,0.0,0.0,90.0


And now let's refit our model.

In [38]:
from sklearn.linear_model import LinearRegression
model_2 = LinearRegression()
model_2.fit(df[['bathrooms', 'square_meters']], y)
model_2.coef_

array([45.09455419,  0.40830022])

In [27]:
model_1.coef_

array([4.50945542e+01, 3.79320162e-02])

> Take a look at the change in the second coefficient, which represents the size of the house.

So the point that we can see here, is just by going from square feet to square meters, we increased the coefficient on the size of our apartment from `.03` when using feet to over ten times the amount of `.408`.  This makes sense of course, it's a bigger deal to increase by one square meter as opposed to one square foot.

However, if we try to use parameter sizes to rank our features, our rankings will arbitrarily depend on the units in each feature.  For example, we were to convert our listing sizes to square kilometers we could probably make them more important than the number of bathrooms.

### Our fix

So currently we are ranking our features by asking how much does our target change per a one unit increase in the feature.  But this is too dependent on our choice of units.

Instead what we should really ask is the following: 

> How much does our target variable change **given an expected amount of movement** in the feature.  

So if there's a normal amount of change in the size of a listing and this normal change leads to a change in price, then size is something we want in our model.  But if this normal change in the size of the apartment doesn't lead to a large change in price, then we don't need to include the listing size in the model.

Of course we have a name for "normal change" - it's standard deviation.  So instead of measuring the target's response  to an increase in size by a square feot or a square meter, we measure the change in the target per  standard deviation change in our feature.  

And this means that we can convert our feature variables to their respective standard deviations by simply calculating the z-score.  

> Remember: The z-score of a feature variable is the number of standard deviations of that variable from the mean of the feature.

### Reviewing the Z-score

Here's the formula for translating each of our feature variables into their standard deviation from the average.

$z = \frac{X - \hat{X}}{\sigma}$

We just first calculate the difference of each feature variable from the average for that feature, and then divide this difference by the feature's standard deviation.

Let's do this for square feet.

In [30]:
sq_feet = df.square_feet
sq_feet[:3]

2    720.0
3      0.0
8      0.0
Name: square_feet, dtype: float64

In [31]:
sq_feet_x_hat = sq_feet.mean()
sq_feet_x_hat
# 465.40

sq_feet.std()
# 503.49

503.49193483870715

In [32]:
z_score = (sq_feet - sq_feet_x_hat)/sq_feet.std()

In [33]:
z_score[0:3]

2    0.505666
3   -0.924347
8   -0.924347
Name: square_feet, dtype: float64

Notice that we get the same z-scores for our feature variables regardless of units that we use.

In [37]:
sq_meters = df.square_meters
sq_meters_x_hat = sq_meters.mean()

meters_z_score = (sq_meters - sq_meters_x_hat)/sq_meters.std()
meters_z_score[:3]

2    0.505666
3   -0.924347
8   -0.924347
Name: square_meters, dtype: float64

Of course we could have just used scipy to make the same calculation.

In [44]:
from scipy.stats import zscore
sqfeet_z_score = zscore(sq_feet)

sqfeet_z_score[0:3]

array([ 0.50623365, -0.92538518, -0.92538518])

> But it's always good to review the calculation.

So if we change all of the units to z-scores, we can measure the effect of each feature changing by one standard deviation.

### Using SKLearn

The Sklearn library has it's own method for changing each of our feature variables into their respective Z-scores.

In [41]:
X = df[['bathrooms', 'square_meters']]
y = df['price']

In [43]:
from sklearn.preprocessing import StandardScaler

In [54]:
scaler = StandardScaler()
scaler.fit(X)
scaled_X = scaler.transform(X)

In [55]:
scaled_X[:3]

array([[-0.35443162,  0.50623365],
       [-0.35443162, -0.92538518],
       [-0.35443162, -0.92538518]])

In [50]:
from sklearn.model_selection import train_test_split

In [56]:
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.33, random_state=42)

In [57]:
scaled_model = LinearRegression()
scaled_model.fit(X_train, y_train)
scaled_model.coef_

array([14.74860074, 18.08278766])

So now we can see that one standard deviation change in the number of bathrooms leads to a 14.74 change in the price, and a one standard deviation change in the size of the house leads to a 18.08 change in the price.  So it appears that size is more important than number of bathrooms.

### Undoing our changes

If we'd like to change our data back, we can use our scaler to achieve this.

In [60]:
scaler.inverse_transform(scaled_X)[:2]

array([[ 1.        , 66.88963211],
       [ 1.        ,  0.        ]])

And doing so, we are back to our original data.

In [61]:
X[:2]

Unnamed: 0,bathrooms,square_meters
2,1.0,66.889632
3,1.0,0.0


### Summary

In this lesson, we saw how we can scale our data to help us to rank our feature importances.  As we saw, linear regression suggests we can determine feature importances right out of the box.  This is because our calculated parameters indicate the expected change in our target from a one unit change in the respective feature.  So to determine feature importances we should simply have to sort the features by their coefficients.  

The problem is that `a one unit change` in the feature variable is pretty arbitrary.  And because we can change the units of any variable, we can correspondingly change the coefficients.  To fix this, we put each of our features into comparable units by replacing each feature variable with its respective z-score.  This calculates each feature variable's deviation from the mean.  This is useful because it asks about changes in our target given standard movements in the feature variables.