### Codio Assignment 9.3: Using StandardScaler

**Estimated Time: 45 Minutes**

**Total Points: 40**


This activity focuses on using the `StandardScaler` to scale the data by converting it to $z$-scores.  To begin, you will scale data using just NumPy functions.  Then, you will use the scikit-learn transformer and incorporate it into a `Pipeline` with a `Ridge` regression model.  

#### Index

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)


In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

### The Dataset

For this example, we will use a housing dataset that is part of the scikitlearn datasets module.  The dataset is chosen because we have multiple features on very different scales.  It is loaded and explored below -- your task is to predict `MedHouseVal` using all the other features after scaling and applying regularization with the `Ridge` estimator. 

In [2]:
cali = fetch_california_housing(as_frame=True)

In [3]:
cali.frame.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [4]:
print(cali.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [5]:
cali.frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [6]:
X = cali.frame.drop('MedHouseVal', axis = 1)
y = cali.frame['MedHouseVal']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

### Problem 1

#### Scaling the Train data

**10 Points**

Recall that **standard scaling** consists of subtracting the feature mean from each datapoint and subsequently dividing by the standard deviation of the feature.  Below, you are to scale `X_train` by subtracting the mean and dividing by the standard deviation.  Be sure to use the `numpy` mean and standard deviation functions with default settings.  

Assign your results to `X_train_scaled` below.  

In [8]:
### GRADED

X_train_scaled = ''

### BEGIN SOLUTION
X_train_scaled = (X_train - np.mean(X_train))/np.std(X_train)
### END SOLUTION

# Answer check
print(X_train_scaled.mean())
print('-----------------')
print(X_train_scaled.std())

MedInc        -88.923785
HouseAge      -11.471106
AveRooms      -68.402542
AveBedrms    -384.740548
Population      1.100404
AveOccup      -13.750958
Latitude      -64.435112
Longitude    -146.214246
dtype: float64
-----------------
MedInc        1.000035
HouseAge      1.000035
AveRooms      1.000035
AveBedrms     1.000035
Population    1.000035
AveOccup      1.000035
Latitude      1.000035
Longitude     1.000035
dtype: float64


### Problem 2

#### Scale the test data

**10 Points**

To scale the test data, use the mean and standard deviation of the **training** data.  In practice, you would not have seen the test data, so you would not be able to compute its mean and deviation.  Instead, you assume it is similar to your train data and use what you know to scale it.  

Assign the response as an array to `X_test_scaled` below.

In [10]:
### GRADED

X_test_scaled = ''

### BEGIN SOLUTION
X_test_scaled = (X_test - np.mean(X_train))/np.std(X_train)
### END SOLUTION

# Answer check
print(X_test_scaled.mean())
print('-----------------')
print(X_test_scaled.std())

MedInc       -0.010885
HouseAge      0.016943
AveRooms     -0.012397
AveBedrms    -0.010116
Population   -0.007164
AveOccup     -0.013088
Latitude     -0.029355
Longitude     0.023962
dtype: float64
-----------------
MedInc        0.991144
HouseAge      0.992575
AveRooms      1.027790
AveBedrms     1.185468
Population    0.977216
AveOccup      0.122749
Latitude      1.000142
Longitude     1.000915
dtype: float64


### Problem 3

#### Using `StandardScaler`

**10 Points**

- Instantiate a `StandardScaler` transformer. Assign the result to `scaler`.
- Use the `.fit_transform` method on `scaler` to transform the training data. Assign the result to `X_train_scaled`.
- Use the `.transform` method on `scaler` to transform the test data. Assign the result to `X_test_scaled`.

In [11]:
### GRADED

scaler = ''
X_train_scaled = ''
X_test_scaled = ''

### BEGIN SOLUTION
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
### END SOLUTION

# Answer check
print(scaler.mean_)
print('----------')
print(scaler.scale_)

[ 3.87689155e+00  2.85753738e+01  5.43812463e+00  1.09803314e+00
  1.42792733e+03  3.11923650e+00  3.56506693e+01 -1.19584102e+02]
----------
[1.90484248e+00 1.26131971e+01 2.45348438e+00 4.47482496e-01
 1.14018573e+03 1.23732074e+01 2.13566827e+00 2.00286090e+00]


### Problem 4

#### Building a `Pipeline`

**15 Points**

Now, construct a pipeline with named steps `scaler` and `ridge` that takes in your data, applies the `StandardScaler`, and fits a `Ridge` model with default settings. Next, use the `fit` function to train this pipeline on `X_train` and `y_train`. Assign your pipeline to `scaled_pipe`.

Use the `predict` function on `scaled_pipe` to compute the predictions on `X_train`. Assign your result to `train_preds`.

Use the `predict` function on `scaled_pipe` to compute the predictions on `X_test`. Assign your result to `test_preds`.

Use the `mean_squared_error` function to compute the MSE between `y_train` and `train_preds`. Assign your result to `train_mse`.

Use the `mean_squared_error` function to compute the MSE between `y_test` and `test_preds`. Assign your result to `test_mse`.

In [13]:
### GRADED

scaled_pipe = ''
train_preds = ''
test_preds = ''
train_mse = ''
test_mse = ''

### BEGIN SOLUTION
scaled_pipe = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge())]).fit(X_train, y_train)
train_preds = scaled_pipe.predict(X_train)
test_preds = scaled_pipe.predict(X_test)
train_mse = mean_squared_error(y_train, train_preds)
test_mse = mean_squared_error(y_test, test_preds)
### END SOLUTION

# Answer check
print(f'Train MSE: {train_mse}')
print(f'Test MSE: {test_mse}')

Train MSE: 0.5233577493232344
Test MSE: 0.5305437338152265
