In [1]:
import pandas as pd
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/boston_housing.csv")

In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
0,0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,4.98,24.0
1,1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,9.14,21.6
2,2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,4.03,34.7
3,3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,2.94,33.4
4,4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,5.33,36.2


## Data standardization
**Standardizing** data refers to transforming each variable so that it more closely follows a **standard** normal distribution, with mean 0 and standard deviation 1.

The [`StandardScaler`](http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.StandardScaler.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML240ENSkillsNetwork34171862-2022-01-01#sklearn.preprocessing.StandardScaler) object in SciKit Learn can do this.


In [3]:
df.drop(columns='Unnamed: 0', inplace=True)

In [5]:
y_col = "MEDV"

X = df.drop(y_col, axis=1)
y = df[y_col]

**Import, fit, and transform using `StandardScaler`**

In [6]:
from sklearn.preprocessing import StandardScaler

s = StandardScaler()
X_ss = s.fit_transform(X)

In [7]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

y_col = "MEDV"
boston_data = df
X = boston_data.drop(y_col, axis=1)
y = boston_data[y_col]
lr.fit(X, y)
print(lr.coef_) # min = -18

[-1.21388618e-01  4.69634633e-02  1.34676947e-02  2.83999338e+00
 -1.87580220e+01  3.65811904e+00  3.61071055e-03 -1.49075365e+00
  2.89404521e-01 -1.26819813e-02 -9.37532900e-01 -5.52019101e-01]


In [8]:
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
X_ss = s.fit_transform(X)
lr2 = LinearRegression()
lr2.fit(X_ss, y)
print(lr2.coef_) # coefficients now "on the same scale"

[-1.04309742  1.09422031  0.0923018   0.72062826 -2.17148707  2.56771612
  0.10153691 -3.13599165  2.51742896 -2.13527146 -2.02770102 -3.93810517]


### Exercise:

Based on these results, what is the most "impactful" feature (this is intended to be slightly ambiguous)? "In what direction" does it affect "y"?

**Hint:** Recall from last week that we can "zip up" the names of the features of a DataFrame `df` with a model `model` fitted on that DataFrame using:

```python
dict(zip(df.columns.values, model.coef_))
```

In [9]:
pd.DataFrame(zip(X.columns, lr2.coef_)).sort_values(by=1)

Unnamed: 0,0,1
11,LSTAT,-3.938105
7,DIS,-3.135992
4,NOX,-2.171487
9,TAX,-2.135271
10,PTRATIO,-2.027701
0,CRIM,-1.043097
2,INDUS,0.092302
6,AGE,0.101537
3,CHAS,0.720628
1,ZN,1.09422


Looking just at the strength of the standardized coefficients LSTAT, DIS, RM and RAD are all the 'most impactful'. Sklearn does not have built in statistical signifigance of each of these variables which would aid in making this claim stronger/weaker

### Lasso with and without scaling

We discussed Lasso in lecture.

Let's review together:

1.  What is different about Lasso vs. regular Linear Regression?
2.  Is standardization more or less important with Lasso vs. Linear Regression? Why?


In [10]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures

In [11]:
pf = PolynomialFeatures(degree=2, include_bias=False,)
X_pf = pf.fit_transform(X)
X_pf_ss = s.fit_transform(X_pf)

### Lasso

In [12]:
las = Lasso()
las.fit(X_pf_ss, y)
las.coef_ 

array([-0.        ,  0.        , -0.        ,  0.        , -0.        ,
        0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.98616746, -0.        , -0.        ,  0.        , -0.        ,
        0.06674111, -0.        , -0.        , -0.        , -0.10651683,
       -0.        , -0.        , -0.        , -0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       -0.        ,  0.        , -0.        , -0.        , -0.        ,
       -0.06140363, -0.        , -0.        , -0.        , -0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        , -0.        ,  3.42859375, -0.        , -0.        ,
       -0.        , -0.        , -0.        , -3.62119188, -0.  

### Exercise

Compare

*   Sum of magnitudes of the coefficients
*   Number of coefficients that are zero

for Lasso with alpha 0.1 vs. 1.

Before doing the exercise, answer the following questions in one sentence each:

*   Which do you expect to have greater magnitude?
*   Which do you expect to have more zeros?


In [14]:
lasso1 = Lasso(alpha=0.1)
lasso1.fit(X_pf_ss, y)
print('sum of coefficients:', abs(lasso1.coef_).sum() )
print('number of coefficients not equal to 0:', (lasso1.coef_!=0).sum())

sum of coefficients: 26.905418282924128
number of coefficients not equal to 0: 21


In [15]:
las1 = Lasso(alpha = 1)
las1.fit(X_pf_ss, y)
print('sum of coefficients:',abs(las1.coef_).sum() )
print('number of coefficients not equal to 0:',(las1.coef_!=0).sum())
### END SOLUTION

sum of coefficients: 8.427143528690644
number of coefficients not equal to 0: 7


In [16]:
from sklearn.metrics import r2_score
y_pre = las.predict(X_pf_ss)
r2_score(y,y_pre)

0.7184429194697673

In [17]:
from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(X_pf, y, test_size=0.3, random_state=72018)
X_train_s = s.fit_transform(x_train)
las.fit(X_train_s, y_train)
X_test_s = s.transform(x_test)
y_pred = las.predict(X_test_s)
r2_score(y_test, y_pred)

0.6758004479036455

In [18]:
X_train_s = s.fit_transform(x_train)
lasso1.fit(X_train_s, y_train)
X_test_s = s.transform(x_test)
y_pred = lasso1.predict(X_test_s)
r2_score(y_test, y_pred)

0.7973812985084684

### Exercise

#### Part 1:

Do the same thing with Lasso of:

*   `alpha` of 0.001
*   Increase `max_iter` to 100000 to ensure convergence.

Calculate the $R^2$ of the model.

Feel free to copy-paste code from above, but write a one sentence comment above each line of code explaining why you're doing what you're doing.

#### Part 2:

Do the same procedure as before, but with Linear Regression.

Calculate the $R^2$ of this model.

#### Part 3:

Compare the sums of the absolute values of the coefficients for both models, as well as the number of coefficients that are zero. Based on these measures, which model is a "simpler" description of the relationship between the features and the target?


In [19]:
lassoo1 = Lasso(alpha=0.001, max_iter=100000)
X_train_s = s.fit_transform(x_train)
lassoo1.fit(X_train_s, y_train)
X_test_s = s.transform(x_test)
y_pred = lassoo1.predict(X_test_s)
r2_score(y_test, y_pred)


0.8917369917314388

In [20]:
lr = LinearRegression()
lr.fit(X_train_s, y_train)
y_pred_lr = lr.predict(X_test_s)

r2_score(y_test, y_pred_lr)

0.8799482868812627

## L1 vs. L2 Regularization
As mentioned in the deck: `Lasso` and `Ridge` regression have the same syntax in SciKit Learn.

Now we're going to compare the results from Ridge vs. Lasso regression:


In [22]:
### BEGIN SOLUTION
# Decreasing regularization and ensuring convergence
from sklearn.linear_model import Ridge


r = Ridge(alpha = 0.001)
X_train_s = s.fit_transform(x_train)
r.fit(X_train_s, y_train)
X_test_s = s.transform(x_test)
y_pred_r = r.predict(X_test_s)

# Calculating r2 score
r.coef_
### END SOLUTION

array([-1.43017082e+01,  1.15185474e+01, -2.17817850e+01,  5.00353736e+00,
       -4.13509201e+00,  1.30868122e+01,  1.94339817e+01, -2.26454525e+01,
        2.09366189e+01, -1.40106342e+00,  1.68964569e+01,  3.97219520e-01,
        6.49774506e-01,  2.42472380e-01,  6.66775946e-01,  1.38068884e+00,
       -8.37348303e+00,  6.53158555e+00, -7.65358814e-01, -1.16623097e+00,
       -2.52325047e+00, -8.00433885e+00,  2.15527434e+01,  4.50185260e+00,
       -8.12917429e-01, -3.79154996e-02,  2.91198781e-01, -9.77233905e+00,
        1.59002806e+00,  1.12634966e+00, -1.15591658e+00, -1.07548914e+00,
        4.43908941e+00, -4.07048351e+00, -1.23523954e+00,  8.08218938e+00,
       -3.14207888e-01, -8.01161891e+00,  1.28027050e+01,  8.12427445e+00,
        1.58713331e+00,  6.87224963e+00,  5.32012933e-02,  1.57809183e+00,
       -5.40970480e+00,  5.00353737e+00, -4.32883794e+00, -8.05472175e+00,
        2.13226638e+00,  1.62233746e-01,  8.33891342e-02, -9.76359999e-01,
        3.26407951e-01, -

In [24]:
lassoo1 # same alpha as Ridge above

In [25]:
lassoo1.coef_

array([-0.00000000e+00,  0.00000000e+00, -1.82869276e+01,  5.40005682e+00,
       -0.00000000e+00,  1.11657677e+01,  8.69228971e+00, -1.79751693e+01,
        4.94276895e+00,  0.00000000e+00,  6.12489385e+00,  8.73861796e-01,
        9.74404935e-01,  2.13151397e-03,  0.00000000e+00,  1.73896404e+00,
       -9.99432645e+00,  4.92668212e+00, -5.52928098e-01, -7.25516204e-01,
       -0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  4.08824127e+00,
        5.22040647e-01,  2.17945394e-01,  1.30088853e-01, -1.26984366e+00,
        4.91283934e-01,  7.66056624e-01, -8.31333579e-01, -7.08257356e-01,
        2.65589223e+00, -8.39092955e-01, -9.99905016e-01,  3.88446848e+00,
       -7.72469478e-01,  6.91632151e+00,  7.29412809e+00,  5.85602446e+00,
        1.40587620e+00,  6.37871457e+00,  5.95018329e-01, -2.08335499e+00,
       -5.58976948e+00,  1.45350681e+00, -3.36312999e+00, -7.49442274e+00,
        2.40744022e+00,  1.03079226e+00, -8.08815677e-01,  0.00000000e+00,
        1.11004858e+00, -

In [27]:
y_pred = r.predict(X_pf_ss)
print(r2_score(y, y_pred))

y_pred = lassoo1.predict(X_pf_ss)
print(r2_score(y, y_pred))

0.904550293230595
0.9061473908134146


# Example: Does it matter when you scale?

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X_ss, y, test_size=0.3, 
                                                    random_state=72018)

In [29]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
r2_score(y_test, y_pred)

0.6881141968532016

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=72018)

In [31]:
s = StandardScaler()
lr_s = LinearRegression()
X_train_s = s.fit_transform(X_train)
lr_s.fit(X_train_s, y_train)
X_test_s = s.transform(X_test)
y_pred_s = lr_s.predict(X_test_s)
r2_score(y_test, y_pred)

0.6881141968532016

**Conclusion:** It doesn't matter whether you scale before or afterwards, in terms of the raw predictions, for Linear Regression. However, it matters for other algorithms. Plus, as we'll see later, we can make scaling part of a `Pipeline`.
