## Variable magnitude

### Does the magnitude of the variable matter?

In Linear Regression models, the scale of variables used to estimate the output matters. Linear models are of the type **y = w x + b**, where the regression coefficient w represents the expected change in y for a one unit change in x (the predictor). Thus, the magnitude of w is partly determined by the magnitude of the units being used for x. If x is a distance variable, just changing the scale from kilometers to miles will cause a change in the magnitude of the coefficient.

In addition, in situations where we estimate the outcome y by contemplating multiple predictors x1, x2, ...xn, predictors with greater numeric ranges dominate over those with smaller numeric ranges.

Gradient descent converges faster when all the predictors (x1 to xn) are within a similar scale, therefore having features in a similar scale is useful for Neural Networks as well as.

In Support Vector Machines, feature scaling can decrease the time to find the support vectors.

Finally, methods using Euclidean distances or distances in general are also affected by the magnitude of the features, as Euclidean distance is sensitive to variations in the magnitude or scales of the predictors. Therefore feature scaling is required for methods that utilise distance calculations like k-nearest neighbours (KNN) and k-means clustering.

In summary:

#### Magnitude matters because:

- The regression coefficient is directly influenced by the scale of the variable
- Variables with bigger magnitude / value range dominate over the ones with smaller magnitude / value range
- Gradient descent converges faster when features are on similar scales
- Feature scaling helps decrease the time to find support vectors for SVMs
- Euclidean distances are sensitive to feature magnitude.

#### The machine learning models affected by the magnitude of the feature are:

- Linear and Logistic Regression
- Neural Networks
- Support Vector Machines
- KNN
- K-means clustering
- Linear Discriminant Analysis (LDA)
- Principal Component Analysis (PCA)

#### Machine learning models insensitive to feature magnitude are the ones based on Trees:

- Classification and Regression Trees
- Random Forests
- Gradient Boosted Trees

# Feature scaling with sklearn 

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

## Standard Scaler

Sklearn its main scaler, the StandardScaler, uses a strict definition of standardization to standardize data. It purely centers the data by using the following formula, where u is the mean and s is the standard deviation.
x_scaled = (x — u) / s


## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

# to scale the features
from sklearn.preprocessing import StandardScaler

# to evaluate performance and separate into
# train and test set
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

## Load the data

In [2]:
data = pd.read_csv('C:\\Users\\gusal\\machine learning\\Feature engineering\\python code\\real_estate_price_size_year.csv')
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


In [3]:
# let's have a look at the values of those variables
# to get an idea of the feature magnitudes

data.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


We can see that size varies between 480 and 1843, year between 2006 and 2018. So the variables have different magnitude.

In [4]:
# let's now calculate the range

for col in ['size', 'year']:
    print(col, 'range: ', data[col].max() - data[col].min())

size range:  1362.76
year range:  12


The range of values that each variable can take are quite different.

### Declare the dependent and the independent variables

In [5]:

x = data[['size','year']]
y = data['price']

In [6]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    x,y,
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((70, 2), (30, 2))

### Linear Regression on unscaled variables

In [7]:
# model build on unscaled variables
# call the model
reg = LinearRegression()

# train the model
reg.fit(X_train, y_train)

# evaluate performance

pred_test = reg.predict(X_test)



In [8]:
pred_test

array([234789.76598258, 196736.22379736, 216244.65467349, 442856.59794862,
       260209.43470158, 263051.09589619, 261331.90196503, 225652.38478638,
       247325.31989282, 216734.47400903, 251779.26768641, 257755.83584681,
       246118.27118608, 332031.81135881, 228029.98997407, 314435.56779708,
       371539.85431297, 211412.73692404, 220330.64770139, 259495.13735212,
       390716.89984298, 371133.90271374, 437490.35182723, 233041.16299408,
       258592.14050752, 244150.91148572, 230330.8105939 , 246768.0946664 ,
       241395.33634217, 370677.07261666])

### Calculate the R-squared

In [9]:
reg.score(X_test,y_test)

0.7808422782486512

### Calculate the Adjusted R-squared

In [10]:
# Let's use the handy function we created
def adj_r2(X_test,y_test):
    r2 = reg.score(X_test,y_test)
    n = X_test.shape[0]
    p = X_test.shape[1]
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    return adjusted_r2

In [11]:
R2_adjusted = adj_r2(X_test,y_test)

0.7646083729337364

In [12]:
reg.intercept_

-2891242.2628917126

In [13]:
reg.coef_

array([ 221.83147499, 1486.31846289])

### Scale the inputs

In [14]:
# cal the scaler
scaler = StandardScaler()

# call the model
reg_scaleding = LinearRegression()

# fit the scaler
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [15]:
#let's have a look at the scaled training dataset

print('Mean: ', X_train_scaled.mean(axis=0))
print('Standard Deviation: ', X_train_scaled.std(axis=0))
print('Minimum value: ', X_train_scaled.min(axis=0))
print('Maximum value: ', X_train_scaled.max(axis=0))

Mean:  [-4.01266322e-16  2.07833750e-14]
Standard Deviation:  [1. 1.]
Minimum value:  [-1.23748987 -1.37274255]
Maximum value:  [3.29596754 1.18971021]


### Linear Regression on scaled variables


In [16]:
# model build on scaled variables
# call the model
reg_scaled = LinearRegression()

# train the model
reg_scaled.fit(X_train_scaled, y_train)

# evaluate performance

pred_test_scaled = reg_scaled.predict(X_test_scaled)

In [17]:

reg_scaled.intercept_

295040.45062857127

In [18]:

reg_scaled.coef_

array([64792.42758435,  6960.4489198 ])

### Calculate the R-squared

In [19]:
reg_scaled.score(X_test_scaled,y_test)

0.7808422782486508

### Calculate the Adjusted R-squared

In [29]:
# Let's use the handy function we created
def adj_r2(X_test_scaled,y_test):
    r2 = reg_scaled.score(X_test_scaled,y_test)
    n = X_test_scaled.shape[0]
    p = X_test_scaled.shape[1]
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    return adjusted_r2

In [30]:
Adj_R2_scaled = adj_r2(X_test_scaled,y_test)

In [31]:
Adj_R2_scaled

0.7646083729337361

### Compare the R-squared and the Adjusted R-squared

It seems the the R-squared is only slightly larger than the Adjusted R-squared, implying that we were not penalized a lot for the inclusion of 2 independent variables. 

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Comparing the Adjusted R-squared with the R-squared, we realize that 'Year' is not bringing too much value to the result.

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [22]:
new_data = [[750,2009]]
new_data_scaled = scaler.transform(new_data)

In [23]:
reg.predict(new_data_scaled)

array([-2892429.04139697])

### Create a summary table with your findings

In [33]:
reg_summary = pd.DataFrame(data = x.columns.values, columns=['Features'])
reg_summary ['Coefficients_scaled'] = reg_scaled.coef_
reg_summary ['Coefficients'] = reg.coef_
reg_summary ['R2_adjusted_scaled'] = Adj_R2_scaled
reg_summary ['R2_adjusted'] = Adj_R2
reg_summary

Unnamed: 0,Features,Coefficients_scaled,Coefficients,R2_adjusted_scaled,R2_adjusted
0,size,64792.427584,221.831475,0.764608,0.764608
1,year,6960.44892,1486.318463,0.764608,0.764608


We observe that the performance of linear regression did not change when using the datasets with the features scaled (compare R2_adj values for train and test set for models with and without feature scaling). 

However, when looking at the coefficients we do see a big difference in the values. This is because the magnitude of the variable was affecting the coefficients. After scaling, all 3 variables have the relatively the same effect (coefficient) towards price.