# Linear Transformations - Lab

## Introduction

In this lab, you'll practice your linear transformation skills!

## Objectives

You will be able to:

* Determine if a linear transformation would be useful for a specific model or set of data
* Identify an appropriate linear transformation technique for a specific model or set of data
* Apply linear transformations to independent and dependent variables in linear regression
* Interpret the coefficients of variables that have been transformed using a linear transformation

## Ames Housing Data

Let's look at the Ames Housing data, where each record represents a home sale:

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

ames = pd.read_csv('ames.csv', index_col=0)
ames

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125


We'll use this subset of features. These are specifically the _continuous numeric_ variables, which means that we'll hopefully have meaningful mean values.

From the data dictionary (`data_description.txt`):

```
LotArea: Lot size in square feet

MasVnrArea: Masonry veneer area in square feet

TotalBsmtSF: Total square feet of basement area

GrLivArea: Above grade (ground) living area square feet

GarageArea: Size of garage in square feet
```

In [7]:
ames = ames[[
    "LotArea",
    "MasVnrArea",
    "TotalBsmtSF",
    "GrLivArea",
    "GarageArea",
    "SalePrice"
]].copy()
ames

Unnamed: 0_level_0,LotArea,MasVnrArea,TotalBsmtSF,GrLivArea,GarageArea,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,8450,196.0,856,1710,548,208500
2,9600,0.0,1262,1262,460,181500
3,11250,162.0,920,1786,608,223500
4,9550,0.0,756,1717,642,140000
5,14260,350.0,1145,2198,836,250000
...,...,...,...,...,...,...
1456,7917,0.0,953,1647,460,175000
1457,13175,119.0,1542,2073,500,210000
1458,9042,0.0,1152,2340,252,266500
1459,9717,0.0,1078,1078,240,142125


We'll also drop any records with missing values for any of these features:

In [9]:
ames.dropna(inplace=True)
ames

Unnamed: 0_level_0,LotArea,MasVnrArea,TotalBsmtSF,GrLivArea,GarageArea,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,8450,196.0,856,1710,548,208500
2,9600,0.0,1262,1262,460,181500
3,11250,162.0,920,1786,608,223500
4,9550,0.0,756,1717,642,140000
5,14260,350.0,1145,2198,836,250000
...,...,...,...,...,...,...
1456,7917,0.0,953,1647,460,175000
1457,13175,119.0,1542,2073,500,210000
1458,9042,0.0,1152,2340,252,266500
1459,9717,0.0,1078,1078,240,142125


And plot the distributions of the un-transformed variables:

In [11]:
ames.hist(figsize=(15,10), bins="auto");

## Step 1: Build an Initial Linear Regression Model

`SalePrice` should be the target, and all other columns in `ames` should be predictors.

In [13]:
# Your code here - build a linear regression model with un-transformed features

import statsmodels.api as sm

# Defining the target and features
X = ames.drop("SalePrice", axis=1)
y = ames["SalePrice"]

# Adding a constant to the model 
X = sm.add_constant(X)

# Fitting the model
model = sm.OLS(y, X)
results = model.fit()

# Summary results
results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.676
Model:,OLS,Adj. R-squared:,0.675
Method:,Least Squares,F-statistic:,603.0
Date:,"Wed, 16 Apr 2025",Prob (F-statistic):,0.0
Time:,10:32:30,Log-Likelihood:,-17622.0
No. Observations:,1452,AIC:,35260.0
Df Residuals:,1446,BIC:,35290.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.525e+04,4145.934,-3.677,0.000,-2.34e+04,-7113.396
LotArea,0.2568,0.125,2.056,0.040,0.012,0.502
MasVnrArea,55.0481,7.427,7.412,0.000,40.480,69.616
TotalBsmtSF,44.1640,3.324,13.286,0.000,37.643,50.685
GrLivArea,63.8443,2.772,23.030,0.000,58.406,69.282
GarageArea,93.4629,6.795,13.755,0.000,80.134,106.792

0,1,2,3
Omnibus:,817.744,Durbin-Watson:,1.991
Prob(Omnibus):,0.0,Jarque-Bera (JB):,77147.499
Skew:,-1.709,Prob(JB):,0.0
Kurtosis:,38.546,Cond. No.,50900.0


## Step 2: Evaluate Initial Model and Interpret Coefficients

Describe the model performance overall and interpret the meaning of each predictor coefficient. Make sure to refer to the explanations of what each feature means from the data dictionary!

### Your written answer here

1. **Model Performance (Untransformed Model)**
- **R-squared = 0.676:** The model explains about 67.6% of the variance in house sale prices, which is a moderately strong fit for a first linear model.
- **Adj. R-squared = 0.675:** confirms that most predictors are contributing useful information.
- **F-statistic = 603, p-value < 0.001:** The model is statistically significant, meaning at least one predictor is significantly related to SalePrice.
- **Durbin-Watson = 1.991:** Suggests no major autocorrelation in the residuals.
- **JB = 77,147 & Skew = -1.709:** Indicates that the residuals are highly non-normal, possibly due to skewed target or predictor variables.
- **Condition Number = 50,900:** This is high, suggesting potential multicollinearity or numerical instability.

2. **Interpretation of Coefficients**
- **Intercept (-15,250):** The expected SalePrice when all predictors are 0 (not meaningful in real-world context).

- **LotArea (+0.2568):** For each additional square foot of land, SalePrice increases by about $0.26, holding other features constant. 

- **MasVnrArea (+55.05):** Each additional square foot of masonry veneer (decorative brick/stone on the house) adds about $55 to SalePrice.

- **TotalBsmtSF (+44.16):** For every extra square foot of basement area, SalePrice increases by about $44.

- **GrLivArea (+63.84):** Each additional square foot of above-ground living area increases SalePrice by approximately $64.

- **GarageArea (+93.46):** Each additional square foot of garage area increases SalePrice by about $93, making it the strongest contributor per square foot.

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>

The model overall is statistically significant and explains about 68% of the variance in sale price.

The coefficients are all statistically significant.

* `LotArea`: for each additional square foot of lot area, the price increases by about \\$0.26
* `MasVnrArea`: for each additional square foot of masonry veneer, the price increases by about \\$55
* `TotalBsmtSF`: for each additional square foot of basement area, the price increases by about \\$44
* `GrLivArea`: for each additional square foot of above-grade living area, the price increases by about \\$64
* `GarageArea`: for each additional square foot of garage area, the price increases by about \\$93

</details>

## Step 3: Express Model Coefficients in Metric Units

Your stakeholder gets back to you and says this is great, but they are interested in metric units.

Specifically they would like to measure area in square meters rather than square feet.

Report the same coefficients, except using square meters. You can do this by building a new model, or by transforming just the coefficients.

The conversion you can use is **1 square foot = 0.092903 square meters**.

In [18]:
# Your code here - building a new model or transforming coefficients
# from initial model so that they are in square meters

# Transforming Coefficients to Metric Units
sqft_to_sqm = 0.092903

# Original coefficients in square feet
coeffs_sqft = {
    "LotArea": 0.2568,
    "MasVnrArea": 55.0481,
    "TotalBsmtSF": 44.1640,
    "GrLivArea": 63.8443,
    "GarageArea": 93.4629
}

# Converting to square meters
coeffs_sqm = {feature: round(value / sqft_to_sqm, 2) for feature, value in coeffs_sqft.items()}
coeffs_sqm

{'LotArea': 2.76,
 'MasVnrArea': 592.53,
 'TotalBsmtSF': 475.38,
 'GrLivArea': 687.21,
 'GarageArea': 1006.03}

### Your written answer here

**Interpretation:**
- **LotArea:** Each additional square meter of lot size is associated with a $2.76 increase in sale price, all else equal.

- **MasVnrArea:** Each square meter of masonry veneer adds approximately $592.53 to the house’s value.

- **TotalBsmtSF:** Each square meter of basement area adds about $475.38 to the sale price.

- **GrLivArea:** Above-ground living area increases sale price by roughly $687.21 per square meter.

- **GarageArea:** Garage space contributes the most, with $1,006.03 per square meter.

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>

* `LotArea`: for each additional square meter of lot area, the price increases by about \\$2.76
* `MasVnrArea`: for each additional square meter of masonry veneer, the price increases by about \\$593
* `TotalBsmtArea`: for each additional square meter of basement area, the price increases by about \\$475
* `GrLivArea`: for each additional square meter of above-grade living area, the price increases by about \\$687
* `GarageArea`: for each additional square meter of garage area, the price increases by about \\$1,006

</details>

## Step 4: Center Data to Provide an Interpretable Intercept

Your stakeholder is happy with the metric results, but now they want to know what's happening with the intercept value. Negative \\$17k for a home with zeros across the board...what does that mean?

Center the data so that the mean is 0, fit a new model, and report on the new intercept.

(It doesn't matter whether you use data that was scaled to metric units or not. The intercept should be the same either way.)

In [22]:
# Your code here - center data

# Selecting the features and target
X = ames[["LotArea", "MasVnrArea", "TotalBsmtSF", "GrLivArea", "GarageArea"]]
y = ames["SalePrice"]

# Centering the data by subtracting the mean from each feature
X_centered = X - X.mean()

In [23]:
# Your code here - build a new model

# Adding a constant (intercept)
X_centered = sm.add_constant(X_centered)

# Fitting a new model with centered data
model_centered = sm.OLS(y, X_centered).fit()

# summary results
print(model_centered.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.676
Model:                            OLS   Adj. R-squared:                  0.675
Method:                 Least Squares   F-statistic:                     603.0
Date:                Wed, 16 Apr 2025   Prob (F-statistic):               0.00
Time:                        10:32:30   Log-Likelihood:                -17622.
No. Observations:                1452   AIC:                         3.526e+04
Df Residuals:                    1446   BIC:                         3.529e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const        1.806e+05   1186.695    152.200      

### Your written answer here - interpret the new intercept

- If a house has average lot area, masonry veneer area, basement size, above-ground living area, and garage area, its expected sale price is approximately $180,600.

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>

The new intercept is about \\$181k. This means that a home with average lot area, average masonry veneer area, average total basement area, average above-grade living area, and average garage area would sell for about \\$181k.

</details>

## Step 5: Identify the "Most Important" Feature

Finally, either build a new model with transformed coefficients or transform the coefficients from the Step 4 model so that the most important feature can be identified.

Even though all of the features are measured in area, they are different kinds of area (e.g. lot area vs. masonry veneer area) that are not directly comparable as-is. So apply **standardization** (dividing predictors by their standard deviations) and identify the feature with the highest standardized coefficient as the "most important".

In [27]:
# Your code here - building a new model or transforming coefficients
# from centered model so that they are in standard deviations

# Selecting features and target
X = ames[["LotArea", "MasVnrArea", "TotalBsmtSF", "GrLivArea", "GarageArea"]]
y = ames["SalePrice"]

# Standardizing predictors 
X_standardized = (X - X.mean()) / X.std()

# Adding a constant 
X_standardized = sm.add_constant(X_standardized)

# Fitting model
model_standardized = sm.OLS(y, X_standardized)
results_standardized = model_standardized.fit()

# Display standardized coefficients
print(results_standardized.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.676
Model:                            OLS   Adj. R-squared:                  0.675
Method:                 Least Squares   F-statistic:                     603.0
Date:                Wed, 16 Apr 2025   Prob (F-statistic):               0.00
Time:                        10:32:30   Log-Likelihood:                -17622.
No. Observations:                1452   AIC:                         3.526e+04
Df Residuals:                    1446   BIC:                         3.529e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const        1.806e+05   1186.695    152.200      

### Your written answer here - identify the "most important" feature

- The feature with the highest standardized coefficient is GrLivArea, meaning it is the most important predictor of house sale price among the features in the model.

- A one standard deviation increase in GrLivArea is associated with the largest positive increase in SalePrice compared to other predictors, making it the most influential feature in the model.

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>

The feature with the highest standardized coefficient is `GrLivArea`. This means that above-grade living area is most important.

</details>

## Summary
Great! You've now got some hands-on practice transforming data and interpreting the results!