# Linear Regression  Exercise

---
---
---
## Complete the tasks in bold

**TASK: Run the cells under the Imports and Data section to make sure you have imported the correct general libraries as well as the correct datasets. Later on you may need to run further imports from scikit-learn.**

### Imports

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data

In [18]:
df = pd.read_csv("AMES_Final_DF.csv")

In [19]:
df.head()

Unnamed: 0,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,...,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD,Sale Condition_AdjLand,Sale Condition_Alloca,Sale Condition_Family,Sale Condition_Normal,Sale Condition_Partial
0,141.0,31770,6,5,1960,1960,112.0,639.0,0.0,441.0,...,0,0,0,0,1,0,0,0,1,0
1,80.0,11622,5,6,1961,1961,0.0,468.0,144.0,270.0,...,0,0,0,0,1,0,0,0,1,0
2,81.0,14267,6,6,1958,1958,108.0,923.0,0.0,406.0,...,0,0,0,0,1,0,0,0,1,0
3,93.0,11160,7,5,1968,1968,0.0,1065.0,0.0,1045.0,...,0,0,0,0,1,0,0,0,1,0
4,74.0,13830,5,5,1997,1998,0.0,791.0,0.0,137.0,...,0,0,0,0,1,0,0,0,1,0


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2925 entries, 0 to 2924
Columns: 274 entries, Lot Frontage to Sale Condition_Partial
dtypes: float64(11), int64(263)
memory usage: 6.1 MB


**TASK: The label we are trying to predict is the SalePrice column. Separate out the data into X features and y labels**

In [21]:
X = df.drop('SalePrice',axis=1)
y = df['SalePrice']

**TASK: Use scikit-learn to split up X and y into a training set and test set. Since we will later be using a Grid Search strategy, set your test proportion to 10%. To get the same data split as the solutions notebook, you can specify random_state = 101**

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=101)

**TASK: The dataset features has a variety of scales and units. For optimal regression performance, scale the X features. Take carefuly note of what to use for .fit() vs what to use for .transform()**

In [24]:
from sklearn.preprocessing import StandardScaler

In [25]:
scaler = StandardScaler()

In [26]:
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

**TASK: Fit the data using a Linear Regression Model**

Now, we'll create and train the standard Linear Regression model. We import the LinearRegression class, create an instance of the model, and then 'fit' it using our scaled training features (scaled_X_train) and the corresponding training target values (y_train).

In [27]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()
linear_model.fit(scaled_X_train, y_train)

With the model trained, we can use it to make predictions on the scaled test features (scaled_X_test).

In [28]:
# Make predictions on the scaled test data
y_pred_linear = linear_model.predict(scaled_X_test)

**TASK: Evaluate your model's performance on the unseen 10% scaled test set.**

Now, we evaluate how well the Linear Regression model performed. We'll use two common metrics:

**Mean Absolute Error (MAE):** The average absolute difference between the actual prices (y_test) and the predicted prices (y_pred_linear). It tells us, on average, how far off our predictions are in dollars.

**Root Mean Squared Error (RMSE):** This is the square root of the Mean Squared Error (MSE). MSE calculates the average of the squared differences between actual and predicted values. RMSE is often preferred because it's in the same units as the target variable (dollars in this case) and penalizes larger errors more heavily than MAE.
We'll import the necessary functions, calculate the metrics, and print them.

In [29]:
# Import evaluation metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Calculate Mean Absolute Error (MAE)
mae_linear = mean_absolute_error(y_test, y_pred_linear)

# Calculate Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)
mse_linear = mean_squared_error(y_test, y_pred_linear)
rmse_linear = np.sqrt(mse_linear)

In [30]:
# Print the results
print(f"Linear Regression MAE: ${mae_linear:.2f}")
print(f"Linear Regression RMSE: ${rmse_linear:.2f}")

Linear Regression MAE: $14584.74
Linear Regression RMSE: $20858.43


In [31]:
# You can also compare RMSE to the standard deviation of the actual prices
print(f"Average Sale Price: ${y_test.mean():.2f}")
print(f"Standard Deviation of Sale Price: ${y_test.std():.2f}")

Average Sale Price: $184537.35
Standard Deviation of Sale Price: $72346.11


**TASK: Repeat the above steps using Polynomial Regression and Regularization.**
**Note: Only Try one Polynomial Degree and one regularization technique.**

We'll use Polynomial Features (degree 2) combined with Ridge Regression (which is Linear Regression with L2 regularization). Regularization helps prevent overfitting, especially when we have many features (like after adding polynomial terms).

First, import the necessary tools: PolynomialFeatures to create the interaction and polynomial terms, and Ridge for the regularized regression model.

In [32]:
# Import PolynomialFeatures and Ridge Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge

Now, create the polynomial features. We'll choose degree 2 (includes original features, interaction terms like x1*x2, and squared terms like x1^2). We 'fit' this on the original unscaled training data (X_train) and then 'transform' both X_train and X_test.

In [33]:
# Create Polynomial Features object (degree=2)
# include_bias=False prevents adding a column of ones, as LinearRegression/Ridge handles it
polynomial_converter = PolynomialFeatures(degree=2, include_bias=False)

# Fit and transform the training data
X_train_poly = polynomial_converter.fit_transform(X_train)

# Transform the test data
X_test_poly = polynomial_converter.transform(X_test)

# Check the new shape (it will have many more columns)
print(f"Original number of features: {X_train.shape[1]}")
print(f"Polynomial (degree 2) number of features: {X_train_poly.shape[1]}")

Original number of features: 273
Polynomial (degree 2) number of features: 37674


We have created many new features. It's crucial to scale these new polynomial features before feeding them into the Ridge model. We'll use StandardScaler again, fitting it only on the transformed training data (X_train_poly) and then transforming both sets.

In [34]:
# Create a new scaler for polynomial features
poly_scaler = StandardScaler()

# Fit the scaler to the polynomial training and test data and transform it
scaled_X_train_poly = poly_scaler.fit_transform(X_train_poly)
scaled_X_test_poly = poly_scaler.transform(X_test_poly)

Now, create and train the Ridge Regression model. Ridge takes an alpha parameter, which controls the strength of the regularization. A higher alpha means stronger regularization. We'll start with a default alpha=1.0, but often this is tuned using techniques like cross-validation (though the notebook didn't explicitly ask for tuning here). We fit it on the scaled polynomial training data

In [35]:
# Create an instance of the Ridge Regression model
# alpha=1.0 is a common starting point
ridge_model = Ridge(alpha=1.0)

# Train (fit) the Ridge model on the scaled polynomial training data
ridge_model.fit(scaled_X_train_poly, y_train)

Use the trained Ridge model to make predictions on the scaled polynomial test data

In [36]:
# Make predictions using the Ridge model on the scaled polynomial test data
y_pred_ridge = ridge_model.predict(scaled_X_test_poly)

Finally, evaluate the Ridge model using MAE and RMSE, just like we did for the simple Linear Regression model.

In [37]:
# Calculate Mean Absolute Error (MAE) for Ridge
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)

# Calculate Mean Squared Error (MSE) for Ridge
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
# Calculate Root Mean Squared Error (RMSE) for Ridge
rmse_ridge = np.sqrt(mse_ridge)

# Print the results for Ridge
print(f"Ridge Regression (Poly Degree 2) MAE: ${mae_ridge:.2f}")
print(f"Ridge Regression (Poly Degree 2) RMSE: ${rmse_ridge:.2f}")

# Compare with Linear Regression results
print("\nComparison:")
print(f"Linear Regression MAE: ${mae_linear:.2f}")
print(f"Linear Regression RMSE: ${rmse_linear:.2f}")
print(f"Ridge Regression MAE: ${mae_ridge:.2f}")
print(f"Ridge Regression RMSE: ${rmse_ridge:.2f}")

Ridge Regression (Poly Degree 2) MAE: $25583.12
Ridge Regression (Poly Degree 2) RMSE: $34155.39

Comparison:
Linear Regression MAE: $14584.74
Linear Regression RMSE: $20858.43
Ridge Regression MAE: $25583.12
Ridge Regression RMSE: $34155.39


## Great work!

----