**INTRODUCTION**

This notebook focuses on analyzing a California housing dataset to predict house values using two advanced regression techniques: Principal Component Regression (PCR) and Partial Least Squares Regression (PLSR). The main purpose is to address and mitigate multicollinearity, a common issue in regression models where predictor variables are highly correlated, leading to unstable and unreliable coefficient estimates. Throughout the analysis, we will first detect multicollinearity using the Variance Inflation Factor (VIF), then apply PCR and PLSR to reduce dimensionality and improve model performance. By the end of the notebook, we will compare the results of both methods, assessing their ability to predict house prices accurately while handling multicollinearity.

Author: Tsaqif Wismadi

**0. Import libraries**

In [7]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor

**1. Data preparation**

In [8]:
# Step 2: Load and preprocess the data
df = pd.read_csv('housing.csv')

# Handle missing values (fill missing bedrooms with median)
df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)

# Convert categorical variable into dummy variables
df = pd.get_dummies(df, drop_first=True)

# Step 3: Separate predictors and target variable
X = df.drop('median_house_value', axis=1)
y = df['median_house_value']

In [9]:
df.head(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0,0,1,0


**Variables explanation:**

1. longitude: A measure of how far west a house is; a higher value is farther west
2. latitude: A measure of how far north a house is; a higher value is farther north
3. housingMedianAge: Median age of a house within a block; a lower number is a newer building
4. totalRooms: Total number of rooms within a block
5. totalBedrooms: Total number of bedrooms within a block
6. population: Total number of people residing within a block
7. households: Total number of households, a group of people residing within a home unit, for a block
8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
9. medianHouseValue: Median house value for households within a block (measured in US Dollars)
10. oceanProximity: Location of the house w.r.t ocean/sea

**2. Initial multicollinearity detection using VIF**

In [10]:
# Step 4: Multicollinearity detection using VIF (Before handling with PCR and PLSR)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Calculate VIF for each feature
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]
vif['Feature'] = X.columns

# Display the VIFs (High VIF indicates multicollinearity)
print("Variance Inflation Factor (VIF) to detect multicollinearity:")
print(vif)

Variance Inflation Factor (VIF) to detect multicollinearity:
          VIF                     Feature
0   18.028444                   longitude
1   19.925764                    latitude
2    1.321927          housing_median_age
3   12.349114                 total_rooms
4   27.040073              total_bedrooms
5    6.342122                  population
6   28.315383                  households
7    1.740468               median_income
8    2.853630      ocean_proximity_INLAND
9    1.002039      ocean_proximity_ISLAND
10   1.565746    ocean_proximity_NEAR BAY
11   1.197133  ocean_proximity_NEAR OCEAN


**3. Conducting PCR**

In [11]:
# Step 5: Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: PCR (Principal Component Regression) - addressing multicollinearity
# PCR works by applying PCA to eliminate multicollinearity (keeping a reduced number of components)
# We handle multicollinearity by reducing to the first few principal components

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create a pipeline for PCR
pca = PCA(n_components=10)  # Reduce to 10 components (can tune this)
pcr = Pipeline(steps=[('pca', pca), ('regressor', LinearRegression())])

# Fit the PCR model
pcr.fit(X_train_scaled, y_train)

# Predict and evaluate the PCR model
y_pred_pcr = pcr.predict(X_test_scaled)
mse_pcr = mean_squared_error(y_test, y_pred_pcr)
r2_pcr = r2_score(y_test, y_pred_pcr)

print("\nPCR Results (Multicollinearity handled by selecting 10 principal components):")
print(f'MSE: {mse_pcr}')
print(f'R-squared: {r2_pcr}')


PCR Results (Multicollinearity handled by selecting 10 principal components):
MSE: 4969958348.151292
R-squared: 0.6207322728494282


**4. Conducting PLSR**

In [12]:
# Step 7: PLSR (Partial Least Squares Regression) - addressing multicollinearity
# PLSR handles multicollinearity by finding new components (latent variables) based on both predictors and response
# We handle multicollinearity by keeping a reduced number of latent variables

# Create a PLSR model
pls = PLSRegression(n_components=10)  # Reduce to 10 components (can tune this)

# Fit the PLSR model
pls.fit(X_train_scaled, y_train)

# Predict and evaluate the PLSR model
y_pred_pls = pls.predict(X_test_scaled)
mse_pls = mean_squared_error(y_test, y_pred_pls)
r2_pls = r2_score(y_test, y_pred_pls)

print("\nPLSR Results (Multicollinearity handled by selecting 10 latent variables):")
print(f'MSE: {mse_pls}')
print(f'R-squared: {r2_pls}')


PLSR Results (Multicollinearity handled by selecting 10 latent variables):
MSE: 4889438275.137577
R-squared: 0.626876924965732


**CONCLUSION**

In this analysis of the California housing dataset, we examined how various factors such as location, housing structure, population, and income influence median house values. The dataset revealed significant multicollinearity, particularly among geographical variables like longitude and latitude, as well as structural variables such as total rooms, total bedrooms, and households. These variables exhibited high VIF scores, indicating that they provide redundant information and could distort the accuracy of a standard linear regression model. Through dimensionality reduction using PCR and PLSR, we successfully addressed this multicollinearity by reducing the predictors into fewer, uncorrelated components that captured the most significant variance in house prices.

When comparing the two regression methods, PLSR slightly outperformed PCR, achieving a better R-squared score (0.627 vs. 0.621) and a lower mean squared error. PLSR's ability to optimize components based on both the predictors and the response variable allowed it to more effectively capture relationships between features like median income, ocean proximity, and house values. Both models significantly improved prediction stability by reducing multicollinearity, but PLSR’s consideration of the target variable during component creation gave it a slight edge in performance. Overall, both methods proved effective, but PLSR may offer better predictive accuracy in cases where multicollinearity is present and a strong relationship between predictors and the target needs to be preserved.