# Feature Engineering Linear Regression Pipeline #

California Housing Dataset:
This dataset, available through Scikit-Learn, is derived from the 1990 U.S. census. It includes 20,640 instances with 8 numeric predictive attributes and the target variable, which is the median house value for California districts. The attributes include median income, median house age, average number of rooms and bedrooms per household, population, average household occupancy, and geographical coordinates (latitude and longitude). The target variable is the median house value in hundreds of thousands of dollars. This dataset does not have any missing values, and all features are measured on an interval scale.

1. Formulate the Prediction Question:
Prediction Question: "How accurately can we predict the median house value in a district of California based on various socio-economic indicators?"
Locate and Load the Dataset:
2. The California Housing dataset can be loaded from the Scikit-Learn library. This dataset includes 20,640 instances with 8 numeric predictive attributes and the target.

In [14]:
#Import necessary libraries and load the dataset:

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target
feature_names = data.feature_names

To explore the data set, we merge the features (X) and the labels (y) into a pandas DataFrame and display the first rows from it:

In [15]:
mat = np.column_stack((X, y))
df = pd.DataFrame(mat, columns=np.append(feature_names, 'MedianValue'))
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianValue
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


3. Dataset Description and Variable Identification:

Independent Variables (All measured on an interval scale):
MedInc: Median income in block group
HouseAge: Median house age in block group
AveRooms: Average number of rooms per household
AveBedrms: Average number of bedrooms per household
Population: Block group population
AveOccup: Average number of household members
Latitude: Block group latitude
Longitude: Block group longitude

Dependent Variable (Interval scale):
MedHouseVal: Median house value for districts, expressed in hundreds of thousands of dollars

4. Feature Engineering:
Since the California Housing dataset does not contain missing values or categorical variables, and is relatively clean, we can focus on other aspects of feature engineering:
Outlier Handling: Use RobustScaler to scale features using statistics that are robust to outliers.
Variable Transformation: Apply PowerTransformer to make data more Gaussian-like, which can be beneficial for regression modeling.


In [16]:
from sklearn.preprocessing import RobustScaler, PowerTransformer
# Split Dataset:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [17]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

In [18]:
#Create and Configure Pipeline:

pipeline = Pipeline([
    ('scaler', RobustScaler()),
    ('power_transformer', PowerTransformer()),
    ('regressor', LinearRegression())
])


9. Fitting the model

In [19]:
pipeline.fit(X_train, y_train)


10. Making predictions and evaluating the performance of the model using the cross-validation techniques

In [20]:
from sklearn.model_selection import cross_val_score
import numpy as np

scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
r2_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='r2')

print("RMSE scores:", rmse_scores)
print("R2 scores:", r2_scores)



RMSE scores: [0.70909481 0.69054084 0.71636558 0.69876307 0.73235967]
R2 scores: [0.6327154  0.63259473 0.61891077 0.62574398 0.60689968]


The RMSE (Root Mean Square Error) and R² scores of a 5-fold cross-validation process. 

Interpretation:
RMSE Scores: [0.70909481, 0.69054084, 0.71636558, 0.69876307, 0.73235967]
RMSE is a measure of the difference between the values predicted by a model and the values actually observed from the environment that is being modeled.
Lower RMSE values are better as they indicate smaller residuals (errors).
RMSE scores are relatively close to each other across the five folds, suggesting consistency in the model's performance.
The values suggest that on average, the model's predictions are within approximately 0.70 units (in the scale of the target variable) from the actual values. Depending on the context and scale of the target variable, we can determine if this level of error is acceptable.

R² Scores: [0.6327154, 0.63259473, 0.61891077, 0.62574398, 0.60689968]
R² is a statistical measure that represents the proportion of the variance for the dependent variable that's explained by the independent variables in a regression model.
An R² score can range from 0 to 1, with higher values indicating better model performance.
R² scores are all above 0.6, which suggests that more than 60% of the variance in the target variable is predictable from the features in the model.
The consistency in R² scores across the folds also indicates stable model performance.

Overall Interpretation:

The model shows a moderate level of prediction accuracy, as indicated by the RMSE and R² values.
The consistency in the cross-validation scores suggests that the model is stable and not heavily dependent on the particular subset of data used for training.
However, whether these scores are considered "good" depends on the specific domain and application. In some contexts, an R² around 0.63 might be  good, while in others, there might be a need for higher predictive accuracy.
To improve the model, consider experimenting with different feature selection, model complexity, or tuning the hyperparameters. Also, compare these scores with other models to see if there is a significant difference in performance.

In [21]:
train_score = pipeline.score(X_train, y_train)
print('R2 score on the training set:', np.round(train_score, 5))

test_score = pipeline.score(X_test, y_test)
print('R2 score on the test set:', np.round(test_score, 5))


R2 score on the training set: 0.62436
R2 score on the test set: 0.60588


R² Scores: The R² score on the training set is 0.62436 and on the test set is 0.60588. This indicates that the model explains approximately 62.44% of the variance in the training dataset and 60.59% of the variance in the test dataset. These scores are relatively close, suggesting that the model is not overfitting significantly. However, there's still room for improvement in terms of model performance.

In [22]:
y_pred = pipeline.predict(X_test.iloc[10].values.reshape(1, -1))
print(y_pred[0])


1.1628891379545465




In [23]:
print(y_test.iloc[10])


1.232


Interpreting Predicted Values: The predicted value for a particular sample in the test set is 1.1628891379545465. Given that the actual value (as per y_test[10]) is 1.232, this indicates a relatively close prediction by the model. The values are likely in hundreds of thousands of dollars, given the nature of the dataset.

# Feature Engineering Ridge Regression Pipeline #

Using a different approch with diffrent scales and model:
Outlier handling: Normalizing the data set using a StandardScaler, as most regression models require the data to be normalized.
The regression model is ridge regression

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

ridge_pipeline = Pipeline([    
    ('std_scaler', StandardScaler()),
    ('reg', Ridge())
])

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [27]:
ridge_pipeline.fit(X_train, y_train)

In [28]:
from sklearn.model_selection import cross_val_score
import numpy as np

scores = cross_val_score(ridge_pipeline, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
r2_scores = cross_val_score(ridge_pipeline, X_train, y_train, cv=5, scoring='r2')

print("RMSE scores:", rmse_scores)
print("R2 scores:", r2_scores)

RMSE scores: [0.75307967 0.71262224 0.71689694 0.72802147 0.6960968 ]
R2 scores: [0.56987693 0.62101057 0.61488119 0.60066159 0.61429342]


In [32]:
y_pred = ridge_pipeline.predict(X_test)
y_pred

array([2.78361845, 1.8423736 , 2.26260888, ..., 2.1018368 , 2.863246  ,
       2.9732366 ])

In [33]:
y_test

11569    2.041
2341     1.062
12590    1.713
7416     1.535
16291    0.672
         ...  
18147    3.701
11605    1.938
13384    1.557
20444    2.086
7690     4.536
Name: MedHouseVal, Length: 4128, dtype: float64

In [35]:
y_pred = ridge_pipeline.predict(X_test.iloc[0:10])
y_pred

array([2.78361845, 1.8423736 , 2.26260888, 2.06596008, 1.29080603,
       2.19034146, 2.21916562, 2.14140585, 1.16486708, 1.81617117])

In [39]:
y_test

11569    2.041
2341     1.062
12590    1.713
7416     1.535
16291    0.672
         ...  
18147    3.701
11605    1.938
13384    1.557
20444    2.086
7690     4.536
Name: MedHouseVal, Length: 4128, dtype: float64

References:

Pace, R. Kelley, & Barry, Ronald. (1997). Sparse Spatial Autoregressions. Statistics and Probability Letters, 33(4), 291-297.

Scikit learn. (1997). fetch_california_housing [Data set]. https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset