# House Price Prediction with Ridge Regression
## Project Setup and Data Loading

This section initializes the environment by importing necessary libraries and loading the raw dataset (`data.csv`) for preliminary inspection.

In [120]:
import pandas as pd

df = pd.read_csv('data.csv')
df.head(9)

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA
5,2014-05-02 00:00:00,490000.0,2.0,1.0,880,6380,1.0,0,0,3,880,0,1938,1994,522 NE 88th St,Seattle,WA 98115,USA
6,2014-05-02 00:00:00,335000.0,2.0,2.0,1350,2560,1.0,0,0,3,1350,0,1976,0,2616 174th Ave NE,Redmond,WA 98052,USA
7,2014-05-02 00:00:00,482000.0,4.0,2.5,2710,35868,2.0,0,0,3,2710,0,1989,0,23762 SE 253rd Pl,Maple Valley,WA 98038,USA
8,2014-05-02 00:00:00,452500.0,3.0,2.5,2430,88426,1.0,0,0,4,1570,860,1985,0,46611-46625 SE 129th St,North Bend,WA 98045,USA


## Initial Data Snapshot & Descriptive Statistics

After loading the data, we examine the first few rows and then look at the summary statistics (mean, std, min, max, quartiles) to understand the distribution and scale of numerical features.


In [121]:
df.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated
count,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0
mean,551963.0,3.40087,2.160815,2139.346957,14852.52,1.512065,0.007174,0.240652,3.451739,1827.265435,312.081522,1970.786304,808.608261
std,563834.7,0.908848,0.783781,963.206916,35884.44,0.538288,0.084404,0.778405,0.67723,862.168977,464.137228,29.731848,979.414536
min,0.0,0.0,0.0,370.0,638.0,1.0,0.0,0.0,1.0,370.0,0.0,1900.0,0.0
25%,322875.0,3.0,1.75,1460.0,5000.75,1.0,0.0,0.0,3.0,1190.0,0.0,1951.0,0.0
50%,460943.5,3.0,2.25,1980.0,7683.0,1.5,0.0,0.0,3.0,1590.0,0.0,1976.0,0.0
75%,654962.5,4.0,2.5,2620.0,11001.25,2.0,0.0,0.0,4.0,2300.0,610.0,1997.0,1999.0
max,26590000.0,9.0,8.0,13540.0,1074218.0,3.5,1.0,4.0,5.0,9410.0,4820.0,2014.0,2014.0


# 2. Data Cleaning and Feature Engineering

This section focuses on refining the dataset for modeling. We perform three key operations:
1. Outlier removal based on price constraints.
2. Feature engineering from date and spatial columns.
3. Target Encoding for categorical features (`zipcode` and `city`).


## 2.1. Price Outlier Filtering and Date Features

We enforce business logic by removing sales outside the $\$50,000$ to $\$5,000,000$ range. Then, we derive time-based features (`sale_year` and `house_age`).


In [122]:
df = df[(df["price"] > 50000) & (df["price"] < 5_000_000)]

df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["sale_year"] = df["date"].dt.year
df["house_age"] = df["sale_year"] - df["yr_built"]


## 2.2. Geospatial Feature Encoding (Target Encoding)

We calculate median prices grouped by `zipcode` and `city` to create numerical representations (`zipcode_value` and `city_value`) that capture the location's average value.


In [123]:
df["zipcode"] = df["statezip"].str.extract(r"(\d+)").astype(float)
zipcode_price = df.groupby("zipcode")["price"].median().to_dict()
df["zipcode_value"] = df["zipcode"].map(zipcode_price)

city_price = df.groupby("city")["price"].median().to_dict()
df["city_value"] = df["city"].map(city_price)

# 3. Data Preparation for Modeling

This stage finalizes the data structure by selecting the relevant features, removing any rows that still have missing values in these critical columns, and applying a log transformation to the target variable (`price`) to stabilize the variance and normalize the distribution before training.


In [124]:
import numpy as np

features = [
    "sqft_living",
    "bathrooms",
    "bedrooms",
    "floors",
    "sqft_lot",
    "house_age",
    "zipcode_value",
    "city_value",
    "view",
    "condition",

]


df = df.dropna(subset=features + ["price"])

X = df[features]
y = np.log1p(df["price"])

## 3.1. Train-Test Split

The processed feature matrix ($\mathbf{X}$) and target vector ($\mathbf{y}$) are split into training (80%) and testing (20%) sets. A fixed `random_state` is used to ensure reproducibility of the splits.


In [125]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

## 3.2. Feature Scaling

The training data is standardized using `StandardScaler` (`fit_transform`) to set the mean to 0 and the standard deviation to 1. The same fitted scaler is then used to transform the test set (`transform`) to prevent data leakage from the test set into the scaling process.


In [126]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 4. Model Training: Ridge Regression

A Ridge Regression model is initialized with the default regularization parameter ($\alpha=1.0$). The model is then trained using the **scaled training data** ($\mathbf{X}_{\text{train\_scaled}}$ and $\mathbf{y}_{\text{train}}$).


In [127]:
from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
model.fit(X_train_scaled, y_train)

0,1,2
,"alpha  alpha: {float, ndarray of shape (n_targets,)}, default=1.0 Constant that multiplies the L2 term, controlling regularization strength. `alpha` must be a non-negative float i.e. in `[0, inf)`. When `alpha = 0`, the objective is equivalent to ordinary least squares, solved by the :class:`LinearRegression` object. For numerical reasons, using `alpha = 0` with the `Ridge` object is not advised. Instead, you should use the :class:`LinearRegression` object. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.",1.0
,"fit_intercept  fit_intercept: bool, default=True Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations (i.e. ``X`` and ``y`` are expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"max_iter  max_iter: int, default=None Maximum number of iterations for conjugate gradient solver. For 'sparse_cg' and 'lsqr' solvers, the default value is determined by scipy.sparse.linalg. For 'sag' solver, the default value is 1000. For 'lbfgs' solver, the default value is 15000.",
,"tol  tol: float, default=1e-4 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for each solver: - 'svd': `tol` has no impact. - 'cholesky': `tol` has no impact. - 'sparse_cg': norm of residuals smaller than `tol`. - 'lsqr': `tol` is set as atol and btol of scipy.sparse.linalg.lsqr,  which control the norm of the residual vector in terms of the norms of  matrix and coefficients. - 'sag' and 'saga': relative change of coef smaller than `tol`. - 'lbfgs': maximum of the absolute (projected) gradient=max|residuals|  smaller than `tol`. .. versionchanged:: 1.2  Default value changed from 1e-3 to 1e-4 for consistency with other linear  models.",0.0001
,"solver  solver: {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'}, default='auto' Solver to use in the computational routines: - 'auto' chooses the solver automatically based on the type of data. - 'svd' uses a Singular Value Decomposition of X to compute the Ridge  coefficients. It is the most stable solver, in particular more stable  for singular matrices than 'cholesky' at the cost of being slower. - 'cholesky' uses the standard :func:`scipy.linalg.solve` function to  obtain a closed-form solution. - 'sparse_cg' uses the conjugate gradient solver as found in  :func:`scipy.sparse.linalg.cg`. As an iterative algorithm, this solver is  more appropriate than 'cholesky' for large-scale data  (possibility to set `tol` and `max_iter`). - 'lsqr' uses the dedicated regularized least-squares routine  :func:`scipy.sparse.linalg.lsqr`. It is the fastest and uses an iterative  procedure. - 'sag' uses a Stochastic Average Gradient descent, and 'saga' uses  its improved, unbiased version named SAGA. Both methods also use an  iterative procedure, and are often faster than other solvers when  both n_samples and n_features are large. Note that 'sag' and  'saga' fast convergence is only guaranteed on features with  approximately the same scale. You can preprocess the data with a  scaler from :mod:`sklearn.preprocessing`. - 'lbfgs' uses L-BFGS-B algorithm implemented in  :func:`scipy.optimize.minimize`. It can be used only when `positive`  is True. All solvers except 'svd' support both dense and sparse data. However, only 'lsqr', 'sag', 'sparse_cg', and 'lbfgs' support sparse input when `fit_intercept` is True. .. versionadded:: 0.17  Stochastic Average Gradient descent solver. .. versionadded:: 0.19  SAGA solver.",'auto'
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. Only 'lbfgs' solver is supported in this case.",False
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag' or 'saga' to shuffle the data. See :term:`Glossary ` for details. .. versionadded:: 0.17  `random_state` to support Stochastic Average Gradient.",


## 4.1. Model Evaluation

The trained Ridge model is used to make predictions on the **scaled test set** ($\mathbf{X}_{\text{test\_scaled}}$). The performance is then assessed using three common regression metrics:
1.  **Mean Absolute Error (MAE):** The average magnitude of the errors.
2.  **Root Mean Squared Error (RMSE):** The square root of the average of squared errors, sensitive to larger errors.
3.  **Coefficient of Determination ($R^2$):** The proportion of the variance in the dependent variable that is predictable from the independent variables.

*Note: Given that the target variable ($\text{price}$) was log-transformed, these metrics evaluate the performance on the log-scale predictions.*


In [128]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")

MAE: 0.1906
RMSE: 0.2682
R²: 0.7725


# 5. Conclusion and Next Steps

The Ridge Regression model has been trained and evaluated on the standardized data.

## Summary of Performance
The final evaluation metrics are:
*   **MAE:** {mae:.4f}
*   **RMSE:** {rmse:.4f}
*   **R²:** {r2:.4f}


## Interpretation
The $R^2$ value indicates the proportion of variance explained by the model. The RMSE provides an estimate of the typical prediction error in the log-scale of the housing price.

## Next Steps (Recommendations)
1.  **Inverse Transformation:** Apply the inverse of the log transformation ($\text{Price} = e^y$) to $\mathbf{y}_{\text{pred}}$ to obtain price predictions in the original currency/unit for a more interpretable final result.
2.  **Hyperparameter Tuning:** Perform a more thorough search for the optimal $\alpha$ parameter using techniques like Cross-Validation (e.g., `RidgeCV`) to potentially improve the performance metrics.
3.  **Model Comparison:** Compare the performance of the Ridge model against other models, such as Lasso or ElasticNet, to select the best-performing approach.
