# House Price Prediction - Adjusted as a Portfolio Project
This project demonstrates data cleaning, regression modeling, and evaluation on the King County housing dataset.
The project was part of the IBM data analytics course, now adjusted to fit for a portfolio  
## Steps:
1. Import libraries and load data
2. Data cleaning
3. Regression models
4. Train-test split
5. Ridge regression & polynomial features
---

## 1. Import Libraries and Load Data

In [28]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

file_name='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)
df.head()

Unnamed: 0.1,Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,0,7129300520,20141013T000000,221900.0,3.0,1.0,1180,5650,1.0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,1,6414100192,20141209T000000,538000.0,3.0,2.25,2570,7242,2.0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,2,5631500400,20150225T000000,180000.0,2.0,1.0,770,10000,1.0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,3,2487200875,20141209T000000,604000.0,4.0,3.0,1960,5000,1.0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,4,1954400510,20150218T000000,510000.0,3.0,2.0,1680,8080,1.0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## 2. Data Cleaning

In [30]:
df.drop(["id", "Unnamed: 0"], axis=1, inplace=True)
df.describe()

mean = df['bedrooms'].mean()
df['bedrooms'] = df['bedrooms'].replace(np.nan, mean)

mean = df['bathrooms'].mean()
df['bathrooms'] = df['bathrooms'].replace(np.nan, mean)


## 3. Simple Linear Regression

In [87]:
X = df[["sqft_living"]]
Y = df["price"]
lm = LinearRegression()
lm.fit(X, Y)
print("R²:", round(lm.score(X, Y),3))

R²: 0.493


## 4. Multiple Linear Regression

In [83]:
Z = df[["floors", "waterfront", "lat", "bedrooms", "sqft_basement", "view", "bathrooms", "sqft_living", "sqft_above", "grade", "sqft_living15"]]
lm1 = LinearRegression()
lm1.fit(Z, Y)
print("R²:", round(lm1.score(Z, Y),3))

R²: 0.658


## 5. Pipeline with Polynomial Features

In [79]:
Input=[('scale', StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model', LinearRegression())]
pipe = Pipeline(Input)
Z = Z.astype(float)
pipe.fit(Z, Y)
ypipe = pipe.predict(Z)
print("R²:", round(r2_score(Y, ypipe),3))

R²: 0.751


## 6. Train-Test Split

In [41]:
X = df.drop("price", axis=1)
Y = df["price"]

X = pd.get_dummies(X, drop_first=True)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.15, random_state=1)
X_train.shape, X_test.shape

((18371, 389), (3242, 389))

## 7. Ridge Regression

In [69]:
RidgeModel = Ridge(alpha=0.1)
RidgeModel.fit(X_train, Y_train)
yhat = RidgeModel.predict(X_test)
print("R²:",round(r2_score(Y_test, yhat),3))

R²: 0.687


## 8. Polynomial Transformation + Ridge Regression

In [65]:
##print('hello')
##X_train_small = X_train.astype("float32")
##X_test_small  = X_test.astype("float32")
##pr = PolynomialFeatures(
     ##degree=2, 
     ##include_bias = False, 
     ##interaction_only= True,
##)
##X_train_pr = pr.fit_transform(X_train_small)
##X_test_pr = pr.transform(X_test_small)
##RidgeModel.fit(X_train_pr, Y_train)
##y_hat = RidgeModel.predict(X_test_pr)
##r2_score(Y_test, y_hat)



# === Memory-safe polynomial step (dthe above code runs into memory issues) ===

candidates = ["sqft_living", "bedrooms", "bathrooms", "floors", "sqft_lot", "grade"]
num_cols = [c for c in candidates if c in X_train.columns]


Xtr = X_train[num_cols].astype("float32")
Xte = X_test[num_cols].astype("float32")

# Polynomial features on ONLY those columns (no bias term, interactions only).
pr = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
Xtr_pr = pr.fit_transform(Xtr)
Xte_pr = pr.transform(Xte)


RidgeModel.fit(Xtr_pr, Y_train)
y_hat = RidgeModel.predict(Xte_pr)
print("Using polynomial on:", num_cols)
print("R²:", round(r2_score(Y_test, y_hat),3))


Using polynomial on: ['sqft_living', 'bedrooms', 'bathrooms', 'floors', 'sqft_lot', 'grade']
R²: 0.594


  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T


##  Conclusions  

### Baseline Model (Linear Regression with one feature)  
- Established a simple baseline for prediction.  
- Useful for understanding relationships but limited in accuracy.  

### Multiple Linear Regression  
- Adding more features improved the explanatory power of the model.  
- Highlighted how additional predictors can reduce bias but may introduce collinearity.  

### Ridge Regression  
- Applied to handle multicollinearity and stabilize coefficients.  
- Achieved better generalization compared to the plain linear regression.  

### Polynomial Features + Ridge  
- Explored non-linear relationships by expanding features.  
- The model ran successfully, but produced a warning about an *ill-conditioned matrix*, which indicates that polynomial transformations can create highly correlated features.  
- Ridge regularization helped mitigate this, but performance gains were modest, and computation became more resource-intensive.  



## Key Takeaways  
- Linear regression is easy to interpret but limited in predictive power.  
- Ridge regression improves stability when features are correlated.  
- Polynomial features add flexibility but can introduce serious computational and numerical challenges.  
- In practice, careful feature selection and preprocessing are necessary before applying polynomial expansions at scale.  
