
# Predicting House Prices with Simple Linear Regression  
**Author:** Your Name  

This notebook follows a lightweight ML lifecycle using a **simulated (offline)** house-prices dataset with realistic ranges.  
You can replace the synthetic DataFrame with a real CSV (e.g., Kaggle King County) later and keep the same steps.

**Lifecycle Steps**  
1) Data selection & loading  
2) Cleaning (missing values, simple outlier handling)  
3) Feature engineering (one new meaningful feature)  
4) Train/test split & Linear Regression  
5) Evaluation (MSE, R²) and visualization (scatter + regression line)  
6) Sanity-check predictions for two custom inputs


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

%matplotlib inline
np.random.seed(42)


## 1) Data Selection & Loading

In [None]:

# Simulated "Mini House Prices" dataset (no internet needed).
# Columns: price (target, USD), sqft_living, year_built, bathrooms, occupants
N = 800
sqft = np.random.randint(500, 3500, size=N)
year_built = np.random.randint(1950, 2022, size=N)
bathrooms = np.round(np.random.uniform(1, 4, size=N), 1)
occupants = np.clip(np.round(np.random.normal(2.7, 1.0, size=N), 1), 1, 6)

# Price model with noise (USD):
# base + sqft effect + bathroom effect + (newer houses, higher price) + negative crowding + noise
base = 50_000
price = (
    base
    + 180 * sqft
    + 15_000 * bathrooms
    + 2_000 * (year_built - 1950)
    - 8_000 * (occupants - 2.5)
    + np.random.normal(0, 45_000, size=N)  # noise
)

df = pd.DataFrame({
    "price": price,
    "sqft_living": sqft,
    "year_built": year_built,
    "bathrooms": bathrooms,
    "occupants": occupants
})

df.head()


## 2) Data Cleaning

In [None]:

# Introduce a few missing values artificially (to demonstrate handling)
df.loc[np.random.choice(df.index, 10, replace=False), "bathrooms"] = np.nan

print("Missing values before fill:\n", df.isna().sum())

# Simple imputation for missing bathrooms with median
df["bathrooms"] = df["bathrooms"].fillna(df["bathrooms"].median())

# Simple outlier handling: clip sqft at 1st/99th percentiles
low, high = df["sqft_living"].quantile([0.01, 0.99])
df["sqft_living_clipped"] = df["sqft_living"].clip(lower=low, upper=high)

print("\nMissing values after fill:\n", df.isna().sum())
(low, high)


## 3) Feature Engineering

In [None]:

# Create house age from build year; assume "current year" 2025 for a clear, simple feature.
df["house_age"] = 2025 - df["year_built"]

# Engineered feature: sqft per occupant (space per person signal)
df["sqft_per_occupant"] = df["sqft_living_clipped"] / df["occupants"]

# We'll use one raw numeric feature + one engineered feature for the model.
X = df[["sqft_living_clipped", "sqft_per_occupant"]]
y = df["price"]
X.head()


## 4) Train/Test Split & Linear Regression

In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

print("Intercept:", round(linreg.intercept_, 2))
print("Coefficients [sqft_living_clipped, sqft_per_occupant]:", np.round(linreg.coef_, 2))


## 5) Evaluation (MSE & R²)

In [None]:

y_pred = linreg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:,.2f}")
print(f"R² : {r2:.4f}")


## 6) Visualization: Scatter + Regression Line

In [None]:

# Scatter: sqft_living vs price
fig = plt.figure()  # single plot, default styles
sample = df.sample(500, random_state=42)
plt.scatter(sample["sqft_living_clipped"], sample["price"], alpha=0.5)

# Regression line: vary sqft_living across its range; hold sqft_per_occupant at its TRAIN mean
x_line = np.linspace(df["sqft_living_clipped"].min(), df["sqft_living_clipped"].max(), 200)
fixed_sqft_per_occ = X_train["sqft_per_occupant"].mean()
X_line = pd.DataFrame({
    "sqft_living_clipped": x_line,
    "sqft_per_occupant": fixed_sqft_per_occ
})
y_line = linreg.predict(X_line)

plt.plot(x_line, y_line)
plt.xlabel("Square Footage (clipped 1st–99th pct)")
plt.ylabel("Price (USD)")
plt.title("House Price vs. Sqft with Model Regression Line")
plt.show()


## 7) Sanity-Check Predictions (Two Custom Inputs)

In [None]:

cases = pd.DataFrame({
    "sqft_living_clipped": [900, 2400],
    "sqft_per_occupant": [450, 1200]  # e.g., 900 sqft / 2 ppl, 2400 sqft / 2 ppl
})
preds = linreg.predict(cases)
print("Inputs:\n", cases)
print("Predicted Price (USD):", np.round(preds, 0))
