# Predicting House Prices with Simple Linear Regression
### AI-ML-Assignment-2-Linear-Regression
**Author:** Aden Osman  
**Date:** 2025-11-04

This notebook follows the **ML lifecycle**: Data Loading, Cleaning, Feature Engineering, Model Training, and Evaluation, using a small demo house-price dataset.

**Tip:** To use a real dataset (e.g., Kaggle King County), replace the CSV path in the Data Loading section.

## 1) Data Loading

In [None]:
import pandas as pd

# Change this path to your real dataset if you have one.
csv_path = r"/mnt/data/data/king_county_demo.csv"

df = pd.read_csv(csv_path)
df.head()

## 2) Quick EDA (Shape, Info, Missing Values)

In [None]:
print("Shape:", df.shape)
print("\nInfo:")
print(df.info())
print("\nMissing values per column:")
print(df.isnull().sum())
df.describe(include='all')

## 3) Data Cleaning
- Handle missing values (simple imputation)
- Remove obvious outliers in chosen features

We will:
1. Impute missing `sqft_living` and `yr_built` with median.
2. Remove homes with extreme `sqft_living` (beyond 99th percentile).

In [None]:
import numpy as np

# 1) Median imputation for numerical columns we need
for col in ["sqft_living", "yr_built"]:
    if col in df.columns:
        med = df[col].median()
        df[col] = df[col].fillna(med)

# 2) Remove extreme outliers in sqft_living (beyond 99th percentile)
q99 = df["sqft_living"].quantile(0.99)
before = df.shape[0]
df = df[df["sqft_living"] <= q99].copy()
after = df.shape[0]

print(f"Removed {before - after} outlier rows based on sqft_living > 99th percentile ({q99:.0f}).")

## 4) Feature Engineering
- Keep one numerical feature: `sqft_living`
- Create a new feature: `house_age = 2025 - yr_built`

In [None]:
df["house_age"] = 2025 - df["yr_built"]
df[["sqft_living", "yr_built", "house_age", "bedrooms", "price"]].head()

## 5) Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

# Features and target (use one original + one engineered feature)
X = df[["sqft_living", "house_age"]]
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape

## 6) Model: Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

## 7) Evaluation (MSE and R²)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)

## 8) Visualization: Scatter with Regression Line
We show actual vs predicted along the primary feature `sqft_living`.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# For a clean 2D line, hold house_age at its median when drawing the line vs sqft_living
x_axis = X_test["sqft_living"].values
age_median = np.median(X_test["house_age"].values)

# Create a range of sqft values for a smooth line
sqft_line = np.linspace(X_test["sqft_living"].min(), X_test["sqft_living"].max(), 100)
X_line = pd.DataFrame({
    "sqft_living": sqft_line,
    "house_age": age_median
})
y_line = model.predict(X_line)

plt.figure(figsize=(7,5))
plt.scatter(x_axis, y_test.values, label="Actual")
plt.scatter(x_axis, y_pred, label="Predicted")
plt.plot(sqft_line, y_line, label="Regression line")
plt.title("House Price vs. Square Footage (with Regression Line)")
plt.xlabel("sqft_living")
plt.ylabel("price")
plt.legend()
plt.show()

## 9) Custom Predictions (Two Inputs)
Demonstrate the model works by predicting two example houses.

In [None]:
examples = pd.DataFrame({
    "sqft_living": [1200, 2600],
    "house_age": [2025 - 1995, 2025 - 2010]  # e.g., built in 1995 and 2010
})
preds = model.predict(examples)
for i, p in enumerate(preds, start=1):
    print(f"Example {i} features: {examples.iloc[i-1].to_dict()} -> Predicted price: ${p:,.0f}")