# 🏠 Predicting Housing Prices with Machine Learning

**Author:** Lenise Muso Nkwain

This notebook explores a machine learning approach to predict housing prices based on geographic and socioeconomic features, with a deeper analytical lens.

## 📊 Load and Explore the Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visual theme
sns.set(style='whitegrid')

# Load dataset
housing = pd.read_csv('housing 2.csv')
housing.head()

## 🧹 Data Cleaning

In [None]:
# Overview of data
housing.info()

# Drop missing values for simplicity
housing = housing.dropna()

# Summary statistics
housing.describe()

## 📈 Exploratory Data Analysis (EDA)

In [None]:
# Visualizing housing prices distribution
plt.figure(figsize=(8,6))
sns.histplot(housing['median_house_value'], kde=True, bins=50)
plt.title('Distribution of Median House Value')
plt.xlabel('Median House Value')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(housing.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Scatter plot: Median Income vs. House Value
plt.figure(figsize=(8,6))
sns.scatterplot(x='median_income', y='median_house_value', data=housing)
plt.title('Income vs. House Value')
plt.show()

## 🧠 Feature Engineering and Encoding

In [None]:
# Encode categorical variable
ocean_dummies = pd.get_dummies(housing['ocean_proximity'], drop_first=True)
housing = pd.concat([housing.drop('ocean_proximity', axis=1), ocean_dummies], axis=1)
housing.head()

## 🧪 Feature Importance

In [None]:
from sklearn.ensemble import RandomForestRegressor

X = housing.drop('median_house_value', axis=1)
y = housing['median_house_value']

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)

# Feature importance plot
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values().plot(kind='barh', figsize=(10,6), title='Feature Importances')
plt.show()

## 🤖 Model Training and Evaluation

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Evaluation
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

## 📊 Residual Analysis

In [None]:
# Residual plot
residuals = y_test - y_pred
plt.figure(figsize=(8,6))
sns.histplot(residuals, kde=True)
plt.title('Residuals Distribution')
plt.xlabel('Error')
plt.show()

In [None]:
# Predicted vs Actual
plt.figure(figsize=(8,6))
sns.scatterplot(x=y_test, y=y_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Home Prices')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r')
plt.show()

## 🧾 Conclusion and Recommendations

- **Income** and **location** are strong indicators of housing prices.
- The Random Forest model performs well, with solid R² and low RMSE.
- Future improvements could include:
  - Incorporating time series data for trend analysis
  - Using deep learning models for feature-rich inputs like images
  - Expanding data coverage beyond California


## 🧪 Statistical Tests

### 🔍 Normality Test for Target Variable

In [None]:
from scipy.stats import shapiro

# Shapiro-Wilk test for normality
stat, p = shapiro(housing['median_house_value'])
print('Shapiro-Wilk Test Statistic=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('✅ The distribution of house values appears normal (fail to reject H0)')
else:
    print('❌ The distribution of house values is not normal (reject H0)')

### 🔗 Pearson Correlation Test Between Features and Target

In [None]:
# Pearson correlation test for each numeric feature
from scipy.stats import pearsonr

for col in housing.select_dtypes(include=[np.number]).columns:
    if col != 'median_house_value':
        corr, p = pearsonr(housing[col], housing['median_house_value'])
        print(f"{col}: Pearson correlation = {corr:.3f}, p-value = {p:.3f}")

### 🧪 T-test: High vs Low Income Areas

In [None]:
# Compare house values between high and low income areas
from scipy.stats import ttest_ind

median_income_threshold = housing['median_income'].median()
high_income = housing[housing['median_income'] > median_income_threshold]['median_house_value']
low_income = housing[housing['median_income'] <= median_income_threshold]['median_house_value']

stat, p = ttest_ind(high_income, low_income)
print('T-test Statistic=%.3f, p=%.3f' % (stat, p))
if p < 0.05:
    print('✅ Significant difference in home values between income groups')
else:
    print('❌ No significant difference in home values between income groups')