# House Price Prediction — EDA & Baseline Models

**Author:** Shreyansh Mishra  
**College:** GLA University, Mathura  
**Date:** 2025-11-12

This notebook demonstrates EDA and baseline regression models (Linear Regression & Random Forest) on a house price dataset. Replace `house_prices_sample.csv` with your real dataset if available.

In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
%matplotlib inline

In [None]:
df = pd.read_csv('house_prices_sample.csv')
df.head()

In [None]:
df.info()

df.describe()

## Data Cleaning
- Check missing values
- Encode categorical variables


In [None]:
df.isnull().sum()

In [None]:
# Feature engineering

df['age'] = 2025 - df['year_built']
df = pd.get_dummies(df, columns=['location'], drop_first=True)
df.head()

## EDA Visualizations
1. Price distribution
2. Price vs sqft
3. Price by location
4. Correlation heatmap

In [None]:
plt.figure(figsize=(8,5))
plt.hist(df['price'], bins=40)
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(df['sqft'], df['price'], alpha=0.5)
plt.title('Price vs Sqft')
plt.xlabel('Sqft')
plt.ylabel('Price')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='location', y='price', data=df)
plt.title('Price by Location')
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

## Modeling (Baseline)
- Split data
- Linear Regression
- Random Forest Regressor
- Evaluate with RMSE and MAE

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

X = df.drop(columns=['price','year_built'])
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Metrics
for name, ypred in [('Linear Regression', y_pred_lr), ('Random Forest', y_pred_rf)]:
    rmse = mean_squared_error(y_test, ypred, squared=False)
    mae = mean_absolute_error(y_test, ypred)
    print(name, 'RMSE:', round(rmse,2), 'MAE:', round(mae,2))

In [None]:
# Feature importance from RF
importances = rf.feature_importances_
feat_names = X.columns
imp_df = pd.Series(importances, index=feat_names).sort_values(ascending=False)
imp_df.head(10)

## Conclusion & Next Steps
- Model performance baseline shown. Future work: hyperparameter tuning, cross-validation, stacking, use real dataset, add more features (proximity to amenities, crime rate, schools).