
# 🏠 House Price Index (HPI) Analysis

This notebook analyzes **U.S. House Price Index (HPI)** data using the **FHFA dataset**:  
🔗 [FHFA HPI Datasets](https://www.fhfa.gov/data/hpi/datasets)

### 📊 Project Features
- Cleans and processes HPI data
- Visualizes **national and regional trends**
- Compares **top 5 regions**
- Predicts **future HPI trends** using Linear Regression
- Saves a trained model for later use


In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
import pickle

# Set plot style
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = [14, 6]



## 1️⃣ Load Dataset & Preprocessing
- Load the FHFA HPI dataset (CSV)
- Handle missing values
- Create a `date` column from `yr` and `period`


In [None]:

# Load dataset
file_path = "your_file.csv"  # Replace with your CSV path
df = pd.read_csv(file_path)

print("✅ Dataset Loaded")
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())

# Fill missing values in index_sa with index_nsa
df['index_sa'].fillna(df['index_nsa'], inplace=True)

# Create datetime column
df['date'] = pd.to_datetime(dict(year=df['yr'], month=df['period'], day=1))

# Preview data
df.head()



## 2️⃣ Exploratory Data Analysis (EDA)
Let's explore the dataset to understand its structure and missing values.


In [None]:

print("\nDataset Info:")
print(df.info())

print("\nMissing Values:")
print(df.isnull().sum())

# List first 20 regions
print("\n🔹 First 20 Available Regions:")
print(df['place_name'].unique()[:20])



## 3️⃣ National Average HPI Trend
Visualize the **average HPI** across all regions over time.


In [None]:

# Compute national average
avg_hpi = df.groupby('date')[['index_nsa','index_sa']].mean().reset_index()

# Plot National Average HPI
plt.plot(avg_hpi['date'], avg_hpi['index_nsa'], label='Average HPI NSA', alpha=0.8)
plt.plot(avg_hpi['date'], avg_hpi['index_sa'], label='Average HPI SA', alpha=0.9)
plt.title("National Average House Price Index Over Time")
plt.xlabel("Date")
plt.ylabel("House Price Index")
plt.legend()
plt.show()



## 4️⃣ Regional Analysis
Select a **specific region** to analyze its HPI trends.


In [None]:

# Select region
region_name = "East North Central"

# Auto-select first available region if not found
if region_name not in df['place_name'].unique():
    print(f"⚠️ Region '{region_name}' not found! Using first available region instead.")
    region_name = df['place_name'].unique()[0]

region_df = df[df['place_name'] == region_name].copy()
print(f"✅ Selected Region: {region_name}, Rows: {len(region_df)}")

# Plot regional trend
plt.plot(region_df['date'], region_df['index_nsa'], label='HPI NSA')
plt.plot(region_df['date'], region_df['index_sa'], label='HPI SA')
plt.title(f"House Price Index - {region_name}")
plt.xlabel("Date")
plt.ylabel("House Price Index")
plt.legend()
plt.show()



## 5️⃣ Compare Top 5 Regions
Plot HPI for the **first 5 regions** to compare trends.


In [None]:

top_regions = df['place_name'].unique()[:5]

for region in top_regions:
    region_data = df[df['place_name'] == region]
    plt.plot(region_data['date'], region_data['index_sa'], label=region)

plt.title("House Price Index - Top 5 Regions")
plt.xlabel("Date")
plt.ylabel("House Price Index")
plt.legend()
plt.show()



## 6️⃣ HPI Prediction (Linear Regression)
Predict the next **12 months** of HPI for the selected region using Linear Regression.


In [None]:

region_df = region_df.sort_values('date')
region_df['time_step'] = range(len(region_df))

X = region_df[['time_step']]
y = region_df['index_sa']

if len(region_df) > 0:
    model = LinearRegression()
    model.fit(X, y)

    # Predict next 12 months
    future_steps = np.arange(len(region_df), len(region_df)+12).reshape(-1, 1)
    future_pred = model.predict(future_steps)
    future_dates = pd.date_range(start=region_df['date'].iloc[-1], periods=13, freq='M')[1:]

    # Plot prediction
    plt.plot(region_df['date'], y, label='Actual HPI')
    plt.plot(future_dates, future_pred, '--', color='red', label='Predicted HPI')
    plt.title(f"HPI Prediction - {region_name}")
    plt.xlabel("Date")
    plt.ylabel("House Price Index")
    plt.legend()
    plt.show()

    # Save model
    with open("../models/hpi_model.pkl", "wb") as file:
        pickle.dump(model, file)

    print("✅ Prediction complete. Model saved as hpi_model.pkl")
else:
    print("⚠️ Not enough data for regression in this region!")



## ✅ Summary
- Cleaned and visualized **HPI dataset**
- Plotted **national and regional trends**
- Compared **top 5 regions**
- Predicted **next 12 months HPI** using Linear Regression

Next steps:
- Try **different ML models** for better predictions  
- Analyze **state-level trends** for deeper insights
