# 🧹 Data Cleaning Notebook

This notebook documents the **data collection and cleaning process** for the HDB Resale Flat Prices dataset.

---

## 📌 Step 1: Import Libraries

In [None]:
import pandas as pd
import numpy as np

# Display options for better viewing
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

## 📌 Step 2: Load Raw Data
Raw dataset is stored in `data/raw/`. We will start by loading the 2017 resale flat data.

In [None]:
df = pd.read_csv('../data/raw/resale-flat-prices-2017.csv')

# Preview dataset
df.shape, df.head(), df.info()

## 📌 Step 3: Standardize Column Names
Ensure all column names are lowercase and snake_case.

In [None]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df.head(2)

## 📌 Step 4: Convert Data Types
Convert `month` to datetime, check numerical columns.

In [None]:
df['month'] = pd.to_datetime(df['month'], format='%Y-%m')

# Create year column
df['year'] = df['month'].dt.year
df[['month','year']].head()

## 📌 Step 5: Handle Missing Values
Check for nulls and decide on a cleaning strategy.

In [None]:
print(df.isnull().sum())

# Example strategy: drop missing rows
df = df.dropna()

## 📌 Step 6: Save Cleaned Data
Export cleaned dataset into `data/cleaned/` folder.

In [None]:
df.to_csv('../data/cleaned/hdb_resale_cleaned.csv', index=False)
print('✅ Cleaned dataset saved!')

---

## 📝 Notes & Observations
- List any data quality issues found
- Describe your cleaning decisions
- Document transformations (e.g., new columns created)