# **Project Overview: Income Prediction and Clustering of Canadian Census Tracts**

This project demonstrates how to use machine learning and clustering techniques to analyze Canadian Census data and predict median household income. The workflow includes:

- Data preprocessing and feature engineering  
- Correlation analysis and feature selection  
- Clustering using KMeans and BIRCH  
- Cluster-specific regression modeling  
- Evaluation using RMSE and MAE

### **1. Data Loading and Initial Cleanup**

Load the 2021 census training dataset and standardize column names for consistency.

In [None]:
import pandas as pd
import numpy as np

# Load census training data
data = pd.read_csv("CensusCanada2021Training.csv")

# Rename misnamed column
data = data.rename(columns={
    'Total Households For Period Of Construction Built Between 1981 And 190':
    'Total Households For Period Of Construction Built Between 1981 And 1990'
})


### **2. Feature Engineering**

To improve model performance, derive socioeconomic indicators likely to correlate with income:

In [None]:
# Tenure-related variables
data['Pct_Owner'] = (data['Dwellings by Tenure Owner'] / data['Total Households for Tenure']) * 100
data['Pct_Renter'] = (data['Dwellings by Tenure Renter'] / data['Total Households for Tenure']) * 100
data['Ratio_of_Renters_to_Owner'] = data['Pct_Renter'] / data['Pct_Owner']

# Housing age categories
data['Pct_Older_House'] = (
    data['Total Households For Period Of Construction Built Before 1961'] +
    data['Total Households For Period Of Construction Built Between 1961 And 1980']
) / data['Total Households For Period Of Construction'] * 100

data['Pct_New_House'] = (
    data['Total Households For Period Of Construction Built Between 2006 And 2010'] +
    data['Total Households For Period Of Construction Built Between 2011 And 2015'] +
    data['Total Households For Period Of Construction Built Between 2016 And 2021']
) / data['Total Households For Period Of Construction'] * 100

data['Ratio_of_Olderhouse_to_Newhouse'] = data['Pct_Older_House'] / data['Pct_New_House']

# Structure types
data['Total Household for Structure Type'] = (
    data['Total Households For Structure Type Houses'] +
    data['Total Households For Structure Type Apartment, Building Low And High Rise'] +
    data['Total Households For Structure Type Other Dwelling Types']
)

data['Pct_Structure_Houses'] = data['Total Households For Structure Type Houses'] / data['Total Household for Structure Type']
data['Pct_Structure_Apartment'] = data['Total Households For Structure Type Apartment, Building Low And High Rise'] / data['Total Household for Structure Type']

# Average household size
data['Household_Size'] = data['Total Population'] / data['Total Households']

These features provide meaningful proxies for economic conditions, housing quality, and living arrangements.

### **3. Data Cleaning**

Address missing and infinite values to ensure modeling integrity.

In [None]:
data.replace([np.inf, -np.inf], np.nan, inplace=True)
data.fillna(data.mean(), inplace=True)

### **4. Feature Selection via Decision Tree**

Use a Decision Tree Regressor to rank feature importance in predicting the target variable:
`Median Household Income (Current Year $)`

In [None]:
features = [
    'Pct_Owner', 'Pct_Renter', 'Ratio_of_Renters_to_Owner',
    'Pct_Older_House', 'Pct_New_House', 'Ratio_of_Olderhouse_to_Newhouse',
    'Pct_Structure_Houses', 'Pct_Structure_Apartment', 'Household_Size'
]
target = 'Median Household Income (Current Year $)'

X = data[features]
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

tree = DecisionTreeRegressor(random_state=42)
tree.fit(X_train, y_train)

importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': tree.feature_importances_
}).sort_values(by='Importance', ascending=False)
importances

### **5. Clustering Analysis (KMeans and BIRCH)**

Apply KMeans and BIRCH clustering to segment the data into groups of similar census tracts.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# KMeans clustering
kmeans = KMeans(n_clusters=2, random_state=42)
X_train['Cluster'] = kmeans.fit_predict(X_scaled)

# BIRCH clustering
birch = Birch(n_clusters=2)
X_train['Birch_Cluster'] = birch.fit_predict(X_scaled)

These clusters serve as subpopulations for tailored modeling in the next stage.

### **6. Cluster-Based Regression Modeling**

 Fit Random Forest Regressors within each cluster to predict income more accurately by leveraging local structure.

In [None]:
results = {}
rf = RandomForestRegressor(n_estimators=100, random_state=42)

for cluster in X_train['Cluster'].unique():
    cluster_mask = X_train['Cluster'] == cluster
    rf.fit(X_train[cluster_mask][features], y_train[cluster_mask])
    preds = rf.predict(X_test[features])

    rmse = mean_squared_error(y_test, preds, squared=False)
    mae = mean_absolute_error(y_test, preds)

    results[cluster] = {'RMSE': rmse, 'MAE': mae}

pd.DataFrame(results)

This approach helps reduce error metrics (e.g., RMSE, MAE) by adapting to data heterogeneity.

### **Conclusion**
This project showcases how to blend regression and clustering techniques to model socio-economic outcomes from census data. Key takeaways include: Feature engineering from demographic and housing data is essential. Decision trees offer interpretable feature importance scores. Clustering improves model precision by segmenting heterogeneous data. Cluster-based regression provides actionable insights for policy or urban planning.