<a href="https://colab.research.google.com/github/Theeyecode/Housing-Stress-Canada/blob/eda/descriptive_stat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis for Canada Housing Survery Data 2022

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


url = "https://drive.google.com/uc?id=11Y8p_9-CYw0tpGPFu-jlgzxzWOOVS43F"

In [2]:


df = pd.read_csv(url)
df.head()

Unnamed: 0,PUMFID,EHA_10,EHA_25,FP_05,DWI_05A,DWI_05B,DWI_05C,DWI_05D,NEI_05A,NEI_05B,...,PSTIR_GR,PVISMIN,PWSA_D15,P2DCT_20,P2DCT_25,PATT_05,PATT_10,PATT_15A,PATT_15B,VERDATE
0,63501,4,2,1,2,2,2,2,4,4,...,1,9,999.6,996,6,1,1,6,6,11/08/2025
1,63502,3,2,2,2,2,2,2,4,4,...,1,2,999.6,996,6,2,1,6,6,11/08/2025
2,63503,3,2,1,2,2,2,2,4,4,...,1,2,999.6,996,6,2,1,6,6,11/08/2025
3,63504,3,2,1,2,1,2,2,3,4,...,1,1,999.6,996,6,2,1,6,6,11/08/2025
4,63505,4,2,1,2,2,2,2,4,4,...,1,2,999.6,2,1,1,2,6,3,11/08/2025


Shape of the Raw data


In [3]:
df.shape

(38657, 103)

In [4]:
# df.dtypes
df.dtypes.value_counts()


Unnamed: 0,count
int64,98
float64,4
object,1


In [5]:
# Identify non-numeric columns (typically dates or text fields)
df.select_dtypes(include="object").columns.tolist()

['VERDATE']

In [6]:
# Convert verification date to datetime for proper handling
df["VERDATE"] = pd.to_datetime(df["VERDATE"], errors="coerce")

In [7]:
# # Check missing values per column (after initial load)
df.isna().sum().sort_values(ascending=False)

Unnamed: 0,0
PUMFID,0
EHA_10,0
EHA_25,0
FP_05,0
DWI_05A,0
...,...
PATT_05,0
PATT_10,0
PATT_15A,0
PATT_15B,0


In [8]:
# Get basic descriptive stats for numeric columns (unweighted, structure check)
df.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
PUMFID,38657.0,82829.0,63501.0,73165.0,82829.0,92493.0,102157.0,11159.459015
EHA_10,38657.0,2.888119,1.0,2.0,3.0,4.0,9.0,1.076674
EHA_25,38657.0,1.953126,1.0,2.0,2.0,2.0,9.0,0.327064
FP_05,38657.0,1.028119,1.0,1.0,1.0,1.0,9.0,0.294491
DWI_05A,38657.0,1.96337,1.0,2.0,2.0,2.0,9.0,0.446082
...,...,...,...,...,...,...,...,...
PATT_05,38657.0,6.103733,0.0,1.0,1.0,2.0,99.0,21.497314
PATT_10,38657.0,1.87803,1.0,1.0,1.0,2.0,9.0,1.892823
PATT_15A,38657.0,5.800321,1.0,6.0,6.0,6.0,9.0,1.272822
PATT_15B,38657.0,5.399669,1.0,6.0,6.0,6.0,9.0,1.733045


In [None]:
# # Identify columns that contain obvious reserved codes (e.g., 9, 96, 99, 999, etc.)
reserved_codes = [9, 96, 99, 996, 999, 999.6, 999.9, 99999996, 99999999, 99999999999]

reserved_check = {
    col: df[col].isin(reserved_codes).any()
    for col in df.columns
    if df[col].dtype != "object"
}

[k for k, v in reserved_check.items() if v]


---
## [Task 1 : Handle Reserved Codes as NA](https://emmanuelolajubu90.atlassian.net/browse/SCRUM-12)

* Identify outcome vars (PCHN, PSTIR_GR) → confirm universe + eligibility rules

* Apply logic to convert reserved codes to NA for all relevant variables, without performing any recoding at this stage.

---

In [16]:
# 1. PCHN (Core Housing Need)

print("PCHN (Original Counts):")
df['PCHN'].value_counts(dropna=False).sort_index()

PCHN (Original Counts):


Unnamed: 0_level_0,count
PCHN,Unnamed: 1_level_1
1,6164
2,30938
9,1555


In [13]:
# Create a clean version: Map 9 to NaN
df['PCHN_Clean'] = df['PCHN'].replace({9: np.nan})

In [15]:
# Print the cleaned data count

print(df['PCHN_Clean'].value_counts(dropna=False).sort_index())
print('_'*50)
print(f"Records Excluded (Not Stated): {df['PCHN_Clean'].isna().sum()}")

PCHN_Clean
1.0     6164
2.0    30938
NaN     1555
Name: count, dtype: int64
__________________________________________________
Records Excluded (Not Stated): 1555


In [22]:
# 2. PSTIR_GR (Shelter-cost-to-income ratio group)

print("Definition: 1 (<30%), 2 (30-50%), 3 (50-100%), 4 (>=100%) \n")
print("Reserved Codes: 5 = Not Applicable, 9 = Not Stated \n")

print("PSTIR_GR (Original Counts):")
df['PSTIR_GR'].value_counts(dropna=False).sort_index()

Definition: 1 (<30%), 2 (30-50%), 3 (50-100%), 4 (>=100%) 

Reserved Codes: 5 = Not Applicable, 9 = Not Stated 

PSTIR_GR (Original Counts):


Unnamed: 0_level_0,count
PSTIR_GR,Unnamed: 1_level_1
1,28650
2,6440
3,2012
4,549
5,429
9,577


In [23]:
# Map 5 and 9 to NaN

df['PSTIR_GR_Clean'] = df['PSTIR_GR'].replace({5: np.nan, 9: np.nan})

In [30]:
print("\nPSTIR_GR cleaned data counts:")
display(df['PSTIR_GR_Clean'].value_counts(dropna=False).sort_index())
print(f"\nRecords Excluded (N/A or Not Stated): {df['PSTIR_GR_Clean'].isna().sum()}")


PSTIR_GR cleaned data counts:


Unnamed: 0_level_0,count
PSTIR_GR_Clean,Unnamed: 1_level_1
1.0,28650
2.0,6440
3.0,2012
4.0,549
,1006



Records Excluded (N/A or Not Stated): 1006


In [33]:
# 3. Valid Universe Check : How many households are valid for BOTH measures?

valid_both = df.dropna(subset=['PCHN_Clean', 'PSTIR_GR_Clean'])
print(f"Total rows in dataset: {len(df)}")
print(f"Rows valid for BOTH PCHN and PSTIR_GR: {len(valid_both)}")

Total rows in dataset: 38657
Rows valid for BOTH PCHN and PSTIR_GR: 37102


### Output for Task 1

* **PCHN**: Preserved 37,102 valid households (6,164 In Need / 30,938 Not In Need) data. We excluded 1,555 "Not Stated" records.

* **PSTIR_GR**: Preserved 37,651 valid households. We excluded 1,006 records (429 "Not Applicable" + 577 "Not Stated") data.

* **Intersection**: 37,102 households have valid data for both variables, makes it valid sample data size for our analysis.

---
## [Task 2: Predictor Variable Audit](https://emmanuelolajubu90.atlassian.net/browse/SCRUM-13)

* **Goal**: "Sanitize" the independent variables (Demographics, Geography, Socio-economic).

* **Action**: Systematically identify reserved codes (e.g., 99, 99999996) for key columns like Income, Age, and Tenure to prevent them from skewing analysis.

---

In [35]:
# List of all predictor variables we are checking
categorical_vars = [
    'PDCT_05',   # Tenure (Owner/Renter)
    'PMINOR',    # Visible Minority Status
    'PHTYPE',    # Household Type
    'PEMPL',     # Employment Status
    'PHGEDUC',   # Education Level
    'REGION',    # Region (Atlantic, QC, ON, etc.)
    'PDWLTYPE',  # Dwelling Type (Single detached, High-rise, etc.)
    'PAGEP1'     # Age of Reference Person
]

In [37]:
# 1. Audit Categorical Variables

print("1. Categorical Variable Audit (Looking for 9, 99, etc.)")
print("_"*50)
for var in categorical_vars:
    # Get value counts including NaNs
    counts = df[var].value_counts(dropna=False).sort_index()

    # Check for common reserved codes
    has_9 = 9 in counts.index
    has_99 = 99 in counts.index

    flag = ""
    if has_9: flag += "[FLAG: Contains Code 9] "
    if has_99: flag += "[FLAG: Contains Code 99] "

    print(f"\nVariable: {var} {flag}")
    display(counts)

1. Categorical Variable Audit (Looking for 9, 99, etc.)
__________________________________________________

Variable: PDCT_05 [FLAG: Contains Code 9] 


Unnamed: 0_level_0,count
PDCT_05,Unnamed: 1_level_1
1,16399
2,21719
9,539



Variable: PMINOR [FLAG: Contains Code 9] 


Unnamed: 0_level_0,count
PMINOR,Unnamed: 1_level_1
1,6278
2,30227
9,2152



Variable: PHTYPE [FLAG: Contains Code 99] 


Unnamed: 0_level_0,count
PHTYPE,Unnamed: 1_level_1
1,5561
2,8710
3,3056
4,1177
5,16853
6,1165
99,2135



Variable: PEMPL [FLAG: Contains Code 9] 


Unnamed: 0_level_0,count
PEMPL,Unnamed: 1_level_1
1,20471
2,16521
9,1665



Variable: PHGEDUC [FLAG: Contains Code 99] 


Unnamed: 0_level_0,count
PHGEDUC,Unnamed: 1_level_1
1,4502
2,8523
3,3829
4,7886
5,2079
6,6400
7,4159
99,1279



Variable: REGION 


Unnamed: 0_level_0,count
REGION,Unnamed: 1_level_1
1,10797
2,5329
3,7369
4,11294
5,3868



Variable: PDWLTYPE [FLAG: Contains Code 99] 


Unnamed: 0_level_0,count
PDWLTYPE,Unnamed: 1_level_1
1,13564
2,1993
3,3340
4,1648
5,4204
6,12303
99,1605



Variable: PAGEP1 


Unnamed: 0_level_0,count
PAGEP1,Unnamed: 1_level_1
1,2753
2,7982
3,12878
4,15044


In [42]:
# 2. Audit Continuous Variables (Income)

# Specific check for PHHTTINC (Total Household Income)
income_col = 'PHHTTINC'
max_val = df[income_col].max()
reserved_income_code = 99999999999

print(f"Variable: {income_col}")
print(f"Max Value found: {max_val}")

if max_val == reserved_income_code:
    count_reserved = (df[income_col] == reserved_income_code).sum()
    print(f"[FLAG] Found {count_reserved} records with Reserved Code {reserved_income_code}")
else:
    print("No standard reserved code (999...9) found as max value.")

Variable: PHHTTINC
Max Value found: 99999999999
[FLAG] Found 2026 records with Reserved Code 99999999999


This suggests that roughly 2,000 respondents either refused to disclose their income or did not know it. By removing these, your descriptive statistics (mean, median) will now reflect reality.

In [45]:
# Show distribution without the garbage code to see real stats
clean_income = df[df[income_col] != reserved_income_code][income_col]

print("\nReal Income Statistics (excluding reserved code)")

display(clean_income.describe().apply(lambda x: format(x, 'f')))


Real Income Statistics (excluding reserved code)


Unnamed: 0,PHHTTINC
count,36631.0
mean,84160.437198
std,82272.363137
min,-72500.0
25%,30000.0
50%,60000.0
75%,110000.0
max,975000.0
