<a href="https://colab.research.google.com/github/Dong2Yo/DATA3960_1252/blob/main/Lectures/DATA_PREPARATION_CLASS_NOTEBOOK_stu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA PREPARATION CLASS NOTEBOOK
### Dataset Survey, Data Dictionary, and Inconsistency Detection


## Learning Objectives
- Understand why dataset surveys are critical before analysis
- Learn how to create a data dictionary
- Identify different types of data inconsistencies
- Detect inconsistencies using Python
- Propose realistic cleaning strategies
- Streamline prep into a reusable Transformer

## 1. Why Dataset Survey Matters


Consider:
- Misleading averages or distributions
- Hidden bias
- Incorrect assumptions
- Model instability
- Loss of stakeholder trust

### <span style="color:#1f77b4">ðŸ’¬ CLASS DISCUSSION</span>
### <span style="color:#1f77b4">What could go wrong if we immediately start modelling this data?</span>


In [None]:
import pandas as pd
import numpy as np

## 2. Initial Dataset Survey


In [None]:
try:
    df = pd.read_csv('https://raw.githubusercontent.com/Dong2Yo/Dataset/refs/heads/main/edmonton_census_synthetic_raw.csv')
    print('Dataset loaded successfully')
    print('Number of rows:', len(df))
except FileNotFoundError:
    print('Error: edmonton_census_synthetic_raw.csv not found.')

Dataset loaded successfully
Number of rows: 481200


In [None]:
# Shape and structure
print("Dataset shape:", df.shape)
print("\nData types:\n", df.dtypes)

Dataset shape: (481200, 6)

Data types:
 person_id           int64
neighbourhood      object
age                 int64
gender             object
education_level    object
has_children       object
dtype: object


In [None]:
# Missing values
print("\nMissing values:\n", df.isna().sum())


Missing values:
 person_id             0
neighbourhood      1002
age                   0
gender             1502
education_level    2005
has_children          0
dtype: int64


In [None]:
# Check duplicates
duplicate_rows = df[df.duplicated()]
print('Duplicate rows detected:', len(duplicate_rows))

Duplicate rows detected: 1200


In [None]:
# Quick statistical overview
print("\nAge summary:\n", df["age"].describe())


Age summary:
 count    481200.000000
mean         49.683421
std          32.084380
min         -20.000000
25%          24.000000
50%          49.000000
75%          75.000000
max         999.000000
Name: age, dtype: float64


In [None]:
df.describe(include='object')

Unnamed: 0,neighbourhood,gender,education_level,has_children
count,480198,479698,479195,481200
unique,7,4,7,2
top,Windermere,Female,High school,No
freq,68952,239715,80032,263964


### <span style="color:#1f77b4">ðŸ’¬ CLASS DISCUSSION</span>
### <span style="color:#1f77b4">What variables already look suspicious from this summary?</span>


## 3. Creating a Data Dictionary

### A data dictionary documents:
- Variable name
- Description
- Data type
- Allowed values / business rules


### This is CRITICAL for:
- Team collaboration
- Sponsor communication
- Long-term project sustainability

In [None]:
data_dictionary = pd.DataFrame({
"variable": [
"person_id",
"neighbourhood",
"age",
"gender",
"education_level",
"has_children"
],
"description": [
"Unique individual identifier",
"Edmonton neighbourhood of residence",
"Age in years",
"Self-reported gender",
"Highest level of education completed",
"Whether the person has children"
],
"data_type": [
"integer",
"categorical",
"integer",
"categorical",
"categorical",
"categorical"
],
"allowed_values / rules": [
"Positive, unique",
"Predefined neighbourhood list",
"0â€“110",
"Male, Female, Non-binary",
"Standardized education categories",
"Yes / No"
]
})


print(data_dictionary)

          variable                           description    data_type  \
0        person_id          Unique individual identifier      integer   
1    neighbourhood   Edmonton neighbourhood of residence  categorical   
2              age                          Age in years      integer   
3           gender                  Self-reported gender  categorical   
4  education_level  Highest level of education completed  categorical   
5     has_children       Whether the person has children  categorical   

              allowed_values / rules  
0                   Positive, unique  
1      Predefined neighbourhood list  
2                              0â€“110  
3           Male, Female, Non-binary  
4  Standardized education categories  
5                           Yes / No  


### <span style="color:#1f77b4">ðŸ’¬ CLASS DISCUSSION</span>
### <span style="color:#1f77b4">Can we automate the process? In other words, which rules are assumptions, and which are enforced by data collection?</span>


## 4. Detecting Data Inconsistencies

### 4.1 Age validity checks

### 4.2 Gender coding inconsistencies

### 4.3 Education level inconsistencies

## ðŸ’¡ Teaching insight

### Validation rules must be defined independently of data collection.

### 4.4 Logical inconsistencies (children vs age)

## 5. Strategies to Address Inconsistencies

## 6. Packaging Data Prep into a Transformer

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin



# Apply transformer


## 7. Optional: Export Cleaned Dataset

## 8. Final Reflection
- Dataset survey revealed hidden risks
- Data dictionary clarified assumptions
- Inconsistencies came from multiple sources
- Transformer ensures reproducibility
- Discussion points available for students on ethical cleaning and model impact