#  Data Preprocessing Pipeline (NumPy Only)

## 1. Overview
This notebook executes the data preprocessing pipeline. To ensure **Data Consistency** and prevent **Data Leakage**, we adhere to the following strict protocol:

1.  **Fit on Training Set:** Calculate statistics (Mean, Median, Mode, Categories) solely based on `aug_train.csv`.
2.  **Transform on Test Set:** Apply the *same* statistics and mappings derived from the Training set to `aug_test.csv`.

**Key Steps:**
* **Cleaning:** Handle specific string values (`<1`, `>20`, `never`).
* **Imputation:** Fill missing values using Median (Numeric) or "Unknown"/Mode (Categorical).
* **Feature Engineering:** Log-transform `training_hours`.
* **Encoding:** Ordinal Encoding for ranked data, One-Hot Encoding for nominal data.
* **Scaling:** Z-score Standardization using Training set parameters.

In [1]:
import sys
import os
import numpy as np

# Import custom functions from src
sys.path.append(os.path.abspath('..'))
from src.data_processing import (
    load_data, 
    convert_experience, 
    convert_last_new_job, 
    ordinal_encode, 
    one_hot_encode
)

print("Libraries and custom modules loaded.")

Libraries and custom modules loaded.


In [2]:
# --- 1. LOAD DATA & SETUP ---
print("Loading raw data...")
train_data = load_data('../data/raw/aug_train.csv')
test_data = load_data('../data/raw/aug_test.csv')

print(f"Original Train shape: {train_data.shape}")
print(f"Original Test shape : {test_data.shape}")

# 2. Separate Target (Only in Train)
y_train_raw = train_data['target'].astype(int)

# 3. Save IDs for Submission to CSV (NumPy Only)
# Lưu lại ID của tập Test để sau này ghép vào file nộp bài
test_ids = test_data['enrollee_id']

# Tạo thư mục nếu chưa có
if not os.path.exists('../data/processed/'):
    os.makedirs('../data/processed/')

# Lưu thành file CSV (Dạng số nguyên %d)
np.savetxt(
    '../data/processed/test_ids.csv', 
    test_ids,
    delimiter=",",
    header="enrollee_id",
    comments='',  
    fmt='%d'      
)

print("Test IDs saved to ../data/processed/test_ids.csv")

Loading raw data...
Data loaded successfully. 
Data loaded successfully. 
Original Train shape: (19158,)
Original Test shape : (2129,)
Test IDs saved to ../data/processed/test_ids.csv


In [3]:
def process_data_pipeline(data, is_train=True, stats=None):
    """
    Hàm xử lý dữ liệu trung tâm.
    - is_train=True: Tính toán Median/Mode/Categories và lưu vào 'stats'.
    - is_train=False: Dùng 'stats' để xử lý (không tính lại).
    """
    X_list = []
    feature_names = []
    
    # Nếu là Train, khởi tạo stats mới. Nếu là Test, phải cung cấp stats.
    if is_train:
        stats = {}
    elif stats is None:
        raise ValueError("Must provide 'stats' when processing Test data!")

    # --- 1. NUMERICAL FEATURES ---
    
    # A. City Development Index (Giữ nguyên)
    cdi = data['city_development_index']
    X_list.append(cdi.reshape(-1, 1))
    if is_train: feature_names.append('city_dev_index')

    # B. Training Hours (Log Transform)
    hours = np.log1p(data['training_hours'].astype(float))
    X_list.append(hours.reshape(-1, 1))
    if is_train: feature_names.append('log_training_hours')

    # C. Experience (Convert + Fill Median)
    exp = convert_experience(data['experience'])
    if is_train:
        # Tính Median chỉ trên tập Train
        stats['exp_median'] = np.nanmedian(exp)
    # Fill bằng Median đã lưu
    exp[np.isnan(exp)] = stats['exp_median']
    X_list.append(exp.reshape(-1, 1))
    if is_train: feature_names.append('experience_years')

    # D. Last New Job (Convert + Fill 0)
    lnj = convert_last_new_job(data['last_new_job'])
    lnj[np.isnan(lnj)] = 0
    X_list.append(lnj.reshape(-1, 1))
    if is_train: feature_names.append('last_new_job_years')

    # --- 2. ORDINAL FEATURES (Ranking) ---
    
    # A. Education Level
    edu_map = {'Primary School': 0, 'High School': 1, 'Graduate': 2, 'Masters': 3, 'Phd': 4, '': 2, 'nan': 2}
    edu_encoded = ordinal_encode(data['education_level'].astype(str), edu_map)
    X_list.append(edu_encoded.reshape(-1, 1))
    if is_train: feature_names.append('education_level_ord')

    # B. Company Size
    size_map = {
        '<10': 0, '10/49': 1, '50-99': 2, '100-500': 3, 
        '500-999': 4, '1000-4999': 5, '5000-9999': 6, '10000+': 7, 
        '': -1, 'nan': -1
    }
    size_encoded = ordinal_encode(data['company_size'], size_map)
    X_list.append(size_encoded.reshape(-1, 1))
    if is_train: feature_names.append('company_size_ord')

    # C. Enrolled University
    univ_map = {'no_enrollment': 0, 'Part time course': 1, 'Full time course': 2, '': 0, 'nan': 0}
    univ_encoded = ordinal_encode(data['enrolled_university'], univ_map)
    X_list.append(univ_encoded.reshape(-1, 1))
    if is_train: feature_names.append('enrolled_university_ord')

    # --- 3. NOMINAL FEATURES (One-Hot) ---
    # Test set bắt buộc phải dùng đúng danh sách cột (categories) của Train set
    
    nominal_cols = ['gender', 'company_type', 'major_discipline']
    fill_vals = ['Unknown', 'Unknown', 'STEM'] # Giá trị fill mặc định
    
    for i, col_name in enumerate(nominal_cols):
        raw_col = data[col_name].astype(str)
        # 1. Fill missing cơ bản
        raw_col[(raw_col == '') | (raw_col == 'nan')] = fill_vals[i]
        
        # 2. Encoding logic
        if is_train:
            # FIT: Tìm unique categories và lưu vào stats
            oh_matrix, cats = one_hot_encode(raw_col)
            stats[f'cat_{col_name}'] = cats # Lưu lại danh mục
            feature_names.extend([f"{col_name}_{c}" for c in cats])
        else:
            # TRANSFORM: Lấy categories cũ ra dùng
            saved_cats = stats[f'cat_{col_name}']
            oh_matrix = (raw_col[:, None] == saved_cats[None, :]).astype(int)
            
        X_list.append(oh_matrix)

    # Tổng hợp thành ma trận
    X_final = np.column_stack(X_list).astype(float)
    
    return X_final, stats, feature_names

###  The Core Processing Logic: `process_data_pipeline`

The function below serves as the "heart" of our preprocessing workflow. It is explicitly designed to ensure **Data Consistency** between the Training and Test sets while strictly adhering to the **"Fit on Train, Transform on Test"** principle.

####  Key Objectives:
1.  **Prevent Data Leakage:** We strictly avoid calculating any statistics (Mean, Median, Categories) on the Test set. All parameters are learned solely from the Training set.
2.  **Data Consistency:** We ensure that the processed Test set has exactly the **same number of columns** and the **same column order** as the Training set (crucial for One-Hot Encoding).

####  Mechanism:
The function accepts an `is_train` flag and a `stats` dictionary:

* **FIT Phase (`is_train=True`):**
    * Calculates the **Median** for numerical columns (e.g., `experience`).
    * Identifies unique **Categories** for nominal columns.
    * Saves all these parameters into the `stats` dictionary and returns it for future use.

* **TRANSFORM Phase (`is_train=False`):**
    * Does **not** recalculate anything.
    * Uses the Median from `stats` to fill missing values.
    * Uses the Category list from `stats` to generate One-Hot columns (ensuring the Test set structure matches the Train set, regardless of missing categories in the Test data).

####  Technical Details per Group:
1.  **Numerical Features:**
    * `training_hours`: Applied **Log Transform** (`np.log1p`) to handle skewed distribution.
    * `experience`: Missing values filled using the **Median** (learned from Train).

2.  **Ordinal Features (Hierarchical Mapping):**
    * Mapped strings to integers based on a predefined hierarchy to preserve rank information.
    * **`education_level`**: Converted to a **0-4 scale**:
        * `Primary School`: 0 $\rightarrow$ `High School`: 1 $\rightarrow$ `Graduate`: 2 $\rightarrow$ `Masters`: 3 $\rightarrow$ `Phd`: 4.
    * **`company_size`**: Converted to a **0-7 scale**:
        * `<10`: 0 $\rightarrow$ `10/49`: 1 $\rightarrow$ ... $\rightarrow$ `10000+`: 7.
        * Missing values (`nan` / `''`) are mapped to **-1**.
    * **`enrolled_university`**: Converted to a **0-2 scale**:
        * `no_enrollment`: 0 $\rightarrow$ `Part time course`: 1 $\rightarrow$ `Full time course`: 2.

3.  **Nominal Features (One-Hot Encoding):**
    * Utilizes NumPy **Broadcasting**: `(col[:, None] == saved_cats[None, :])`.
    * Categorical variables are expanded into binary matrixes (0/1), where the number of new columns equals the number of unique categories found in the Training set:
        * **`gender`**: Expands into **4 columns** (`Female`, `Male`, `Other`, `Unknown`).
        * **`company_type`**: Expands into **7 columns** (`Pvt Ltd`, `Funded Startup`, `Early Stage Startup`, `NGO`, `Public Sector`, `Other`, `Unknown`).
        * **`major_discipline`**: Expands into **6 columns** (`STEM`, `Humanities`, `Business Degree`, `Arts`, `No Major`, `Other`).

In [4]:
# --- STEP 1: Process TRAIN Data (FIT & TRANSFORM) ---
print("[STEP 1] PROCESSING TRAINING DATA")
print(f"   Input Shape: {train_data.shape}")

# Thực hiện xử lý và HỌC các tham số (stats)
X_train_processed, train_stats, feature_names = process_data_pipeline(train_data, is_train=True)

print(f"   >> Processing Complete.")
print(f"   Output Shape: {X_train_processed.shape} ({len(feature_names)} features generated)")
print(f"   >> LEARNED STATS (Evidence):")
print(f"      - Median Experience (imputed): {train_stats['exp_median']} years")
print(f"      - Gender Categories found    : {len(train_stats['cat_gender'])} {train_stats['cat_gender']}")
print(f"      - Company Types found        : {len(train_stats['cat_company_type'])}")


# --- STEP 2: Standardization (Scaling) ---
print("\n[STEP 2] STANDARDIZATION (Calculating Z-score parameters)")

# Tính Mean và Std CHỈ trên tập Train
mean_vals = np.mean(X_train_processed, axis=0)
std_vals = np.std(X_train_processed, axis=0)

# Xử lý trường hợp std = 0
std_vals[std_vals == 0] = 1 

print(f"   >> Reference Statistics (calculated from Train):")
print(f"      - Mean Vector Sample (First 3): {np.round(mean_vals[:3], 4)}")
print(f"      - Std Vector Sample  (First 3): {np.round(std_vals[:3], 4)}")

# Áp dụng công thức
X_train_scaled = (X_train_processed - mean_vals) / std_vals
print(f"   >> Train Data Scaled.")


# --- STEP 3: Process TEST Data (TRANSFORM ONLY) ---
print("\n[STEP 3] PROCESSING TEST DATA")
print(f"   Input Shape: {test_data.shape}")
print(f"   >> Applying stats learned from STEP 1...")

# Truyền 'train_stats' vào để áp dụng quy luật cũ
X_test_processed, _, _ = process_data_pipeline(test_data, is_train=False, stats=train_stats)

# Áp dụng Mean/Std CỦA TẬP TRAIN
X_test_scaled = (X_test_processed - mean_vals) / std_vals

print(f"   >> Test Data Processed & Scaled.")
print(f"   Output Shape: {X_test_scaled.shape}")


# --- STEP 4: FINAL VALIDATION ---
print("\n[STEP 4] CONSISTENCY CHECK")

# Kiểm tra số cột
cols_match = X_train_scaled.shape[1] == X_test_scaled.shape[1]
print(f"   1. Column Count Match : {'[PASS]' if cols_match else '[FAIL]'}")

# Kiểm tra NaN
nan_train = np.isnan(X_train_scaled).sum()
nan_test = np.isnan(X_test_scaled).sum()
print(f"   2. NaN Count (Train)  : {'[0]' if nan_train == 0 else f'[FAIL] {nan_train}'}")
print(f"   3. NaN Count (Test)   : {'[0]' if nan_test == 0 else f'[FAIL] {nan_test}'}")

if cols_match and nan_train == 0 and nan_test == 0:
    print("\n>> PIPELINE SUCCESS: Data is ready for modeling.")
else:
    print("\n>> PIPELINE WARNING: Something needs to be fixed.")

[STEP 1] PROCESSING TRAINING DATA
   Input Shape: (19158,)
   >> Processing Complete.
   Output Shape: (19158, 24) (24 features generated)
   >> LEARNED STATS (Evidence):
      - Median Experience (imputed): 9.0 years
      - Gender Categories found    : 4 ['Female' 'Male' 'Other' 'Unknow']
      - Company Types found        : 7

[STEP 2] STANDARDIZATION (Calculating Z-score parameters)
   >> Reference Statistics (calculated from Train):
      - Mean Vector Sample (First 3): [ 0.8288  3.8002 10.0964]
      - Std Vector Sample  (First 3): [0.1234 0.9449 6.7656]
   >> Train Data Scaled.

[STEP 3] PROCESSING TEST DATA
   Input Shape: (2129,)
   >> Applying stats learned from STEP 1...
   >> Test Data Processed & Scaled.
   Output Shape: (2129, 24)

[STEP 4] CONSISTENCY CHECK
   1. Column Count Match : [PASS]
   2. NaN Count (Train)  : [0]
   3. NaN Count (Test)   : [0]

>> PIPELINE SUCCESS: Data is ready for modeling.


In [None]:
# --- 4. SAVE TO CSV (NumPy Only) ---

# Tạo thư mục đầu ra
output_dir = '../data/processed/'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving to CSV...")

# --- A. XỬ LÝ TẬP TRAIN (aug_train.csv) ---
# 1. Ghép Features và Target lại thành một ma trận duy nhất
data_train_export = np.column_stack((X_train_scaled, y_train_raw.reshape(-1, 1)))

# 2. Tạo Header (Danh sách tên cột + 'target')
header_train = ",".join(feature_names + ['target'])

# 3. Lưu file
train_file_path = os.path.join(output_dir, 'aug_train.csv')
np.savetxt(
    train_file_path,        
    data_train_export,        
    delimiter=",",            
    header=header_train,      
    comments='',              
    fmt='%.6f'                
)
print(f"Saved: {train_file_path} {data_train_export.shape}")


# --- B. XỬ LÝ TẬP TEST (aug_test.csv) ---
header_test = ",".join(feature_names)

test_file_path = os.path.join(output_dir, 'aug_test.csv')
np.savetxt(
    test_file_path,
    X_test_scaled,
    delimiter=",",
    header=header_test,
    comments='',
    fmt='%.6f'
)
print(f"Saved: {test_file_path} {X_test_scaled.shape}")

print("\nProcessing Pipeline Completed Successfully.")

Saving to CSV...
Saved: ../data/processed/aug_train.csv (19158, 25)
Saved: ../data/processed/aug_test.csv (2129, 24)

Processing Pipeline Completed Successfully.


In [None]:
# Quick Sanity Check
print("--- Sanity Check ---")
print(f"Any NaN in Train: {np.isnan(X_train_scaled).sum()}")
print(f"Any NaN in Test:  {np.isnan(X_test_scaled).sum()}")

# Kiểm tra xem số cột có khớp nhau không (Quan trọng!)
assert X_train_scaled.shape[1] == X_test_scaled.shape[1], "CRITICAL ERROR: Train and Test features mismatch!"
print("Column count matches. Pipeline is robust.")

--- Sanity Check ---
Any NaN in Train? 0
Any NaN in Test?  0
Column count matches. Pipeline is robust.
