# ðŸ¦· Dental Implant 10-Year Survival Prediction

## Notebook 02: Data Preprocessing

**Objective:** Clean the data, handle categorical features, and prepare it for model training. The processed data will be saved for use in the modeling notebooks.

---


### ðŸŽ¨ Setup: Import Libraries


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import warnings
warnings.filterwarnings('ignore')

print("âœ… Libraries imported!")


---

### 1. Load Data


In [None]:
# TODO: Load the raw training data.
# Hint: Use pd.read_csv() with the correct path to data/raw/train.csv
df = ...

# Display basic info
print(f"Shape: {df.shape}")
df.head()


In [None]:
# TODO: Load the test data as well - we'll need to apply the same preprocessing.
# Hint: The test data is at data/raw/test.csv
df_test = ...

print(f"Test shape: {df_test.shape}")
df_test.head()


---

### 2. Identify Column Types


In [None]:
# TODO: Identify the ID column, target column, and feature columns.
# Hint: Usually there's an 'id' column that shouldn't be used as a feature.

# Define column names
id_col = ...  # e.g., 'id' or 'patient_id'
target_col = 'implant_survival_10y'

# Get all feature columns (everything except id and target)
feature_cols = [col for col in df.columns if col not in [id_col, target_col]]

print(f"ID column: {id_col}")
print(f"Target column: {target_col}")
print(f"Feature columns ({len(feature_cols)}): {feature_cols}")


In [None]:
# TODO: Separate numerical and categorical features.
# Hint: Use df[feature_cols].select_dtypes()

numerical_cols = ...
categorical_cols = ...

print(f"Numerical columns: {list(numerical_cols)}")
print(f"Categorical columns: {list(categorical_cols)}")


---

### 3. Handle Missing Values


In [None]:
# TODO: Check for missing values in training data.
# Hint: Use df.isnull().sum()
print("Missing values in training data:")
...


In [None]:
# TODO: If there are missing values, decide how to handle them.
# Options:
# 1. For numerical columns: fill with median or mean
#    df[col].fillna(df[col].median(), inplace=True)
# 2. For categorical columns: fill with mode or a placeholder like 'Unknown'
#    df[col].fillna(df[col].mode()[0], inplace=True)
# 3. Drop rows with missing values (not recommended if many rows are affected)

# Example:
# for col in numerical_cols:
#     if df[col].isnull().sum() > 0:
#         median_val = df[col].median()
#         df[col].fillna(median_val, inplace=True)
#         df_test[col].fillna(median_val, inplace=True)  # Use training median for test!

...


---

### 4. Feature Engineering & Selection (Optional)


In [None]:
# TODO: Based on the EDA, decide if any new features should be created.
# Examples of feature engineering:
# - Age groups: df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 70, 100], labels=['young', 'middle', 'senior', 'elderly'])
# - Combining related features: df['total_risk'] = df['risk_factor_1'] + df['risk_factor_2']
# - Interaction terms: df['age_x_smoking'] = df['age'] * df['smoking']

# For now, we will proceed with all features.
# Add your feature engineering code below if needed:

...


In [None]:
# TODO: Based on the EDA, decide if any features should be dropped.
# Reasons to drop features:
# - High correlation with other features (multicollinearity)
# - No variance (constant values)
# - Too many missing values
# - Identified as irrelevant during EDA

# Example:
# cols_to_drop = ['irrelevant_feature']
# df = df.drop(columns=cols_to_drop)
# df_test = df_test.drop(columns=cols_to_drop)

...


---

### 5. Handle Categorical Features


In [None]:
# TODO: Look at unique values for each categorical feature.
# This helps decide between label encoding and one-hot encoding.

for col in categorical_cols:
    print(f"\n{col}: {df[col].nunique()} unique values")
    print(df[col].value_counts())


In [None]:
# TODO: Convert binary categorical features (e.g., 'gender', 'smoking') into numerical format (0 and 1).
# Hint: You can use df['column'].map({'Value1': 0, 'Value2': 1})

# Example for a binary column like 'gender':
# df['gender'] = df['gender'].map({'Male': 0, 'Female': 1})
# df_test['gender'] = df_test['gender'].map({'Male': 0, 'Female': 1})

# Identify binary columns (columns with exactly 2 unique values)
binary_cols = [col for col in categorical_cols if df[col].nunique() == 2]
print(f"Binary columns: {binary_cols}")

# TODO: Encode binary columns
...


In [None]:
# TODO: Apply one-hot encoding to multi-class categorical features.
# Hint: Use pd.get_dummies() with drop_first=True to avoid multicollinearity.

# Identify multi-class categorical columns
multiclass_cols = [col for col in categorical_cols if df[col].nunique() > 2]
print(f"Multi-class columns for one-hot encoding: {multiclass_cols}")

# TODO: Apply one-hot encoding
# df_processed = pd.get_dummies(df, columns=multiclass_cols, drop_first=True)
# df_test_processed = pd.get_dummies(df_test, columns=multiclass_cols, drop_first=True)

df_processed = ...
df_test_processed = ...

print(f"Shape after encoding: {df_processed.shape}")


In [None]:
# TODO: Make sure train and test have the same columns after one-hot encoding.
# Hint: Sometimes test data might not have all categories that appear in training.

# Get columns in train but not in test
train_cols = set(df_processed.columns)
test_cols = set(df_test_processed.columns)

# Add missing columns to test with zeros
missing_in_test = train_cols - test_cols
for col in missing_in_test:
    if col != target_col:  # Don't add target column to test
        df_test_processed[col] = 0

# Remove extra columns from test
extra_in_test = test_cols - train_cols
df_test_processed = df_test_processed.drop(columns=list(extra_in_test), errors='ignore')

print(f"Train columns: {df_processed.shape[1]}")
print(f"Test columns: {df_test_processed.shape[1]}")


---

### 6. Separate Features and Target


In [None]:
# TODO: Separate the features (X) from the target variable (y).
# Don't include the ID column in features!

# Get the final feature columns (exclude id and target)
final_feature_cols = [col for col in df_processed.columns if col not in [id_col, target_col]]

X = df_processed[final_feature_cols]
y = df_processed[target_col]

# For test data, keep the ID for submission and get features
test_ids = df_test_processed[id_col] if id_col in df_test_processed.columns else df_test[id_col]
X_test = df_test_processed[final_feature_cols]

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"X_test shape: {X_test.shape}")


---

### 7. Save Processed Data


In [None]:
# TODO: Save the processed features (X) and target (y) to the /data/processed/ folder.
# This will allow other notebooks to load the clean data directly.
# Hint: Use the .to_csv(index=False) method.

# Save training data
X.to_csv('../data/processed/X_train.csv', index=False)
y.to_csv('../data/processed/y_train.csv', index=False)

# Save test data
X_test.to_csv('../data/processed/X_test.csv', index=False)
test_ids.to_csv('../data/processed/test_ids.csv', index=False)

print("âœ… Processed data saved to /data/processed/ folder!")
print(f"   - X_train.csv: {X.shape}")
print(f"   - y_train.csv: {y.shape}")
print(f"   - X_test.csv: {X_test.shape}")
print(f"   - test_ids.csv: {test_ids.shape}")


In [None]:
# TODO: Also save the list of feature names for reference.
# This can be useful for feature importance analysis later.

feature_names = pd.DataFrame({'feature_name': final_feature_cols})
feature_names.to_csv('../data/processed/feature_names.csv', index=False)

print(f"Saved {len(final_feature_cols)} feature names.")


---

### 8. Quick Validation


In [None]:
# TODO: Verify the saved data by loading it back and checking.

X_check = pd.read_csv('../data/processed/X_train.csv')
y_check = pd.read_csv('../data/processed/y_train.csv')

print(f"Loaded X_train shape: {X_check.shape}")
print(f"Loaded y_train shape: {y_check.shape}")
print(f"\nFeature columns: {list(X_check.columns)}")
print(f"\nTarget distribution:")
print(y_check.value_counts())


---

### âœ… Data Preprocessing Complete!

**Data has been saved to `/data/processed/`:**
- `X_train.csv` - Training features
- `y_train.csv` - Training target
- `X_test.csv` - Test features
- `test_ids.csv` - Test IDs for submission
- `feature_names.csv` - List of feature names

**Next Step:** Proceed to `03_Baseline_Models.ipynb` to train your first models!


# ðŸ¦· Dental Implant 10-Year Survival Prediction

## Notebook 02: Data Preprocessing

**Objective:** Clean the data, handle categorical features, and prepare it for model training. The processed data will be saved for use in the modeling notebooks.

---
