# MLimputer - Basic Usage

This notebook demonstrates the basic workflow for using MLimputer:
1. Generate synthetic data with missing values
2. Configure and fit an imputer
3. Transform train/test data
4. Save and load the fitted imputer

## Setup

In [None]:
from sklearn.model_selection import train_test_split
from mlimputer import MLimputer
from mlimputer.schemas.parameters import imputer_parameters
from mlimputer.data.data_generator import ImputationDatasetGenerator

import warnings
warnings.filterwarnings("ignore")

print("="*60)
print("MLIMPUTER - BASIC USAGE")
print("="*60)

## Generate Dataset with Missing Values

We'll create a synthetic multiclass classification dataset with:
- 2000 samples
- 15% missing values
- 5 categorical features

In [None]:
generator = ImputationDatasetGenerator(random_state=42)
X, y = generator.quick_multiclass(n_samples=2000, missing_rate=0.15, n_categorical=5)

print(f"Dataset shape: {X.shape}")
print(f"Missing values: {X.isnull().sum().sum()} ({X.isnull().sum().sum()/X.size:.1%})")
print(f"\nFirst 5 rows:")
X.head()

## Train/Test Split

Split data into training (80%) and test (20%) sets

In [None]:
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

## Configure Imputation Strategy

We'll use **KNN imputation** with custom parameters:
- 5 neighbors
- Distance-based weighting

In [None]:
params = imputer_parameters()
params["KNN"]["n_neighbors"] = 5
params["KNN"]["weights"] = "distance"

print("KNN Configuration:")
for key, value in params["KNN"].items():
    print(f"  {key}: {value}")

## Create and Fit Imputer

In [None]:
imputer = MLimputer(
    imput_model="KNN",
    imputer_configs=params
)

imputer.fit(X_train)

## Transform Data

Apply the fitted imputer to both training and test sets

In [None]:
X_train_imputed = imputer.transform(X_train)
X_test_imputed = imputer.transform(X_test)

print("="*60)
print("IMPUTATION RESULTS")
print("="*60)

print(f"\nTraining Set:")
print(f"  Missing values: {X_train.isnull().sum().sum():,} → {X_train_imputed.isnull().sum().sum():,}")
print(f"  Imputed: {X_train.isnull().sum().sum():,} values")

print(f"\nTest Set:")
print(f"  Missing values: {X_test.isnull().sum().sum():,} → {X_test_imputed.isnull().sum().sum():,}")
print(f"  Imputed: {X_test.isnull().sum().sum():,} values")

## Save and Load Imputer

Save the fitted imputer for future use

In [None]:
import pickle

# Save fitted imputer
with open("fitted_imputer.pkl", 'wb') as f:
    pickle.dump(imputer, f)
print("✓ Imputer saved to 'fitted_imputer.pkl'")

# Load and test
with open("fitted_imputer.pkl", 'rb') as f:
    loaded_imputer = pickle.load(f)
print("✓ Imputer loaded successfully")

## Test on New Data

Verify the loaded imputer works on fresh data

In [None]:
new_data = generator.quick_multiclass(n_samples=100, missing_rate=0.2, n_categorical=5)[0]
new_data_imputed = loaded_imputer.transform(new_data)

print(f"New data imputation: {new_data.isnull().sum().sum()} → {new_data_imputed.isnull().sum().sum()} missing values")
print("\n✓ Basic usage completed successfully!")