# Size Suggestion with IQR Outlier Detection & StandardScaler

This notebook demonstrates an improved approach for size prediction using:
- **IQR (Interquartile Range)** for outlier detection
- **StandardScaler** for proper feature normalization
- **Global statistics** from training data (not per-size-category)

This approach solves the data leakage and scale mismatch issues from the per-size z-score normalization method.

## 1. Import Libraries

In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import joblib

%matplotlib inline
print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load and Explore Data

In [2]:
# Load the dataset
df = pd.read_csv('../data/final_test.csv')
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

Dataset shape: (119734, 4)

Columns: ['weight', 'age', 'height', 'size']


Unnamed: 0,weight,age,height,size
0,62,28.0,172.72,XL
1,59,36.0,167.64,L
2,61,34.0,165.1,M
3,65,27.0,175.26,L
4,62,45.0,172.72,M


In [3]:
# Data info
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 119734 entries, 0 to 119733
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   weight  119734 non-null  int64  
 1   age     119477 non-null  float64
 2   height  119404 non-null  float64
 3   size    119734 non-null  str    
dtypes: float64(2), int64(1), str(1)
memory usage: 3.7 MB


In [4]:
# Statistical summary
df.describe()

Unnamed: 0,weight,age,height
count,119734.0,119477.0,119404.0
mean,61.756811,34.027311,165.805794
std,9.944863,8.149447,6.737651
min,22.0,0.0,137.16
25%,55.0,29.0,160.02
50%,61.0,32.0,165.1
75%,67.0,37.0,170.18
max,136.0,117.0,193.04


In [5]:
# Size distribution
print("Size Distribution:")
print(df['size'].value_counts())

Size Distribution:
size
M       29712
S       21924
XXXL    21359
XL      19119
L       17587
XXS      9964
XXL        69
Name: count, dtype: int64


## 3. Data Preprocessing (WITHOUT per-size z-score)

In [6]:
# Handle missing values
print("Missing values before:")
print(df.isna().sum())

df['age'] = df['age'].fillna(df['age'].median())
df['height'] = df['height'].fillna(df['height'].median())
df['weight'] = df['weight'].fillna(df['weight'].median())

print("\nMissing values after:")
print(df.isna().sum())

Missing values before:
weight      0
age       257
height    330
size        0
dtype: int64

Missing values after:
weight    0
age       0
height    0
size      0
dtype: int64
weight      0
age       257
height    330
size        0
dtype: int64

Missing values after:
weight    0
age       0
height    0
size      0
dtype: int64


In [7]:
# Remove outliers using IQR method on RAW data
def remove_outliers_iqr(df, columns):
    """
    Remove outliers using IQR method.
    This is applied to RAW data before any normalization.
    """
    df_clean = df.copy()
    
    for col in columns:
        Q1 = df_clean[col].quantile(0.25)
        Q3 = df_clean[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        before_count = len(df_clean)
        df_clean = df_clean[(df_clean[col] >= lower_bound) & (df_clean[col] <= upper_bound)]
        after_count = len(df_clean)
        removed = before_count - after_count
        
        print(f"{col}: Removed {removed} outliers (Range: [{lower_bound:.2f}, {upper_bound:.2f}])")
    
    return df_clean

print("Removing outliers using IQR method...")
print(f"Original dataset size: {len(df)}")

df_clean = remove_outliers_iqr(df, ['age', 'height', 'weight'])

print(f"\nCleaned dataset size: {len(df_clean)}")
print(f"Removed: {len(df) - len(df_clean)} rows ({(len(df) - len(df_clean))/len(df)*100:.2f}%)")

Removing outliers using IQR method...
Original dataset size: 119734
age: Removed 6726 outliers (Range: [17.00, 49.00])
height: Removed 167 outliers (Range: [144.78, 185.42])
weight: Removed 3236 outliers (Range: [37.00, 85.00])

Cleaned dataset size: 109605
Removed: 10129 rows (8.46%)
weight: Removed 3236 outliers (Range: [37.00, 85.00])

Cleaned dataset size: 109605
Removed: 10129 rows (8.46%)


In [8]:
# Encode size labels
size_mapping_encode = {"XXS":1, "S":2, "M":3, "L":4, "XL":5, "XXL":6, "XXXL":7}
df_clean['size'] = df_clean['size'].map(size_mapping_encode)

print("Size encoding:")
print(df_clean['size'].value_counts().sort_index())

Size encoding:
size
1     9731
2    21127
3    28379
4    16533
5    17747
6       63
7    16025
Name: count, dtype: int64


## 4. Train/Validation/Test Split (BEFORE Scaling)

In [9]:
# Prepare features and target
X = df_clean[['age', 'height', 'weight']].copy()
y = df_clean['size'].copy()

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures: {X.columns.tolist()}")

Features shape: (109605, 3)
Target shape: (109605,)

Features: ['age', 'height', 'weight']


In [10]:
# Split the data: 60% train, 20% validation, 20% test
# IMPORTANT: Split BEFORE scaling to prevent data leakage

X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, 
    test_size=0.4, 
    random_state=42, 
    stratify=y
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, 
    test_size=0.5, 
    random_state=42, 
    stratify=y_temp
)

print(f"Dataset Split Summary:")
print(f"{'='*50}")
print(f"Total samples: {len(X)}")
print(f"\nTraining set:   {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Validation set: {len(X_val)} samples ({len(X_val)/len(X)*100:.1f}%)")
print(f"Testing set:    {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")
print(f"{'='*50}")

Dataset Split Summary:
Total samples: 109605

Training set:   65763 samples (60.0%)
Validation set: 21921 samples (20.0%)
Testing set:    21921 samples (20.0%)


## 5. Feature Scaling with StandardScaler

In [45]:
# Fit StandardScaler on training data ONLY
scaler = StandardScaler()
scaler.fit(X_train)

# Transform all three sets
X_train_scaled = pd.DataFrame(
    scaler.transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

X_val_scaled = pd.DataFrame(
    scaler.transform(X_val),
    columns=X_val.columns,
    index=X_val.index
)

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

print("StandardScaler fitted and applied!")
print(f"\nScaler parameters (from training data):")
print(f"Mean: {scaler.mean_}")
print(f"Std:  {scaler.scale_}")

# Save the scaler for later use
joblib.dump(scaler, '../models/feature_scaler.pkl')
print("\nScaler saved to '../models/feature_scaler.pkl'")

StandardScaler fitted and applied!

Scaler parameters (from training data):
Mean: [ 32.72575764 165.70530815  60.73199215]
Std:  [6.13875694 6.64634509 8.10532127]

Scaler saved to '../models/feature_scaler.pkl'


## 6. Train Models

### Decision Tree Training 

In [12]:
# Train Decision Tree Model
print("Training Decision Tree Model")

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train_scaled, y_train)

Training Decision Tree Model


0,1,2
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`.",'gini'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at  each split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note::  The search for a split does not stop until at least one  valid partition of the node samples is found, even if it requires to  effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",42
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


#### Evaluate on validation set

In [13]:
# Evaluate on validation set
y_val_pred = dt_model.predict(X_val_scaled)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f'\nValidation Accuracy: {val_accuracy*100:.2f}%')


Validation Accuracy: 48.51%


### Evaluate on test set

In [14]:
# Evaluate on test set
y_test_pred = dt_model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f'Test Accuracy: {test_accuracy*100:.2f}%')

print('Classification Report (Test Set):')
print(classification_report(y_test, y_test_pred))

Test Accuracy: 47.90%
Classification Report (Test Set):
              precision    recall  f1-score   support

           1       0.49      0.45      0.47      1946
           2       0.47      0.50      0.49      4226
           3       0.47      0.58      0.52      5676
           4       0.34      0.27      0.30      3306
           5       0.42      0.37      0.39      3550
           6       0.00      0.00      0.00        12
           7       0.71      0.64      0.67      3205

    accuracy                           0.48     21921
   macro avg       0.41      0.40      0.40     21921
weighted avg       0.48      0.48      0.48     21921



### Save the model

In [15]:
# Save model
joblib.dump(dt_model, '../models/decision_tree_model_standardized.pkl')
print("\nDecision Tree model saved!")


Decision Tree model saved!


### MLP Neural Network Model Traning

In [16]:
# Train MLP Neural Network Model
print("Training MLP Neural Network Model")

mlp = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42)
mlp.fit(X_train_scaled, y_train)

Training MLP Neural Network Model


0,1,2
,"hidden_layer_sizes  hidden_layer_sizes: array-like of shape(n_layers - 2,), default=(100,) The ith element represents the number of neurons in the ith hidden layer.","(100, ...)"
,"activation  activation: {'identity', 'logistic', 'tanh', 'relu'}, default='relu' Activation function for the hidden layer. - 'identity', no-op activation, useful to implement linear bottleneck,  returns f(x) = x - 'logistic', the logistic sigmoid function,  returns f(x) = 1 / (1 + exp(-x)). - 'tanh', the hyperbolic tan function,  returns f(x) = tanh(x). - 'relu', the rectified linear unit function,  returns f(x) = max(0, x)",'relu'
,"solver  solver: {'lbfgs', 'sgd', 'adam'}, default='adam' The solver for weight optimization. - 'lbfgs' is an optimizer in the family of quasi-Newton methods. - 'sgd' refers to stochastic gradient descent. - 'adam' refers to a stochastic gradient-based optimizer proposed  by Kingma, Diederik, and Jimmy Ba For a comparison between Adam optimizer and SGD, see :ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_training_curves.py`. Note: The default solver 'adam' works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, 'lbfgs' can converge faster and perform better.",'adam'
,"alpha  alpha: float, default=0.0001 Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss. For an example usage and visualization of varying regularization, see :ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_alpha.py`.",0.0001
,"batch_size  batch_size: int, default='auto' Size of minibatches for stochastic optimizers. If the solver is 'lbfgs', the classifier will not use minibatch. When set to ""auto"", `batch_size=min(200, n_samples)`.",'auto'
,"learning_rate  learning_rate: {'constant', 'invscaling', 'adaptive'}, default='constant' Learning rate schedule for weight updates. - 'constant' is a constant learning rate given by  'learning_rate_init'. - 'invscaling' gradually decreases the learning rate at each  time step 't' using an inverse scaling exponent of 'power_t'.  effective_learning_rate = learning_rate_init / pow(t, power_t) - 'adaptive' keeps the learning rate constant to  'learning_rate_init' as long as training loss keeps decreasing.  Each time two consecutive epochs fail to decrease training loss by at  least tol, or fail to increase validation score by at least tol if  'early_stopping' is on, the current learning rate is divided by 5. Only used when ``solver='sgd'``.",'constant'
,"learning_rate_init  learning_rate_init: float, default=0.001 The initial learning rate used. It controls the step-size in updating the weights. Only used when solver='sgd' or 'adam'.",0.001
,"power_t  power_t: float, default=0.5 The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to 'invscaling'. Only used when solver='sgd'.",0.5
,"max_iter  max_iter: int, default=200 Maximum number of iterations. The solver iterates until convergence (determined by 'tol') or this number of iterations. For stochastic solvers ('sgd', 'adam'), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.",1000
,"shuffle  shuffle: bool, default=True Whether to shuffle samples in each iteration. Only used when solver='sgd' or 'adam'.",True


### Evaluation on validation set

In [17]:
# Evaluate on validation set
y_val_pred = mlp.predict(X_val_scaled)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f'\nValidation Accuracy: {val_accuracy*100:.2f}%')


Validation Accuracy: 50.10%


### Evaluation on test set

In [18]:

y_test_pred = mlp.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f'Test Accuracy: {test_accuracy*100:.2f}%')
print('Classification Report (Test Set):')
print(classification_report(y_test, y_test_pred))

Test Accuracy: 50.25%
Classification Report (Test Set):
              precision    recall  f1-score   support

           1       0.54      0.51      0.53      1946
           2       0.50      0.48      0.49      4226
           3       0.47      0.63      0.54      5676
           4       0.36      0.19      0.25      3306
           5       0.43      0.43      0.43      3550
           6       0.00      0.00      0.00        12
           7       0.72      0.70      0.71      3205

    accuracy                           0.50     21921
   macro avg       0.43      0.42      0.42     21921
weighted avg       0.50      0.50      0.49     21921



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


### Save model

In [19]:
joblib.dump(mlp, '../models/mlp_model_standardized.pkl')
print("\nMLP model saved!")


MLP model saved!



## 7. Function for model testing
### 7.1 Decision Tree Model

#### Predict Function

In [None]:
# Size mapping for display
size_mapping = {1: "XXS", 2: "S", 3: "M", 4: "L", 5: "XL", 6: "XXL", 7: "XXXL"}

def predict_with_decision_tree(age, height, weight):
    # Prepare input
    input_data = pd.DataFrame({
        'age': [age],
        'height': [height],
        'weight': [weight]
    })
    
    # Check outliers
    def check_outlier(value, data):
        Q1, Q3 = np.percentile(data, 25), np.percentile(data, 75)
        IQR = Q3 - Q1
        lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
        return (value < lower or value > upper), lower, upper
    
    outliers = {}
    for feature in ['age', 'height', 'weight']:
        is_out, lower, upper = check_outlier(input_data[feature].values[0], X_train[feature])
        outliers[feature] = {
            'is_outlier': is_out,
            'value': float(input_data[feature].values[0]),
            'valid_range': (float(lower), float(upper))
        }
    
    # Standardize
    input_scaled = scaler.transform(input_data)
    input_scaled_df = pd.DataFrame(input_scaled, columns=['age', 'height', 'weight'])
    
    # Predict
    prediction = dt_model.predict(input_scaled_df)[0]
    probabilities = dt_model.predict_proba(input_scaled_df)[0]
    confidence = probabilities[prediction - 1] * 100
    
    return {
        'model': 'Decision Tree',
        'input': {
            'age': age,
            'height': height,
            'weight': weight
        },
        'standardized': {
            'age': float(input_scaled_df['age'].values[0]),
            'height': float(input_scaled_df['height'].values[0]),
            'weight': float(input_scaled_df['weight'].values[0])
        },
        'outliers': outliers,
        'has_outliers': any(v['is_outlier'] for v in outliers.values()),
        'prediction': {
            'size': size_mapping[prediction],
            'size_code': int(prediction),
            'confidence': float(confidence)
        },
        'probabilities': {
            size_mapping[i+1]: float(prob*100) for i, prob in enumerate(probabilities)
        }
    }
print("Decision Tree prediction functions created!")

### 7.2. MLP Model

#### Predict Function

In [32]:
def predict_with_mlp(age, height, weight):
    # Prepare input
    input_data = pd.DataFrame({
        'age': [age],
        'height': [height],
        'weight': [weight]
    })
    
    # Check outliers
    def check_outlier(value, data):
        Q1, Q3 = np.percentile(data, 25), np.percentile(data, 75)
        IQR = Q3 - Q1
        lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
        return (value < lower or value > upper), lower, upper
    
    outliers = {}
    for feature in ['age', 'height', 'weight']:
        is_out, lower, upper = check_outlier(input_data[feature].values[0], X_train[feature])
        outliers[feature] = {
            'is_outlier': is_out,
            'value': float(input_data[feature].values[0]),
            'valid_range': (float(lower), float(upper))
        }
    
    # Standardize
    input_scaled = scaler.transform(input_data)
    input_scaled_df = pd.DataFrame(input_scaled, columns=['age', 'height', 'weight'])
    
    # Predict
    prediction = mlp.predict(input_scaled_df)[0]
    probabilities = mlp.predict_proba(input_scaled_df)[0]
    confidence = probabilities[prediction - 1] * 100
    
    return {
        'model': 'MLP Neural Network',
        'input': {
            'age': age,
            'height': height,
            'weight': weight
        },
        'standardized': {
            'age': float(input_scaled_df['age'].values[0]),
            'height': float(input_scaled_df['height'].values[0]),
            'weight': float(input_scaled_df['weight'].values[0])
        },
        'outliers': outliers,
        'has_outliers': any(v['is_outlier'] for v in outliers.values()),
        'prediction': {
            'size': size_mapping[prediction],
            'size_code': int(prediction),
            'confidence': float(confidence)
        },
        'probabilities': {
            size_mapping[i+1]: float(prob*100) for i, prob in enumerate(probabilities)
        }
    }

### 7.3. Individual Test Function

#### Decision Tree

In [38]:
def test_decision_tree_only(age, height, weight):
    # Get prediction
    result_dt = predict_with_decision_tree(age, height, weight)
    
    # Display result
    print(f"{'DECISION TREE MODEL TEST':^68}")
    
    print(f"{'INPUT':^68}")
    print(f" Age:    {age} years")
    print(f" Height: {height} cm")
    print(f" Weight: {weight} kg")
    
    # Outlier status
    print(f"{'OUTLIER CHECK':^68}")
    for feature, data in result_dt['outliers'].items():
        status = "OUTLIER" if data['is_outlier'] else "Normal"
        lower, upper = data['valid_range']
        print(f"{feature.capitalize():8}: {data['value']:>6.2f} │ [{lower:>6.2f}, {upper:>6.2f}] │ {status:11}")
    
    # Prediction result
    print(f"{'PREDICTION RESULT':^68}")
    print(f"Predicted Size: {result_dt['prediction']['size']:>4} (Code: {result_dt['prediction']['size_code']})")
    print(f"Confidence:     {result_dt['prediction']['confidence']:>6.2f}%")
    print(f"{'PROBABILITY DISTRIBUTION':^68}")
    
    for size, prob in result_dt['probabilities'].items():
        bar_length = int(prob / 2.5)
        bar = '█' * bar_length
        marker = 'PREDICTED' if size == result_dt['prediction']['size'] else ''
        print(f"{size:>4}: {prob:>6.2f}% │{bar:<40}│{marker:14}")
    
    
    # Summary
    print(f"{'FINAL RECOMMENDATION':^68}")
    print(f"Recommended Size: {result_dt['prediction']['size']:>4}")
    print(f"Confidence Level: {result_dt['prediction']['confidence']:>6.2f}%")
    
    if result_dt['has_outliers']:
        print(f"Note: Input contains outliers")
        
    return result_dt

print("Decision Tree model testing functions created!")

Decision Tree model testing functions created!


#### MLP

In [41]:
def test_mlp_only(age, height, weight):

    # Get prediction
    result_mlp = predict_with_mlp(age, height, weight)
    
    # Display result
    print(f"{'MLP NEURAL NETWORK MODEL TEST':^68}")
    
    print(f"{'INPUT':^68}")
    print(f"Age:    {age} years")
    print(f"Height: {height} cm")
    print(f"Weight: {weight} kg")
    
    # Outlier status
    print(f"{'OUTLIER CHECK':^68}")
    for feature, data in result_mlp['outliers'].items():
        status = "OUTLIER" if data['is_outlier'] else "Normal"
        lower, upper = data['valid_range']
        print(f"{feature.capitalize():8}: {data['value']:>6.2f} [{lower:>6.2f}, {upper:>6.2f}]  {status:11} ")
    
    # Prediction result
    print(f"{'PREDICTION RESULT':^68}")
    print(f"Predicted Size: {result_mlp['prediction']['size']:>4} (Code: {result_mlp['prediction']['size_code']})")
    print(f"Confidence:     {result_mlp['prediction']['confidence']:>6.2f}%")
    print(f"{'PROBABILITY DISTRIBUTION':^68}")
    
    for size, prob in result_mlp['probabilities'].items():
        bar_length = int(prob / 2.5)
        bar = '█' * bar_length
        marker = 'PREDICTED' if size == result_mlp['prediction']['size'] else ''
        print(f"{size:>4}: {prob:>6.2f}% │{bar:<40}│{marker:14}")
    # Summary
    print(f"{'FINAL RECOMMENDATION':^68}")
    print(f"Recommended Size: {result_mlp['prediction']['size']:>4}")
    print(f"Confidence Level: {result_mlp['prediction']['confidence']:>6.2f}%")

    if result_mlp['has_outliers']:

        print(f"Note: Input contains outliers:")

    
    return result_mlp

print("MLP model testing functions created!")

MLP model testing functions created!


### 8. Testing the model

In [43]:
# Test Case: Young person with potential outlier (very light for height)
test_decision_tree_only(20, 180, 59)

                      DECISION TREE MODEL TEST                      
                               INPUT                                
 Age:    20 years
 Height: 180 cm
 Weight: 59 kg
                           OUTLIER CHECK                            
Age     :  20.00 │ [ 18.50,  46.50] │ Normal     
Height  : 180.00 │ [144.78, 185.42] │ Normal     
Weight  :  59.00 │ [ 40.00,  80.00] │ Normal     
                         PREDICTION RESULT                          
Predicted Size:    L (Code: 4)
Confidence:     100.00%
                      PROBABILITY DISTRIBUTION                      
 XXS:   0.00% │                                        │              
   S:   0.00% │                                        │              
   M:   0.00% │                                        │              
   L: 100.00% │████████████████████████████████████████│PREDICTED     
  XL:   0.00% │                                        │              
 XXL:   0.00% │                               

{'model': 'Decision Tree',
 'input': {'age': 20, 'height': 180, 'weight': 59},
 'standardized': {'age': -2.0730186532353825,
  'height': 2.150759800397594,
  'weight': -0.21368581161863567},
 'outliers': {'age': {'is_outlier': np.False_,
   'value': 20.0,
   'valid_range': (18.5, 46.5)},
  'height': {'is_outlier': np.False_,
   'value': 180.0,
   'valid_range': (144.78000000000003, 185.42000000000002)},
  'weight': {'is_outlier': np.False_,
   'value': 59.0,
   'valid_range': (40.0, 80.0)}},
 'has_outliers': False,
 'prediction': {'size': 'L', 'size_code': 4, 'confidence': 100.0},
 'probabilities': {'XXS': 0.0,
  'S': 0.0,
  'M': 0.0,
  'L': 100.0,
  'XL': 0.0,
  'XXL': 0.0,
  'XXXL': 0.0}}

In [44]:
test_mlp_only(19, 180, 59)

                   MLP NEURAL NETWORK MODEL TEST                    
                               INPUT                                
Age:    19 years
Height: 180 cm
Weight: 59 kg
                           OUTLIER CHECK                            
Age     :  19.00 [ 18.50,  46.50]  Normal      
Height  : 180.00 [144.78, 185.42]  Normal      
Weight  :  59.00 [ 40.00,  80.00]  Normal      
                         PREDICTION RESULT                          
Predicted Size:    M (Code: 3)
Confidence:      44.78%
                      PROBABILITY DISTRIBUTION                      
 XXS:   2.46% │                                        │              
   S:  22.72% │█████████                               │              
   M:  44.78% │█████████████████                       │PREDICTED     
   L:  24.14% │█████████                               │              
  XL:   2.68% │█                                       │              
 XXL:   0.00% │                                        

{'model': 'MLP Neural Network',
 'input': {'age': 19, 'height': 180, 'weight': 59},
 'standardized': {'age': -2.2359180821797984,
  'height': 2.150759800397594,
  'weight': -0.21368581161863567},
 'outliers': {'age': {'is_outlier': np.False_,
   'value': 19.0,
   'valid_range': (18.5, 46.5)},
  'height': {'is_outlier': np.False_,
   'value': 180.0,
   'valid_range': (144.78000000000003, 185.42000000000002)},
  'weight': {'is_outlier': np.False_,
   'value': 59.0,
   'valid_range': (40.0, 80.0)}},
 'has_outliers': False,
 'prediction': {'size': 'M', 'size_code': 3, 'confidence': 44.77860883344016},
 'probabilities': {'XXS': 2.4564014800210368,
  'S': 22.72171423350079,
  'M': 44.77860883344016,
  'L': 24.144205384626797,
  'XL': 2.6849392016068854,
  'XXL': 0.0001453394169834051,
  'XXXL': 3.2139855273873676}}