<h1 style="text-align: center; color: darkblue;">Inference</h1>

### 📑 <font color='blue'> Table of Contents </font>
1. [Introduction](#introduction)
2. [Setup](#setup)
3. [Load model and related components](#load)
4. [Get new data](#new_data) 
5. [Preprocessing](#preprocessing)
6. [Predictions](#predictions) 

<a name="introduction"></a>
## <font color="darkred"> 1. Introduction </font>

In this notebook, we demonstrate how to use our trained model to make predictions on new data.

The workflow is as follows:
    
- Load model and related components (scaler, encoder, etc.)

- Get new data

- Preprocess the data (scaling, encoding, feature selection, etc.)

- Make predictions

- Interpret the predictions

This order ensures that the data is prepared exactly as during training before generating predictions, and then results can be meaningfully interpreted.

<a name="setup"></a>
## <font color="darkred"> 2. Setup </font>

In [70]:
import os
import joblib
import pandas as pd
import numpy as np
import tensorflow as tf

from tensorflow import keras
from joblib import load

In this notebook we will focus on the last experiment only.

In [71]:
# get last experiment

base_path = "../outputs/saved_models"

# list all experiment directories
experiments = [d for d in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, d))]

# sort them by timestamp at the end of the name
experiments.sort()

# last one (most recent)
latest_experiment = experiments[-1]
latest_path = os.path.join(base_path, latest_experiment)

print("All experiments:", experiments)
print("Latest experiment:", latest_experiment)
print("Path to latest:", latest_path)


All experiments: ['experiment_baseline_standardize_20250905_192125']
Latest experiment: experiment_baseline_standardize_20250905_192125
Path to latest: ../outputs/saved_models/experiment_baseline_standardize_20250905_192125


In [72]:
experiment_path = latest_path

<a name="load"></a>
## <font color="darkred"> 3. Load model and related components </font>

**Load model**

In [73]:
model_path = os.path.join(experiment_path, "model.h5")
model = keras.models.load_model(model_path)



**Load column names**

Why do we need column names in AI projects?

When training a model, the input features have a specific meaning and order. At inference time, if we feed the model data without the same structure, the predictions become unreliable. Column names act as a blueprint: they ensure new data is preprocessed consistently, features are aligned correctly, and nothing is misplaced or missing. Without them, the model might confuse inputs (e.g., treating "age" as "income"), leading to wrong results.

In short: saving column names guarantees that training and inference speak the same “language.”

In [75]:
# Load column names
with open(f"{experiment_path}/columns.json", "r") as f:
    columns = json.load(f)

**Load scaler and encoder**

In [76]:
scaler_path = os.path.join(experiment_path, "scaler.pkl")
encoder_path = os.path.join(experiment_path, "encoder.pkl")

In [77]:
# scaler
scaler = load(scaler_path)

# See type of scaler
print(type(scaler))

# Main learned attributes
print("mean:", getattr(scaler, "mean_", None)) # mean for every feature
print("var:", getattr(scaler, "var_", None)) # variance for every feature


<class 'sklearn.preprocessing._data.StandardScaler'>
mean: [1.41559238e+01 1.93511328e+01 9.21518750e+01 6.58153516e+02
 9.61988672e-02 1.03554531e-01 8.85161713e-02 4.88897402e-02
 1.81255273e-01 6.27087305e-02 4.09529102e-01 1.21794902e+00
 2.90134512e+00 4.10547617e+01 6.94725781e-03 2.51113359e-02
 3.16497336e-02 1.17416348e-02 2.04345078e-02 3.75897129e-03
 1.63169453e+01 2.57480273e+01 1.07621934e+02 8.86556445e+02
 1.32138906e-01 2.53280762e-01 2.71695561e-01 1.14682229e-01
 2.90017188e-01 8.38891016e-02]
var: [1.25668844e+01 1.85788022e+01 5.99698209e+02 1.26879396e+05
 2.01562007e-04 2.81310339e-03 6.48463278e-03 1.53749022e-03
 7.52674738e-04 4.73514193e-05 8.17568545e-02 3.12598201e-01
 4.38201521e+00 2.22211910e+03 8.32386721e-06 2.98804298e-04
 9.40867147e-04 3.93270921e-05 6.84521929e-05 6.59908845e-06
 2.36568213e+01 3.76441772e+01 1.15197767e+03 3.32406115e+05
 5.35499243e-04 2.45697564e-02 4.37180255e-02 4.36030274e-03
 3.76101920e-03 3.24277262e-04]


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [78]:
encoder = joblib.load(encoder_path)
encoder

{'M': 1, 'B': 0}

<a name="new_data"></a>
## <font color="darkred"> 4. Get new data </font>

In this example, we will simulate some new data.

In [98]:
# Original data for reference
original = np.array([7.76, 24.54, 47.92, 181.0, 0.05263, 0.04362, 0.0, 0.0,
       0.1587, 0.05884, 0.3857, 1.428, 2.548, 19.15, 0.007189, 0.00466,
       0.0, 0.0, 0.02676, 0.002783, 9.456, 30.37, 59.16, 268.6, 0.08996,
       0.06444, 0.0, 0.0, 0.2871, 0.07039])

# Generate a new vector: similar scale, but perturbed enough
np.random.seed(42)  # for reproducibility
noise_factor = 0.7  # larger factor → more difference
new_data = original * (1 + noise_factor * (2 * np.random.rand(*original.shape) - 1))

new_data

array([6.39700385e+00, 4.00247407e+01, 6.34840096e+01, 2.06000060e+02,
       2.72847655e-02, 2.26122734e-02, 0.00000000e+00, 0.00000000e+00,
       1.81165733e-01, 7.59801867e-02, 1.26825215e-01, 2.36744378e+00,
       3.73388939e+00, 1.14378116e+01, 3.98669556e-03, 2.59453102e-03,
       0.00000000e+00, 0.00000000e+00, 2.42103882e-02, 1.96958698e-03,
       1.09367534e+01, 1.50420000e+01, 4.19445884e+01, 2.18346708e+02,
       8.44272781e-02, 9.01674345e-02, 0.00000000e+00, 0.00000000e+00,
       3.24245112e-01, 2.56945024e-02])

<a name="preprocessing"></a>
## <font color="darkred"> 5. Preprocess new data </font>

In [99]:
# Add column names

# Convert to DataFrame with proper column names
new_data_df = pd.DataFrame([new_data], columns=columns)

# visualization
new_data_df.head()


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,6.397004,40.024741,63.48401,206.00006,0.027285,0.022612,0.0,0.0,0.181166,0.07598,...,10.936753,15.042,41.944588,218.346708,0.084427,0.090167,0.0,0.0,0.324245,0.025695


In [100]:
scaled_data = scaler.transform(new_data_df)

scaled_data

array([[-2.18870618e+00,  4.79631192e+00, -1.17065515e+00,
        -1.26937606e+00, -4.85404461e+00, -1.52609816e+00,
        -1.09920812e+00, -1.24684236e+00, -3.26372864e-03,
         1.92864403e+00, -9.88711742e-01,  2.05595572e+00,
         3.97713558e-01, -6.28284966e-01, -1.02615183e+00,
        -1.30260681e+00, -1.03182467e+00, -1.87233109e+00,
         4.56377834e-01, -6.96565004e-01, -1.10616414e+00,
        -1.74493577e+00, -1.93505609e+00, -1.15898629e+00,
        -2.06179179e+00, -1.04061246e+00, -1.29942816e+00,
        -1.73675198e+00,  5.58120260e-01, -3.23165085e+00]])

<a name="predictions"></a>
## <font color="darkred"> 6. Predictions </font>

In [101]:
y_pred_prob = model.predict(scaled_data)
y_pred_prob

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step


array([[7.0527065e-05]], dtype=float32)

That probability is about 0.0000705 (≈0.007%), which is extremely close to zero—indicating the model is almost certain the tumor is not malignant.

In [103]:
y_pred = (y_pred_prob > 0.5).astype(int)
y_pred

array([[0]])