## Serialization/ Save the Model
Before deployment, the trained model must be saved in a format that allows easy loading and inference. Common formats include:
- Pickle (.pkl) – Simple models in Python environments.
- Joblib (.joblib) -  Efficient for sklearn-based models, optimized for large objects.
- ONNX (.onnx) – Cross-platform deployment (models work outside Python)., optimized for fast inference in cloud and edge environments.
- .json/.yaml -  Text-based storage format, useful for model structure saving, so it can be rebuilt easily.
- TensorFlow SavedModel / HDF5 (.h5) – Used for deep learning models.

## Pipeline Packaging
Once you've already trained multiple models, selected the best one, and finalized your preprocessing steps, you go back and create a unified pipeline that includes both preprocessing and the selected model. This ensures that training and inference use the same transformations and prevents issues like applying different scaling methods at different stages.

- Common Preprocessing Steps Required at Inference Time:
    - Feature scaling/normalization
    - One-hot encoding
    - Feature transformations
    - Handling missing values
    - Text vectorization
    - Dimensionality reduction
    - Any custom feature engineering

- Implementation Methods:
    - Scikit-learn Pipelines
    - TensorFlow Transform
    - Feature stores
    - Custom preprocessing modules bundled with the model

- Key Takeaways:
    - You don’t retrain the model – you load the pre-trained model and integrate it into the pipeline.
    - Ensure all preprocessing steps are the same as used during training.
    - Fit only preprocessing steps before adding them to the pipeline.
    - Save the complete pipeline so that inference can be run in a single step later.


#### Example 1

In [None]:
import joblib
import pandas as pd
import xgboost as xgb
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# 1️⃣ Load the selected model (previously trained)
best_model = joblib.load("best_xgboost_model.joblib")

# 2️⃣ Define the preprocessing steps based on what was used during training
preprocessor = ColumnTransformer([
    ('scaler', StandardScaler(), ['amount', 'transactions']),
    ('encoder', OneHotEncoder(handle_unknown='ignore'), ['category'])
])

# 3️⃣ Create the final pipeline (preprocessing + final model)
final_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', best_model)  # Use the already trained model
])

# 4️⃣ Save the final pipeline
joblib.dump(final_pipeline, "fraud_detection_pipeline.joblib")

print("Final pipeline saved successfully!")


Loading and Running Inference:

Remember ensure the new input data is structured like the training data (same column names and types).

In [None]:
import joblib
import pandas as pd

# 1️⃣ Load the saved pipeline
pipeline = joblib.load("fraud_detection_pipeline.joblib")

# 2️⃣ Prepare new data (it should match the format used during training)
new_data = pd.DataFrame({
    'amount': [500.0],
    'transactions': [10],
    'category': ['electronics']
})

# 3️⃣ Run inference (both preprocessing & model prediction are handled in one step)
prediction = pipeline.predict(new_data)

print(f"Predicted class: {prediction[0]}")


# If you need confidence scores, use .predict_proba() instead:

probabilities = pipeline.predict_proba(new_data)
print(f"Fraud Probability: {probabilities[0][1]:.2f}")



#### Example 2

In [None]:
 import joblib
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif

# 1️⃣ Load dataset
X, y = load_iris(return_X_y=True)
feature_names = load_iris().feature_names  # Get feature names for reference

# Introduce some missing values (for demonstration purposes)
np.random.seed(42)
X[np.random.randint(0, X.shape[0], 5), np.random.randint(0, X.shape[1], 5)] = np.nan

# 2️⃣ Split dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3️⃣ Define a Custom Transformer for manual feature manipulation
class CustomFeatureTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self  # No fitting required for transformation
    
    def transform(self, X):
        X = X.copy()  # Ensure we don't modify the original data
        
        # Apply manual transformations:
        new_feature = np.sqrt(np.abs(X[:, 0] * X[:, 1]))  # Example: Multiply first two columns and apply sqrt. Creates a new feature 
        new_feature = new_feature.reshape(-1, 1)  # Reshape to add as a new feature. 
        
        return np.hstack((X, new_feature))  # Append the new feature to the dataset

# 4️⃣ Define a pipeline with multiple preprocessing steps
pipeline = Pipeline([
    ('custom_transform', CustomFeatureTransformer()),  # Manual transformation
    ('imputer', SimpleImputer(strategy='mean')),  # Handle missing values
    ('scaler', StandardScaler()),  # Normalize features
    ('feature_selection', SelectKBest(f_classif, k=3)),  # Select top 3 features
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))  # Train the model
])

# 5️⃣ Train the pipeline
pipeline.fit(X_train, y_train)

# 6️⃣ Save the trained pipeline
joblib.dump(pipeline, "iris_pipeline_custom.pkl")

# 7️⃣ Load the saved pipeline
loaded_pipeline = joblib.load("iris_pipeline_custom.pkl")

# 8️⃣ Make predictions (preprocessing + model inference in one step)
y_pred = loaded_pipeline.predict(X_test)
print("Sample predictions:", y_pred[:5])


#### Example 3

In [None]:
import numpy as np
import pandas as pd
import joblib
import onnx
import skl2onnx
import onnxruntime as ort

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# 1️⃣ Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
y = iris.target

# Simulate categorical features for demonstration
X['flower_type'] = np.random.choice(['A', 'B', 'C'], size=len(X))  # Fake categorical feature

# 2️⃣ Define preprocessing: Scale numerical features & encode categorical ones
preprocessor = ColumnTransformer([
    ('scaler', StandardScaler(), ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']),
    ('encoder', OneHotEncoder(handle_unknown='ignore'), ['flower_type'])
])

# 3️⃣ Create pipeline with preprocessing + model
model = LogisticRegression()
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', model)
])

# 4️⃣ Train the pipeline
pipeline.fit(X, y)

# 5️⃣ Convert to ONNX format
initial_type = [('float_input', FloatTensorType([None, X.shape[1] - 1 + 3]))]  # Adjust input shape
onnx_model = convert_sklearn(pipeline, initial_types=initial_type)
onnx.save_model(onnx_model, "logistic_regression_pipeline.onnx")

# 6️⃣ Load ONNX model for inference
ort_session = ort.InferenceSession("logistic_regression_pipeline.onnx")
input_name = ort_session.get_inputs()[0].name

# 7️⃣ Prepare new data for inference (must match training structure)
new_data = pd.DataFrame({
    'sepal_length': [5.1],
    'sepal_width': [3.5],
    'petal_length': [1.4],
    'petal_width': [0.2],
    'flower_type': ['B']
})

# Apply same preprocessing before ONNX inference
new_data_transformed = preprocessor.transform(new_data)

# 8️⃣ Run inference using ONNX
predictions = ort_session.run(None, {input_name: new_data_transformed.astype('float32')})

print("ONNX Predictions:", predictions[0])



* A confidence score represents how sure a model is about its prediction. It is typically a probability value between 0 and 1.Confidence scores are particularly useful in classification models, where they provide a measure of certainty for each predicted class.
* Artifacts are all the files generated during the ML workflow: datasets, save model, pipelines, logs, store results, etc