# Getting Started with Neuralk's Foundry

## What is Neuralk's Foundry?

**Neuralk's Foundry** is a modular framework designed to help you quickly build and experiment with data processing pipelines, with a strong emphasis on machine learning, and especially tabular data and its representation learning. Whether you're prototyping new ideas or building something more robust, Neuralk Foundry makes it easy to plug in your components, customize workflows, and stay productive. Its flexible architecture supports fast iteration while ensuring stability, with clear and informative error messages to guide you when things go wrong.

Neuralk foundry is the core that supports TabBench, our in-house tabular dataset benchmark.

### Why Another ML Pipeline?

Most pipelines used in benchmarks or research follow a rigid structure: they’re built for clean inputs, a single model, and a predictable output. That makes them hard to adapt or chain together when your use case goes beyond the basics. On the other hand, tools like ZenML or other MLOps frameworks offer the flexibility to build more complex workflows but they often come with heavy tooling and a steep setup cost. Neuralk Foundry strikes a balance: it gives you the building blocks to structure sophisticated workflows, while keeping the lightweight feel of prototyping. You can start simple, grow as needed, and stay in control the whole way.



## The Simplest Example Workflow in Foundry

A gentle introduction to Neuralk's Foundry is available in TabBench, our tabular benchmark. Discover it [here](https://github.com/Neuralk-AI/TabBench/).

# Exploring Neuralk’s Foundry

As you may have seen in the previous tutorials, Foundry already offers a wide range of components for building machine learning workflows—even though the current version is still evolving. These components cover everything from data splitting and preprocessing to modeling and evaluation. But beyond the individual steps, Foundry also provides tools and conventions that make it easier to design, organize, and run experiments consistently.

In this section, we’ll explore the structure behind Foundry, understand how the different pieces fit together, and walk through the key utilities that help streamline the creation and execution of custom workflows.


## 🧩 How the Package Is Organized

To use Foundry effectively, it helps to understand its modular design. The `neuralk_foundry_ce` package is structured around a clean separation of concerns, with each submodule addressing a specific stage in the machine learning pipeline.

This is different from frameworks like **scikit-learn**, where pipelines and transformations tend to live in a shared namespace and follow a uniform `fit`/`transform`/`predict` interface. In Foundry, each **Step** explicitly defines its inputs and outputs, allowing greater flexibility for chaining and inspection — particularly useful when dealing with real-world workflows that deviate from standard ML patterns.

Below is a detailed breakdown of the core modules:
* **📁 `datasets`** — Utilities for loading and managing datasets.

* **🧪 `sample_selection`** — Modules for selecting subsets of samples, for example to create splits or candidate pairs.

  * `splitter` – Train/val/test strategies: Extended versions of `sklearn.model_selection` splitters with better traceability and richer outputs (e.g. `StratifiedShuffleSplitter`).
  * `blocking` – Candidate pair generation: Reduces quadratic cost in record linkage by pre-selecting promising pairs (e.g. `NearestNeighborBlocking`).

* **🧼 `feature_engineering`** — Converts raw inputs into usable representations.

  * `preprocessing` – Imputation, encoding (one-hot, target, integer), standard scaling.
  * `vectorizer` – Turns unstructured data such as text into vectors (e.g. `TfIdfVectorizer`)
  * `blocking` – Combines features from two records; computes distances/similarities.

* **🤖 `models`** — ML models and training steps, decoupled from preprocessing.

  * `classifier` / `regressor` – `XGBoost`, `CatBoost`, `LightGBM`, `MLP`
  * `embedder` – Embedding-based models (e.g., contrastive, metric learning)
  * `clustering` – `ConnectedComponentExtractor` Embedders can be chained into downstream classification or clustering workflows.

* **⚙️ `workflow`** — Engine that manages the definition, execution, and orchestration of workflows.
* **🔧 `utils`** — Generic helper functions used across modules.


# Building a simple workflow for a binary classification task

Let’s walk through a minimal but illustrative example of how to build a classification workflow using Foundry's modular steps. This will demonstrate how to connect existing components for a complete cross-validated training pipeline.

## 🧠 Task: Binary Classification on Synthetic Data

To keep things simple and reproducible, we start by generating a synthetic classification dataset using a built-in step. The dataset includes 500 samples, 10 features, and balanced binary labels. This removes any dependency on external files or preprocessing.



In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification


X, y = make_classification(n_samples=500, n_features=10, n_informative=5, 
                           n_redundant=2, n_clusters_per_class=1, random_state=42)

X = pd.DataFrame(X)


## 🧪 Building a Simple Classification Workflow

Let’s walk through a minimal but illustrative example of how to build a classification workflow using Foundry's modular steps. This will demonstrate how to connect existing components for a complete cross-validated training pipeline.


### 🔀 Step 1: Stratified Data Splitting

We use the `StratifiedShuffleSplitter` to split the dataset into training and testing sets. Stratification ensures that both splits preserve the original class distribution — which is particularly important when classes are imbalanced or rare.

Foundry’s splitter step outputs a set of boolean masks for each fold, ready to be consumed by downstream steps. Note that our generated problem does not require preprocessing so we can skip this step.


In [2]:
from neuralk_foundry_ce.sample_selection.splitter import StratifiedShuffleSplitter


splitter = StratifiedShuffleSplitter()
print(splitter.describe())

Stratified shuffle-based data splitter for classification tasks.

Inputs:
- X: Input features of the dataset
- y: Target variable to predict
- fold_index: Index of the train/test split (optional)

Outputs:
- splits: Masks for train, validation, and test sets

Parameters:
  None


### 🔮 Step 2: Model Prediction

Next, we plug in a model. In this case, we use the `XGBoostClassifier` step, an efficient baseline for binary classification tasks. This step takes as input the training/test splits and automatically evaluates model performance on each fold.


In [3]:
from neuralk_foundry_ce.models.classifier import XGBoostClassifier


classifier = XGBoostClassifier()
print(classifier.describe())

  from .autonotebook import tqdm as notebook_tqdm


Train an XGBoost classifier on tabular data.

Inputs:
- X: Input features of the dataset
- y: Target variable to predict
- splits: Masks for train, validation, and test sets
- metric_to_optimize: Metric to optimize during model selection or hyperparameter tuning
- y_classes: Original classes before label encoding (optional)
- categorical_features: Names of columns corresponding to categorical features

Outputs:
- y_pred: Predicted target values
- y_score: Class probability distribution for each sample

Parameters:
- n_hyperopt_trials: Number of trials attempted for hyperparameter optimization (optional)


### 🧵 Step 3: Compose the Workflow

Once the data generation, splitter, and model are defined, we simply compose them into a workflow using Foundry’s orchestration engine. The engine handles dependency resolution, runs each step in order, and collects final outputs — including fold-wise accuracy metrics.


In [4]:
from neuralk_foundry_ce.workflow import WorkFlow
from neuralk_foundry_ce.feature_engineering.preprocessing import ColumnTypeDetection


workflow = WorkFlow(steps=[splitter, ColumnTypeDetection(), classifier], cache_dir=None)
workflow.display()

### ✅ Final Result

At the end of this example, you’ll see:

* Accuracy on each fold
* The average accuracy across folds

In [5]:
all_accuracies = []

for fold_index in range(5):
    _, metrics = workflow.run(init_data={'X': X, 'y': y, 'metric_to_optimize': 'accuracy', 'fold_index': fold_index})
    accuracy = metrics['xgboost-classifier']['test_accuracy']
    print(f'Fold {fold_index} accuracy: {accuracy:.2f}')
    all_accuracies.append(accuracy)

print(f'Mean accuracy: {np.mean(all_accuracies):.2f}')



Fold 0 accuracy: 1.00
Fold 1 accuracy: 0.99
Fold 2 accuracy: 0.97
Fold 3 accuracy: 1.00
Fold 4 accuracy: 0.97
Mean accuracy: 0.99


## Exploring Step Internals: `describe` and Workflow Display

Each step in Foundry comes with a `describe` method that provides a clear summary of what the step does, along with its expected inputs and outputs. This helps you understand how the step fits into the overall workflow.

The same information is also used in the workflow explorer, which visually displays each step and how data flows between them. This makes it easier to debug, document, or simply understand what your pipeline is doing at a glance.


In [6]:
from neuralk_foundry_ce.feature_engineering.preprocessing import ColumnTypeDetection


print(ColumnTypeDetection.describe())

Detects column types and forwards them to the rest of the pipeline.

Inputs:
- X: Input features of the dataset
- y: Target variable to predict

Outputs:
- numerical_features: Names of the numerical feature columns
- categorical_features: Names of the categorical feature columns
- text_features: Names of the text feature columns
- date_features: Names of the date feature columns

Parameters:
- numerical_features: Names of the numerical feature columns (optional)
- categorical_features: Names of the categorical feature columns (optional)
- text_features: Names of the text feature columns (optional)
- date_features: Names of the date feature columns (optional)


In [7]:
from neuralk_foundry_ce.workflow import notebook_display


notebook_display([StratifiedShuffleSplitter(), ColumnTypeDetection()])

## Easily Swapping Steps with Variants

Sometimes it’s useful to explore alternative options for a given step—especially when tuning a pipeline or comparing model performance. Foundry makes this easy with the `list_variants` method.

This method shows you other steps that serve a similar purpose to the one you're currently using. It's particularly handy for models, allowing you to quickly test different algorithms without changing the structure of your workflow.


In [8]:
XGBoostClassifier.list_variants()

Step [xgboost-classifier] is a leaf step. Listing sibling variants from parent [classifier-model]:
- CatBoostClassifier (name='catboost-classifier'): Train a CatBoost classifier on tabular data.
- LightGBMClassifier (name='lightgbm-classifier'): Train a LightGBM classifier on tabular data.
- MLPClassifier (name='mlp-classifier'): Train a neural network (MLP) classifier with categorical embeddings on tabular data.
- TabICLClassifier (name='tabicl-classifier'): Apply a TabICL classifier to tabular data.
- TabPFNClassifier (name='tabpfn-classifier'): Train a TabPFN classifier on tabular data.
- XGBoostClassifier (name='xgboost-classifier'): Train an XGBoost classifier on tabular data.
