OpenTabular · AnFreTh · Feb 17, 2025 · Jan 4, 2025 · Jan 8, 2025 · Jan 8, 2025
diff --git a/.github/workflows/pr-tests.yml b/.github/workflows/pr-tests.yml
@@ -0,0 +1,50 @@
+name: PR Unit Tests
+
+on:
+  pull_request:
+    branches:
+      - develop
+      - master  # Add any other branches where you want to enforce tests
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout Repository
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"  # Change this to match your setup
+
+      - name: Install Poetry
+        run: |
+          curl -sSL https://install.python-poetry.org | python3 -
+          echo "$HOME/.local/bin" >> $GITHUB_PATH
+          export PATH="$HOME/.local/bin:$PATH"
+
+      - name: Install Dependencies
+        run: |
+          python -m pip install --upgrade pip
+          poetry install
+          pip install pytest
+
+      - name: Install Package Locally
+        run: |
+          poetry build
+          pip install dist/*.whl  # Install the built package to fix "No module named 'mambular'"
+
+      - name: Run Unit Tests
+        env:
+          PYTHONPATH: ${{ github.workspace }}  # Ensure the package is discoverable
+        run: pytest tests/
+
+      - name: Verify Tests Passed
+        if: ${{ success() }}
+        run: echo "All tests passed! Pull request is allowed."
+
+      - name: Fail PR on Test Failure
+        if: ${{ failure() }}
+        run: exit 1  # This ensures the PR cannot be merged if tests fail
diff --git a/README.md b/README.md
@@ -21,6 +21,17 @@
 
 Mambular is a Python library for tabular deep learning. It includes models that leverage the Mamba (State Space Model) architecture, as well as other popular models like TabTransformer, FTTransformer, TabM and tabular ResNets. Check out our paper `Mambular: A Sequential Model for Tabular Deep Learning`, available [here](https://arxiv.org/abs/2408.06291). Also check out our paper introducing [TabulaRNN](https://arxiv.org/pdf/2411.17207) and analyzing the efficiency of NLP inspired tabular models.
 
+<h3>⚡ What's New ⚡</h3>
+<ul>
+  <li>Individual preprocessing: preprocess each feature differently, use pre-trained models for categorical encoding</li>
+  <li>Extract latent representations of tables</li>
+  <li>Use embeddings as inputs</li>
+  <li>Define custom training metrics</li>
+</ul>
+
+
+
+
 <h3> Table of Contents </h3>
 
 - [🏃 Quickstart](#-quickstart)
@@ -30,7 +41,6 @@ Mambular is a Python library for tabular deep learning. It includes models that
 - [🛠️ Installation](#️-installation)
 - [🚀 Usage](#-usage)
 - [💻 Implement Your Own Model](#-implement-your-own-model)
-- [Custom Training](#custom-training)
 - [🏷️ Citation](#️-citation)
 - [License](#license)
 
@@ -103,6 +113,7 @@ pip install mamba-ssm
 <h2> Preprocessing </h2>
 
 Mambular simplifies data preprocessing with a range of tools designed for easy transformation of tabular data.
+Specify a default method, or a dictionary defining individual preprocessing methods for each feature.
 
 <h3> Data Type Detection and Transformation </h3>
 
@@ -116,6 +127,7 @@ Mambular simplifies data preprocessing with a range of tools designed for easy t
 - **Polynomial Features**: Automatically generates polynomial and interaction terms for numerical features, enhancing the ability to capture higher-order relationships.  
 - **Box-Cox & Yeo-Johnson Transformations**: Performs power transformations to stabilize variance and normalize distributions.  
 - **Custom Binning**: Enables user-defined bin edges for precise discretization of numerical data.  
+- **Pre-trained Encoding**: Use sentence transformers to encode categorical features.
 
 
 
@@ -147,6 +159,28 @@ preds = model.predict(X)
 preds = model.predict_proba(X)
 ```
 
+Get latent representations for each feature:
+```python
+# simple encoding
+model.encode(X)
+```
+
+Use unstructured data:
+```python
+# load pretrained models
+image_model = ...
+nlp_model = ...
+
+# create embeddings
+img_embs = image_model.encode(images)
+txt_embs = nlp_model.encode(texts)
+
+# fit model on tabular data and unstructured data
+model.fit(X_train, y_train, embeddings=[img_embs, txt_embs])
+```
+
+
+
 <h3> Hyperparameter Optimization</h3>
 Since all of the models are sklearn base estimators, you can use the built-in hyperparameter optimizatino from sklearn.
 
@@ -222,9 +256,11 @@ MambularLSS allows you to model the full distribution of a response variable, no
 - **studentt**: For data with heavier tails, useful with small samples.
 - **negativebinom**: For over-dispersed count data.
 - **inversegamma**: Often used as a prior in Bayesian inference.
+- **johnsonsu**: Four parameter distribution defining location, scale, kurtosis and skewness.
 - **categorical**: For data with more than two categories.
 - **Quantile**: For quantile regression using the pinball loss.
 
+
 These distribution classes make MambularLSS versatile in modeling various data types and distributions.
 
 
@@ -269,13 +305,16 @@ Here's how you can implement a custom model with Mambular:
 
    ```python
    from dataclasses import dataclass
+   from mambular.configs import BaseConfig
 
    @dataclass
-   class MyConfig:
+   class MyConfig(BaseConfig):
        lr: float = 1e-04
        lr_patience: int = 10
        weight_decay: float = 1e-06
-       lr_factor: float = 0.1
+       n_layers: int = 4
+       pooling_method:str = "avg
+
    ```
 
 2. **Second, define your model:**  
@@ -290,22 +329,32 @@ Here's how you can implement a custom model with Mambular:
    class MyCustomModel(BaseModel):
        def __init__(
            self,
-           cat_feature_info,
-           num_feature_info,
+           feature_information: tuple,
            num_classes: int = 1,
            config=None,
            **kwargs,
        ):
-           super().__init__(**kwargs)
-           self.save_hyperparameters(ignore=["cat_feature_info", "num_feature_info"])
+            super().__init__(**kwargs)
+            self.save_hyperparameters(ignore=["feature_information"])
+            self.returns_ensemble = False
+
+            # embedding layer
+            self.embedding_layer = EmbeddingLayer(
+                *feature_information,
+                config=config,
+            )
 
-           input_dim = get_feature_dimensions(num_feature_info, cat_feature_info)
+           input_dim = np.sum(
+                [len(info) * self.hparams.d_model for info in feature_information]
+            )
 
            self.linear = nn.Linear(input_dim, num_classes)
 
-       def forward(self, num_features, cat_features):
-           x = num_features + cat_features
-           x = torch.cat(x, dim=1)
+       def forward(self, *data) -> torch.Tensor:
+            x = self.embedding_layer(*data)
+            B, S, D = x.shape
+            x = x.reshape(B, S * D)
+
 
            # Pass through linear layer
            output = self.linear(x)
@@ -329,60 +378,11 @@ Here's how you can implement a custom model with Mambular:
    ```python
    regressor = MyRegressor(numerical_preprocessing="ple")
    regressor.fit(X_train, y_train, max_epochs=50)
+
+   regressor.evaluate(X_test, y_test)
    ```
 
-# Custom Training
-If you prefer to setup custom training, preprocessing and evaluation, you can simply use the `mambular.base_models`.
-Just be careful that all basemodels expect lists of features as inputs. More precisely as list for numerical features and a list for categorical features. A custom training loop, with random data could look like this.
 
-```python
-import torch
-import torch.nn as nn
-import torch.optim as optim
-from mambular.base_models import Mambular
-from mambular.configs import DefaultMambularConfig
-
-# Dummy data and configuration
-cat_feature_info = {
-    "cat1": {
-        "preprocessing": "imputer -> continuous_ordinal",
-        "dimension": 1,
-        "categories": 4,
-    }
-}  # Example categorical feature information
-num_feature_info = {
-    "num1": {"preprocessing": "imputer -> scaler", "dimension": 1, "categories": None}
-} # Example numerical feature information
-num_classes = 1
-config = DefaultMambularConfig()  # Use the desired configuration
-
-# Initialize model, loss function, and optimizer
-model = Mambular(cat_feature_info, num_feature_info, num_classes, config)
-criterion = nn.MSELoss()  # Use MSE for regression; change as appropriate for your task
-optimizer = optim.Adam(model.parameters(), lr=0.001)
-
-# Example training loop
-for epoch in range(10):  # Number of epochs
-    model.train()
-    optimizer.zero_grad()
-
-    # Dummy Data
-    num_features = [torch.randn(32, 1) for _ in num_feature_info]
-    cat_features = [torch.randint(0, 5, (32,)) for _ in cat_feature_info]
-    labels = torch.randn(32, num_classes)  
-
-    # Forward pass
-    outputs = model(num_features, cat_features)
-    loss = criterion(outputs, labels)
-
-    # Backward pass and optimization
-    loss.backward()
-    optimizer.step()
-
-    # Print loss for monitoring
-    print(f"Epoch [{epoch+1}/10], Loss: {loss.item():.4f}")
-
-```
 
 # 🏷️ Citation
 

diff --git a/mambular/__version__.py b/mambular/__version__.py
@@ -16,4 +16,4 @@
 #
 
 # The following line *must* be the last in the module, exactly as formatted:
-__version__ = "1.1.0"
+__version__ = "1.2.0"
diff --git a/mambular/arch_utils/layer_utils/attention_utils.py b/mambular/arch_utils/layer_utils/attention_utils.py
@@ -5,7 +5,6 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from einops import rearrange
-from rotary_embedding_torch import RotaryEmbedding
 
 
 class GEGLU(nn.Module):
@@ -25,7 +24,7 @@ def FeedForward(dim, mult=4, dropout=0.0):
 
 
 class Attention(nn.Module):
-    def __init__(self, dim, heads=8, dim_head=64, dropout=0.0, rotary=False):
+    def __init__(self, dim, heads=8, dim_head=64, dropout=0.0):
         super().__init__()
         inner_dim = dim_head * heads
         self.heads = heads
@@ -34,18 +33,13 @@ def __init__(self, dim, heads=8, dim_head=64, dropout=0.0, rotary=False):
         self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False)
         self.to_out = nn.Linear(inner_dim, dim, bias=False)
         self.dropout = nn.Dropout(dropout)
-        self.rotary = rotary
         dim = np.int64(dim / 2)
-        self.rotary_embedding = RotaryEmbedding(dim=dim)
 
     def forward(self, x):
         h = self.heads
         x = self.norm(x)
         q, k, v = self.to_qkv(x).chunk(3, dim=-1)
         q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v))  # type: ignore
-        if self.rotary:
-            q = self.rotary_embedding.rotate_queries_or_keys(q)
-            k = self.rotary_embedding.rotate_queries_or_keys(k)
         q = q * self.scale
 
         sim = torch.einsum("b h i d, b h j d -> b h i j", q, k)
@@ -61,7 +55,7 @@ def forward(self, x):
 
 
 class Transformer(nn.Module):
-    def __init__(self, dim, depth, heads, dim_head, attn_dropout, ff_dropout, rotary=False):
+    def __init__(self, dim, depth, heads, dim_head, attn_dropout, ff_dropout):
         super().__init__()
         self.layers = nn.ModuleList([])
 
@@ -74,7 +68,6 @@ def __init__(self, dim, depth, heads, dim_head, attn_dropout, ff_dropout, rotary
                             heads=heads,
                             dim_head=dim_head,
                             dropout=attn_dropout,
-                            rotary=rotary,
                         ),
                         FeedForward(dim, dropout=ff_dropout),
                     ]