Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
f74cee4
docstring fixed for default batch size
mkumar73 Jan 4, 2025
63af2eb
fix: extend parameter for preprocessing
mkumar73 Jan 8, 2025
8e8579c
fix splines, include sigmoid and rbf
mkumar73 Jan 8, 2025
7169e96
rbf and sigmoid expansion
mkumar73 Jan 10, 2025
10fd848
scaling strategy included for ple, splines etc.
mkumar73 Jan 10, 2025
b398d13
Merge pull request #201 from basf/feat/splines
AnFreTh Jan 10, 2025
e437f3c
fix sklearn warnings
AnFreTh Jan 15, 2025
d4c61f3
include predict step
AnFreTh Jan 17, 2025
e3e39bf
adjust datamodule and dataset to include prediction dataset
AnFreTh Jan 17, 2025
8cc1e79
fix batch prediction in sklearn models
AnFreTh Jan 17, 2025
792c4a2
format
AnFreTh Jan 17, 2025
f37b6d3
adapt lightningmodule to have custom metrics
AnFreTh Jan 17, 2025
44778a1
assign datasets
AnFreTh Jan 17, 2025
fd9a257
include passing metrics into sklearn models
AnFreTh Jan 17, 2025
c5d2931
fix ensemble prediction bug
AnFreTh Jan 17, 2025
850a5cc
include sentence/word embeddings as preprocessing techniques for cate…
AnFreTh Jan 17, 2025
50a3883
make sentence_transformer input optional dependency
AnFreTh Jan 17, 2025
fac6a1f
include encoding function to create embeddings
AnFreTh Jan 18, 2025
d08af31
adjust order in __getitem__ functionality and batch for lightningmodule
AnFreTh Jan 18, 2025
40fef33
include encoding function in sklearn base classes
AnFreTh Jan 18, 2025
0708f3f
fix: sentence-transformers included
mkumar73 Jan 19, 2025
c4df541
fix: B904
mkumar73 Jan 19, 2025
2473e5c
chore: auto formatting
mkumar73 Jan 19, 2025
75c2d1b
exclude sentence-transformers
mkumar73 Jan 19, 2025
4a76db9
Merge pull request #202 from basf/util_fixes
mkumar73 Jan 19, 2025
2a65660
adapt embedding layer to new input format of tuple information
AnFreTh Jan 23, 2025
4d5f94a
adapt basemodel encoding function to tuple input
AnFreTh Jan 23, 2025
adc6d19
batch now returns tuple and *data is passed to forward method
AnFreTh Jan 23, 2025
a02b9dd
first two basemodels adapted to new logic
AnFreTh Jan 23, 2025
10d1c00
major changes in handling embeddings as array/list inputs in addition…
AnFreTh Jan 23, 2025
cbe8dd3
dataset returns tuple of data (cat, num, emb), label
AnFreTh Jan 23, 2025
b84aa50
adjust two first basemodel configs to handle projection for embeddings
AnFreTh Jan 23, 2025
8cc3e83
adapt first only regressor and classifier to handle embeddings
AnFreTh Jan 23, 2025
6c0bc5c
preprocessor does not preprocess embeddings, but takes them as input …
AnFreTh Jan 23, 2025
743c214
feature dimensions adapted to new output format of get_feature_info
AnFreTh Jan 23, 2025
4ec70f8
adapting all basemodels to new dataset __getitem__ method
AnFreTh Jan 24, 2025
a2c7845
adapt lightning layer and preprocessor to account for no passed embed…
AnFreTh Jan 24, 2025
b8bc5e9
restructure configs to create parent config-class
AnFreTh Feb 12, 2025
a4c5992
fix minor bugs related to imports and dim identification
AnFreTh Feb 12, 2025
6fc04eb
fix bug related to column names in datamodule - turn int to string
AnFreTh Feb 12, 2025
e60dd80
make box-cox strictly positive
AnFreTh Feb 12, 2025
febf165
include unit tests
AnFreTh Feb 12, 2025
161f6de
remove dependence on rotary embeddings
AnFreTh Feb 12, 2025
bd998d3
include params relöated to [BUG] Missing Configuration Attributes in …
AnFreTh Feb 12, 2025
44d3b3a
test new unit test for pr-requests
AnFreTh Feb 12, 2025
5fc2ed7
change py-version
AnFreTh Feb 12, 2025
e722767
adapt test to .py version 3.10
AnFreTh Feb 12, 2025
1fcb030
install poetry in workflow
AnFreTh Feb 12, 2025
ac27a1d
ensure mambular is locally installed
AnFreTh Feb 12, 2025
c379a7a
Merge pull request #214 from basf/embeddings
AnFreTh Feb 14, 2025
c3e9c90
add JohnsonSU and individual preprocessing
AnFreTh Feb 14, 2025
b10ff52
adapt embedding layer to new preprocessing
AnFreTh Feb 14, 2025
d6380fd
Merge branch 'develop' into johnson_su
AnFreTh Feb 14, 2025
2e87e87
Merge pull request #216 from basf/johnson_su
AnFreTh Feb 14, 2025
7904ae1
adapt base models
AnFreTh Feb 14, 2025
d155e22
Merge pull request #217 from basf/dev_fix
AnFreTh Feb 14, 2025
62cbc6c
adapt readme
AnFreTh Feb 16, 2025
18d954b
version fix
AnFreTh Feb 16, 2025
35ba22b
add baseconfig to init
AnFreTh Feb 16, 2025
fa2c978
lock update after torch version change
mkumar73 Feb 16, 2025
4bbf174
reformatting
mkumar73 Feb 16, 2025
3a769c1
formatting, refactor (used exception instead of assert)
mkumar73 Feb 16, 2025
3cdf998
Merge pull request #219 from basf/rdme_fix
mkumar73 Feb 17, 2025
3a20cc3
increase version
AnFreTh Feb 17, 2025
d14b666
Merge pull request #220 from basf/vs_increase
AnFreTh Feb 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions .github/workflows/pr-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: PR Unit Tests

on:
pull_request:
branches:
- develop
- master # Add any other branches where you want to enforce tests

jobs:
test:
runs-on: ubuntu-latest

steps:
- name: Checkout Repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10" # Change this to match your setup

- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH
export PATH="$HOME/.local/bin:$PATH"

- name: Install Dependencies
run: |
python -m pip install --upgrade pip
poetry install
pip install pytest

- name: Install Package Locally
run: |
poetry build
pip install dist/*.whl # Install the built package to fix "No module named 'mambular'"

- name: Run Unit Tests
env:
PYTHONPATH: ${{ github.workspace }} # Ensure the package is discoverable
run: pytest tests/

- name: Verify Tests Passed
if: ${{ success() }}
run: echo "All tests passed! Pull request is allowed."

- name: Fail PR on Test Failure
if: ${{ failure() }}
run: exit 1 # This ensures the PR cannot be merged if tests fail
124 changes: 62 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,17 @@

Mambular is a Python library for tabular deep learning. It includes models that leverage the Mamba (State Space Model) architecture, as well as other popular models like TabTransformer, FTTransformer, TabM and tabular ResNets. Check out our paper `Mambular: A Sequential Model for Tabular Deep Learning`, available [here](https://arxiv.org/abs/2408.06291). Also check out our paper introducing [TabulaRNN](https://arxiv.org/pdf/2411.17207) and analyzing the efficiency of NLP inspired tabular models.

<h3>⚡ What's New ⚡</h3>
<ul>
<li>Individual preprocessing: preprocess each feature differently, use pre-trained models for categorical encoding</li>
<li>Extract latent representations of tables</li>
<li>Use embeddings as inputs</li>
<li>Define custom training metrics</li>
</ul>




<h3> Table of Contents </h3>

- [🏃 Quickstart](#-quickstart)
Expand All @@ -30,7 +41,6 @@ Mambular is a Python library for tabular deep learning. It includes models that
- [🛠️ Installation](#️-installation)
- [🚀 Usage](#-usage)
- [💻 Implement Your Own Model](#-implement-your-own-model)
- [Custom Training](#custom-training)
- [🏷️ Citation](#️-citation)
- [License](#license)

Expand Down Expand Up @@ -103,6 +113,7 @@ pip install mamba-ssm
<h2> Preprocessing </h2>

Mambular simplifies data preprocessing with a range of tools designed for easy transformation of tabular data.
Specify a default method, or a dictionary defining individual preprocessing methods for each feature.

<h3> Data Type Detection and Transformation </h3>

Expand All @@ -116,6 +127,7 @@ Mambular simplifies data preprocessing with a range of tools designed for easy t
- **Polynomial Features**: Automatically generates polynomial and interaction terms for numerical features, enhancing the ability to capture higher-order relationships.
- **Box-Cox & Yeo-Johnson Transformations**: Performs power transformations to stabilize variance and normalize distributions.
- **Custom Binning**: Enables user-defined bin edges for precise discretization of numerical data.
- **Pre-trained Encoding**: Use sentence transformers to encode categorical features.



Expand Down Expand Up @@ -147,6 +159,28 @@ preds = model.predict(X)
preds = model.predict_proba(X)
```

Get latent representations for each feature:
```python
# simple encoding
model.encode(X)
```

Use unstructured data:
```python
# load pretrained models
image_model = ...
nlp_model = ...

# create embeddings
img_embs = image_model.encode(images)
txt_embs = nlp_model.encode(texts)

# fit model on tabular data and unstructured data
model.fit(X_train, y_train, embeddings=[img_embs, txt_embs])
```



<h3> Hyperparameter Optimization</h3>
Since all of the models are sklearn base estimators, you can use the built-in hyperparameter optimizatino from sklearn.

Expand Down Expand Up @@ -222,9 +256,11 @@ MambularLSS allows you to model the full distribution of a response variable, no
- **studentt**: For data with heavier tails, useful with small samples.
- **negativebinom**: For over-dispersed count data.
- **inversegamma**: Often used as a prior in Bayesian inference.
- **johnsonsu**: Four parameter distribution defining location, scale, kurtosis and skewness.
- **categorical**: For data with more than two categories.
- **Quantile**: For quantile regression using the pinball loss.


These distribution classes make MambularLSS versatile in modeling various data types and distributions.


Expand Down Expand Up @@ -269,13 +305,16 @@ Here's how you can implement a custom model with Mambular:

```python
from dataclasses import dataclass
from mambular.configs import BaseConfig

@dataclass
class MyConfig:
class MyConfig(BaseConfig):
lr: float = 1e-04
lr_patience: int = 10
weight_decay: float = 1e-06
lr_factor: float = 0.1
n_layers: int = 4
pooling_method:str = "avg

```

2. **Second, define your model:**
Expand All @@ -290,22 +329,32 @@ Here's how you can implement a custom model with Mambular:
class MyCustomModel(BaseModel):
def __init__(
self,
cat_feature_info,
num_feature_info,
feature_information: tuple,
num_classes: int = 1,
config=None,
**kwargs,
):
super().__init__(**kwargs)
self.save_hyperparameters(ignore=["cat_feature_info", "num_feature_info"])
super().__init__(**kwargs)
self.save_hyperparameters(ignore=["feature_information"])
self.returns_ensemble = False

# embedding layer
self.embedding_layer = EmbeddingLayer(
*feature_information,
config=config,
)

input_dim = get_feature_dimensions(num_feature_info, cat_feature_info)
input_dim = np.sum(
[len(info) * self.hparams.d_model for info in feature_information]
)

self.linear = nn.Linear(input_dim, num_classes)

def forward(self, num_features, cat_features):
x = num_features + cat_features
x = torch.cat(x, dim=1)
def forward(self, *data) -> torch.Tensor:
x = self.embedding_layer(*data)
B, S, D = x.shape
x = x.reshape(B, S * D)


# Pass through linear layer
output = self.linear(x)
Expand All @@ -329,60 +378,11 @@ Here's how you can implement a custom model with Mambular:
```python
regressor = MyRegressor(numerical_preprocessing="ple")
regressor.fit(X_train, y_train, max_epochs=50)

regressor.evaluate(X_test, y_test)
```

# Custom Training
If you prefer to setup custom training, preprocessing and evaluation, you can simply use the `mambular.base_models`.
Just be careful that all basemodels expect lists of features as inputs. More precisely as list for numerical features and a list for categorical features. A custom training loop, with random data could look like this.

```python
import torch
import torch.nn as nn
import torch.optim as optim
from mambular.base_models import Mambular
from mambular.configs import DefaultMambularConfig

# Dummy data and configuration
cat_feature_info = {
"cat1": {
"preprocessing": "imputer -> continuous_ordinal",
"dimension": 1,
"categories": 4,
}
} # Example categorical feature information
num_feature_info = {
"num1": {"preprocessing": "imputer -> scaler", "dimension": 1, "categories": None}
} # Example numerical feature information
num_classes = 1
config = DefaultMambularConfig() # Use the desired configuration

# Initialize model, loss function, and optimizer
model = Mambular(cat_feature_info, num_feature_info, num_classes, config)
criterion = nn.MSELoss() # Use MSE for regression; change as appropriate for your task
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Example training loop
for epoch in range(10): # Number of epochs
model.train()
optimizer.zero_grad()

# Dummy Data
num_features = [torch.randn(32, 1) for _ in num_feature_info]
cat_features = [torch.randint(0, 5, (32,)) for _ in cat_feature_info]
labels = torch.randn(32, num_classes)

# Forward pass
outputs = model(num_features, cat_features)
loss = criterion(outputs, labels)

# Backward pass and optimization
loss.backward()
optimizer.step()

# Print loss for monitoring
print(f"Epoch [{epoch+1}/10], Loss: {loss.item():.4f}")

```

# 🏷️ Citation

Expand Down
2 changes: 1 addition & 1 deletion mambular/__version__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@
#

# The following line *must* be the last in the module, exactly as formatted:
__version__ = "1.1.0"
__version__ = "1.2.0"
11 changes: 2 additions & 9 deletions mambular/arch_utils/layer_utils/attention_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
from rotary_embedding_torch import RotaryEmbedding


class GEGLU(nn.Module):
Expand All @@ -25,7 +24,7 @@ def FeedForward(dim, mult=4, dropout=0.0):


class Attention(nn.Module):
def __init__(self, dim, heads=8, dim_head=64, dropout=0.0, rotary=False):
def __init__(self, dim, heads=8, dim_head=64, dropout=0.0):
super().__init__()
inner_dim = dim_head * heads
self.heads = heads
Expand All @@ -34,18 +33,13 @@ def __init__(self, dim, heads=8, dim_head=64, dropout=0.0, rotary=False):
self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False)
self.to_out = nn.Linear(inner_dim, dim, bias=False)
self.dropout = nn.Dropout(dropout)
self.rotary = rotary
dim = np.int64(dim / 2)
self.rotary_embedding = RotaryEmbedding(dim=dim)

def forward(self, x):
h = self.heads
x = self.norm(x)
q, k, v = self.to_qkv(x).chunk(3, dim=-1)
q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v)) # type: ignore
if self.rotary:
q = self.rotary_embedding.rotate_queries_or_keys(q)
k = self.rotary_embedding.rotate_queries_or_keys(k)
q = q * self.scale

sim = torch.einsum("b h i d, b h j d -> b h i j", q, k)
Expand All @@ -61,7 +55,7 @@ def forward(self, x):


class Transformer(nn.Module):
def __init__(self, dim, depth, heads, dim_head, attn_dropout, ff_dropout, rotary=False):
def __init__(self, dim, depth, heads, dim_head, attn_dropout, ff_dropout):
super().__init__()
self.layers = nn.ModuleList([])

Expand All @@ -74,7 +68,6 @@ def __init__(self, dim, depth, heads, dim_head, attn_dropout, ff_dropout, rotary
heads=heads,
dim_head=dim_head,
dropout=attn_dropout,
rotary=rotary,
),
FeedForward(dim, dropout=ff_dropout),
]
Expand Down
Loading