Develop #221

AnFreTh · 2025-02-17T08:32:51Z

This pull request includes changes to improve the preprocessing and embedding layers in the mambular package. The main changes involve adding a feature preprocessing dictionary to the Preprocessor class, updating the forward method in the embedding layer, and refactoring code for better readability and functionality.

Preprocessing improvements:

mambular/preprocessing/preprocessor.py: Added feature_preprocessing parameter to allow custom preprocessing techniques for individual columns. Updated the fit method to use this parameter for both numerical and categorical features. [1] [2] [3] [4] [5]

Embedding layer updates:

mambular/arch_utils/layer_utils/embedding_layer.py: Modified the forward method to handle different dimensions of categorical embeddings and ensure they are properly processed. [1] [2]

Allow unstructured data as inputs:

mambular/arch_utils/layer_utils/embedding_layer.py: Modified the forward method to handle num_features, cat_features and pre-embedded unstructured data. [1] [2]

Get latent representation of tables

mambular/base_models/basemodel.py: Updated the encode method to accept a single data parameter instead of separate num_features and cat_features parameters. [1] [2]

Documentation Updates:

Added a new "What's New" section to highlight recent features and improvements, such as individual preprocessing for each feature and the use of embeddings as inputs (README.md).
Enhanced the preprocessing section to include details about specifying individual preprocessing methods for each feature and using pre-trained encoding for categorical features (README.md) [1] [2].
Updated the usage examples to demonstrate how to get latent representations for each feature and use unstructured data (README.md).
Removed the "Custom Training" section and integrated custom model implementation details into other sections (README.md) [1] [2].

Workflow Automation:

Introduced a new GitHub Actions workflow (.github/workflows/pr-tests.yml) to run unit tests on pull requests targeting the develop and master branches. This workflow sets up the environment, installs dependencies, runs the tests, and ensures that pull requests cannot be merged if tests fail.

Codebase Improvements:

Updated the version number from 1.1.0 to 1.2.0 in mambular/__version__.py to reflect the new release.
Removed the rotary_embedding_torch dependency and associated code from the attention mechanism (mambular/arch_utils/layer_utils/attention_utils.py) to simplify the implementation [1] [2] [3] [4] [5].
Enhanced the EmbeddingLayer class to support embedding projections and handle additional embedding types (mambular/arch_utils/layer_utils/embedding_layer.py) [1] [2] [3] [4] [5] [6].

RBF and Sigmoid, with scaling strategy

…gorical

Util fixes

… to tabular data

Embeddings

Johnson su

adapt base models

Rdme fix

increase version

mkumar73 and others added 30 commits January 4, 2025 12:09

docstring fixed for default batch size

f74cee4

fix: extend parameter for preprocessing

63af2eb

fix splines, include sigmoid and rbf

8e8579c

rbf and sigmoid expansion

7169e96

scaling strategy included for ple, splines etc.

10fd848

Merge pull request #201 from basf/feat/splines

b398d13

RBF and Sigmoid, with scaling strategy

fix sklearn warnings

e437f3c

include predict step

d4c61f3

adjust datamodule and dataset to include prediction dataset

e3e39bf

fix batch prediction in sklearn models

8cc1e79

format

792c4a2

adapt lightningmodule to have custom metrics

f37b6d3

assign datasets

44778a1

include passing metrics into sklearn models

fd9a257

fix ensemble prediction bug

c5d2931

include sentence/word embeddings as preprocessing techniques for cate…

850a5cc

…gorical

make sentence_transformer input optional dependency

50a3883

include encoding function to create embeddings

fac6a1f

adjust order in __getitem__ functionality and batch for lightningmodule

d08af31

include encoding function in sklearn base classes

40fef33

fix: sentence-transformers included

0708f3f

fix: B904

c4df541

chore: auto formatting

2473e5c

exclude sentence-transformers

75c2d1b

Merge pull request #202 from basf/util_fixes

4a76db9

Util fixes

adapt embedding layer to new input format of tuple information

2a65660

adapt basemodel encoding function to tuple input

4d5f94a

batch now returns tuple and *data is passed to forward method

adc6d19

first two basemodels adapted to new logic

a02b9dd

major changes in handling embeddings as array/list inputs in addition…

10d1c00

… to tabular data

AnFreTh and others added 18 commits February 12, 2025 14:39

install poetry in workflow

1fcb030

ensure mambular is locally installed

ac27a1d

Merge pull request #214 from basf/embeddings

c379a7a

Embeddings

add JohnsonSU and individual preprocessing

c3e9c90

adapt embedding layer to new preprocessing

b10ff52

Merge branch 'develop' into johnson_su

d6380fd

Merge pull request #216 from basf/johnson_su

2e87e87

Johnson su

adapt base models

7904ae1

Merge pull request #217 from basf/dev_fix

d155e22

adapt base models

adapt readme

62cbc6c

version fix

18d954b

add baseconfig to init

35ba22b

lock update after torch version change

fa2c978

reformatting

4bbf174

formatting, refactor (used exception instead of assert)

3a769c1

Merge pull request #219 from basf/rdme_fix

3cdf998

Rdme fix

increase version

3a20cc3

Merge pull request #220 from basf/vs_increase

d14b666

increase version

AnFreTh added this to the v1.2.0 milestone Feb 17, 2025

AnFreTh merged commit 25b88a3 into master Feb 17, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Develop #221

Develop #221

Uh oh!

AnFreTh commented Feb 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Develop #221

Develop #221

Uh oh!

Conversation

AnFreTh commented Feb 17, 2025

Preprocessing improvements:

Embedding layer updates:

Allow unstructured data as inputs:

Get latent representation of tables

Documentation Updates:

Workflow Automation:

Codebase Improvements:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants