Skip to content

Preprocessed data for various popular tabular datasets to go along with imodels.

Notifications You must be signed in to change notification settings

csinva/imodels-data

Repository files navigation

imodels🔍 data

Tabular data for various problems, especially for high-stakes rule-based modeling with the imodels package.

See also https://huggingface.co/imodels

Includes the following datasets and more (see notebooks for more details on the datasets).

To download, use the "Name" field as the key: e.g. imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels').

Name Samples Features Class 0 Class 1 Majority class %
heart 270 15 150 120 55.6
breast_cancer 277 17 196 81 70.8
haberman 306 3 81 225 73.5
credit_g 1000 60 300 700 70
csi_pecarn_prop 3313 97 2773 540 83.7
csi_pecarn_pred 3313 39 2773 540 83.7
juvenile_clean 3640 286 3153 487 86.6
compas_two_year_clean 6172 20 3182 2990 51.6
enhancer 7809 80 7115 694 91.1
fico 10459 23 5000 5459 52.2
iai_pecarn_prop 12044 73 11841 203 98.3
iai_pecarn_pred 12044 58 11841 203 98.3
credit_card_clean 30000 33 23364 6636 77.9
tbi_pecarn_prop 42428 223 42052 376 99.1
tbi_pecarn_pred 42428 121 42052 376 99.1
readmission_clean 101763 150 54861 46902 53.9

Data usage

First, install the imodels package: pip install imodels. Then, use the imodels.get_clean_dataset function.

imodels.get_clean_dataset(dataset_name: str, data_source: str = 'imodels', data_path='data') ‑> Tuple[numpy.ndarray, numpy.ndarray, list]
"""
Fetch clean data (as numpy arrays) from various sources including imodels, pmlb, openml, and sklearn. If data is not downloaded, will download and cache. Otherwise will load locally

Parameters
----------
dataset_name: str
    dataset_name - unique dataset identifier
data_source: str
    options: 'imodels', 'pmlb', 'sklearn', 'openml', 'synthetic'
data_path: str
    path to load/save data (default: 'data')

Returns
-------
X: np.ndarray
    features
y: np.ndarray
    outcome
feature_names: list
"""

Example

# download compas dataset from imodels
X, y, feature_names = imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')
# download ionosphere dataset from pmlb
X, y, feature_names = imodels.get_clean_dataset('ionosphere', data_source='pmlb')
# download liver dataset from openml
X, y, feature_names = imodels.get_clean_dataset('8', data_source='openml')
# download ca housing from sklearn
X, y, feature_names = imodels.get_clean_dataset('california_housing', data_source='sklearn')

Data info

Data comes from various sources - please cite those sources appropriately.

notebooks_fetch_data contains notebooks which download and preprocess the data

data_cleaned contains the cleaned csv file for each dataset

Clinical decision-rule (PECARN) datasets

To use any of the clinical decision-rule datasets, you must first accept the research data use agreement here.

There are two versions of each PECARN (TBI, IAI, and CSI) dataset.

  • prop: missing values have not been imputed
  • pred: missing values have been imputed

csi_pecarn_pred.csv note: unlike the rest of the datasets in this repo, which are fully cleaned, csi_pecarn_pred.csv contains a variable ("SITE") that should be removed before fitting models.

Dataset Task Size References
iai_pecarn Predict intra-abdominal injury requiring acute intervention before CT 12,044 patients, 203 with IAI-I 📄, 🔗
tbi_pecarn Predict traumatic brain injuries before CT 42,412 patients, 376 with ciTBI 📄, 🔗
csi_pecarn Predict cervical spine injury in children 3,314 patients, 540 with CSI 📄, 🔗

Miscellaneous notes

The breast_cancer dataset here is not the extremely common Wisconsin breast-cancer dataset but rather this dataset from OpenML. Preprocessing (e.g. dropping missing values) results in the cleaned data having n=277, p=17, rather than the original n=286, p=9.

Some other cool datasets:

  • moleculenet - benchmarks for molecular datasets
  • srbench - benchmarking for symbolic regression
  • big-bench - language modeling benchmarks