Install pyenv and poetry for environment management. In the pyproject.toml
, see line python = ...
for the acceptable Python version, e.g., ">=3.9,<3.10"
, and do the following from the root directory of the project
pyenv install <python version>
pyenv local <python version>
poetry env use <python version>
poetry install
e.g., <python version>
is equal to 3.9.17
For missing packages, use poetry add <package name>
(see poetry docs).
.
├── README.md
├── pyproject.toml : poetry dependency file
├── poetry.lock : poetry lock file (created after the installation runs)
├── scripts
│ ├── imputation : scripts for missing data imputation
│ └── elicitation : scripts for prior elicitation
├── data (basically ignored by git)
│ ├── openml : raw data of OpenML-CC18
│ ├── cities : cities data for prior elicitation
│ ├── working : calculated features, processing steps
│ └── output : classification results, figures, etc.
└── .env : environmental variables (create this manually)
You can get OpenML-CC18 Curated Classification benchmark datasets and download them locally. The downloaded data will be stored in data/openml
. The following files will be downloaded.
data/openml
├── [OpenML ID]
│ ├── X.csv : feature matrix
│ ├── y.csv : classification labels
│ ├── X_categories.json : list of categorical variables in the features
│ ├── y_categories.json : list of class in `y.csv`
│ ├── description.txt : description of the dataset
│ └── details.json : meta data of the dataset
└── openml-datasets-CC18.csv : list of downloaded datasets
poetry run python scripts/get-datasets.py
In the preprocessing step, you will split the original OpenML datasets into train and test subsets, and generate missing values.
Please get OpenML datasets and store them in data/openml
in advance.
The splitted complete datasets will be stored in data/working/complete
, and the incomplete datasets (datasets with "real" missing values) will be stored in data/working/incomplete
.
For the complete datasets, the code artificially generates missing values based on missingness patterns (MCAR, MAR, MNAR).
poetry run python scripts/imputation/preprocess.py
poetry run python scripts/imputation/preprocess.py
[--n_corrupted_rows_train N_CORRUPTED_ROWS_TRAIN] [--n_corrupted_rows_test N_CORRUPTED_ROWS_TEST]
[--n_corrupted_columns N_CORRUPTED_COLUMNS] [--test_size TEST_SIZE]
[--seed SEED] [--debug]
required arguments:
(none)
optional arguments:
--n_corrupted_rows_train the default value is 120
--n_corrupted_rows_test the default value is 30
--n_corrupted_columns the default value is 6. the code will generate max 6 corrupted columns.
--test_size the default value is 0.2. fraction of testing subsets for train test split.
--seed default value: 42
--debug display some additional logs to the terminal
You can use OpenAI API, or other APIs compatible with OpenAI API, e.g. llama-cpp-python and vLLM.
Instructions for each model are the following:
- OpenAI API
Please set your API key in
.env
asOPENAI_API_KEY="YOUR_OPENAI_API_KEY"
- Other OpenAI API compatible APIs
Please set a base URL to the inference server in
.env
asIf an API key is required, please set it inCUSTOM_INFERENCE_SERVER_URL="YOUR_CUSTOM_INFERENCE_SERVER_URL"
.env
asCUSTOM_API_KEY="YOUR_CUSTOM_API_KEY"
To edit prompts, edit scripts/imputation/prompts.json
.
{
"expert_prompt_initialization": {
"system_prompt": "...",
"user_prompt_prefix": "...",
"user_prompt_suffix": "..."
},
"non_expert_prompt": "...",
"data_imputation": {
"system_prompt_suffix": "...",
"user_prompt_prefix": "..."
}
}
For each row with missing values, two types of requests to LLMs will be done.
- Expert prompt initialization: Ask LLMs to make prompts for LLMs to act like experts. System prompt:
system_prompt
. User prompt:user_prompt_prefix + dataset_description + user_prompt_suffix
.dataset_description
is a description of the dataset downloaded from OpenML. - Data Imputation: Using the expert prompt, ask LLMs to guess a missing value. System prompt:
epi_prompt
+system_prompt_suffix
. User prompt: `
(Note) There may be multiple missing values in the target row. This will be done by repeating step 2 for each missing value in the target row. (Other missing values are hidden)
- Please run
generate-missing-values.py
(see above) in advance. Corrupted datasets (datasets with missing values) and the log file (log.csv
) must be stored indata/working/complete
anddata/working/incomplete
. - Evaluation is currently unavailable. Needs update.
You can test multiple missing values imputation methods for the generated incomplete datasets. The following methods are available:
- Mean/Mode (impute numerical values with mean and categorical values with mode)
- K-nearest neighbors
- Random Forest
- LLMs
For example, if you want to impute with Mean/Mode method, run the following command.
poetry run python scripts/imputation/experiment.py --method meanmode
For LLMs, you can test several models. OpenAI GPT models are available, and also other models OpenAI API compatible models are available.
To select a model, set the --llm_model
option. The default model is gpt-4
. For OpenAI GPT models, please use official model names, e.g. gpt-3.5-turbo
. For OpenAI API compatible models, you can freely set a model name, but please note that --llm_model
option is required.
poetry run python scripts/imputation/experiment.py --method llm --llm_model gpt-3.5-turbo
You can also ask whether you want LLMs to behave like an expert or not. The default role is expert
.
poetry run python scripts/imputation/experiment.py --method llm --llm_model gpt-4 --llm_role nonexpert
If you want to run experiments for a specific dataset, please give the OpenML ID, missingness. For example,
poetry run python scripts/imputation/experiment.py --method meanmode --openml_id 31 --missingness MCAR
You can also evaluate downstream tasks by adding the downstream flag.
poetry run python scripts/imputation/experiment.py --method meanmode --downstream
poetry run python scripts/imputation/experiment.py
[--method {meanmode, knn, rf, llm}] [--evaluate] [--downstream]
[--openml_id OPENML_ID] [--missingness {MCAR, MAR, MNAR}] [--dataset {['incomplete', 'complete'], incomplete, complete}]
[--llm_model LLM_MODEL] [--llm_role {expert, nonexpert}]
[--debug]
required arguments:
--method select a imputation method you want to apply (default value: meanmode)
optional arguments:
--evaluate calculate RMSE or Macro F1
--downstream evaluate downstream tasks
--openml_id specify a target openml id
--missingness specify a missingness pattern (MCAR or MAR or MNAR)
--dataset specify a dataset type. complete or incomplete.
--llm_model specify a llm model. the default is gpt-4
--llm_role select whether the llm to be an expert or not.
--seed default value: 42
--debug display some additional logs to the terminal
If you want to modify imputation method using LLMs, please edit scripts/imputation/modules/llmimputer.py
.
Setup for LLM APIs is the same as for the LLM imputer. See above.
To edit prompts, edit scripts/elicitation/prompts.json
.
If you want to modify the elictation method using LLMs, please edit scripts/elicitation/modules/llmelicitor.py
.