Skip to content

Bachelor Thesis at Charles University

Notifications You must be signed in to change notification settings

Rattko/Bachelor-Thesis

Repository files navigation

Bachelor Thesis at Charles University

Description

This repository contains everything related to my Bachelor thesis on benchmarking various preprocessing methods for imbalanced classification at Charles University, Prague. It also contains LaTex sources for the paper on the same topic published in the IEEE International Conference on Big Data.

Quick Preview

You can use Docker for a quick preview of the functionality. Run the following command:

docker run --rm -it -p 5001:5001 rattko/bachelor-thesis:latest

The command downloads a pre-built docker image from Docker Hub and runs it. The docker image fires up a Mlflow server and executes an experiment consisting of two preprocessing methods over two datasets. Four runs will be performed in total, each lasting roughly 90 seconds. One of the runs should fail due to insufficient training time; the other three may or may not finish successfully, depending on your PC's computing power. You can observe the results of the runs using Mlflow UI accessible on 127.0.0.1:5001. Mlflow server will continue running after the experiment has finished until you stop the container or press Ctrl-C.

Installation

Unfortunately, one of our core dependencies does not support the Windows operating system. Thus, we are forced to require either macOS or Linux operating system. We also use the new syntax for type hints requiring Python 3.10. Furthermore, AutoSklearn requires SWIG. It can be installed using Homebrew on macOS or any package manager on Linux. Once these requirements have been met, we can obtain the source code and proceed with the installation. Run the following commands:

git clone git@github.com:Rattko/Bachelor-Thesis.git && cd Bachelor-Thesis
python3 -m venv .venv && source .venv/bin/activate
pip3 install -r tools/requirements.txt
pip3 install -e .

We also need to patch AutoSklearn to gain complete control over the preprocessing steps in the experiment.

bash tools/patch_autosklearn.sh

Running Experiments

Once we have completed the installation process described in the previous section, we can proceed to run an experiment or two. First, we need to boot up a Mlflow tracking server using:

mlflow server --backend-store-uri sqlite:///.mlruns.db --default-artifact-root .mlruns-artifacts &> .mlflow.logs &

We redirect the server's stdout and stderr to a log file and boot it in the background so that we can continue using the same terminal window. This command boots up a server accessible on 127.0.0.1:5000 in the browser. Once we have run an experiment, information about runs will appear on that address in real-time.

Now we need to download some datasets to use in experiments. We can use

./tools/download_openml_datasets.py

for this. The script automatically downloads datasets satisfying pre-specified conditions from OpenML. A progress bar will pop up, showing the download status. After a successful download, we are ready to execute an experiment. Run the following command:

python3 src/core/main.py \
    --datasets 310 40900 \
    --preprocessings smote tomek_links \
    --total_time 90 \
    --time_per_run 30

You can consult the help page of the script to learn more about the various supported switches. However, the following should be sufficient to get you up to speed. The --datasets switch expects the names of the datasets as found in the datasets/ directory without any extension. Likewise, the --preprocessing switch expects the names of the preprocessing methods found in the src/core/preprocessings/ directory. The switch also accepts special values to run all, only oversampling or only undersampling preprocessing methods. The last two switches control the time allocation for model training in AutoSklearn. See time_left_for_this_task and per_run_time_limit on this link for an explanation.