Skip to content

The collection and reproduction code of the clustering methods I have known

License

Notifications You must be signed in to change notification settings

Mr-SGXXX/Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clustering Methods Collection & Reproduction

Notice: The repository is far from completed, all codes I provided are just for share and will often be changed. And I do not promise any performance and accurate reproduction.

Introduction

This project is the collection of the codes reproduced by me and the paper sites about the clustering method in the past years. It also contains many methods which were publicly published and have shared their codes.

In the recent years, as the representation of unsupervised tasks, the clustering task received great attention from the researchers. Many related works came up and achieved significant success. For most works, the author gave the code for others to use, however, some of them are not fully complete, besides, the running environments and code frameworks are very different or out of fashion. So I decide to collect the papers as well as the codes, and bring them togethor.

Install and Usage

  1. If you want to use the code provided by this repository, the first thing you need to do is to select a proper position and clone this repository from github. The command you need is:

    git clone git@github.com:Mr-SGXXX/Clustering.git
    cd Clustering
  2. After downloading the repository, you need to construct a proper python environment. I advise you to use the conda, which can easily build a nice environment without influencing your other project setting. You can download Anaconda here, but I suggest you to install miniconda following this site. To create a new conda environment, you can use the following command:

    conda create -n clustering python=3.9
    conda activate clustering
  3. If you want to run the codes, you need to install the packages this repository used. And to install them, you can run the following command:

    pip install -r requirements.txt
  4. After preparing the running environment, you can choose which dataset for clustering and which method you want to use in the config files. For the chosen dataset or method, you can also change the hyper-parameters setting in the config file.

  5. Run the experiment you set in the config file by the following bash command:

    python main.py --config_path /path/to/config.cfg --device 0

    or just:

    python main.py -cp /path/to/config.cfg -d 0

    then just wait or do anything else you want.

  6. After the experiment, there will be a log file containing the collected message during the experiment as well as a set of figures generated based the features and scores of the experiment. If you stop too many experiments, you can use the following bash script to remove the useless log.

    bash ./clean_log.sh /path/to/log /path/to/figures 

    Or when you use the default log and figure path, you can use:

    bash ./clean_log.sh

    Notice: Do Not Use The Script When Running Any Experiment.

  7. If you want to add a new method or dataset based on this repository, you can firstly lookup the '__init__.py' file for method package (divided into classical methods and deep methods) or dataset package. Then you can design your method and benefit from the pipeline.

Methods List

Classical Methods

Those methods not using deep learning will be included in the this part.

Deep Methods

Those methods using deep learning will be included in this part. Notice that those muti-view clustering methods and GNN-based clustering methods are not includes here.

In the code provided by the authors, they gave a pretrained weight for Reuters10K, with it, we can gain a nice result sometimes not lower than the article for Reuters10K dataset, but pretraining from start following the code setting in the article instead of using the pretrain weight, the score is hardly as good as what it should be, but similar to this repositary. Besides, the result is not stable.

This method is designed for clustering on large dataset like ImageNet, and don't work well for the small datasets. In the official implementation, the author gave the detailed scripts about their experiments in the article, which contains using conv features of different level to do LogisticRegression for clustering and using these features for object detection. This repository doesn't offer these parts, and it only gives the clustering result by doing classical clustering on the fc features, which is also called the last epoch cluster assignments.

In this method, most codes are the same as DEC, except the clustering process. Instead of only using KL loss, the IDEC adds the reconstruct loss in clustering process. Because the IDEC use the same pretrain process as the DEC, in order to save time, the IDEC will directly use the DEC pretrain weight

In this method, the pretrain process is the most important part, whether the features are learned well by pretraining is directly correspond to whether the result is good. With a reproduced greedy layer-wise pretraining referred to the DEC paper, the pretrained weight is more likely to be good, by which the DEC method is more likely to gain a good score. Though the best score in many experimnets is no lower than the score in the article, the method is still not stable, scores of multiple experiments are very different.

Dataset

Image

  • MNIST

  • Fashion MNIST

  • CIFAR-10

  • CIFAR-100

  • STL-10

Text

  • Reuters-10K:

Notice: The Reuters-10K used here is most likely the same as the Reuters-10K used in DEC, which is generated by select random 10000 sample from the original Reuters with 685071 samples. Because the original Reuters dataset download url in DEC repository is not available now, the total dataset experiment is not possible for now.

Experiment Results

Results Disclaimer

All the experimental results you can see in this repository are obtained based on the code provided in this repository. Due to factors such as experimental environment and parameter settings, these results may differ slightly or greatly from those in the original paper. I strive to ensure the accuracy of the results, but can't guarantee exact correspondence with the original paper.

The possible difference reasons from my personal view:

  • The problem of clustering usually is not stable, the difference of initializing will cause significant difference of results.
  • Not all methods were orginally implemented by pytorch, besides different pytorch version may cause difference. This repository may implement the method in a different way.
  • Different hardware devices may cause some different results for their slightly different calculating process.
  • The results of some methods strictly depends on some weight from a excellent but rare pretrain try, which doesn't occur all the time, causing the scores are easily lower than what authors declaimed.
  • Some methods don't offer the hyper-parameter setting they used for all dataset, for these methods we use the default hyper-parameter they offered in their code or paper.
  • The public code of some methods can not be run correctly for some bugs or outdated APIs. Though we try to fix these errors, it may cause some diffence of the results.
  • Some methods unfairly used the best epoch recognized by clustering evaluation metrics(ACC, ARI, etc) in the clustering process, which needs ground truth information.(Early stop doesn't mean you can use unfair setting)
  • There may be some bugs in this repository which influence the score of some methods. If you find any bug, welcome to raise issues or contact me through email.

The hardware environment accessible to me as follows:

  • CPU: Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz
  • GPU: NVIDIA GTX 2080ti
  • Memory Size: 64GB

Scores Table

Each method in each dataset will be tried for several times for fair. For deep methods, only the result of the last epoch or the result chosen in a no-need ground truth way will be used. The highest scores as well as mean and std shown as the table with the format "max, mean(std)". The running time of the deep methods contain pretrain time and clustering time.

Reuters10K

Method Test Times ACC NMI ARI
EDESC 16 0.7632, 0.6978(0.0575) 0.5849, 0.4686(0.0591) 0.5927, 0.4826(0.0730)
DEC 16 0.7366, 0.6440(0.0456) 0.4879, 0.4228(0.0417) 0.4591, 0.3936(0.0452)
Spectral Clustering 8 0.4441, 0.4441(0.0000) 0.0905, 0.0905(0.0000) 0.0175, 0.0175(0.0000)
KMeans 16 0.5622, 0.5301(0.0162) 0.3549, 0.3243(0.0195) 0.2655, 0.2211(0.0190)

End

In the end, I would like to express my gratitude to all researchers in the field of clustering and the entire AI community for their contributions. Thank you for their willingness to open-source their code.

In addition, thank to the github for the copilot assistant which greatly improved my efficence.

Contact

My email is yx_shao@qq.com. If you have any question or advice, please contact me.