Skip to content

sort DataWarrior clusters and relabel their molecules accordingly

License

Notifications You must be signed in to change notification settings

nbehrnd/datawarrior_clustersort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

context/motivation

By the post Assign cluster name based on cluster size (April 7, 2022), user mcmc observes DataWarrior labels the clusters in sequence of their creation. mcmc suggests it were beneficial if the cluster labels would reflect their popularity e.g., the greater the number of molecules in a cluster, the lower the assigned label.

use case

The script runs from the command line and requires an installation of Python 3. All functionality is provided by the standard library, there are no additional dependencies:

datawarrior_clustersort.py [-h] [-r] file

DataWarrior's result of clustering (Chemistry -> cluster compounds/reactions) may be exported as text file (File -> Save Special) is read as input (file). The script identifies the column with DataWarrior's cluster labels by search for the column header Cluster No assigned by default. A new record file is written where the entries are sort to report the most populous cluster and its entries first, followed by the less populous clusters. To reflect this sequence now based on counting entries per cluster, the cluster's labels are newly assigned.

It is possible to reverse the sort by either optional --reverse, or -r. Then, the script reports first the cluster with the least number of entries.

test case

A library of 100 random drug-like molecules was generated by DataWarrior and clustered at low threshold of similarity (Structure FragFp 0.4, file 100Random_Molecules.dwar). The export (file 100Random_Molecules.txt) was processed by

python3 datawarrior_clustersort.py 100Random_Molecules.txt

to yield 100Random_Molecules_sort.txt as newly assigned set. As preview, the script briefly describes the distribution prior and after the sort to the CLI:

DataWarrior's assignment of clusters:
cluster:        1 molecules:       11
cluster:        2 molecules:       28
cluster:        3 molecules:       29
cluster:        4 molecules:       14
cluster:        5 molecules:        8
cluster:        6 molecules:        2
cluster:        7 molecules:        3
cluster:        8 molecules:        2
cluster:        9 molecules:        1
cluster:       10 molecules:        1
cluster:       11 molecules:        1

clusters newly sorted and labeled:
cluster:        1 molecules:       29
cluster:        2 molecules:       28
cluster:        3 molecules:       14
cluster:        4 molecules:       11
cluster:        5 molecules:        8
cluster:        6 molecules:        3
cluster:        7 molecules:        2
cluster:        8 molecules:        2
cluster:        9 molecules:        1
cluster:       10 molecules:        1
cluster:       11 molecules:        1

A running instance of DataWarrior was able to read the newly written file 100Random_Molecules_sort.txt both by File -> Open, as well as by the short cut Ctrl + O.

content of the project

tree
.
├── datawarrior_clustersort.py
├── LICENSE
├── README.html
├── README.md
├── README.org
└── test_data
    ├── 100Random_Molecules.dwar
    ├── 100Random_Molecules_sort.txt
    └── 100Random_Molecules.txt

2 directories, 8 files

About

sort DataWarrior clusters and relabel their molecules accordingly

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages