Numerous studies have demonstrated that biological sequences, such as DNA, RNA, and peptide, can be considered the "language of life". Utilizing pre-trained language models (LMs) like ESM2, GPT, and BERT have yielded state-of-the-art (SOTA) results in many cases. However, the increasing size of datasets exponentially escalates the time and hardware resources required for fine-tuning a complete LM. This paper assumed that natural language shared linguistic logic with the "language of life" like peptides. We took the LM BERT model as an example in a novel Principal Component Analysis (PCA)-based Ying-Yang dilution network of the inter- and intra-BERT layers, termed TaiChiNet, for feature representation of peptide sequences. The Ying-Yang dilution architecture fuses the PCA transformation matrices trained on positive and negative samples, respectively. We transferred the TaiChiNet features into a subtractive layer feature space and observed that TaiChiNet just rotated the original subtractive features with a certain angle and didn’t change the relative distance among the dimensions. TaiChiNet-engineered features together with the hand-crafted (HC) ones were integrated for the prediction model of anti-coronavirus peptides (TaiChiACVP). Experimental results demonstrated that the TaiChiACVP model achieved new SOTA performance and remarkably short training time on five imbalanced datasets established for the anti-coronavirus peptide (ACVP) prediction task. The decision paths of the random forest classifier illustrated that TaiChiNet features can complement HC features for better decisions. TaiChiNet has also learned the latent features significantly correlated with physicochemical properties including molecular weight. This makes an explainable connection between the deep learning-represented features and the ACVP-associated physicochemical properties. Additionally, we extended our work to the other LMs, including ESM2 with 6 or 12 layers, ProGen2 small and base version, ProtBERT, and ProtGPT2. Due to the limitations of these recent LMs, none of them outperforms TaiChiACVP. However, some limitations of TaiChiNet remained to be investigated in the future, including learnable rotation degrees, extended fusions of more layers, and end-to-end training architecture.

The original .faa files are in PreAntiCoV
Run python feature_extract.py, the code was derived from PreAntiCoV, the corresponding features will be stored in ./data
Check the folder name: the root folder name MUST be TaiChiACVP
cd TaiChiACVP
conda env create -f environments.yaml
conda activate taichinet
python bert_features.py
the corresponding features will be stored in ./embedding_features
python raw_bert_org.py
The corresponding features will be stored in ./outputs
Just check the Balanced RF results for each layer
python pcxa_layer_org.py
The corresponding features will be stored in ./outputs
Just check the cross-validation results
- modify parameters
features.peptide.LLM=trueandfeatures.peptide.HC=truein./config/train_org.yamlbefore running
features:
peptide:
LLM: true
HC: true # false # true # true
properties:
load: true
- then run
python train.py
python interpretation.py
- modify parameters
plot.tree=truein./config/train_org.yamlbefore running NOTICE: this change will make the code run really slow.
plot:
tree: true
- then run
python train.py
There are taichi,pcxa,pp+nn,np+n,pn+p,np,pn available for testing
- modify parameters
pretrained.dilation.typein./config/train_org.yamlbefore running
pretrained:
dilation:
type: 'taichi'
- then run
python train.py
-
Users should save the features of LMs into
./embedding_features, and the file name should follow the same rules as BERT inbert_features.py. -
modify parameters
pretrained.dilation.topKandpretrained.dilation.features_listin./config/train_org.yamlbefore running NOTICE: the length of the 2 list should be the same
pretrained:
dilation:
type: 'taichi'
topK: [2] # the same length with the features_list
features_list: ['bert']
comparison: false #false #true #false
percentage: -1
- then run
python train.py
Please contact kwbb1997@.gmail.com
This work was supported by the Senior and Junior Technological Innovation Team (20210509055RQ), Guizhou Provincial Science and Technology Projects (ZK2023-297), the Science and Technology Foundation of Health Commission of Guizhou Province (gzwkj2023-565), Science and Technology Project of Education Department of Jilin Province (JJKH20220245KJ and JJKH20220226SK), the National Natural Science Foundation of China (62072212 and U19A2061), the Jilin Provincial Key Laboratory of Big Data Intelligent Computing (20180622002JC), and the Fundamental Research Funds for the Central Universities, JLU.


