This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Name		Name	Last commit message	Last commit date
parent directory ..
example		example
fold_features		fold_features
pdb_lists		pdb_lists
.pdb_pre.py.swp		.pdb_pre.py.swp
README.md		README.md
determine_ss.py		determine_ss.py
domain_list.txt		domain_list.txt
fold2seq2.png		fold2seq2.png
fold_feat_gen.py		fold_feat_gen.py
pdb_pre.py		pdb_pre.py
ss_dense_gen.py		ss_dense_gen.py

README.md

Fold2Seq Data and Feature Generation

Data

The CATH IDs of protein domains in training, validation and two test sets are in pdb_lists/.

Feature Generation:

Input File:

In order to generate SSE density features, you need to first provide a file with all input proteins' information. Each row describes a protein domain. The meaning of each column is:
- Column1: The path to the PDB
- Column2: The PDB ID
- Column3: The chain ID
- Column4: The starting residue ID
- Column5: The ending residue ID
An example of this input file is example/domain_list.txt.

Secondary Structure Assignment:

Moreover, you need to pre-assign a secondary structure element to each residue. We provide an assignment file (ss.txt) obtained from RCSB PDB which contains most of exsiting PDBs. You can first check if your protein is in this file. If not, you can append it following the format in the file.

Generating features:

To generate SSE density features, you can run:

python fold_feat_gen.py --domain_list example/domain_list.txt --ss ss.txt --out $path_to_the_output_dictionary.

It will generate a python dictionary containing input information and fold features in fold_features/.