To demonstrate the ``SequenceFeature().get_df_parts()``, we first obtain an example sequence dataset using the ``load_dataset()`` function:

In [1]:
import aaanalysis as aa
sf = aa.SequenceFeature()
df_seq = aa.load_dataset(name="SEQ_CAPSID")
aa.display_df(df_seq, n_rows=3, show_shape=True)


DataFrame shape: (7862, 3)


Unnamed: 0,entry,sequence,label
1,CAPSID_1,MVTHNVKINKHVTRR...DTPRIPATKLDEENV,0
2,CAPSID_2,MKKRQKKMTLSNFTD...AMLEAVINARHFGEE,0
3,CAPSID_3,MRYGGSVISQELEVS...PGSGAPDKQEVELVD,0


By default, three sequence parts (``tmd``, ``jmd_n_tmd_n``, ``tmd_c_jmd_c``) with a ``jmd_n`` and ``jmd_c`` length of each 10 residues are provided:

In [2]:
df_parts = sf.get_df_parts(df_seq=df_seq)
aa.display_df(df=df_parts, n_rows=5, show_shape=True)

DataFrame shape: (7862, 3)


Unnamed: 0,tmd,jmd_n_tmd_n,tmd_c_jmd_c
CAPSID_1,HVTRRSYSSAKEVLE...KPDLFKGDDDDTPRI,MVTHNVKINKHVTRR...SPLNDDGSFTNPTVT,ARHGDNNIETPIEDV...DTPRIPATKLDEENV
CAPSID_2,SNFTDTSFQDFVSAE...QMQNGVFMRMAMLEA,MKKRQKKMTLSNFTD...PSQMMLDLMTIHEEF,GHFDGLSIGIVGDLS...AMLEAVINARHFGEE
CAPSID_3,ELEVSLHMAFVEARS...RLSFDGAGKSPGSGA,MRYGGSVISQELEVS...FEEHHNVRYSAAALS,AAAELSARYINDRHL...PGSGAPDKQEVELVD
CAPSID_4,VGRHRRIKAEDVSKYQRIRDEKRRQALE,MERGDIPFKYVGRHRRIKAEDVSK,YQRIRDEKRRQALEELAEMDAGLI
CAPSID_5,LILFTQVSTAQKKAI...TLAAMLQIQMPSGCV,MKRIYLLFAALILFT...KLYMGKNYEMIKSTP,YGNTITLDLAKQAIA...PSGCVGEPITELTHQ


Any combination of valid sequence parts can be obtained using the ``list_part`` parameter:

In [3]:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'])
aa.display_df(df=df_parts, n_rows=3, show_shape=True)

DataFrame shape: (7862, 4)


Unnamed: 0,jmd_n,tmd,jmd_c,tmd_jmd
CAPSID_1,MVTHNVKINK,HVTRRSYSSAKEVLE...KPDLFKGDDDDTPRI,PATKLDEENV,MVTHNVKINKHVTRR...DTPRIPATKLDEENV
CAPSID_2,MKKRQKKMTL,SNFTDTSFQDFVSAE...QMQNGVFMRMAMLEA,VINARHFGEE,MKKRQKKMTLSNFTD...AMLEAVINARHFGEE
CAPSID_3,MRYGGSVISQ,ELEVSLHMAFVEARS...RLSFDGAGKSPGSGA,PDKQEVELVD,MRYGGSVISQELEVS...PGSGAPDKQEVELVD


Set the length of both JMDs by the ``jmd_c_len`` and ``jmd_n_len`` parameters. 

In [4]:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'tmd', 'jmd_c', 'tmd_jmd'], jmd_c_len=5, jmd_n_len=20)
aa.display_df(df=df_parts, n_rows=3, show_shape=True)

DataFrame shape: (7862, 4)


Unnamed: 0,jmd_n,tmd,jmd_c,tmd_jmd
CAPSID_1,MVTHNVKINKHVTRRSYSSA,KEVLEIPPLTEVQTA...KGDDDDTPRIPATKL,DEENV,MVTHNVKINKHVTRR...DTPRIPATKLDEENV
CAPSID_2,MKKRQKKMTLSNFTDTSFQD,FVSAEQVDDKSAMAL...VFMRMAMLEAVINAR,HFGEE,MKKRQKKMTLSNFTD...AMLEAVINARHFGEE
CAPSID_3,MRYGGSVISQELEVSLHMAF,VEARSARHEFITVEH...GAGKSPGSGAPDKQE,VELVD,MRYGGSVISQELEVS...PGSGAPDKQEVELVD


A JMD length of 0 is indicated by '...':

In [5]:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=['jmd_n', 'jmd_n_tmd_n'], jmd_n_len=0)
aa.display_df(df=df_parts, n_rows=3, show_shape=True)

DataFrame shape: (7862, 2)


Unnamed: 0,jmd_n,jmd_n_tmd_n
CAPSID_1,,MVTHNVKINKHVTRR...AQANSPLNDDGSFTN
CAPSID_2,,MKKRQKKMTLSNFTD...GAGQHPSQMMLDLMT
CAPSID_3,,MRYGGSVISQELEVS...RGLKSRFEEHHNVRY


To select all possible parts, set ``all_parts=True``: 

In [6]:
df_parts = sf.get_df_parts(df_seq=df_seq, all_parts=True)
aa.display_df(df=df_parts, n_rows=3, show_shape=True)

DataFrame shape: (7862, 8)


Unnamed: 0,tmd,tmd_n,tmd_c,jmd_n,jmd_c,tmd_jmd,jmd_n_tmd_n,tmd_c_jmd_c
CAPSID_1,HVTRRSYSSAKEVLE...KPDLFKGDDDDTPRI,HVTRRSYSSAKEVLE...SPLNDDGSFTNPTVT,ARHGDNNIETPIEDV...KPDLFKGDDDDTPRI,MVTHNVKINK,PATKLDEENV,MVTHNVKINKHVTRR...DTPRIPATKLDEENV,MVTHNVKINKHVTRR...SPLNDDGSFTNPTVT,ARHGDNNIETPIEDV...DTPRIPATKLDEENV
CAPSID_2,SNFTDTSFQDFVSAE...QMQNGVFMRMAMLEA,SNFTDTSFQDFVSAE...PSQMMLDLMTIHEEF,GHFDGLSIGIVGDLS...QMQNGVFMRMAMLEA,MKKRQKKMTL,VINARHFGEE,MKKRQKKMTLSNFTD...AMLEAVINARHFGEE,MKKRQKKMTLSNFTD...PSQMMLDLMTIHEEF,GHFDGLSIGIVGDLS...AMLEAVINARHFGEE
CAPSID_3,ELEVSLHMAFVEARS...RLSFDGAGKSPGSGA,ELEVSLHMAFVEARS...FEEHHNVRYSAAALS,AAAELSARYINDRHL...RLSFDGAGKSPGSGA,MRYGGSVISQ,PDKQEVELVD,MRYGGSVISQELEVS...PGSGAPDKQEVELVD,MRYGGSVISQELEVS...FEEHHNVRYSAAALS,AAAELSARYINDRHL...PGSGAPDKQEVELVD


Entries with sequence gaps can be removed setting ``remove_entries_with_gaps=True``:

In [7]:
n_total = len(df_parts)
# Disable warning that entries have been removed and re-instantiate SequenceFeature
aa.options["verbose"] = False   
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq, remove_entries_with_gaps=True)
n_removed = n_total - len(df_parts)
print(f"{n_removed} sequence with gaps were removed")

5 sequence with gaps were removed


``SequenceFeature().get_df_parts()`` works with four different ``df_seq`` formats, which we demonstrate using the ``DOM_GSEC`` domain level γ-secretase substrates dataset (see [Breimann24c]_). Next to the common 'entry', 'sequence', and 'label' columns, this dataset provides columns for the TMD start and stop position ('tmd_start', 'tmd_stop') and the default sequence parts 'jmd_n', 'tmd', 'jmd_c':

In [8]:
df_seq = aa.load_dataset(name="DOM_GSEC")
aa.display_df(df_seq, n_rows=5)

Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c
1,P05067,MLPGLALLLLAAWTA...GYENPTYKFFEQMQN,1,701,723,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH
2,P14925,MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS,1,868,890,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD
3,P70180,MRSLLLFTFSACVLL...RELREDSIRSHFSVA,1,477,499,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER
4,Q03157,MGPTSPAARGQGRRW...HGYENPTYRFLEERP,1,585,607,APSGTGVSRE,ALSGLLIMGAGGGSLIVLSLLLL,RKKKPYGTIS
5,Q06481,MAATGTAAAAATGRL...GYENPTYKYLEQMQI,1,694,716,LREDFSLSSS,ALIGLLVIAVAIATVIVISLVML,RKRQYGTISH


1. ``Position-based format``:

In [9]:
cols_position_based = ["entry", "sequence", "tmd_start", "tmd_stop"]
list_parts = ["jmd_n", "tmd", "jmd_c"]
df_parts = sf.get_df_parts(df_seq=df_seq[cols_position_based], list_parts=list_parts)
aa.display_df(df_parts, n_rows=3)

Unnamed: 0,jmd_n,tmd,jmd_c
P05067,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH
P14925,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD
P70180,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER


2. ``Part-based format``:

In [10]:
cols_part_based = ["entry", "jmd_n", "tmd", "jmd_c"]
df_parts = sf.get_df_parts(df_seq=df_seq[cols_part_based], list_parts=list_parts)
aa.display_df(df_parts, n_rows=3)

Unnamed: 0,jmd_n,tmd,jmd_c
P05067,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH
P14925,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD
P70180,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER


3. ``Sequence-TMD-based format``:

In [11]:
cols_sequence_tmd_based = ["entry", "sequence", "tmd"]
df_parts = sf.get_df_parts(df_seq=df_seq[cols_sequence_tmd_based], list_parts=list_parts)
aa.display_df(df_parts, n_rows=3)

Unnamed: 0,jmd_n,tmd,jmd_c
P05067,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH
P14925,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD
P70180,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER


4. ``Sequence-based format``:

In [13]:
cols_sequence_based = ["entry", "sequence"]
df_parts = sf.get_df_parts(df_seq=df_seq[cols_sequence_based], list_parts=list_parts)
# Only providing the sequence will create flanking jmd_n and jmd_c regions defined by 
# their length and the tmd as the remaining middle part
aa.display_df(df_parts, n_rows=3)

Unnamed: 0,jmd_n,tmd,jmd_c
P05067,MLPGLALLLL,AAWTARALEVPTDGN...ERHLSKMQQNGYENP,TYKFFEQMQN
P14925,MAGRARSGLL,LLLLGLLALQSSCLA...EKDEDDGTESEEEYS,APLPKPAPSS
P70180,MRSLLLFTFS,ACVLLARVLLAGGAS...QQEESNIGKHRELRE,DSIRSHFSVA
