<a href="https://colab.research.google.com/github/ItunuIsewon/MACHINE-LEARNING-FOR-HOST-PATHOGEN-PROTEIN-PROTEIN-INTERACTION-PREDICTION-TUTORIAL/blob/main/Feature_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#FEATURE EXTRACTION FOR HP-PPI PREDICTION
Feature extraction involves transforming protein sequences- which are strings of amino acid-  into numerical vector that machine learining models can understand. This transformation is essential because ML algorithms cannot directly process strings but rather  require numberical input. during this transformation, the varying enth of raw sequnces are converted into consistent length numerical representations that capture the biolocal properties of the protein.

Several feature extraction methods have been develpoed to achieve this. these descriptors can be used individually or in combination to represent protein equences. While combining multiple featurees may enhance the predictive performace of ML models, it increases computation time and memory usage. Also, high dimensional feature vectors can lead to overfitting and incusion of redundant and irrelevant features. Each descriptor has different dimensional vectors and different computing time. Descriptors with high number of features take longer to compute.


In this tutorial, we will be using the iFeature package, a widely used Python-based toolkit for sequnce feature extraction. iFeature  offers a wide range of descriptors and is user-friendy.

In this notebook, we will demostrate how to use iFeature to:
* Extract features from host and pathogen protein sequences,
* Use desciptors both individually and in combination and
* Prepare the data for machine learning model training.



**Step 1:** Install and import necessary packages

In [None]:
!pip install tqdm




In [None]:
!git clone https://github.com/Superzchen/iFeature.git


Cloning into 'iFeature'...
remote: Enumerating objects: 322, done.[K
remote: Counting objects: 100% (47/47), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 322 (delta 40), reused 33 (delta 29), pack-reused 275 (from 1)[K
Receiving objects: 100% (322/322), 6.72 MiB | 27.32 MiB/s, done.
Resolving deltas: 100% (150/150), done.


In [None]:
#Mounting Google Drive to access files
import os
from google.colab import drive
drive.mount ('/content/my_drive')

#import pandas for data manipulation
import pandas as pd

from tqdm.notebook import tqdm




Mounted at /content/my_drive


**Step 2:** Load your dataset.

For this notebook, we created a new folder named Features in the HPI folder. We also moved the merged HPI dataset into this folder for feature extraction.

In [None]:
file_path = '/content/my_drive/My Drive/HPI/Features'
HPIdata = file_path + "/merged_hpi_dataset.csv"
df = pd.read_csv(HPIdata)

# Preview data
df.head()

Unnamed: 0,host_sequence,pathogen_sequence,label
0,MKIITYFCIWAVAWAIPVPQSKPLERHVEKSMNLHLLARSNVSVQD...,MYEANILLVDDETAILQLLTTILEKEGFSHITTATSAEMALSLTQQ...,0
1,MPGSLPLNAEACWPKDVGIVALEIYFPSQYVDQAELEKYDGVDAGK...,MKTVVIKRDGCQVPFDEVRIKEAVERAALAVGVVDADYCATVARVV...,1
2,MTRRCMPARPGFPSSPAPGSSPPRCHLRPGSTAHAAAGKRTESPGD...,MQRKKGAYAPVFYPAIVIAAILSLLGVLVPVAFANNIDIIQNLILE...,0
3,MSGARCRTLYPFSGERHGQGLRFAAGELITLLQVPDGGWWEGEKED...,MLLHLSIKNFAIIKSTEIDFREGMTVLTGETGAGKSILLDALSFVL...,1
4,MPAESGKRFKPSKYVPVSAAAIFLVGATTLFFAFTCPGLSLYVSPA...,MSDTSTDLQNGFDFAGLAASMALAAKNNEFTMATAAFIGMLNEPVK...,1


In [None]:
len(df)

8894

**Step 3: Generate FASTA file for host and pathogen proten sequence**




In this step, we will assign unique ids to each interacting pair and generate corresponding FASTA files for their sequences. These identifiers maintain the pairing structure for feature extraction.

In [None]:

with open("host.fasta", "w") as host_f, open("pathogen.fasta", "w") as patho_f:
    for i, row in tqdm(df.iterrows(), total=len(df), desc="Writing FASTA"):
        host_f.write(f">H{i+1}\n{row['host_sequence']}\n")
        patho_f.write(f">P{i+1}\n{row['pathogen_sequence']}\n")


Writing FASTA:   0%|          | 0/8894 [00:00<?, ?it/s]

**Step 4:  Run ifeature for feature extraction**

##AMINO ACID COMPOSITION

###Amino acid composition (AAC)
20 Features

In [None]:
# Amino acid compposition
#for host sequences
!python3 iFeature/iFeature.py --file host.fasta --type AAC --out host_aac.tsv


#for pathogen sequences
!python3 iFeature/iFeature.py --file pathogen.fasta --type AAC --out pathogen_aac.tsv


Descriptor type: AAC
Descriptor type: AAC


Step 5: Merge the extracted features

In [None]:
host_feat = pd.read_csv("host_aac.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_aac.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)

#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/aac_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'aac_features.csv'")
combined_features.head()




Final shape: (8894, 41)
Saved as 'aac_features.csv'


Unnamed: 0,H_A,H_C,H_D,H_E,H_F,H_G,H_H,H_I,H_K,H_L,...,P_N,P_P,P_Q,P_R,P_S,P_T,P_V,P_W,P_Y,label
0,0.017679,0.001537,0.199078,0.049193,0.002306,0.061491,0.01153,0.018447,0.033051,0.007686,...,0.034188,0.042735,0.047009,0.042735,0.059829,0.064103,0.055556,0.004274,0.038462,0
1,0.084615,0.019231,0.071154,0.053846,0.034615,0.076923,0.019231,0.048077,0.055769,0.092308,...,0.050562,0.039326,0.047753,0.057584,0.037921,0.050562,0.067416,0.007022,0.043539,1
2,0.044331,0.01364,0.052856,0.075874,0.031543,0.051151,0.034101,0.049446,0.092924,0.098892,...,0.04698,0.053691,0.026846,0.020134,0.040268,0.020134,0.053691,0.013423,0.04698,0
3,0.056723,0.012605,0.046218,0.096639,0.02521,0.058824,0.027311,0.029412,0.079832,0.088235,...,0.04918,0.014572,0.087432,0.034608,0.08561,0.041894,0.052823,0.0,0.030965,1
4,0.057343,0.018182,0.034965,0.057343,0.044755,0.086713,0.01958,0.029371,0.046154,0.083916,...,0.05814,0.011628,0.046512,0.0,0.081395,0.05814,0.046512,0.0,0.011628,1


###Composition of K-Spaced Amino Acid Pairs (CKSAAP)
2400 Features

In [None]:
!python3 iFeature/codes/CKSAAP.py host.fasta 2 host_cksaap.tsv
!python3 iFeature/codes/CKSAAP.py pathogen.fasta 2 pathogen_cksaap.tsv

In [None]:
host_feat = pd.read_csv("host_cksaap.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_cksaap.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/cksaap_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'cksaap_features.csv'")
combined_features.head()


Final shape: (8894, 2401)
Saved as 'cksaap_features.csv'


Unnamed: 0,H_AA.gap0,H_AC.gap0,H_AD.gap0,H_AE.gap0,H_AF.gap0,H_AG.gap0,H_AH.gap0,H_AI.gap0,H_AK.gap0,H_AL.gap0,...,P_YN.gap2,P_YP.gap2,P_YQ.gap2,P_YR.gap2,P_YS.gap2,P_YT.gap2,P_YV.gap2,P_YW.gap2,P_YY.gap2,label
0,0.0,0.0,0.000769,0.003077,0.0,0.004615,0.0,0.000769,0.0,0.0,...,0.004329,0.0,0.0,0.0,0.004329,0.004329,0.004329,0.0,0.0,0
1,0.007707,0.003854,0.0,0.007707,0.003854,0.00578,0.0,0.0,0.003854,0.009634,...,0.004231,0.00141,0.00141,0.0,0.007052,0.004231,0.002821,0.0,0.002821,1
2,0.005119,0.0,0.00256,0.001706,0.000853,0.005119,0.000853,0.001706,0.003413,0.005973,...,0.0,0.0,0.0,0.0,0.0,0.0,0.006849,0.0,0.0,0
3,0.002105,0.0,0.006316,0.002105,0.0,0.006316,0.004211,0.0,0.008421,0.002105,...,0.003663,0.0,0.003663,0.0,0.0,0.001832,0.0,0.0,0.001832,1
4,0.004202,0.0,0.001401,0.007003,0.001401,0.004202,0.001401,0.002801,0.002801,0.001401,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


###Dipeptide Composition (DPC)
400 Features

In [None]:
# Dipeptide composition
!python3 iFeature/iFeature.py --file host.fasta --type DPC --out host_dpc.tsv
!python3 iFeature/iFeature.py --file pathogen.fasta --type DPC --out pathogen_dpc.tsv


Descriptor type: DPC
Descriptor type: DPC


In [None]:
host_feat = pd.read_csv("host_dpc.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_dpc.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/dpc_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'dpc.csv'")
combined_features.head()


Final shape: (8894, 801)
Saved as 'dpc.csv'


Unnamed: 0,H_AA,H_AC,H_AD,H_AE,H_AF,H_AG,H_AH,H_AI,H_AK,H_AL,...,P_YN,P_YP,P_YQ,P_YR,P_YS,P_YT,P_YV,P_YW,P_YY,label
0,0.0,0.0,0.000769,0.003077,0.0,0.004615,0.0,0.000769,0.0,0.0,...,0.004292,0.0,0.0,0.0,0.004292,0.0,0.004292,0.0,0.0,0
1,0.007707,0.003854,0.0,0.007707,0.003854,0.00578,0.0,0.0,0.003854,0.009634,...,0.0,0.004219,0.0,0.001406,0.004219,0.001406,0.0,0.0,0.002813,1
2,0.005119,0.0,0.00256,0.001706,0.000853,0.005119,0.000853,0.001706,0.003413,0.005973,...,0.0,0.006757,0.0,0.006757,0.0,0.006757,0.0,0.0,0.0,0
3,0.002105,0.0,0.006316,0.002105,0.0,0.006316,0.004211,0.0,0.008421,0.002105,...,0.0,0.001825,0.00365,0.0,0.005474,0.0,0.0,0.0,0.0,1
4,0.004202,0.0,0.001401,0.007003,0.001401,0.004202,0.001401,0.002801,0.002801,0.001401,...,0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


###Dipeptide deviation from expected mean (DDE)
400 Features

In [None]:
# Host DDE
!python3 iFeature/iFeature.py --file host.fasta --type DDE --out host_dde.tsv

# Pathogen DDE
!python3 iFeature/iFeature.py --file pathogen.fasta --type DDE --out pathogen_dde.tsv


Descriptor type: DDE
Descriptor type: DDE


In [None]:
host_feat = pd.read_csv("host_dde.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_dde.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/dde_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'dde.csv'")
combined_features.head()


Final shape: (8894, 801)
Saved as 'dde.csv'


Unnamed: 0,H_AA,H_AC,H_AD,H_AE,H_AF,H_AG,H_AH,H_AI,H_AK,H_AL,...,P_YN,P_YP,P_YQ,P_YR,P_YS,P_YT,P_YV,P_YW,P_YY,label
0,-2.369396,-1.67361,-1.074811,0.721583,-1.67361,0.173832,-1.67361,-1.56167,-1.67361,-2.905043,...,1.498457,-0.708534,-0.500739,-0.868241,0.28724,-0.708534,0.705873,-0.353981,-0.500739,0
1,1.186282,0.837924,-1.057465,2.733314,0.837924,0.515438,-1.057465,-1.295824,0.837924,0.90613,...,-0.874719,1.191357,-0.874719,-0.855227,0.4677,-0.428018,-1.237705,-0.618353,1.41419,1
2,0.428782,-1.589082,0.302868,-0.327782,-0.958432,0.428782,-0.958432,-0.916867,0.933518,-0.204069,...,-0.399084,1.209991,-0.399084,0.757826,-0.691979,1.209991,-0.564694,-0.282119,-0.399084,0
3,-0.731003,-1.011648,1.960201,-0.021032,-1.011648,0.671452,0.969585,-1.239678,2.950817,-1.182844,...,-0.767934,-0.16433,1.839258,-1.331535,0.928794,-1.086608,-1.086608,-0.542865,-0.767934,1
4,-0.040118,-1.240314,-0.432329,2.799609,-0.432329,-0.040118,-0.432329,-0.199742,0.375655,-1.685432,...,3.007526,-0.427949,-0.302443,-0.524411,-0.524411,-0.427949,-0.427949,-0.213802,-0.302443,1


###Tripeptide composition (TPC)
8000 Features

In [None]:
# Host TPC
!python3 iFeature/iFeature.py --file host.fasta --type TPC --out host_tpc.tsv

# Pathogen TPC
!python3 iFeature/iFeature.py --file pathogen.fasta --type TPC --out pathogen_tpc.tsv


Descriptor type: TPC
object address  : 0x7d8a0cccb580
object refcount : 2
object type     : 0x9d7580
object type name: KeyboardInterrupt
object repr     : KeyboardInterrupt()
lost sys.stderr
^C
Descriptor type: TPC
Traceback (most recent call last):
  File "/content/iFeature/iFeature.py", line 55, in <module>
    encodings = eval(myFun)
                ^^^^^^^^^^^
  File "<string>", line 1, in <module>
  File "/content/iFeature/codes/TPC.py", line 24, in TPC
    tmpCode = [i/sum(tmpCode) for i in tmpCode]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content/iFeature/codes/TPC.py", line 24, in <listcomp>
    tmpCode = [i/sum(tmpCode) for i in tmpCode]
                 ^^^^^^^^^^^^
KeyboardInterrupt
^C


###Combined AAC, DPC, and DDE

In [None]:
import pandas as pd

# --- Load and set index ---
# Host
host_aac = pd.read_csv("host_aac.tsv", sep="\t").set_index("#")
host_dpc = pd.read_csv("host_dpc.tsv", sep="\t").set_index("#")
host_dde = pd.read_csv("host_dde.tsv", sep="\t").set_index("#")

# Pathogen
patho_aac = pd.read_csv("pathogen_aac.tsv", sep="\t").set_index("#")
patho_dpc = pd.read_csv("pathogen_dpc.tsv", sep="\t").set_index("#")
patho_dde = pd.read_csv("pathogen_dde.tsv", sep="\t").set_index("#")

# --- Prefix column names for clarity ---
host_aac.columns = ["H_AAC_" + col for col in host_aac.columns]
host_dpc.columns = ["H_DPC_" + col for col in host_dpc.columns]
host_dde.columns = ["H_DDE_" + col for col in host_dde.columns]

patho_aac.columns = ["P_AAC_" + col for col in patho_aac.columns]
patho_dpc.columns = ["P_DPC_" + col for col in patho_dpc.columns]
patho_dde.columns = ["P_DDE_" + col for col in patho_dde.columns]

# --- Concatenate features ---
host_combined = pd.concat([host_aac, host_dpc, host_dde], axis=1).reset_index(drop=True)
patho_combined = pd.concat([patho_aac, patho_dpc, patho_dde], axis=1).reset_index(drop=True)

# Final combined feature matrix
combined_features = pd.concat([host_combined, patho_combined], axis=1)

#Add the label column
combined_features["label"] = df["label"]


#save as csv
combined_features.to_csv(file_path + "/aac_dpc_dde_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'aac_dpc_dde_features.csv'")
combined_features


Final shape: (8894, 1641)
Saved as 'aac_dpc_dde_features.csv'


Unnamed: 0,H_AAC_A,H_AAC_C,H_AAC_D,H_AAC_E,H_AAC_F,H_AAC_G,H_AAC_H,H_AAC_I,H_AAC_K,H_AAC_L,...,P_DDE_YN,P_DDE_YP,P_DDE_YQ,P_DDE_YR,P_DDE_YS,P_DDE_YT,P_DDE_YV,P_DDE_YW,P_DDE_YY,label
0,0.017679,0.001537,0.199078,0.049193,0.002306,0.061491,0.011530,0.018447,0.033051,0.007686,...,1.498457,-0.708534,-0.500739,-0.868241,0.287240,-0.708534,0.705873,-0.353981,-0.500739,0
1,0.084615,0.019231,0.071154,0.053846,0.034615,0.076923,0.019231,0.048077,0.055769,0.092308,...,-0.874719,1.191357,-0.874719,-0.855227,0.467700,-0.428018,-1.237705,-0.618353,1.414190,1
2,0.044331,0.013640,0.052856,0.075874,0.031543,0.051151,0.034101,0.049446,0.092924,0.098892,...,-0.399084,1.209991,-0.399084,0.757826,-0.691979,1.209991,-0.564694,-0.282119,-0.399084,0
3,0.056723,0.012605,0.046218,0.096639,0.025210,0.058824,0.027311,0.029412,0.079832,0.088235,...,-0.767934,-0.164330,1.839258,-1.331535,0.928794,-1.086608,-1.086608,-0.542865,-0.767934,1
4,0.057343,0.018182,0.034965,0.057343,0.044755,0.086713,0.019580,0.029371,0.046154,0.083916,...,3.007526,-0.427949,-0.302443,-0.524411,-0.524411,-0.427949,-0.427949,-0.213802,-0.302443,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8889,0.071759,0.013889,0.057870,0.069444,0.037037,0.037037,0.025463,0.074074,0.062500,0.125000,...,4.711535,-0.555074,-0.392285,-0.680190,-0.680190,-0.555074,-0.555074,-0.277313,-0.392285,0
8890,0.040799,0.090098,0.055249,0.064173,0.047174,0.059499,0.025074,0.068423,0.047599,0.070973,...,-0.386760,1.283983,-0.386760,-0.670609,-0.670609,1.283983,-0.547255,-0.273407,-0.386760,0
8891,0.041667,0.083333,0.041667,0.055556,0.013889,0.041667,0.000000,0.041667,0.055556,0.166667,...,-0.662619,-0.937589,-0.662619,-0.275733,-0.275733,-0.937589,-0.937589,-0.468416,3.869743,0
8892,0.068646,0.027829,0.037106,0.040816,0.048237,0.046382,0.029685,0.042672,0.027829,0.172542,...,0.687460,-1.010582,-0.714205,-1.238373,-1.238373,0.972739,-0.018922,-0.504883,-0.714205,1


###Combined AAC + CKSAAP

In [None]:
import pandas as pd

# --- Load and set index ---
# Host
host_aac = pd.read_csv("host_aac.tsv", sep="\t").set_index("#")
host_cksaap = pd.read_csv("host_cksaap.tsv", sep="\t").set_index("#")


# Pathogen
patho_aac = pd.read_csv("pathogen_aac.tsv", sep="\t").set_index("#")
patho_cksaap = pd.read_csv("pathogen_cksaap.tsv", sep="\t").set_index("#")

# --- Prefix column names for clarity ---
host_aac.columns = ["H_AAC_" + col for col in host_aac.columns]
host_cksaap.columns = ["H_CKSAAP_" + col for col in host_cksaap.columns]


patho_aac.columns = ["P_AAC_" + col for col in patho_aac.columns]
patho_cksaap.columns = ["P_CKSAAP_" + col for col in patho_cksaap.columns]

# --- Concatenate features ---
host_combined = pd.concat([host_aac, host_cksaap], axis=1).reset_index(drop=True)
patho_combined = pd.concat([patho_aac, patho_cksaap], axis=1).reset_index(drop=True)

# Final combined feature matrix
combined_features = pd.concat([host_combined, patho_combined], axis=1)

#Add the label column
combined_features["label"] = df["label"]


#save as csv
combined_features.to_csv(file_path + "/aac_cksaap_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'aac_cksaap_features.csv'")



Final shape: (8894, 2441)
Saved as 'aac_cksaap_features.csv'


##GROUPED AMINO ACID COMPOSITION

###Grouped amino acid composition (GAAC)
5 Features

In [None]:
# For Host proteins
!python3 iFeature/iFeature.py --file host.fasta --type GAAC --out host_gaac.tsv

# For Pathogen proteins
!python3 iFeature/iFeature.py --file pathogen.fasta --type GAAC --out pathogen_gaac.tsv


Descriptor type: GAAC
Descriptor type: GAAC


In [None]:
host_feat = pd.read_csv("host_gaac.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_gaac.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/gaac_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'gaac_features.csv'")


Final shape: (8894, 11)
Saved as 'gaac.csv'


###Composition of k-spaced amino acid group pairs (CKSAAGP)
150 Features

In [None]:
#  Host
!python3 iFeature/iFeature.py --file host.fasta --type CKSAAGP --out host_cksaagp.tsv

#  Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type CKSAAGP --out pathogen_cksaagp.tsv


Descriptor type: CKSAAGP
Descriptor type: CKSAAGP


In [None]:
host_feat = pd.read_csv("host_cksaagp.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_cksaagp.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/cksaagp_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'cksaagp_features.csv'")


Final shape: (8894, 301)
Saved as 'cksaagp_features.csv'


###Grouped dipeptide composition (GDPC)
25 Features

In [None]:
# For Host
!python3 iFeature/iFeature.py --file host.fasta --type GDPC --out host_gdpc.tsv

# For Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type GDPC --out pathogen_gdpc.tsv


Descriptor type: GDPC
Descriptor type: GDPC


In [None]:
host_feat = pd.read_csv("host_gdpc.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_gdpc.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/gdpc_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'gdpc_features.csv'")

Final shape: (8894, 51)
Saved as 'gdpc_features.csv'


###Grouped tripeptide composition (GTPC)
125 Features

In [None]:
# For Host
!python3 iFeature/iFeature.py --file host.fasta --type GTPC --out host_gtpc.tsv

# For Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type GTPC --out pathogen_gtpc.tsv


Descriptor type: GTPC
Descriptor type: GTPC


In [None]:
host_feat = pd.read_csv("host_gtpc.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_gtpc.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/gtpc_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'gtpc_features.csv'")

Final shape: (8894, 251)
Saved as 'gtpc_features.csv'


###Combined GAAC, CKSAAGP, GDPC and GTPC

In [None]:
import pandas as pd

# Load all 4 descriptors for Host
host_gaac     = pd.read_csv("host_gaac.tsv", sep="\t").set_index("#")
host_cksaagp  = pd.read_csv("host_cksaagp.tsv", sep="\t").set_index("#")
host_gdpc     = pd.read_csv("host_gdpc.tsv", sep="\t").set_index("#")
host_gtpc     = pd.read_csv("host_gtpc.tsv", sep="\t").set_index("#")

# Load all 4 descriptors for Pathogen
patho_gaac    = pd.read_csv("pathogen_gaac.tsv", sep="\t").set_index("#")
patho_cksaagp = pd.read_csv("pathogen_cksaagp.tsv", sep="\t").set_index("#")
patho_gdpc    = pd.read_csv("pathogen_gdpc.tsv", sep="\t").set_index("#")
patho_gtpc    = pd.read_csv("pathogen_gtpc.tsv", sep="\t").set_index("#")

# Combine all host features
host_feat = pd.concat([host_gaac, host_cksaagp, host_gdpc, host_gtpc], axis=1)

# Combine all pathogen features
patho_feat = pd.concat([patho_gaac, patho_cksaagp, patho_gdpc, patho_gtpc], axis=1)

# Reset index for alignment
host_feat = host_feat.reset_index(drop=True)
patho_feat = patho_feat.reset_index(drop=True)

# Combine host and pathogen features side-by-side
combined_features = pd.concat([host_feat, patho_feat], axis=1)

# Add labels (assumes your original dataframe with labels is named df)
combined_features["label"] = df["label"]

# Save final feature set
combined_features.to_csv(file_path + "/combined_gaac_cksaagp_gdpc_gtpc.csv", index=False)

# Output shape
print("Final shape:", combined_features.shape)
print("Saved as: combined_gaac_cksaagp_gdpc_gtpc.csv")


Final shape: (8894, 611)
Saved as: combined_gaac_cksaagp_gdpc_gtpc.csv


##AUTOCORRELATION

###Moran Autocorrelation
240 Features

In [None]:
# Host Moran
!python3 iFeature/iFeature.py --file host.fasta --type Moran --out host_moran.tsv


# Pathogen Moran
!python3 iFeature/iFeature.py --file pathogen.fasta --type Moran --out pathogen_moran.tsv

Descriptor type: Moran
Descriptor type: Moran


In [None]:
host_feat = pd.read_csv("host_moran.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_moran.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/moran_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'moran.csv'")

Final shape: (8894, 481)
Saved as 'moran.csv'


###Geary Autocorrelation
240 Features

In [None]:
# Host Geary
!python3 iFeature/iFeature.py --file host.fasta --type Geary --out host_geary.tsv


# Pathogen Geary
!python3 iFeature/iFeature.py --file pathogen.fasta --type Geary --out pathogen_geary.tsv

Descriptor type: Geary
Descriptor type: Geary


In [None]:
host_feat = pd.read_csv("host_geary.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_geary.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/geary_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'geary.csv'")

Final shape: (8894, 481)
Saved as 'geary.csv'


###Moreau-Broto Autocorrelation
240 Features

In [None]:

# Host Moreau-Broto
!python3 iFeature/iFeature.py --file host.fasta --type NMBroto --out host_nmbroto.tsv


# Pathogen Moreau-Broto
!python3 iFeature/iFeature.py --file pathogen.fasta --type NMBroto --out pathogen_nmbroto.tsv

Descriptor type: NMBroto
Descriptor type: NMBroto


In [None]:
host_feat = pd.read_csv("host_nmbroto.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_nmbroto.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/nmbroto_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'nmbroto.csv'")

Final shape: (8894, 481)
Saved as 'nmbroto.csv'


###Combined Autocorrelation

In [None]:
import pandas as pd

# --- Load host features ---
host_moran = pd.read_csv("host_moran.tsv", sep="\t").set_index("#")
host_geary = pd.read_csv("host_geary.tsv", sep="\t").set_index("#")
host_moreau = pd.read_csv("host_nmbroto.tsv", sep="\t").set_index("#")

# --- Load pathogen features ---
patho_moran = pd.read_csv("pathogen_moran.tsv", sep="\t").set_index("#")
patho_geary = pd.read_csv("pathogen_geary.tsv", sep="\t").set_index("#")
patho_moreau = pd.read_csv("pathogen_nmbroto.tsv", sep="\t").set_index("#")

# --- Add prefixes for clarity ---
host_moran.columns = ["H_Moran_" + col for col in host_moran.columns]
host_geary.columns = ["H_Geary_" + col for col in host_geary.columns]
host_moreau.columns = ["H_MB_" + col for col in host_moreau.columns]

patho_moran.columns = ["P_Moran_" + col for col in patho_moran.columns]
patho_geary.columns = ["P_Geary_" + col for col in patho_geary.columns]
patho_moreau.columns = ["P_MB_" + col for col in patho_moreau.columns]

# --- Combine features ---
host_combined = pd.concat([host_moran, host_geary, host_moreau], axis=1).reset_index(drop=True)
patho_combined = pd.concat([patho_moran, patho_geary, patho_moreau], axis=1).reset_index(drop=True)

# --- Combine host and pathogen into one feature matrix ---
combined_features = pd.concat([host_combined, patho_combined], axis=1)


#Add the label column
combined_features["label"] = df["label"]


#save as csv
combined_features.to_csv(file_path + "/autocorrelation_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'autocorrelation_features.csv'")
combined_features





Final shape: (8894, 1441)
Saved as 'autocorrelation_features.csv'


Unnamed: 0,H_Moran_CIDH920105.lag1,H_Moran_CIDH920105.lag2,H_Moran_CIDH920105.lag3,H_Moran_CIDH920105.lag4,H_Moran_CIDH920105.lag5,H_Moran_CIDH920105.lag6,H_Moran_CIDH920105.lag7,H_Moran_CIDH920105.lag8,H_Moran_CIDH920105.lag9,H_Moran_CIDH920105.lag10,...,P_MB_DAYM780201.lag22,P_MB_DAYM780201.lag23,P_MB_DAYM780201.lag24,P_MB_DAYM780201.lag25,P_MB_DAYM780201.lag26,P_MB_DAYM780201.lag27,P_MB_DAYM780201.lag28,P_MB_DAYM780201.lag29,P_MB_DAYM780201.lag30,label
0,0.249615,0.298017,0.233113,0.218857,0.237162,0.284058,0.238456,0.243610,0.255207,0.291386,...,0.021440,-0.031311,0.036954,0.085184,0.032672,-0.035154,-0.069578,-0.027476,0.074446,0
1,-0.023086,-0.118130,-0.061013,0.019620,0.030569,-0.068660,0.031252,0.005583,-0.014173,-0.082585,...,0.018225,0.010455,0.050791,0.044062,-0.003458,-0.059216,0.055675,0.026134,0.018542,1
2,0.079219,-0.003486,0.050675,0.112781,-0.005620,0.020218,0.023834,0.015892,0.037898,0.037946,...,-0.146189,0.052774,-0.027683,-0.132858,-0.017785,-0.098408,0.008395,-0.207647,-0.075822,0
3,-0.044293,-0.056704,-0.025004,0.017859,-0.009356,-0.085461,0.021619,0.014584,-0.079989,-0.052924,...,0.062592,0.088991,0.092897,0.014658,0.046292,0.120989,0.044969,-0.015386,0.069732,1
4,0.032612,0.112179,0.086560,0.048253,0.093282,0.072423,0.017953,0.099269,0.075509,0.097841,...,0.354901,0.148222,0.115179,0.181689,0.098783,0.274708,0.098860,0.306539,0.111395,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8889,-0.025927,-0.112019,0.057123,-0.018876,-0.034069,-0.018719,0.072873,-0.054761,-0.096700,-0.023733,...,-0.060213,0.090020,-0.121080,0.068097,-0.035871,-0.047676,-0.166371,0.082711,0.046565,0
8890,-0.031370,-0.021766,-0.075884,-0.064188,-0.057623,0.013232,0.033020,0.034734,0.015225,0.026093,...,0.046041,0.049686,-0.031067,0.005401,-0.016418,0.051475,0.013459,-0.060218,0.083532,0
8891,-0.052413,-0.006951,-0.039608,0.001051,0.133807,0.101314,-0.037150,0.014220,0.077066,-0.060335,...,0.052550,-0.000465,0.031952,0.103054,-0.042475,-0.026624,-0.000156,0.037545,-0.079358,0
8892,0.079347,0.060985,0.031621,0.032171,0.054557,0.060549,0.062079,0.041617,0.007722,-0.012845,...,0.045253,-0.011338,0.049261,0.074780,0.000645,0.073995,-0.041918,-0.046551,-0.002469,1


##QUASI SEQUENCE ORDER


###Sequence-order-coupling number (SOCNumber)

60 Features

In [None]:
##Sequence-order-coupling number (SOCNumber)
# For host sequences
!python3 iFeature/iFeature.py --file host.fasta --type SOCNumber --out host_socnumber.tsv

# For pathogen sequences
!python3 iFeature/iFeature.py --file pathogen.fasta --type SOCNumber --out pathogen_socnumber.tsv


Descriptor type: SOCNumber
Descriptor type: SOCNumber


In [None]:

host_feat = pd.read_csv("host_socnumber.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_socnumber.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/socnumber_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'socnumber_features.csv'")

Final shape: (8894, 121)
Saved as 'socnumber_features.csv'


###Quasi-sequence-order descriptors (QSOrder)
100 Features

In [None]:
##Quasi-sequence-order descriptors (QSOrder)
# Host protein sequences
!python3 iFeature/iFeature.py --file host.fasta --type QSOrder --out host_qsorder.tsv

# Pathogen protein sequences
!python3 iFeature/iFeature.py --file pathogen.fasta --type QSOrder --out pathogen_qsorder.tsv


Descriptor type: QSOrder
Descriptor type: QSOrder


In [None]:
##Quasi-sequence-order descriptors (QSOrder)
host_feat = pd.read_csv("host_qsorder.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_qsorder.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/qsorder_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'qsorder_features.csv'")

Final shape: (8894, 201)
Saved as 'qsorder_features.csv'


###Combined SCONumber and QSOrder

In [None]:
import pandas as pd

# Load features
host_qsorder = pd.read_csv("host_qsorder.tsv", sep="\t").set_index("#")
host_soc     = pd.read_csv("host_socnumber.tsv", sep="\t").set_index("#")

patho_qsorder = pd.read_csv("pathogen_qsorder.tsv", sep="\t").set_index("#")
patho_soc     = pd.read_csv("pathogen_socnumber.tsv", sep="\t").set_index("#")

# Combine all features per side
host_feat = pd.concat([host_qsorder, host_soc], axis=1)
patho_feat = pd.concat([patho_qsorder, patho_soc], axis=1)

# Reset index to align with original dataframe structure
host_feat = host_feat.reset_index(drop=True)
patho_feat = patho_feat.reset_index(drop=True)

# Merge host + pathogen features
combined_features = pd.concat([host_feat, patho_feat], axis=1)

#Add the label column
combined_features["label"] = df["label"]

# Save the final combined features
combined_features.to_csv(file_path + "/combined_qsorder_soc.csv", index=False)

print("Final shape:", combined_features.shape)
print("Combined feature file saved: combined_qsorder_soc.csv")


Final shape: (8894, 321)
Combined feature file saved: combined_qsorder_soc.csv


##CONJOINT TRIAD


###Conjoint triad (CTriad)
343 Features

In [None]:
#Conjoint Triad
# For host sequences
!python3 iFeature/iFeature.py --file host.fasta --type CTriad --out host_ctriad.tsv

# For pathogen sequences
!python3 iFeature/iFeature.py --file pathogen.fasta --type CTriad --out pathogen_ctriad.tsv


Descriptor type: CTriad
Descriptor type: CTriad


In [None]:
#conjoint Triad
host_feat = pd.read_csv("host_ctriad.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_ctriad.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/ctriad_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'ctriad_features.csv'")

Final shape: (8894, 687)
Saved as 'ctriad_features.csv'


###Conjoint k-spaced triad (KSCTriad)

In [None]:
!python3 iFeature/iFeature.py --file host.fasta --type KSCTriad --out host_ksctriad.tsv


!python3 iFeature/iFeature.py --file pathogen.fasta --type KSCTriad --out pathogen_ksctriad.tsv


Descriptor type: KSCTriad
Descriptor type: KSCTriad


In [None]:
host_feat = pd.read_csv("host_ksctriad.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_ksctriad.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/ksctriad_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'ksctriad_features.csv'")

Final shape: (8894, 687)
Saved as 'ksctriad_features.csv'


###Combined CTriad and KSCTriad

In [None]:
# Load features
host_ctriad = pd.read_csv("host_ctriad.tsv", sep="\t").set_index("#")
host_ksctriad = pd.read_csv("host_ksctriad.tsv", sep="\t").set_index("#")

patho_ctriad = pd.read_csv("pathogen_ctriad.tsv", sep="\t").set_index("#")
patho_ksctriad = pd.read_csv("pathogen_ksctriad.tsv", sep="\t").set_index("#")

# Combine all features per side
host_feat = pd.concat([host_ctriad, host_ksctriad], axis=1)
patho_feat = pd.concat([patho_ctriad, patho_ksctriad], axis=1)

# Reset index to align with original dataframe structure
host_feat = host_feat.reset_index(drop=True)
patho_feat = patho_feat.reset_index(drop=True)

# Merge host + pathogen features
combined_features = pd.concat([host_feat, patho_feat], axis=1)

# Add the label column
combined_features["label"] = df["label"]

# Save the final combined features
combined_features.to_csv("combined_ctriad_ksctriad.csv", index=False)

print("Final shape:", combined_features.shape)
print("Combined feature file saved: combined_ctriad_ksctriad.csv")


Final shape: (8894, 1373)
Combined feature file saved: combined_ctriad_ksctriad.csv


##PSEUDO-AMINO ACID COMPOSITION

###Pseudo-amino acid composition (PAAC)
50 Features

In [None]:
# Host
!python3 iFeature/iFeature.py --file host.fasta --type PAAC --out host_paac.tsv

# Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type PAAC --out pathogen_paac.tsv


Descriptor type: PAAC
Descriptor type: PAAC


In [None]:
host_feat = pd.read_csv("host_paac.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_paac.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/paac_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'paac_features.csv'")

Final shape: (8894, 101)
Saved as 'paac_features.csv'


###Amphiphilic PAAC (APAAC)
80 Features

In [None]:
# For Host
!python3 iFeature/iFeature.py --file host.fasta --type APAAC --out host_apaac.tsv

# For Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type APAAC --out pathogen_apaac.tsv


Descriptor type: APAAC
Descriptor type: APAAC


In [None]:
host_feat = pd.read_csv("host_apaac.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_apaac.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/apaac_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'apaac_features.csv'")

Final shape: (8894, 161)
Saved as 'apaac_features.csv'


###Combined PAAc and APAAC

In [None]:
# Load features
host_paac = pd.read_csv("host_paac.tsv", sep="\t").set_index("#")
host_apaac = pd.read_csv("host_apaac.tsv", sep="\t").set_index("#")

patho_paac = pd.read_csv("pathogen_paac.tsv", sep="\t").set_index("#")
patho_apaac = pd.read_csv("pathogen_apaac.tsv", sep="\t").set_index("#")

# Combine all features per side
host_feat = pd.concat([host_paac, host_apaac], axis=1)
patho_feat = pd.concat([patho_paac, patho_apaac], axis=1)

# Reset index to align with original dataframe structure
host_feat = host_feat.reset_index(drop=True)
patho_feat = patho_feat.reset_index(drop=True)

# Merge host + pathogen features
combined_features = pd.concat([host_feat, patho_feat], axis=1)

# Add the label column
combined_features["label"] = df["label"]

# Save the final combined features
combined_features.to_csv("combined_paac_apaac.csv", index=False)

print("Final shape:", combined_features.shape)
print("Combined feature file saved: combined_paac_apaac.csv")


Final shape: (8894, 261)
Combined feature file saved: combined_paac_apaac.csv


##C/T/D

###Composition (CTDC)
39 Features

In [None]:
# Host
!python3 iFeature/iFeature.py --file host.fasta --type CTDC --out host_ctdc.tsv

# Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type CTDC --out pathogen_ctdc.tsv


Descriptor type: CTDC
Descriptor type: CTDC


In [None]:
host_feat = pd.read_csv("host_ctdc.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_ctdc.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/ctdc_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'ctdc_features.csv'")
combined_features.head()

Final shape: (8894, 79)
Saved as 'ctdc_features.csv'


Unnamed: 0,H_hydrophobicity_PRAM900101.G1,H_hydrophobicity_PRAM900101.G2,H_hydrophobicity_PRAM900101.G3,H_hydrophobicity_ARGP820101.G1,H_hydrophobicity_ARGP820101.G2,H_hydrophobicity_ARGP820101.G3,H_hydrophobicity_ZIMJ680101.G1,H_hydrophobicity_ZIMJ680101.G2,H_hydrophobicity_ZIMJ680101.G3,H_hydrophobicity_PONP930101.G1,...,P_charge.G1,P_charge.G2,P_charge.G3,P_secondarystruct.G1,P_secondarystruct.G2,P_secondarystruct.G3,P_solventaccess.G1,P_solventaccess.G2,P_solventaccess.G3,label
0,0.399693,0.546503,0.053805,0.856264,0.095311,0.048424,0.887779,0.066872,0.04535,0.838586,...,0.111111,0.747863,0.141026,0.444444,0.311966,0.24359,0.405983,0.333333,0.260684,0
1,0.305769,0.401923,0.292308,0.419231,0.307692,0.273077,0.553846,0.182692,0.263462,0.438462,...,0.116573,0.755618,0.127809,0.448034,0.293539,0.258427,0.441011,0.342697,0.216292,1
2,0.362319,0.379369,0.258312,0.43393,0.29838,0.26769,0.542199,0.197783,0.260017,0.537937,...,0.053691,0.879195,0.067114,0.42953,0.328859,0.241611,0.57047,0.194631,0.234899,0
3,0.378151,0.369748,0.252101,0.434874,0.313025,0.252101,0.573529,0.195378,0.231092,0.518908,...,0.118397,0.73224,0.149362,0.520947,0.233151,0.245902,0.393443,0.404372,0.202186,1
4,0.26014,0.486713,0.253147,0.420979,0.283916,0.295105,0.54965,0.158042,0.292308,0.482517,...,0.046512,0.848837,0.104651,0.476744,0.232558,0.290698,0.5,0.255814,0.244186,1


###Transition (CTDT)
39 Features

In [None]:
# Host
!python3 iFeature/iFeature.py --file host.fasta --type CTDT --out host_ctdt.tsv

# Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type CTDT --out pathogen_ctdt.tsv


Descriptor type: CTDT
Descriptor type: CTDT


In [None]:
host_feat = pd.read_csv("host_ctdt.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_ctdt.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/ctdt_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'ctdt_features.csv'")
combined_features.head()

Final shape: (8894, 79)
Saved as 'ctdt_features.csv'


Unnamed: 0,H_hydrophobicity_PRAM900101.Tr1221,H_hydrophobicity_PRAM900101.Tr1331,H_hydrophobicity_PRAM900101.Tr2332,H_hydrophobicity_ARGP820101.Tr1221,H_hydrophobicity_ARGP820101.Tr1331,H_hydrophobicity_ARGP820101.Tr2332,H_hydrophobicity_ZIMJ680101.Tr1221,H_hydrophobicity_ZIMJ680101.Tr1331,H_hydrophobicity_ZIMJ680101.Tr2332,H_hydrophobicity_PONP930101.Tr1221,...,P_charge.Tr1221,P_charge.Tr1331,P_charge.Tr2332,P_secondarystruct.Tr1221,P_secondarystruct.Tr1331,P_secondarystruct.Tr2332,P_solventaccess.Tr1221,P_solventaccess.Tr1331,P_solventaccess.Tr2332,label
0,0.579231,0.043077,0.043846,0.134615,0.06,0.02,0.108462,0.063077,0.012308,0.136154,...,0.167382,0.034335,0.206009,0.261803,0.175966,0.150215,0.27897,0.236052,0.154506,0
1,0.22736,0.184971,0.240848,0.248555,0.248555,0.16763,0.190751,0.306358,0.102119,0.206166,...,0.180028,0.030942,0.185654,0.254571,0.233474,0.189873,0.322082,0.180028,0.136428,1
2,0.249147,0.182594,0.174915,0.240614,0.219283,0.168089,0.200512,0.271331,0.112628,0.200512,...,0.074324,0.006757,0.128378,0.263514,0.236486,0.141892,0.168919,0.256757,0.114865,0
3,0.235789,0.231579,0.164211,0.261053,0.214737,0.149474,0.231579,0.265263,0.077895,0.210526,...,0.177007,0.023723,0.228102,0.220803,0.25365,0.144161,0.337591,0.164234,0.158759,1
4,0.266106,0.114846,0.2493,0.233894,0.256303,0.172269,0.173669,0.333333,0.092437,0.210084,...,0.094118,0.0,0.211765,0.164706,0.270588,0.2,0.258824,0.2,0.117647,1


###Distribution (CTDD)

In [None]:
# Host
!python3 iFeature/iFeature.py --file host.fasta --type CTDD --out host_ctdd.tsv

# Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type CTDD --out pathogen_ctdd.tsv


Descriptor type: CTDD
Descriptor type: CTDD


In [None]:
host_feat = pd.read_csv("host_ctdd.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_ctdd.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/ctdd_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'ctdd_features.csv'")
combined_features.head()

Final shape: (8894, 391)
Saved as 'ctdd_features.csv'


Unnamed: 0,H_hydrophobicity_PRAM900101.1.residue0,H_hydrophobicity_PRAM900101.1.residue25,H_hydrophobicity_PRAM900101.1.residue50,H_hydrophobicity_PRAM900101.1.residue75,H_hydrophobicity_PRAM900101.1.residue100,H_hydrophobicity_PRAM900101.2.residue0,H_hydrophobicity_PRAM900101.2.residue25,H_hydrophobicity_PRAM900101.2.residue50,H_hydrophobicity_PRAM900101.2.residue75,H_hydrophobicity_PRAM900101.2.residue100,...,P_solventaccess.2.residue25,P_solventaccess.2.residue50,P_solventaccess.2.residue75,P_solventaccess.2.residue100,P_solventaccess.3.residue0,P_solventaccess.3.residue25,P_solventaccess.3.residue50,P_solventaccess.3.residue75,P_solventaccess.3.residue100,label
0,0.153728,23.520369,43.812452,71.637202,100.0,0.38432,33.743274,57.571099,78.708686,99.846272,...,30.34188,55.555556,79.059829,100.0,0.42735,20.940171,44.871795,68.376068,99.57265,0
1,1.538462,21.923077,51.730769,75.192308,99.807692,0.384615,28.076923,51.346154,76.153846,100.0,...,19.94382,48.876404,73.735955,99.719101,0.140449,28.370787,51.685393,80.617978,99.016854,1
2,0.255754,23.955669,45.183291,72.804774,99.914749,0.170503,20.630861,44.671782,67.689685,100.0,...,25.503356,50.33557,75.838926,95.973154,0.671141,34.899329,62.416107,81.879195,99.328859,0
3,1.05042,37.184874,59.87395,78.571429,99.789916,0.420168,19.957983,33.613445,60.714286,98.94958,...,27.504554,50.273224,69.034608,99.271403,0.182149,24.225865,50.637523,73.224044,98.724954,1
4,0.559441,30.629371,53.006993,76.223776,99.58042,0.27972,31.888112,55.664336,79.58042,99.86014,...,15.116279,50.0,72.093023,98.837209,1.162791,6.976744,39.534884,63.953488,96.511628,1


###Combined CTDC, CTDT and CTDD

In [None]:
import pandas as pd

# Load the CTD features for host
host_ctdc = pd.read_csv("host_ctdc.tsv", sep="\t").set_index("#")
host_ctdt = pd.read_csv("host_ctdt.tsv", sep="\t").set_index("#")
host_ctdd = pd.read_csv("host_ctdd.tsv", sep="\t").set_index("#")

# Load the CTD features for pathogen
patho_ctdc = pd.read_csv("pathogen_ctdc.tsv", sep="\t").set_index("#")
patho_ctdt = pd.read_csv("pathogen_ctdt.tsv", sep="\t").set_index("#")
patho_ctdd = pd.read_csv("pathogen_ctdd.tsv", sep="\t").set_index("#")

# Combine host and pathogen features separately
host_feat = pd.concat([host_ctdc, host_ctdt, host_ctdd], axis=1).reset_index(drop=True)
patho_feat = pd.concat([patho_ctdc, patho_ctdt, patho_ctdd], axis=1).reset_index(drop=True)

# Combine host + pathogen features
combined_features = pd.concat([host_feat, patho_feat], axis=1)

# Add the label column from your original dataframe
combined_features["label"] = df["label"]

# Save to CSV
combined_features.to_csv(file_path + "/combined_ctd.csv", index=False)

print("Final shape:", combined_features.shape)
print("Combined feature file saved: combined_ctd.csv")


Final shape: (8894, 547)
Combined feature file saved: combined_ctd.csv


##Combine PAAC + CTriad + CKSAAP

##Combined Geary + GTPC

In [None]:
import pandas as pd

# Load features
host_geary = pd.read_csv("host_geary.tsv", sep="\t").set_index("#")
host_gtpc = pd.read_csv("host_gtpc.tsv", sep="\t").set_index("#")

patho_geary = pd.read_csv("pathogen_geary.tsv", sep="\t").set_index("#")
patho_gtpc = pd.read_csv("pathogen_gtpc.tsv", sep="\t").set_index("#")

# Combine all features per side
host_feat = pd.concat([host_geary, host_gtpc], axis=1)
patho_feat = pd.concat([patho_geary, patho_gtpc], axis=1)

# Reset index to align with original dataframe structure
host_feat = host_feat.reset_index(drop=True)
patho_feat = patho_feat.reset_index(drop=True)

# Merge host + pathogen features
combined_features = pd.concat([host_feat, patho_feat], axis=1)

# Add the label column from original df
combined_features["label"] = df["label"]

# Save the final combined features
combined_features.to_csv(file_path + "/combined_geary_gtpc.csv", index=False)

print("Final shape:", combined_features.shape)
print("Combined feature file saved: combined_geary_gtpc.csv")


Final shape: (8894, 731)
Combined feature file saved: combined_geary_gtpc.csv


**Now that we have extracted all the features, we can now proceed to the final notebook.**