#FEATURE EXTRACTION FOR HP-PPI PREDICTION

Feature extraction is the process of converting protein sequences which are strings of amino acidsâ€”into numerical vectors that machine learning models can understand. This transformation is necessary because ML algorithms cannot directly work with strings and need numbers as input. During this step, protein sequences of different lengths are transformed into fixed-length numerical representations that capture important biological properties of the protein.

Several feature extraction methods have been developed to achieve this, and these descriptors can be applied individually or combined. Although combining multiple descriptors may improve the predictive performance of machine learning models, it increases computational time and memory requirements. Moreover, high-dimensional feature vectors can lead to overfitting and the inclusion of redundant or irrelevant features.

In this tutorial, we will be using the iFeature package, a widely used Python-based toolkit for sequence feature extraction. iFeature  offers a wide range of descriptors and is user-friendy.

##Objectives of this notebook:
By the end of this notebook, you will be able to:
* Extract numerical features from both host and pathogen protein sequences
* Apply descriptors individually or in combination to represent protein pairs.
* Prepare the feature data for machine learning model training and evaluation



**Step 1: Install and import necessary packages**

In [None]:
!pip install tqdm


In [None]:
!git clone https://github.com/Superzchen/iFeature.git


In [None]:
#Mounting Google Drive to access files
import os
from google.colab import drive
drive.mount ('/content/my_drive')

#import pandas for data manipulation
import pandas as pd

from tqdm.notebook import tqdm




**Step 2: Load your dataset.**

For this notebook, we created a new folder named Features in the HPI folder. We also moved the merged HPI dataset into this folder for feature extraction.

In [None]:
file_path = '/content/my_drive/My Drive/HPI/Features'
HPIdata = file_path + "/merged_hpi_dataset.csv"
df = pd.read_csv(HPIdata)

# Preview data
df.head()

In [None]:
len(df)

**Step 3: Generate FASTA file for host and pathogen proten sequence**




In this step, we will assign unique ids to each interacting pair and generate corresponding FASTA files for their sequences. These identifiers maintain the pairing structure for feature extraction.

In [None]:

with open("host.fasta", "w") as host_f, open("pathogen.fasta", "w") as patho_f:
    for i, row in tqdm(df.iterrows(), total=len(df), desc="Writing FASTA"):
        host_f.write(f">H{i+1}\n{row['host_sequence']}\n")
        patho_f.write(f">P{i+1}\n{row['pathogen_sequence']}\n")


**Step 4:  Run ifeature for feature extraction**

For each protein pair, iFeature generates numerical vector for the host protein and the pathogen protein. These feature vectors are first generated separately and then concatenated to represent the host-pathogen pair.

Thus, if a descriptor generates n-features, the final feature vector will be 2n i.e. (n-host + n-pathogen).

##AMINO ACID COMPOSITION

###Amino acid composition (AAC)
20 Features


In [None]:
# Amino acid compposition
#for host sequences
!python3 iFeature/iFeature.py --file host.fasta --type AAC --out host_aac.tsv


#for pathogen sequences
!python3 iFeature/iFeature.py --file pathogen.fasta --type AAC --out pathogen_aac.tsv


**Step 5: Merge the extracted features**

In [None]:
host_feat = pd.read_csv("host_aac.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_aac.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)

#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/aac_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'aac_features.csv'")
combined_features.head()




###Composition of K-Spaced Amino Acid Pairs (CKSAAP)
2400 Features

In [None]:
!python3 iFeature/codes/CKSAAP.py host.fasta 2 host_cksaap.tsv
!python3 iFeature/codes/CKSAAP.py pathogen.fasta 2 pathogen_cksaap.tsv

In [None]:
host_feat = pd.read_csv("host_cksaap.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_cksaap.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/cksaap_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'cksaap_features.csv'")
combined_features.head()


###Dipeptide Composition (DPC)
400 Features

In [None]:
# Dipeptide composition
!python3 iFeature/iFeature.py --file host.fasta --type DPC --out host_dpc.tsv
!python3 iFeature/iFeature.py --file pathogen.fasta --type DPC --out pathogen_dpc.tsv


In [None]:
host_feat = pd.read_csv("host_dpc.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_dpc.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/dpc_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'dpc.csv'")
combined_features.head()


###Dipeptide deviation from expected mean (DDE)
400 Features

In [None]:
# Host DDE
!python3 iFeature/iFeature.py --file host.fasta --type DDE --out host_dde.tsv

# Pathogen DDE
!python3 iFeature/iFeature.py --file pathogen.fasta --type DDE --out pathogen_dde.tsv


In [None]:
host_feat = pd.read_csv("host_dde.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_dde.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/dde_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'dde.csv'")
combined_features.head()


###Tripeptide composition (TPC)
8000 Features (Due to the high dimensionality of this descriptor, TPC takes significantly longer time to compute)

In [None]:
# Host TPC
!python3 iFeature/iFeature.py --file host.fasta --type TPC --out host_tpc.tsv

# Pathogen TPC
!python3 iFeature/iFeature.py --file pathogen.fasta --type TPC --out pathogen_tpc.tsv


###Combined AAC, DPC, and DDE

In [None]:
import pandas as pd

# --- Load and set index ---
# Host
host_aac = pd.read_csv("host_aac.tsv", sep="\t").set_index("#")
host_dpc = pd.read_csv("host_dpc.tsv", sep="\t").set_index("#")
host_dde = pd.read_csv("host_dde.tsv", sep="\t").set_index("#")

# Pathogen
patho_aac = pd.read_csv("pathogen_aac.tsv", sep="\t").set_index("#")
patho_dpc = pd.read_csv("pathogen_dpc.tsv", sep="\t").set_index("#")
patho_dde = pd.read_csv("pathogen_dde.tsv", sep="\t").set_index("#")

# --- Prefix column names for clarity ---
host_aac.columns = ["H_AAC_" + col for col in host_aac.columns]
host_dpc.columns = ["H_DPC_" + col for col in host_dpc.columns]
host_dde.columns = ["H_DDE_" + col for col in host_dde.columns]

patho_aac.columns = ["P_AAC_" + col for col in patho_aac.columns]
patho_dpc.columns = ["P_DPC_" + col for col in patho_dpc.columns]
patho_dde.columns = ["P_DDE_" + col for col in patho_dde.columns]

# --- Concatenate features ---
host_combined = pd.concat([host_aac, host_dpc, host_dde], axis=1).reset_index(drop=True)
patho_combined = pd.concat([patho_aac, patho_dpc, patho_dde], axis=1).reset_index(drop=True)

# Final combined feature matrix
combined_features = pd.concat([host_combined, patho_combined], axis=1)

#Add the label column
combined_features["label"] = df["label"]


#save as csv
combined_features.to_csv(file_path + "/aac_dpc_dde_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'aac_dpc_dde_features.csv'")
combined_features


###Combined AAC + CKSAAP

In [None]:
import pandas as pd

# --- Load and set index ---
# Host
host_aac = pd.read_csv("host_aac.tsv", sep="\t").set_index("#")
host_cksaap = pd.read_csv("host_cksaap.tsv", sep="\t").set_index("#")


# Pathogen
patho_aac = pd.read_csv("pathogen_aac.tsv", sep="\t").set_index("#")
patho_cksaap = pd.read_csv("pathogen_cksaap.tsv", sep="\t").set_index("#")

# --- Prefix column names for clarity ---
host_aac.columns = ["H_AAC_" + col for col in host_aac.columns]
host_cksaap.columns = ["H_CKSAAP_" + col for col in host_cksaap.columns]


patho_aac.columns = ["P_AAC_" + col for col in patho_aac.columns]
patho_cksaap.columns = ["P_CKSAAP_" + col for col in patho_cksaap.columns]

# --- Concatenate features ---
host_combined = pd.concat([host_aac, host_cksaap], axis=1).reset_index(drop=True)
patho_combined = pd.concat([patho_aac, patho_cksaap], axis=1).reset_index(drop=True)

# Final combined feature matrix
combined_features = pd.concat([host_combined, patho_combined], axis=1)

#Add the label column
combined_features["label"] = df["label"]


#save as csv
combined_features.to_csv(file_path + "/aac_cksaap_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'aac_cksaap_features.csv'")



##GROUPED AMINO ACID COMPOSITION

###Grouped amino acid composition (GAAC)
5 Features

In [None]:
# For Host proteins
!python3 iFeature/iFeature.py --file host.fasta --type GAAC --out host_gaac.tsv

# For Pathogen proteins
!python3 iFeature/iFeature.py --file pathogen.fasta --type GAAC --out pathogen_gaac.tsv


In [None]:
host_feat = pd.read_csv("host_gaac.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_gaac.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/gaac_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'gaac_features.csv'")


###Composition of k-spaced amino acid group pairs (CKSAAGP)
150 Features

In [None]:
#  Host
!python3 iFeature/iFeature.py --file host.fasta --type CKSAAGP --out host_cksaagp.tsv

#  Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type CKSAAGP --out pathogen_cksaagp.tsv


In [None]:
host_feat = pd.read_csv("host_cksaagp.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_cksaagp.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/cksaagp_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'cksaagp_features.csv'")


###Grouped dipeptide composition (GDPC)
25 Features

In [None]:
# For Host
!python3 iFeature/iFeature.py --file host.fasta --type GDPC --out host_gdpc.tsv

# For Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type GDPC --out pathogen_gdpc.tsv


In [None]:
host_feat = pd.read_csv("host_gdpc.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_gdpc.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/gdpc_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'gdpc_features.csv'")

###Grouped tripeptide composition (GTPC)
125 Features

In [None]:
# For Host
!python3 iFeature/iFeature.py --file host.fasta --type GTPC --out host_gtpc.tsv

# For Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type GTPC --out pathogen_gtpc.tsv


In [None]:
host_feat = pd.read_csv("host_gtpc.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_gtpc.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/gtpc_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'gtpc_features.csv'")

###Combined GAAC, CKSAAGP, GDPC and GTPC

In [None]:
import pandas as pd

# Load all 4 descriptors for Host
host_gaac     = pd.read_csv("host_gaac.tsv", sep="\t").set_index("#")
host_cksaagp  = pd.read_csv("host_cksaagp.tsv", sep="\t").set_index("#")
host_gdpc     = pd.read_csv("host_gdpc.tsv", sep="\t").set_index("#")
host_gtpc     = pd.read_csv("host_gtpc.tsv", sep="\t").set_index("#")

# Load all 4 descriptors for Pathogen
patho_gaac    = pd.read_csv("pathogen_gaac.tsv", sep="\t").set_index("#")
patho_cksaagp = pd.read_csv("pathogen_cksaagp.tsv", sep="\t").set_index("#")
patho_gdpc    = pd.read_csv("pathogen_gdpc.tsv", sep="\t").set_index("#")
patho_gtpc    = pd.read_csv("pathogen_gtpc.tsv", sep="\t").set_index("#")

# Combine all host features
host_feat = pd.concat([host_gaac, host_cksaagp, host_gdpc, host_gtpc], axis=1)

# Combine all pathogen features
patho_feat = pd.concat([patho_gaac, patho_cksaagp, patho_gdpc, patho_gtpc], axis=1)

# Reset index for alignment
host_feat = host_feat.reset_index(drop=True)
patho_feat = patho_feat.reset_index(drop=True)

# Combine host and pathogen features side-by-side
combined_features = pd.concat([host_feat, patho_feat], axis=1)

# Add labels (assumes your original dataframe with labels is named df)
combined_features["label"] = df["label"]

# Save final feature set
combined_features.to_csv(file_path + "/combined_gaac_cksaagp_gdpc_gtpc.csv", index=False)

# Output shape
print("Final shape:", combined_features.shape)
print("Saved as: combined_gaac_cksaagp_gdpc_gtpc.csv")


##AUTOCORRELATION

###Moran Autocorrelation
240 Features

In [None]:
# Host Moran
!python3 iFeature/iFeature.py --file host.fasta --type Moran --out host_moran.tsv


# Pathogen Moran
!python3 iFeature/iFeature.py --file pathogen.fasta --type Moran --out pathogen_moran.tsv

In [None]:
host_feat = pd.read_csv("host_moran.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_moran.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/moran_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'moran.csv'")

###Geary Autocorrelation
240 Features

In [None]:
# Host Geary
!python3 iFeature/iFeature.py --file host.fasta --type Geary --out host_geary.tsv


# Pathogen Geary
!python3 iFeature/iFeature.py --file pathogen.fasta --type Geary --out pathogen_geary.tsv

In [None]:
host_feat = pd.read_csv("host_geary.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_geary.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/geary_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'geary.csv'")

###Moreau-Broto Autocorrelation
240 Features

In [None]:

# Host Moreau-Broto
!python3 iFeature/iFeature.py --file host.fasta --type NMBroto --out host_nmbroto.tsv


# Pathogen Moreau-Broto
!python3 iFeature/iFeature.py --file pathogen.fasta --type NMBroto --out pathogen_nmbroto.tsv

In [None]:
host_feat = pd.read_csv("host_nmbroto.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_nmbroto.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/nmbroto_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'nmbroto.csv'")

###Combined Autocorrelation

In [None]:
import pandas as pd

# --- Load host features ---
host_moran = pd.read_csv("host_moran.tsv", sep="\t").set_index("#")
host_geary = pd.read_csv("host_geary.tsv", sep="\t").set_index("#")
host_moreau = pd.read_csv("host_nmbroto.tsv", sep="\t").set_index("#")

# --- Load pathogen features ---
patho_moran = pd.read_csv("pathogen_moran.tsv", sep="\t").set_index("#")
patho_geary = pd.read_csv("pathogen_geary.tsv", sep="\t").set_index("#")
patho_moreau = pd.read_csv("pathogen_nmbroto.tsv", sep="\t").set_index("#")

# --- Add prefixes for clarity ---
host_moran.columns = ["H_Moran_" + col for col in host_moran.columns]
host_geary.columns = ["H_Geary_" + col for col in host_geary.columns]
host_moreau.columns = ["H_MB_" + col for col in host_moreau.columns]

patho_moran.columns = ["P_Moran_" + col for col in patho_moran.columns]
patho_geary.columns = ["P_Geary_" + col for col in patho_geary.columns]
patho_moreau.columns = ["P_MB_" + col for col in patho_moreau.columns]

# --- Combine features ---
host_combined = pd.concat([host_moran, host_geary, host_moreau], axis=1).reset_index(drop=True)
patho_combined = pd.concat([patho_moran, patho_geary, patho_moreau], axis=1).reset_index(drop=True)

# --- Combine host and pathogen into one feature matrix ---
combined_features = pd.concat([host_combined, patho_combined], axis=1)


#Add the label column
combined_features["label"] = df["label"]


#save as csv
combined_features.to_csv(file_path + "/autocorrelation_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'autocorrelation_features.csv'")
combined_features





##QUASI SEQUENCE ORDER


###Sequence-order-coupling number (SOCNumber)

60 Features

In [None]:
##Sequence-order-coupling number (SOCNumber)
# For host sequences
!python3 iFeature/iFeature.py --file host.fasta --type SOCNumber --out host_socnumber.tsv

# For pathogen sequences
!python3 iFeature/iFeature.py --file pathogen.fasta --type SOCNumber --out pathogen_socnumber.tsv


In [None]:

host_feat = pd.read_csv("host_socnumber.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_socnumber.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/socnumber_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'socnumber_features.csv'")

###Quasi-sequence-order descriptors (QSOrder)
100 Features

In [None]:
##Quasi-sequence-order descriptors (QSOrder)
# Host protein sequences
!python3 iFeature/iFeature.py --file host.fasta --type QSOrder --out host_qsorder.tsv

# Pathogen protein sequences
!python3 iFeature/iFeature.py --file pathogen.fasta --type QSOrder --out pathogen_qsorder.tsv


In [None]:
##Quasi-sequence-order descriptors (QSOrder)
host_feat = pd.read_csv("host_qsorder.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_qsorder.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/qsorder_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'qsorder_features.csv'")

###Combined SCONumber and QSOrder

In [None]:
import pandas as pd

# Load features
host_qsorder = pd.read_csv("host_qsorder.tsv", sep="\t").set_index("#")
host_soc     = pd.read_csv("host_socnumber.tsv", sep="\t").set_index("#")

patho_qsorder = pd.read_csv("pathogen_qsorder.tsv", sep="\t").set_index("#")
patho_soc     = pd.read_csv("pathogen_socnumber.tsv", sep="\t").set_index("#")

# Combine all features per side
host_feat = pd.concat([host_qsorder, host_soc], axis=1)
patho_feat = pd.concat([patho_qsorder, patho_soc], axis=1)

# Reset index to align with original dataframe structure
host_feat = host_feat.reset_index(drop=True)
patho_feat = patho_feat.reset_index(drop=True)

# Merge host + pathogen features
combined_features = pd.concat([host_feat, patho_feat], axis=1)

#Add the label column
combined_features["label"] = df["label"]

# Save the final combined features
combined_features.to_csv(file_path + "/combined_qsorder_soc.csv", index=False)

print("Final shape:", combined_features.shape)
print("Combined feature file saved: combined_qsorder_soc.csv")


##CONJOINT TRIAD


###Conjoint triad (CTriad)
343 Features

In [None]:
#Conjoint Triad
# For host sequences
!python3 iFeature/iFeature.py --file host.fasta --type CTriad --out host_ctriad.tsv

# For pathogen sequences
!python3 iFeature/iFeature.py --file pathogen.fasta --type CTriad --out pathogen_ctriad.tsv


In [None]:
#conjoint Triad
host_feat = pd.read_csv("host_ctriad.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_ctriad.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/ctriad_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'ctriad_features.csv'")

###Conjoint k-spaced triad (KSCTriad)

In [None]:
!python3 iFeature/iFeature.py --file host.fasta --type KSCTriad --out host_ksctriad.tsv


!python3 iFeature/iFeature.py --file pathogen.fasta --type KSCTriad --out pathogen_ksctriad.tsv


In [None]:
host_feat = pd.read_csv("host_ksctriad.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_ksctriad.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/ksctriad_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'ksctriad_features.csv'")

###Combined CTriad and KSCTriad

In [None]:
# Load features
host_ctriad = pd.read_csv("host_ctriad.tsv", sep="\t").set_index("#")
host_ksctriad = pd.read_csv("host_ksctriad.tsv", sep="\t").set_index("#")

patho_ctriad = pd.read_csv("pathogen_ctriad.tsv", sep="\t").set_index("#")
patho_ksctriad = pd.read_csv("pathogen_ksctriad.tsv", sep="\t").set_index("#")

# Combine all features per side
host_feat = pd.concat([host_ctriad, host_ksctriad], axis=1)
patho_feat = pd.concat([patho_ctriad, patho_ksctriad], axis=1)

# Reset index to align with original dataframe structure
host_feat = host_feat.reset_index(drop=True)
patho_feat = patho_feat.reset_index(drop=True)

# Merge host + pathogen features
combined_features = pd.concat([host_feat, patho_feat], axis=1)

# Add the label column
combined_features["label"] = df["label"]

# Save the final combined features
combined_features.to_csv("combined_ctriad_ksctriad.csv", index=False)

print("Final shape:", combined_features.shape)
print("Combined feature file saved: combined_ctriad_ksctriad.csv")


##PSEUDO-AMINO ACID COMPOSITION

###Pseudo-amino acid composition (PAAC)
50 Features

In [None]:
# Host
!python3 iFeature/iFeature.py --file host.fasta --type PAAC --out host_paac.tsv

# Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type PAAC --out pathogen_paac.tsv


In [None]:
host_feat = pd.read_csv("host_paac.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_paac.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/paac_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'paac_features.csv'")

###Amphiphilic PAAC (APAAC)
80 Features

In [None]:
# For Host
!python3 iFeature/iFeature.py --file host.fasta --type APAAC --out host_apaac.tsv

# For Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type APAAC --out pathogen_apaac.tsv


In [None]:
host_feat = pd.read_csv("host_apaac.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_apaac.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/apaac_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'apaac_features.csv'")

###Combined PAAc and APAAC

In [None]:
# Load features
host_paac = pd.read_csv("host_paac.tsv", sep="\t").set_index("#")
host_apaac = pd.read_csv("host_apaac.tsv", sep="\t").set_index("#")

patho_paac = pd.read_csv("pathogen_paac.tsv", sep="\t").set_index("#")
patho_apaac = pd.read_csv("pathogen_apaac.tsv", sep="\t").set_index("#")

# Combine all features per side
host_feat = pd.concat([host_paac, host_apaac], axis=1)
patho_feat = pd.concat([patho_paac, patho_apaac], axis=1)

# Reset index to align with original dataframe structure
host_feat = host_feat.reset_index(drop=True)
patho_feat = patho_feat.reset_index(drop=True)

# Merge host + pathogen features
combined_features = pd.concat([host_feat, patho_feat], axis=1)

# Add the label column
combined_features["label"] = df["label"]

# Save the final combined features
combined_features.to_csv("combined_paac_apaac.csv", index=False)

print("Final shape:", combined_features.shape)
print("Combined feature file saved: combined_paac_apaac.csv")


##C/T/D

###Composition (CTDC)
39 Features

In [None]:
# Host
!python3 iFeature/iFeature.py --file host.fasta --type CTDC --out host_ctdc.tsv

# Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type CTDC --out pathogen_ctdc.tsv


In [None]:
host_feat = pd.read_csv("host_ctdc.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_ctdc.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/ctdc_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'ctdc_features.csv'")
combined_features.head()

###Transition (CTDT)
39 Features

In [None]:
# Host
!python3 iFeature/iFeature.py --file host.fasta --type CTDT --out host_ctdt.tsv

# Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type CTDT --out pathogen_ctdt.tsv


In [None]:
host_feat = pd.read_csv("host_ctdt.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_ctdt.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/ctdt_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'ctdt_features.csv'")
combined_features.head()

###Distribution (CTDD)

In [None]:
# Host
!python3 iFeature/iFeature.py --file host.fasta --type CTDD --out host_ctdd.tsv

# Pathogen
!python3 iFeature/iFeature.py --file pathogen.fasta --type CTDD --out pathogen_ctdd.tsv


In [None]:
host_feat = pd.read_csv("host_ctdd.tsv", sep="\t").set_index("#")
patho_feat = pd.read_csv("pathogen_ctdd.tsv", sep="\t").set_index("#")


# Add prefixes to column names
host_feat.columns = ["H_" + col for col in host_feat.columns]
patho_feat.columns = ["P_" + col for col in patho_feat.columns]


# Concatenate by row order
combined_features = pd.concat(
    [host_feat.reset_index(drop=True), patho_feat.reset_index(drop=True)], axis=1
)


#Add the label column
combined_features["label"] = df["label"]


# Save to CSV
combined_features.to_csv(file_path + "/ctdd_features.csv", index=False)

print("Final shape:", combined_features.shape)
print("Saved as 'ctdd_features.csv'")
combined_features.head()

###Combined CTDC, CTDT and CTDD

In [None]:
import pandas as pd

# Load the CTD features for host
host_ctdc = pd.read_csv("host_ctdc.tsv", sep="\t").set_index("#")
host_ctdt = pd.read_csv("host_ctdt.tsv", sep="\t").set_index("#")
host_ctdd = pd.read_csv("host_ctdd.tsv", sep="\t").set_index("#")

# Load the CTD features for pathogen
patho_ctdc = pd.read_csv("pathogen_ctdc.tsv", sep="\t").set_index("#")
patho_ctdt = pd.read_csv("pathogen_ctdt.tsv", sep="\t").set_index("#")
patho_ctdd = pd.read_csv("pathogen_ctdd.tsv", sep="\t").set_index("#")

# Combine host and pathogen features separately
host_feat = pd.concat([host_ctdc, host_ctdt, host_ctdd], axis=1).reset_index(drop=True)
patho_feat = pd.concat([patho_ctdc, patho_ctdt, patho_ctdd], axis=1).reset_index(drop=True)

# Combine host + pathogen features
combined_features = pd.concat([host_feat, patho_feat], axis=1)

# Add the label column from your original dataframe
combined_features["label"] = df["label"]

# Save to CSV
combined_features.to_csv(file_path + "/combined_ctd.csv", index=False)

print("Final shape:", combined_features.shape)
print("Combined feature file saved: combined_ctd.csv")


##Combine PAAC + CTriad + CKSAAP

##Combined Geary + GTPC

In [None]:
import pandas as pd

# Load features
host_geary = pd.read_csv("host_geary.tsv", sep="\t").set_index("#")
host_gtpc = pd.read_csv("host_gtpc.tsv", sep="\t").set_index("#")

patho_geary = pd.read_csv("pathogen_geary.tsv", sep="\t").set_index("#")
patho_gtpc = pd.read_csv("pathogen_gtpc.tsv", sep="\t").set_index("#")

# Combine all features per side
host_feat = pd.concat([host_geary, host_gtpc], axis=1)
patho_feat = pd.concat([patho_geary, patho_gtpc], axis=1)

# Reset index to align with original dataframe structure
host_feat = host_feat.reset_index(drop=True)
patho_feat = patho_feat.reset_index(drop=True)

# Merge host + pathogen features
combined_features = pd.concat([host_feat, patho_feat], axis=1)

# Add the label column from original df
combined_features["label"] = df["label"]

# Save the final combined features
combined_features.to_csv(file_path + "/combined_geary_gtpc.csv", index=False)

print("Final shape:", combined_features.shape)
print("Combined feature file saved: combined_geary_gtpc.csv")


**Now that we have extracted all the features, we can now proceed to the final notebook.**