<a href="https://colab.research.google.com/github/ItunuIsewon/Machine_learning_for_host_pathogen_protein-protein_interaction_prediction_Tutorial/blob/main/Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**DATA PREPROCESSING**

Before training a machine learning model to predict host-pathogen protein-protein interactions (HPPIs), it is important to carry out data preprocessing. This crucial step ensures that the data is clean and properly strutured for the ML models to learn effectively.

###Objectives of this notebook:
In this notebook, we will perform essential data preprocessing to ensure we proceed with high quality data for host-pathogen protein-protein interaction prediction. We will carry out the following steps.
1. Filter out Invalid Sequences: Remove protein containing non-standard amino acids along with thier corresponding pairs across all organisms.
2. Merge  positive interaction Data: Combine the cleaned positive data into a single data and label as 1.
3. Clean Negative Interaction data Pool: For each organism's negative interaction pool, apply the same sequence validation to remove pairs that contain invalid amino acids.
4. Balance Positive and Negative Samples: Randomly select a number of valid negative pairs that matches the number of positive pairs for each organism.
5. Merge negative Interaction Data assign label (0)
6. Combine positive and negative data and save for feature extraction.


**Removing Proteins with non-standard amino acids**

This step involves filtering out protein sequences in both the host and pathogen that contain non-standard amino acids. Non-standard amino acids, such as pyrrolysine (O) and selenocysteine (U) or undefined residues represented as “X” or “B” are removed to avoid introducing ambiguity and avoid disrupting calculation of features such as amino acid composition, physicochemical property vectors among others.



To do this;
* Each protein sequence is checked for the presence of non-standard amino acids residue.
* If either the host or pathogen in a pair contains such, the entire interaction pair is removed from the dataset.

In [None]:
#Mounting Google Drive to access files
import os
from google.colab import drive
drive.mount ('/content/my_drive')

#import pandas for data manipulation
import pandas as pd


##Sequence Validation for Positive Dataset

###Host-Bacillus

####Step 1
Create Host-Pathogen Interaction Sequence only pair for both negative and postive data

In [None]:

file_path = '/content/my_drive/My Drive/HPI'

#Load in the data
bacillus = file_path + "/host_bacillus_pairs.csv"

df = pd.read_csv(bacillus)


##creates Host-Pathogen Interaction (HPI) sequence pairs only
sequence_pairs = list(zip(df['Host Sequence'], df['Pathogen Sequence']))


# Create a new DataFrame with just the sequence pairs
sequence_pairs_df = pd.DataFrame(sequence_pairs, columns=['host_sequence', 'pathogen_sequence'])

# Save to CSV
sequence_pairs_df.to_csv(file_path + '/bacillus_sequence_pairs.csv', index=False)



####Step 2
Define a sequence validation function and apply it to both the positive and negative dataset

In [None]:
#SEQUENCE VALIDATION FUNCTION
# Define allowed amino acids (standard 20)
valid_aas = set("ACDEFGHIKLMNPQRSTVWY")

# Sequence Validation function
def is_valid_protein(seq, min_length=35):   #minimun lenght of the sequence is set at 35
    if pd.isna(seq) or not isinstance(seq, str):
        return False
    seq = seq.upper().strip()
    if len(seq) < min_length:
        return False
    return all(aa in valid_aas for aa in seq)



In [None]:
#SEQUENCE VALIDATION FOR BACILLUS
# Read your sequence pair DataFrame
bacillus = file_path+"/bacillus_sequence_pairs.csv"
df = pd.read_csv(bacillus)

# Standardize sequences: strip & convert to uppercase
df['host_sequence'] = df['host_sequence'].str.upper().str.strip()
df['pathogen_sequence'] = df['pathogen_sequence'].str.upper().str.strip()

# Validate sequences
df['valid_host'] = df['host_sequence'].apply(is_valid_protein)
df['valid_pathogen'] = df['pathogen_sequence'].apply(is_valid_protein)

# Filter to keep only valid pairs
df_valid = df[df['valid_host'] & df['valid_pathogen']].copy()

# Drop helper columns
df_valid.drop(columns=['valid_host', 'valid_pathogen'], inplace=True)

# Export to CSV
df_valid.to_csv(file_path + "/validated_positivebacillus.csv", index=False)

print(f"Saved {len(df_valid)} valid HPI pairs to 'validated_positivebacillus.csv'")


Saved 644 valid HPI pairs to 'validated_positivebacillus.csv'


Now repeat steps 1 and 2 for the other bacteria species.

###Host-Ecoli

In [None]:
#Load in the data
ecoli = file_path + "/host_ecoli_pairs.csv"

df = pd.read_csv(ecoli)


##creates Host-Pathogen Interaction (HPI) sequence pairs only
sequence_pairs = list(zip(df['Host Sequence'], df['Pathogen Sequence']))


# Create a new DataFrame with just the sequence pairs
sequence_pairs_df = pd.DataFrame(sequence_pairs, columns=['host_sequence', 'pathogen_sequence'])

# Save to CSV
sequence_pairs_df.to_csv(file_path + '/ecoli_sequence_pairs.csv', index=False)



In [None]:
# Read your sequence pair DataFrame
ecoli = file_path+"/ecoli_sequence_pairs.csv"
df = pd.read_csv(ecoli)

# Standardize sequences: strip & convert to uppercase
df['host_sequence'] = df['host_sequence'].str.upper().str.strip()
df['pathogen_sequence'] = df['pathogen_sequence'].str.upper().str.strip()

# Validate sequences
df['valid_host'] = df['host_sequence'].apply(is_valid_protein)
df['valid_pathogen'] = df['pathogen_sequence'].apply(is_valid_protein)

# Filter to keep only valid pairs
df_valid = df[df['valid_host'] & df['valid_pathogen']].copy()

# Drop helper columns
df_valid.drop(columns=['valid_host', 'valid_pathogen'], inplace=True)

# Export to CSV
df_valid.to_csv(file_path + "/validated_positiveecoli.csv", index=False)

print(f"Saved {len(df_valid)} valid HPI pairs to 'validated_positiveecoli.csv'")


###Host-Fransicella

In [None]:
#Load in the data
fransicella = file_path + "/host_francisella_pairs.csv"

df = pd.read_csv(fransicella)


##creates Host-Pathogen Interaction (HPI) sequence pairs only
sequence_pairs = list(zip(df['Host Sequence'], df['Pathogen Sequence']))


# Create a new DataFrame with just the sequence pairs
sequence_pairs_df = pd.DataFrame(sequence_pairs, columns=['host_sequence', 'pathogen_sequence'])

# Save to CSV
sequence_pairs_df.to_csv(file_path + '/fransicella_sequence_pairs.csv', index=False)



In [None]:
# Read your sequence pair DataFrame
fransicella = file_path+"/fransicella_sequence_pairs.csv"
df = pd.read_csv(fransicella)

# Standardize sequences: strip & convert to uppercase
df['host_sequence'] = df['host_sequence'].str.upper().str.strip()
df['pathogen_sequence'] = df['pathogen_sequence'].str.upper().str.strip()

# Validate sequences
df['valid_host'] = df['host_sequence'].apply(is_valid_protein)
df['valid_pathogen'] = df['pathogen_sequence'].apply(is_valid_protein)

# Filter to keep only valid pairs
df_valid = df[df['valid_host'] & df['valid_pathogen']].copy()

# Drop helper columns
df_valid.drop(columns=['valid_host', 'valid_pathogen'], inplace=True)

# Export to CSV
df_valid.to_csv(file_path + "/validated_positivefransicella.csv", index=False)

print(f"Saved {len(df_valid)} valid HPI pairs to 'validated_positivefransicella.csv'")


###Host-Yersinia

In [None]:
#Load in the data
yersinia = file_path + "/host_yersinia_pairs.csv"

df = pd.read_csv(yersinia)


##creates Host-Pathogen Interaction (HPI) sequence pairs only
sequence_pairs = list(zip(df['Host Sequence'], df['Pathogen Sequence']))


# Create a new DataFrame with just the sequence pairs
sequence_pairs_df = pd.DataFrame(sequence_pairs, columns=['host_sequence', 'pathogen_sequence'])

# Save to CSV
sequence_pairs_df.to_csv(file_path + '/yersinia_sequence_pairs.csv', index=False)



In [None]:
# Read your sequence pair DataFrame
yersinia = file_path+"/yersinia_sequence_pairs.csv"
df = pd.read_csv(yersinia)

# Standardize sequences: strip & convert to uppercase
df['host_sequence'] = df['host_sequence'].str.upper().str.strip()
df['pathogen_sequence'] = df['pathogen_sequence'].str.upper().str.strip()

# Validate sequences
df['valid_host'] = df['host_sequence'].apply(is_valid_protein)
df['valid_pathogen'] = df['pathogen_sequence'].apply(is_valid_protein)

# Filter to keep only valid pairs
df_valid = df[df['valid_host'] & df['valid_pathogen']].copy()

# Drop helper columns
df_valid.drop(columns=['valid_host', 'valid_pathogen'], inplace=True)

# Export to CSV
df_valid.to_csv(file_path + "/validated_positiveyersinia.csv", index=False)

print(f"Saved {len(df_valid)} valid HPI pairs to 'validated_positiveyersinia.csv'")


##Merge all positive data

In [None]:
# File paths
file1 = file_path + "/validated_positivebacillus.csv"
file2 = file_path + "/validated_positiveecoli.csv"
file3 = file_path + "/validated_positivefransicella.csv"
file4 = file_path + "/validated_positiveyersinia.csv"

# Read the csv files into DataFrames
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
df4 = pd.read_csv(file4)

concatenated_df = pd.concat([df1, df2, df3, df4], ignore_index=True)


# Save the merged DataFrame to a new csv file
concatenated_df.to_csv(file_path + "/AllpositiveHPI.csv", index=False)

print(f"Files merged successfully")



##Sequence Validation for Negative Data Pool

Before extracting the negative class, we will carry out sequence validation on each organisms negative data pool so that we will extract a balanced data from already preprocessed data.

In [None]:
#Sequence Validation Function
# Define allowed amino acids (standard 20)
valid_aas = set("ACDEFGHIKLMNPQRSTVWY")

# Sequence Validation function
def is_valid_protein(seq, min_length=35):
    if pd.isna(seq) or not isinstance(seq, str):
        return False
    seq = seq.upper().strip()
    if len(seq) < min_length:
        return False
    return all(aa in valid_aas for aa in seq)



###Negative Host Bacillus

In [None]:
file_path = '/content/my_drive/My Drive/HPI'
import pandas as pd

In [None]:
#Load in the data
bacillus = file_path + "/BacillusNegativePool.csv"

df = pd.read_csv(bacillus)


##creates Host-Pathogen Interaction (HPI) sequence pairs only
sequence_pairs = list(zip(df['Host Sequence'], df['Pathogen Sequence']))


# Create a new DataFrame with just the sequence pairs
sequence_pairs_df = pd.DataFrame(sequence_pairs, columns=['host_sequence', 'pathogen_sequence'])

# Save to CSV
sequence_pairs_df.to_csv(file_path + '/bacillus_neg_pairs.csv', index=False)

In [None]:
# Read your sequence pair DataFrame
bacillus = file_path+"/bacillus_neg_pairs.csv"
df = pd.read_csv(bacillus)

# Standardize sequences: strip & convert to uppercase
df['host_sequence'] = df['host_sequence'].str.upper().str.strip()
df['pathogen_sequence'] = df['pathogen_sequence'].str.upper().str.strip()

# Validate sequences
df['valid_host'] = df['host_sequence'].apply(is_valid_protein)
df['valid_pathogen'] = df['pathogen_sequence'].apply(is_valid_protein)

# Filter to keep only valid pairs
df_valid = df[df['valid_host'] & df['valid_pathogen']].copy()

# Drop helper columns
df_valid.drop(columns=['valid_host', 'valid_pathogen'], inplace=True)

# Export to CSV
df_valid.to_csv(file_path + "/validated_negativebacillus.csv", index=False)

print(f"Saved {len(df_valid)} valid HPI pairs to 'validated_negativebacillus.csv'")


Saved 2962 valid HPI pairs to 'validated_negativebacillus.csv'


###Negative Host-Ecoli

In [None]:
#Load in the data
ecoli = file_path + "/EcoliNegativePool.csv"

df = pd.read_csv(ecoli)


##creates Host-Pathogen Interaction (HPI) sequence pairs only
sequence_pairs = list(zip(df['Host Sequence'], df['Pathogen Sequence']))


# Create a new DataFrame with just the sequence pairs
sequence_pairs_df = pd.DataFrame(sequence_pairs, columns=['host_sequence', 'pathogen_sequence'])

# Save to CSV
sequence_pairs_df.to_csv(file_path + '/ecoli_neg_pairs.csv', index=False)


In [None]:
# Read your sequence pair DataFrame
ecoli = file_path+"/ecoli_neg_pairs.csv"
df = pd.read_csv(ecoli)

# Standardize sequences: strip & convert to uppercase
df['host_sequence'] = df['host_sequence'].str.upper().str.strip()
df['pathogen_sequence'] = df['pathogen_sequence'].str.upper().str.strip()

# Validate sequences
df['valid_host'] = df['host_sequence'].apply(is_valid_protein)
df['valid_pathogen'] = df['pathogen_sequence'].apply(is_valid_protein)

# Filter to keep only valid pairs
df_valid = df[df['valid_host'] & df['valid_pathogen']].copy()

# Drop helper columns
df_valid.drop(columns=['valid_host', 'valid_pathogen'], inplace=True)

# Export to CSV
df_valid.to_csv(file_path + "/validated_negativeecoli.csv", index=False)

print(f"Saved {len(df_valid)} valid HPI pairs to 'validated_negativeecoli.csv'")


###Negative Host-Francisella

In [None]:
#Load in the data
fransicella = file_path + "/FrancisellaNegativePool.csv"

df = pd.read_csv(fransicella)


##creates Host-Pathogen Interaction (HPI) sequence pairs only
sequence_pairs = list(zip(df['Host Sequence'], df['Pathogen Sequence']))


# Create a new DataFrame with just the sequence pairs
sequence_pairs_df = pd.DataFrame(sequence_pairs, columns=['host_sequence', 'pathogen_sequence'])

# Save to CSV
sequence_pairs_df.to_csv(file_path + '/fransicella_neg_pairs.csv', index=False)


In [None]:
# Read your sequence pair DataFrame
fransicella = file_path+"/fransicella_neg_pairs.csv"
df = pd.read_csv(fransicella)

# Standardize sequences: strip & convert to uppercase
df['host_sequence'] = df['host_sequence'].str.upper().str.strip()
df['pathogen_sequence'] = df['pathogen_sequence'].str.upper().str.strip()

# Validate sequences
df['valid_host'] = df['host_sequence'].apply(is_valid_protein)
df['valid_pathogen'] = df['pathogen_sequence'].apply(is_valid_protein)

# Filter to keep only valid pairs
df_valid = df[df['valid_host'] & df['valid_pathogen']].copy()

# Drop helper columns
df_valid.drop(columns=['valid_host', 'valid_pathogen'], inplace=True)

# Export to CSV
df_valid.to_csv(file_path + "/validated_negativefransicella.csv", index=False)

print(f"Saved {len(df_valid)} valid HPI pairs to 'validated_negativefransicella.csv'")


Saved 5968 valid HPI pairs to 'validated_negativefransicella.csv'


###Negative Host-Yersinia

In [None]:
#Load in the data
yersinia = file_path + "/YersiniaNegativePool.csv"

df = pd.read_csv(yersinia)


##creates Host-Pathogen Interaction (HPI) sequence pairs only
sequence_pairs = list(zip(df['Host Sequence'], df['Pathogen Sequence']))


# Create a new DataFrame with just the sequence pairs
sequence_pairs_df = pd.DataFrame(sequence_pairs, columns=['host_sequence', 'pathogen_sequence'])

# Save to CSV
sequence_pairs_df.to_csv(file_path + '/yersinia_neg_pairs.csv', index=False)


In [None]:
# Read your sequence pair DataFrame
yersinia = file_path+"/yersinia_neg_pairs.csv"
df = pd.read_csv(yersinia)

# Standardize sequences: strip & convert to uppercase
df['host_sequence'] = df['host_sequence'].str.upper().str.strip()
df['pathogen_sequence'] = df['pathogen_sequence'].str.upper().str.strip()

# Validate sequences
df['valid_host'] = df['host_sequence'].apply(is_valid_protein)
df['valid_pathogen'] = df['pathogen_sequence'].apply(is_valid_protein)

# Filter to keep only valid pairs
df_valid = df[df['valid_host'] & df['valid_pathogen']].copy()

# Drop helper columns
df_valid.drop(columns=['valid_host', 'valid_pathogen'], inplace=True)

# Export to CSV
df_valid.to_csv(file_path + "/validated_negativeyersinia.csv", index=False)

print(f"Saved {len(df_valid)} valid HPI pairs to 'validated_negativeyersinia.csv'")


#**Ensuring Class Balance After Sequence Validation**
After sequence validation, the number of remaining interactions differed between the positive and negative dataset - 4447 positive pairs and 4425 negative pair. To ensure a balanced dataset, we will randomly select equal number of data present in the positive data.


Next, from each pool, we will randomly select the same number of pairs in the positive interactions for that specie to ensure a balanced dataset. The selected negative pairs should match the counts below:
* 644 for *Bacillus anthracis*
* 62 for *Escherichia coli*
* 1199 for *Francisella tularensis*
* 2542 for *Yersinia pestis*


###Balanced Negative Bacillus

In [None]:
import pandas as pd

# Load the validated negative data pool
negative_pool_df = pd.read_csv(file_path + "/validated_negativebacillus.csv")

# Randomly sample 644 pairs
negative_subset = negative_pool_df.sample(n=644, random_state=42)  # random_state for reproducibility

# Save to new file
negative_subset.to_csv(file_path + "/Negative_bacillus_HPI.csv", index=False)

print("644 negative pairs randomly selected and saved to 'Negative_bacillus_HPI.csv'")


###Balanced Negative Ecoli

In [None]:
# Load the validated negative data pool
negative_pool_df = pd.read_csv(file_path + "/validated_negativeecoli.csv")

# Randomly sample 62 pairs
negative_subset = negative_pool_df.sample(n=62, random_state=42)  # random_state for reproducibility

# Save to new file
negative_subset.to_csv(file_path + "/Negative_ecoli_HPI.csv", index=False)

print("62 negative pairs randomly selected and saved to 'Negative_ecoli_HPI.csv'")


###Balanced Negative Francisella

In [None]:
# Load the validated negative data pool
negative_pool_df = pd.read_csv(file_path + "/validated_negativefransicella.csv")

# Randomly sample 1199 pairs
negative_subset = negative_pool_df.sample(n=1199, random_state=42)  # random_state for reproducibility

# Save to new file
negative_subset.to_csv(file_path + "/Negative_francisella_HPI.csv", index=False)

print("1199 negative pairs randomly selected and saved to 'Negative_francisella_HPI.csv'")


###Balanced Negative Yersinia

In [None]:
# Load the validated negative data pool
negative_pool_df = pd.read_csv(file_path + "/validated_negativeyersinia.csv")

# Randomly sample 2542 pairs
negative_subset = negative_pool_df.sample(n=2542, random_state=42)  # random_state for reproducibility

# Save to new file
negative_subset.to_csv(file_path + "/Negative_yersinia_HPI.csv", index=False)

print("2542 negative pairs randomly selected and saved to 'Negative_yersinia_HPI.csv'")


#**Concatenate the Negative HPI data**
This will form the Negative dataset

In [None]:
# Load the individual negative datasets
bacillus_df = pd.read_csv(file_path + "/Negative_bacillus_HPI.csv")
ecoli_df = pd.read_csv(file_path + "/Negative_ecoli_HPI.csv")
francisella_df = pd.read_csv(file_path + "/Negative_francisella_HPI.csv")
yersinia_df = pd.read_csv(file_path +  "/Negative_yersinia_HPI.csv")

# Concatenate all datasets into a single dataframe
negative_dataset = pd.concat([bacillus_df, ecoli_df, francisella_df, yersinia_df], ignore_index=True)

# Save the final combined dataset
negative_dataset.to_csv(file_path + "/combined_negative_HPI_dataset.csv", index=False)

print("All negative datasets have been concatenated and saved to 'combined_negative_HPI_dataset.csv'.")
len(negative_dataset)

# Merge and Label positive and Negative Data
Merge the the positive and negative dataset into a single dataset and set labels as 1 for positive and 0 for negative.

In [None]:
#Load the datasets
df_pos = pd.read_csv((file_path + "/AllpositiveHPI.csv"))
df_neg = pd.read_csv((file_path + "/combined_negative_HPI_dataset.csv"))

#Add Labels
df_pos['label'] = 1  # Positive interactions
df_neg['label'] = 0  # Negative interactions

#Merge into one dataset
df_combined = pd.concat([df_pos, df_neg], ignore_index=True)

#Shuffle the dataset
df_combined = df_combined.sample(frac=1, random_state=42).reset_index(drop=True)

#Save
df_combined.to_csv(file_path + "/merged_hpi_dataset.csv", index=False)

print(f"Merged dataset saved: {df_combined.shape[0]} rows")




In [None]:
len(df_combined)

**Now the dataset is ready for feature extraction in the next notebook.**