## Z-scales encoding

Z-scales is a **physiochemical property desciptor**. It's a five-dimensional vector descriptor developed by Hellberg et al. (1987) who extracted a set of upto 26 descriptors and condensed them using PCA to pick the z-scales that correspond to hydrophobicity, steric bulk, electronic properties.<p>

Reference: Hellberg S, Sjöström M, Skagerberg B, Wold S. Peptide quantitative structure-activity relationships, a multivariate approach. J Med Chem. 1987 Jul;30(7):1126-35. doi: 10.1021/jm00390a003. PMID: 3599020.

In [None]:
# Import dependencies
import numpy as np
import pandas as pd
from numpy import array

In [None]:
# Define Z-Scale values for 20 standard amino acids (Hellberg et al., 1987)
Z_SCALES = {
    "A": [0.07, -1.73, 0.16, 0.18, -0.11],
    "C": [1.26, -1.57, 0.38, -0.43, -0.21],
    "D": [-0.89, 1.34, -0.30, 0.61, -0.21],
    "E": [-1.68, 1.94, -0.27, 0.37, -0.23],
    "F": [1.52, -1.14, 0.44, -0.99, 1.14],
    "G": [-0.16, -2.46, -0.03, 0.23, 0.15],
    "H": [0.49, 0.88, -0.12, 0.27, 0.23],
    "I": [1.41, -0.84, 0.47, -1.10, 0.31],
    "K": [-1.50, 2.05, 0.30, 1.14, -0.21],
    "L": [1.14, -0.75, 0.40, -1.12, 0.26],
    "M": [0.65, -0.49, 1.30, -0.76, 0.41],
    "N": [-0.75, 1.98, -0.09, 0.14, -0.21],
    "P": [-0.46, 0.27, 0.25, -0.20, 0.14],
    "Q": [-0.73, 1.84, -0.15, 0.11, -0.21],
    "R": [-1.95, 2.44, 0.28, 1.53, -0.21],
    "S": [-0.26, 0.06, -0.11, 0.06, 0.06],
    "T": [-0.30, -0.40, -0.04, -0.32, 0.17],
    "V": [1.13, -0.67, 0.50, -1.09, 0.30],
    "W": [1.85, 0.30, 0.79, -0.71, 2.55],
    "Y": [0.94, 0.65, 0.15, -0.41, 1.61]
}

In [None]:
# Define function for Z-scales encoding

def encode_sequence_zscale(sequence, zscale_dict):
    """
    Convert a protein sequence into a numerical matrix using Z-Scale encoding.
    
    Parameters:
        sequence (str): Protein sequence.
        zscale_dict (dict): Mapping of amino acids to Z-Scale values.

    Returns:
        np.array: Encoded sequence (Length × 5 features)
    """
    encoding = [zscale_dict.get(aa, [0, 0, 0, 0, 0]) for aa in sequence]  # Use [0,0,0,0,0] for unknown AA
    return np.array(encoding)


In [None]:
# Load dataset (with 'sequence' column)
protein_sequences_file = 'Example_Data_Swapped_50_55.csv'  # CSV file path
df = pd.read_csv(protein_sequences_file)

#sequences = df['Sequence'].tolist()

# Apply Z-Scale encoding to all sequences
df["encoded_sequence"] = df["sequence"].apply(lambda seq: encode_sequence_zscale(seq, Z_SCALES).tolist())

# Convert each sequence into a flattened numerical array
df["flattened_sequence"] = df["encoded_sequence"].apply(lambda x: np.array(x).flatten())

In [None]:
# Save encoded data to the same file
df.to_csv("Example_Data_zscales_encoded.csv", index=False)