# AAIndex encoding

An amino acid index is a database of the numerical representation of different **physicochemical and biological properties** for each of the 20 standard amino acids with upto 566 indices in aaindex1.<br>
Reference: Shuichi Kawashima, Minoru Kanehisa, AAindex: Amino Acid index database, Nucleic Acids Research, Volume 28, Issue 1, 1 January 2000, Page 374, https://doi.org/10.1093/nar/28.1.374 <br>
Refer to github: [github.com/amckenna41/aaindex](https://github.com/amckenna41/aaindex) <p>

The description of all the properties is given on the [AAindex List of Indices](https://www.genome.jp/aaindex/AAindex/list_of_indices) <br>
The Python documentation is given here [PyPi aaindex](https://pypi.org/project/aaindex/) <p>
To install aaindex: `!pip3 install aaindex --upgrade`


In [None]:
# Import Dependencies

import aaindex
from aaindex import aaindex1
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

### Selecting AAindex Properties

For our study, 8 properties were chosen, their name and AAindex code is as follows:<br>
1. Normalized average hydrophobicity scales ['CIDH920105'](https://www.genome.jp/entry/aaindex:CIDH920105),<br>  
2. Hydropathy index ['KYTJ820101'](https://www.genome.jp/entry/aaindex:KYTJ820101),<br> 
3. Normalized frequency of alpha-helix ['CHOP780201'](https://www.genome.jp/entry/aaindex:CHOP780201),<br>  
4. Polarity ['GRAR740102'](https://www.genome.jp/entry/aaindex:GRAR740102),<br> 
5. Hydrophilicity value ['HOPT810101'](https://www.genome.jp/entry/aaindex:HOPT810101),<br>  
6. Isoelectric point ['ZIMJ680104'](https://www.genome.jp/entry/aaindex:ZIMJ680104),<br>  
7. Average weighted atomic number ['KARS160118'](https://www.genome.jp/entry/aaindex:KARS160118),<br>  
8. Spin-spin coupling constants ['BUNA790103'](https://www.genome.jp/entry/aaindex:BUNA790103)<br>  

['CIDH920105', 'KYTJ820101', 'CHOP780201', 'GRAR740102', 'HOPT810101', 'ZIMJ680104', 'KARS160118', 'BUNA790103']

In [None]:
# Load dataset (with 'sequence' column)
protein_sequences_file = 'Example_Data.csv'  # CSV file path
df = pd.read_csv(protein_sequences_file)

# Save sequences as a list
sequences = df['sequence'].tolist()

In [None]:
# Define the selected aaindex properties
selected_properties = ['CIDH920105', 'KYTJ820101', 'CHOP780201', 'GRAR740102', 'HOPT810101', 'ZIMJ680104', 'KARS160118', 'BUNA790103'] 

# Extract the selected properties from AAindex1
aa_properties = {}
for prop in selected_properties:
    if prop in aaindex1.record_codes():
        aa_properties[prop] = aaindex1[prop].values

# Generate feature vectors for each protein sequence
encoded_features = []
for sequence in sequences:
    sequence_features = []
    for prop, prop_values in aa_properties.items():
        prop_vector = [prop_values.get(aa, 0) for aa in sequence]
        sequence_features.append((prop_vector))  
    encoded_features.append(sequence_features)

In [None]:
# Save features in a dataframe
feature_columns = [f"{prop}" for prop in aa_properties.keys()]
encoded_df = pd.DataFrame(encoded_features, columns=feature_columns)

In [None]:
# To combine all properties in an array

# First, convert the string lists to real lists
for col in ["CIDH920105", "KYTJ820101", "CHOP780201", "GRAR740102", 
            "HOPT810101", "ZIMJ680104","KARS160118","BUNA790103"]:
    encoded_df[col] = encoded_df[col].apply(ast.literal_eval)

# Now each cell in these columns is a real Python list, not a string!

# Next, flatten the lists row-wise
encoded_df['encoded_combined'] = df.apply(lambda row: np.concatenate([row[col] for col in [
    "CIDH920105", "KYTJ820101", "CHOP780201", "GRAR740102", 
    "HOPT810101", "ZIMJ680104","KARS160118","BUNA790103"
]]), axis=1)

In [None]:
# To save encoded data in the same file
df_merged = pd.concat([df, encoded_df], axis=1, ignore_index=False, sort=False)
df_merged.to_csv("Example_Data_AAindex_encoded.csv", index=False)

'''
# To save encoded data in a separate file
encoded_df.to_csv("AAindex_encoded.csv", index=False)
'''