### Objective: To organize hydrophobicity scales

In this notebook, I am simply combining hydrophobicity scales from different sources. 
The first source was obtained from the Qiagen website [here](https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/650/Hydrophobicity_scales.html).
These are directly from the literature. 

The second source is from a paper *Comparison of hydrophobicity scales for predicting biophysical properties of antibodies* by Waibl et al. Frontiers in Molecular Biosciences 2022. In this paper, they take the scales from the literature and scale them by adding a constant such that Glycine hydrophobicity is 0 and then scale such that the variance is 1. (ie. $z_{aa}=\frac{x_{aa}-x_{\text{gly}}}{\mathbb{V}(X)}$), where $x_aa$ is the hydrophobicity score for a given amino acid. 

I took the table directly from the paper as text. Below, I clean it up and join it to the first source. 

In [1]:
import pandas as pd
import numpy as np
from io import StringIO
from pathlib import Path
import developability

#### Code for cleaning the text 

In [2]:
def create_df_from_num_list(nums, header):
    num_list = [header]
    current_list = []
    i=0
    for num in nums.split(','):
        
        i+=1
        current_list.append(num)
        if i//3:
            num_list.append(','.join(current_list))
            current_list = []
            i=0
    num_list = '\n'.join(num_list)
    return pd.read_csv(StringIO(num_list), sep=",", index_col=False)

def clean_sign(df):
    for col in df.columns:
        vals = []
        for val in df[col].values:
            if val[0] == '−':
                vals.append(float(val[1:])*-1)
            else:
                vals.append(float(val))
        df[col] = vals
    return df

In [3]:
text ="""Residue,BaMe,BlMo,Ei
ALA,0.75,0.37,0.15,
ARG,−0.02,−1.52,−3.09,
ASN,−0.16,−0.79,−1.29,
ASP,−0.50,−1.43,−1.42,
CYS,2.60,0.55,−0.19,
GLN,−0.11,−0.76,−1.37,
GLU,−0.54,−1.40,−1.26,
GLY,0.00,0.00,0.00,
HIS,0.57,−1.00,−0.90,
ILE,2.19,1.34,0.92,
LEU,1.97,1.34,0.60,
LYS,−0.90,−0.67,−2.03,
MET,1.22,0.73,0.16,
PHE,1.92,1.52,0.73,
PRO,0.72,0.64,−0.37,
SER,0.11,−0.43,−0.68,
THR,0.47,−0.15,−0.55,
TRP,1.51,1.16,0.34,
TYR,1.36,1.16,−0.23,
VAL,1.88,1.00,0.61"""

In [4]:
df = pd.read_csv(StringIO(text), sep=",", index_col=False) 

In [5]:
header1 = """KyDo,Me,Ro"""
nums1="""0.76,0.07,0.18,−1.41,0.11,−0.71,−1.06,0.11,−0.80,−1.06,−1.08,−0.89,1.00,−0.90,1.69,−1.06,−0.63,−0.89,−1.06,−2.23,−0.89,0.00,0.00,0.00,−0.96,−0.46,0.53,1.68,1.83,1.42,1.44,1.16,1.16,−1.20,0.01,−1.78,0.79,0.63,1.16,1.10,1.74,1.42,−0.41,0.80,−0.71,−0.14,0.16,−0.53,−0.10,0.36,−0.18,−0.17,1.96,1.16,−0.31,0.80,0.36,1.58,0.36,1.25"""
df2 = create_df_from_num_list(nums1, header1)

In [6]:
header2 = "WiWh,Ja,Mi"
nums2 = "−0.20,0.06,0.40,−0.41,−0.32,−0.15,−0.51,0.13,−0.37,−1.53,−0.43,−0.43,0.31,0.55,1.65,−0.71,0.46,−0.29,−2.51,−0.72,−0.40,0.00,0.00,0.00,−0.20,0.03,0.30,0.35,1.54,2.08,0.71,1.54,1.91,−0.59,−1.07,−0.74,0.30,0.49,2.14,1.43,2.48,2.18,−0.55,0.44,−0.29,−0.15,0.07,−0.19,−0.16,0.16,0.00,2.33,2.81,1.52,1.19,1.84,0.68,−0.08,0.97,1.51"
df3 = create_df_from_num_list(nums2, header2)

In [7]:
df4 = (pd.concat([df, df2, df3], axis=1)
       .set_index('Residue')
       )
df4 = clean_sign(df4)


In [8]:
df4.to_csv('output.csv')
df5 = pd.read_csv('output.csv', index_col=0)

In [9]:
df5.dtypes

BaMe    float64
BlMo    float64
Ei      float64
KyDo    float64
Me      float64
Ro      float64
WiWh    float64
Ja      float64
Mi      float64
dtype: object

In [23]:
data_path = Path(developability.__path__[0])
scales = pd.read_csv(data_path/'hydrophobicity_scales.csv', index_col=0).sort_values('amino_acid_name')
scales.head()

Unnamed: 0_level_0,amino_acid_name,Kyte_Doolittle,Hopp_Woods,Cornette,Eisenberg,Rose,Janin,Engelman_GES
aa,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,Alanine,1.8,-0.5,0.2,0.62,0.74,0.3,1.6
R,Arginine,-4.5,3.0,1.4,-2.53,0.64,-1.4,-12.3
N,Asparagine,-3.5,0.2,-0.5,-0.78,0.63,-0.5,-4.8
D,Aspartic acid,-3.5,3.0,-3.1,-0.9,0.62,-0.6,-9.2
C,Cysteine,2.5,-1.0,4.1,0.29,0.91,0.9,2.0


#### Check scaling method from paper
In the paper mentioned above, they scaled the hydrophobicity per residue scales by adding a constant to center the values by gylcine with glycine set to zero. They then standardized the variance. Below, I check that. 

Although not perfect, it seems my understanding of their method is roughly correct. 

In [21]:
ei = scales['Eisenberg']
ei2 = (ei- ei.loc['G'])/(ei- ei.loc['G']).std()
Ei = df5['Ei']

np.round((ei2.values - Ei.values),2)

array([-0.01,  0.08,  0.03,  0.04, -0.  ,  0.15, -0.07,  0.  ,  0.02,
       -0.02, -0.02,  0.05,  0.  , -0.02,  0.01,  0.02,  0.02, -0.01,
        0.01, -0.01])

In [12]:
kd = scales['Kyte_Doolittle']
kd2 = (kd- kd.loc['G'])/(kd- kd.loc['G']).std()
Kd = df5['KyDo']

np.round(kd2.values - Kd.values,2)

array([-0.02,  0.04,  0.02,  0.02, -0.03,  0.02,  0.02,  0.  ,  0.02,
       -0.04, -0.03,  0.03, -0.02, -0.03,  0.01,  0.01, -0.  ,  0.  ,
        0.01, -0.04])

In [15]:
scales2 = pd.concat([scales.reset_index(), df5.reset_index()], axis=1)
order = ['aa', 'amino_acid_name', 'Residue']
order = order + [col for col in scales2.columns if col not in order]
scales2 = scales2[order].set_index('aa')
scales2

Unnamed: 0_level_0,amino_acid_name,Residue,Kyte_Doolittle,Hopp_Woods,Cornette,Eisenberg,Rose,Janin,Engelman_GES,BaMe,BlMo,Ei,KyDo,Me,Ro,WiWh,Ja,Mi
aa,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
A,Alanine,ALA,1.8,-0.5,0.2,0.62,0.74,0.3,1.6,0.75,0.37,0.15,0.76,0.07,0.18,-0.2,0.06,0.4
R,Arginine,ARG,-4.5,3.0,1.4,-2.53,0.64,-1.4,-12.3,-0.02,-1.52,-3.09,-1.41,0.11,-0.71,-0.41,-0.32,-0.15
N,Asparagine,ASN,-3.5,0.2,-0.5,-0.78,0.63,-0.5,-4.8,-0.16,-0.79,-1.29,-1.06,0.11,-0.8,-0.51,0.13,-0.37
D,Aspartic acid,ASP,-3.5,3.0,-3.1,-0.9,0.62,-0.6,-9.2,-0.5,-1.43,-1.42,-1.06,-1.08,-0.89,-1.53,-0.43,-0.43
C,Cysteine,CYS,2.5,-1.0,4.1,0.29,0.91,0.9,2.0,2.6,0.55,-0.19,1.0,-0.9,1.69,0.31,0.55,1.65
E,Glutamic acid,GLN,-3.5,3.0,-1.8,-0.74,0.62,-0.7,-8.2,-0.11,-0.76,-1.37,-1.06,-0.63,-0.89,-0.71,0.46,-0.29
Q,Glutamine,GLU,-3.5,0.2,-2.8,-0.85,0.62,-0.7,-4.1,-0.54,-1.4,-1.26,-1.06,-2.23,-0.89,-2.51,-0.72,-0.4
G,Glycine,GLY,-0.4,0.0,0.0,0.48,0.72,0.3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
H,Histidine,HIS,-3.2,-0.5,0.5,-0.4,0.78,-0.1,-3.0,0.57,-1.0,-0.9,-0.96,-0.46,0.53,-0.2,0.03,0.3
I,Isoleucine,ILE,4.5,-1.8,4.8,1.38,0.88,0.7,3.1,2.19,1.34,0.92,1.68,1.83,1.42,0.35,1.54,2.08


#### Save them to csv

In [24]:
scales.to_csv(data_path/'hydrophobicity_scales_old.csv')
scales2.to_csv(data_path/'hydrophobicity_scales.csv')