<a href="https://colab.research.google.com/github/tcardlab/optimus_bind_sample/blob/develop/notebooks/3_0_TJC_Cleaning_Code_While_No_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I'm reviewing and cleaning up SKEMPItoPandas and found 2 quirks.

- Skempi_df['Temperature'][6665:9]=''
   for whatever reason, these blanks did not default to 298.0 and are nan
   
- Unfortunately, using df.fillna to patch those 4 values introduce changes. All values of temperature test as equal, but in the calculation the results are different. I am uncertain which output you would consider correct.

**Original values vs df.fillna(298)**

```
ddgMedian
expected (tmp was nan in original):
6665 4G0N_A_B   [3.2996884744981614 vs 2.9170490380241167] 
6666  1C1Y_A_B   [nan vs 2.619152014020517]
6667  1LFD_A_B   [3.8958503047001924 vs 3.3114853574091185] 
6668  1LFD_A_B   [nan vs -0.21121814832220487] 
unexpected(same pdb #'s):
6472  1LFD_A_B   [3.8958503047001924 vs 3.3114853574091185]  310.0
6493 4G0N_A_B   [3.2996884744981614 vs 2.9170490380241167]   308.0
```


is it possible that changes to nan affected the median calculation
```df.groupby(...)['ddG'].transform('median')``` 
thus creating two unexpected outputs with similar pdb#'s?

The worst-case scenario is we use df.dropna for temperature too. I have tested it and it produces consistent results.

In [0]:
import pandas as pd
import numpy as np
import re

In [4]:
link = 'https://life.bsc.es/pid/skempi2/database/download/skempi_v2.csv'

'''Proper python retreival'''
#from urllib.request import urlretrieve
#csv_path, _ = urlretrieve(link,f'skempi_v2.0.csv')

'''Direct import to Pandas'''
#data = pd.read_csv(link, sep=';')
#print(data)

'''OS get'''
!wget $link -O skempi_v2.0.csv #-O to rename

--2019-07-18 09:11:24--  https://life.bsc.es/pid/skempi2/database/download/skempi_v2.csv
Resolving life.bsc.es (life.bsc.es)... 84.88.52.107
Connecting to life.bsc.es (life.bsc.es)|84.88.52.107|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1602208 (1.5M) [text/csv]
Saving to: ‘skempi_v2.0.csv’


2019-07-18 09:11:28 (493 KB/s) - ‘skempi_v2.0.csv’ saved [1602208/1602208]



#original

In [0]:
def SKEMPItoPandas(SKEMPI_loc):
    '''
    Purpose:
        1. Loads SKEMPI CSV file.
        2. Calculates ddG
        3. For multiple measurements, keeps the median value
        4. Eliminates entries with mutations on both sides of the interface
    Input:
        SKEMPI_loc : Location of SKEMPI CSV file
    Output:
        SKEMPI_df : Pandas dataframe    
    '''
    
    # fix this
    pd.options.mode.chained_assignment = None  # default='warn'

    # Constants
    R = 1.9872036e-3  # Ideal Gas Constant in kcal

    SKEMPI_df = pd.read_csv(SKEMPI_loc, sep=';')

    # Convert non numeric temperature comments to numeric values. Default is 298K 
    ConvertTemp = lambda x: int(re.search(r'\d+', x).group(0) or 298)
    BadTemps = SKEMPI_df.Temperature.str.isnumeric() == 0
    SKEMPI_df['Temperature'].loc[BadTemps] = SKEMPI_df['Temperature'].loc[BadTemps].map(ConvertTemp)
    SKEMPI_df['Temperature'] = pd.to_numeric(SKEMPI_df['Temperature'], errors='coerce')

    # Drop missing values
    #SKEMPI_df.dropna(subset=['Temperature'], inplace=True)
    SKEMPI_df.dropna(subset=['Affinity_wt_parsed'], inplace=True)
    SKEMPI_df.dropna(subset=['Affinity_mut_parsed'], inplace=True)

    # Calculate free energies
    SKEMPI_df['dgWT'] = -R*SKEMPI_df['Temperature']*np.log(SKEMPI_df['Affinity_wt_parsed'])
    SKEMPI_df['dgMut'] = -R*SKEMPI_df['Temperature']*np.log(SKEMPI_df['Affinity_mut_parsed'])
    SKEMPI_df['ddG'] = SKEMPI_df['dgWT']-SKEMPI_df['dgMut']

    # Create a key for unique mutations based on PDB and 
    SKEMPI_df['MutKey'] = SKEMPI_df['#Pdb']+'_'+SKEMPI_df['Mutation(s)_PDB']
    # Replace multiple measurements of the same mutation with the group mean
    # May consider grouping by experimental method as well
    SKEMPI_df['ddgMedian'] = SKEMPI_df.groupby('MutKey')['ddG'].transform('median')        
    SKEMPI_df = SKEMPI_df.drop_duplicates(subset=['MutKey', 'Temperature'], keep='first', inplace=False)

    # Flag multiple mutations in the same protein
    SKEMPI_df['NumMutations'] = SKEMPI_df['Mutation(s)_PDB'].str.count(',')+1 

    # Extract Chains and remove cross chain mutations. Chain is the second position in the mutation code
    SKEMPI_df['Prot1Chain'] = SKEMPI_df['#Pdb'].str.split('_').str[1]
    SKEMPI_df['Prot2Chain'] = SKEMPI_df['#Pdb'].str.split('_').str[2]
    SKEMPI_df['MutSplit'] = SKEMPI_df['Mutation(s)_PDB'].str.split(',')

    def ChainCheck(df):
        if df['NumMutations'] == 1:
            CrossChain = False
            return CrossChain
        else:
            Chain = df['MutSplit'][0][1]
            if Chain in df['Prot1Chain']:
                ChainSet = df['Prot1Chain']
            elif Chain in df['Prot2Chain']:
                ChainSet = df['Prot2Chain']
            for i in range(len(df['MutSplit'])):
                Chain = df['MutSplit'][i][1]
                if Chain in ChainSet:
                    CrossChain = False
                else:
                    CrossChain = True
                    break
        return CrossChain

    SKEMPI_df['CrossChain'] = SKEMPI_df.apply(ChainCheck, axis=1)
    SKEMPI_SingleSided = SKEMPI_df[SKEMPI_df.CrossChain == False]

    NumProteins = SKEMPI_SingleSided['#Pdb'].nunique()
    NumMutations = SKEMPI_SingleSided['#Pdb'].count()
    print("There are %s unique single sided mutations in %s proteins" % (NumMutations, NumProteins))             
    return SKEMPI_SingleSided

In [15]:
og_output = SKEMPItoPandas('skempi_v2.0.csv')

There are 5454 unique single sided mutations in 343 proteins


#Changed

##v1.0


In [0]:
def ChainCheck(df):
        if df['NumMutations'] == 1:
            CrossChain = False
            return CrossChain
        else:
            Chain = df['MutSplit'][0][1]
            if Chain in df['Prot1Chain']:
                ChainSet = df['Prot1Chain']
            elif Chain in df['Prot2Chain']:
                ChainSet = df['Prot2Chain']
            for i in range(len(df['MutSplit'])):
                Chain = df['MutSplit'][i][1]
                if Chain in ChainSet:
                    CrossChain = False
                else:
                    CrossChain = True
                    break
        return CrossChain

def gibbsEq(Kd, tmp):
  R = 1.9872036e-3  # Ideal Gas Constant in kcal
  ΔG = -R * tmp * np.log(Kd) #log is ln in np
  return ΔG

def SKEMPItoPandas1(SKEMPI_loc):
    '''
    Purpose:
        1. Loads SKEMPI CSV file.
        2. Calculates ddG
        3. For multiple measurements, keeps the median value
        4. Eliminates entries with mutations on both sides of the interface
    Input:
        SKEMPI_loc : Location of SKEMPI CSV file
    Output:
        SKEMPI_df : Pandas dataframe    
    '''
    
    SKEMPI_df = pd.read_csv(SKEMPI_loc, sep=';')

    # Convert non numeric temperature comments to numeric values. 
    # Default is 298K 
    SKEMPI_df['Temperature'] = SKEMPI_df['Temperature'].str.extract(r'(\d+)')
    SKEMPI_df['Temperature'] = pd.to_numeric(SKEMPI_df['Temperature'], 
                                             errors='coerce')
    SKEMPI_df['Temperature'].fillna(298, inplace=True)
    
    # Drop missing values
    #SKEMPI_df.dropna(subset=['Temperature'], inplace=True)
    SKEMPI_df.dropna(subset=['Affinity_wt_parsed'], inplace=True)
    SKEMPI_df.dropna(subset=['Affinity_mut_parsed'], inplace=True)

    # Calculate free energies
    SKEMPI_df['dgWT'] = gibbsEq(SKEMPI_df['Affinity_wt_parsed'], 
                                SKEMPI_df['Temperature'])
    SKEMPI_df['dgMut'] = gibbsEq(SKEMPI_df['Affinity_mut_parsed'], 
                                 SKEMPI_df['Temperature'])
    SKEMPI_df['ddG'] = SKEMPI_df['dgWT']-SKEMPI_df['dgMut']

    # Create a key for unique mutations based on PDB and 
    SKEMPI_df['MutKey'] = SKEMPI_df['#Pdb']+'_'+SKEMPI_df['Mutation(s)_PDB']
    # Replace multiple measurements of the same mutation with the group mean
    # May consider grouping by experimental method as well
    SKEMPI_df['ddgMedian'] = SKEMPI_df.groupby('MutKey')['ddG'].transform('median')        
    SKEMPI_df = SKEMPI_df.drop_duplicates(subset=['MutKey', 'Temperature'], 
                                          keep='first', inplace=False)

    # Flag multiple mutations in the same protein
    SKEMPI_df['MutSplit'] = SKEMPI_df['Mutation(s)_PDB'].str.split(',')
    SKEMPI_df['NumMutations'] = SKEMPI_df['MutSplit'].apply(len)
    
    # Extract Chains and remove cross chain mutations. 
    # Chain is the second position in the mutation code
    SKEMPI_df['Prot1Chain'] = SKEMPI_df['#Pdb'].str.split('_').str[1]
    SKEMPI_df['Prot2Chain'] = SKEMPI_df['#Pdb'].str.split('_').str[2]
    
    SKEMPI_df['CrossChain'] = SKEMPI_df.apply(ChainCheck, axis=1)
    SKEMPI_SingleSided = SKEMPI_df[SKEMPI_df.CrossChain == False]

    NumProteins = SKEMPI_SingleSided['#Pdb'].nunique()
    NumMutations = SKEMPI_SingleSided['#Pdb'].count()
    print("There are %s unique single sided mutations in %s proteins" % (NumMutations, NumProteins))             
    return SKEMPI_SingleSided

In [9]:
new_output = SKEMPItoPandas1('skempi_v2.0.csv')

There are 5454 unique single sided mutations in 343 proteins


In [0]:
new_output

##v1.1 as class?

###Base Case

https://www.kaggle.com/vinceniko/custom-pandas-subclass

In [0]:
import pandas as pd


class Pandas_Subclass(pd.DataFrame):
    """
    A way to create a Pandas subclass which initializes from another Pandas object without passing it into the class constructor.
    Allows complete overwriting of parent class constructor.
    Allows custom methods to be added onto Pandas objects (which can be created within the constructer itself).
    Ie. pass in a file_path to the class constructor which then calls pd.read_csv within __init__ which then assigns the returned DataFrame to self.
    Params:
        file_path (str): file_path passed into pd.read_csv().
    """

    def __init__(self, file_path):
        super().__init__(pd.read_csv(file_path))  # initialize subclass from DataFrame instance
        # self.__dict__.update(pd.read_csv(file_path).__dict__)  # the unpythonic way to do it

    def custom_method(self):
        print(self)  # returns .csv as Dataframe
        print(type(self))  # returns <class '__main__.Pandas_Subclass'>


if __name__ == '__main__':
    df = Pandas_Subclass('../input/winemag-data_first150k.csv')
    df.custom_method()

###Initial Attempt

In [210]:
def ChainCheck(df):
    '''
    No idea what is happening here, too many if's to think about atm...
    '''
    if df['NumMutations'] == 1:
      crossChain = False
      return crossChain
    else:
      Chain = df['MutSplit'][0][1]
      if Chain in df['Prot1Chain']:
        ChainSet = df['Prot1Chain']
      elif Chain in df['Prot2Chain']:
        ChainSet = df['Prot2Chain']
      for i in range(len(df['MutSplit'])):
        Chain = df['MutSplit'][i][1]
        if Chain in ChainSet:
          crossChain = False
        else:
          crossChain = True
          break
    return crossChain

class MutantDataSet(pd.DataFrame):
  '''
  Subclassed Pandsas DataFrame
  
  Not sure what to think yet....
  '''
  #WHY cant I get this to work?
  def __init__(self, data, sep=',', index=None, columns=None, dtype=None, 
               copy=True,):
    '''Initialize subclass from DataFrame instance.'''
    #from csv
    if type(data)==str:
      data=pd.read_csv(data, sep=sep)
    super(MutantDataSet, self).__init__(data=data,
                                        index=index,
                                        columns=columns,
                                        dtype=dtype,
                                        copy=copy)

#  def __init__(self, data, index=None, columns=None, dtype=str, 
#               copy=True, sep=';'):
#        super(MutantDataSet, self).__init__(data=pd.read_csv(data, sep=sep),
#                                            index=index,
#                                            columns=columns,
#                                            dtype=dtype,
#                                            copy=copy)

  def Mutations(self, row):
    '''Returns dictionary of mutation identifiers.'''
    keys = ['initAA', 'chain', 'loc', 'mutAA']  # code key
    mut_codes = self.loc[row]['Mutation(s)_cleaned'].split(',')
    unzip_code = zip(*[re.findall('(\d+|.)', mut) for mut in mut_codes])
    mut_dct = dict(zip(keys, unzip_code))
    return mut_dct
  
  def to_numeric(self, keys):
    '''
    converts column of single or list of keys to numeric values
    '''
    self[keys] = self[keys].apply(pd.to_numeric, errors='coerce')
    return self[keys]
  
  def to_numeric2(self, keys):
    '''
    converts column of single or list of keys to numeric values
    not as good
    '''
    keys = [keys] if type(keys)==str else keys
    for k in keys:
      self[k] = pd.to_numeric(self[k], errors='coerce')
    return self
  
  def gibbsEq(self, Kd_key, tmp_key='Temperature'): 
    R = 1.9872036e-3  # Ideal Gas Constant in kcal
    ΔG = -R * self[tmp_key] * np.log(self[Kd_key]) #log is ln in np
    return ΔG
  
  def ddG(self, wild, mutant, tmp_key='Temperature'):
    self['dgWT'] = self.gibbsEq(wild, tmp_key)
    self['dgMut'] = self.gibbsEq(mutant, tmp_key)
    self['ddG'] = self['dgWT']-self['dgMut']
    return self
  
  def grouped_avg(self, group_keys, avg_key):
    '''
    DANGEROUS! not sure if median value will be returned to correct indecies
    '''
    averaged = self.groupby(group_keys)[avg_key].transform('median')
    return averaged
    
  def find_cross_chains(self):
    self['Prot1Chain'] = self['#Pdb'].str.split('_').str[1]
    self['Prot2Chain'] = self['#Pdb'].str.split('_').str[2]
    crossChain = self.apply(ChainCheck, axis=1)
    return crossChain
    
  @property
  def _constructor(self):
    return MutantDataSet # Class Name


'''
1) clean each dataset to create consistant MutantDataSet's
  1a) store indeviduals in ~/data/intermediate 
2) combine into uniform MutantDataSet 
  2a) store in ~/data/final  
'''

#1 – clean skempi
def clean_Skempi(path):
  #initialize class
  skempi = MutantDataSet(path, sep=';')  # not working atm...
  #skempi_df = pd.read_csv(path, sep=';')
  #skempi = MutantDataSet(skempi_df)
  
  # Convert non-numeric temperature comments to numeric values. Default is 298K 
  skempi['Temperature'] = skempi['Temperature'].str.extract(r'(\d+)')
  skempi['Temperature'] = skempi.to_numeric('Temperature')
  skempi['Temperature'].fillna(298, inplace=True) #6665-6668 blank
  
  # Calculate free energies
  dropna_lst = ['Affinity_wt_parsed','Affinity_mut_parsed']
  skempi.dropna(subset=dropna_lst, inplace=True)
  skempi = skempi.ddG('Affinity_wt_parsed', 'Affinity_mut_parsed')
  
  #Average and duplicate ddG/tmp values
  group_keys = ['#Pdb', 'Mutation(s)_PDB']
  skempi['ddG'] = skempi.grouped_avg(group_keys, 'ddG')
  skempi = skempi.drop_duplicates(subset=[*group_keys,'Temperature'], 
                                  keep='first', inplace=False)
  
  # Flag multiple mutations in the same protein
  skempi['MutSplit'] = skempi['Mutation(s)_PDB'].str.split(',')
  skempi['NumMutations'] = skempi['MutSplit'].apply(len)
  
  # Extract Chains and remove cross chain mutations. 
  skempi['CrossChain'] = skempi.find_cross_chains()
  SKEMPI_SingleSided = skempi[skempi.CrossChain == False]
  return SKEMPI_SingleSided

#import os
#path=os.path.abspath('skempi_v2.0.csv')
skempi_final = clean_Skempi(path)

#skempi_final = clean_Skempi('skempi_v2.0.csv')
NumProteins = skempi_final['#Pdb'].nunique()
NumMutations = skempi_final['#Pdb'].count()
print("There are %s unique single sided mutations in %s proteins" % 
      (NumMutations, NumProteins))  

#1a – store skempi in ~/data/intermediate


#1 – clean Other 
  #other = MutantDataSet('other.csv')
  #1a – store Other in ~/data/intermediate


#2 – combine 
  #2a – store in ~/data/final  


There are 5454 unique single sided mutations in 343 proteins


####tests

In [170]:
#skempi = MutantDataSet('skempi_v2.0.csv', sep=';')
#skempi.to_numeric()

skempi_df = pd.read_csv('skempi_v2.0.csv', sep=';')
skempi = MutantDataSet(skempi_df)

'''To numeric test'''
test1=skempi
skempi=skempi.to_numeric("Temperature")
test1=skempi.to_numeric2("Temperature")
print(skempi.equals(test1))

'''Drop multiple at the same time test.'''
test2=skempi
skempi.dropna(subset=['Affinity_wt_parsed'], inplace=True)
skempi.dropna(subset=['Affinity_mut_parsed'], inplace=True)

test2.dropna(subset=['Affinity_wt_parsed','Affinity_mut_parsed'], inplace=True)
print(skempi.equals(test2))

'''Calculate free energies'''
skempi['dgWT'] = gibbsEq(skempi, 'Affinity_wt_parsed', 'Temperature')
skempi['dgMut'] = skempi.gibbsEq('Affinity_mut_parsed', 'Temperature')
SKEMPI_df['dgWT'] = gibbsEq(SKEMPI_df['Affinity_wt_parsed'], SKEMPI_df['Temperature'])

AttributeError: ignored

###Final

for unknown reasons 

```skempi['Temperature'].fillna(value=298, inplace=True)```

introduces knew changes outside on nan values

it may be reasonable to add **Temperature** to the **dropna_lst**

In [100]:
def ChainCheck(df):
    '''
    No idea what is happening here, too many if's to think about atm...
    should be method, but not working... i'figure it out later
    '''
    if df['NumMutations'] == 1:
      crossChain = False
      return crossChain
    else:
      Chain = df['MutSplit'][0][1]
      if Chain in df['Prot1Chain']:
        ChainSet = df['Prot1Chain']
      elif Chain in df['Prot2Chain']:
        ChainSet = df['Prot2Chain']
      for i in range(len(df['MutSplit'])):
        Chain = df['MutSplit'][i][1]
        if Chain in ChainSet:
          crossChain = False
        else:
          crossChain = True
          break
    return crossChain

class MutantDataSet(pd.DataFrame):
  '''
  Subclassed Pandsas DataFrame
  
  Not sure what to think yet....
  '''
  def __init__(self, data, sep=',', index=None, columns=None, dtype=None, 
               copy=True,):
    '''Initialize subclass from DataFrame instance or csv path.'''
    if type(data)==str:
      data=pd.read_csv(data, sep=sep)
    super(MutantDataSet, self).__init__(data=data,
                                        index=index,
                                        columns=columns,
                                        dtype=dtype,
                                        copy=copy)

  def Mutations(self, row):
    '''Returns dictionary of mutation identifiers.'''
    keys = ['initAA', 'chain', 'loc', 'mutAA']  # code key
    mut_codes = self.loc[row]['Mutation(s)_cleaned'].split(',')
    unzip_code = zip(*[re.findall('(\d+|.)', mut) for mut in mut_codes])
    mut_dct = dict(zip(keys, unzip_code))
    return mut_dct
  
  def to_numeric(self, keys):
    '''
    converts column of single or list of keys to numeric values
    '''
    self[keys] = self[keys].apply(pd.to_numeric, errors='coerce')
    return self[keys]
  
  def gibbsEq(self, Kd_key, tmp_key='Temperature'): 
    R = 1.9872036e-3  # Ideal Gas Constant in kcal
    ΔG = -R * self[tmp_key] * np.log(self[Kd_key]) #log is ln in np
    return ΔG
  
  def solve_ddG(self, wild, mutant, tmp_key='Temperature'):
    self['dgWT'] = self.gibbsEq(wild, tmp_key)
    self['dgMut'] = self.gibbsEq(mutant, tmp_key)
    self['ddG'] = self['dgWT']-self['dgMut']
    return self
  
  def grouped_avg(self, group_keys, avg_key): 
    '''
    rename to grouped_med...
    '''
    averaged = self.groupby(group_keys)[avg_key].transform('median')
    return averaged  # returns series
    
  def find_cross_chains(self):
    self['Prot1Chain'] = self['#Pdb'].str.split('_').str[1]
    self['Prot2Chain'] = self['#Pdb'].str.split('_').str[2]
    crossChain = self.apply(ChainCheck, axis=1)
    return crossChain
    
  @property
  def _constructor(self):
    return MutantDataSet # Class Name


'''
1) clean each dataset to create consistant MutantDataSet's
  1a) store indeviduals in ~/data/intermediate 
2) combine into uniform MutantDataSet 
  2a) store in ~/data/final  
'''

#1 – clean skempi
def clean_Skempi(path):
# Initialize class
  skempi = MutantDataSet(path, sep=';')

# Convert non-numeric temperature comments to numeric values. Default is 298K 
  skempi['Temperature'] = skempi['Temperature'].str.extract(r'(\d+)')
  skempi['Temperature'] = skempi.to_numeric('Temperature')
  skempi['Temperature'].fillna(value=298, inplace=True) #6665-6668 blank  ### TOGGLE ME ###
  
# Calculate free energies
  dropna_lst = ['Affinity_wt_parsed','Affinity_mut_parsed'] #, 'Temperature']
  skempi.dropna(subset=dropna_lst, inplace=True)
  skempi = skempi.solve_ddG('Affinity_wt_parsed', 'Affinity_mut_parsed')

# Median and duplicate ddG/tmp values
  group_keys = ['#Pdb', 'Mutation(s)_PDB']
  skempi['ddgMedian'] = skempi.groupby(group_keys)['ddG'].transform('median')
  #skempi['ddgMedian'] = skempi.grouped_avg(group_keys, 'ddG')
  skempi = skempi.drop_duplicates(subset=[*group_keys,'Temperature'], 
                                  keep='first', inplace=False)

# Flag multiple mutations in the same protein
  skempi['MutSplit'] = skempi['Mutation(s)_PDB'].str.split(',')
  skempi['NumMutations'] = skempi['MutSplit'].apply(len)

# Extract Chains and remove cross chain mutations. 
  skempi['CrossChain'] = skempi.find_cross_chains()
  SKEMPI_SingleSided = skempi[skempi.CrossChain == False]
  return SKEMPI_SingleSided

skempi_final = clean_Skempi('skempi_v2.0.csv')
NumProteins = skempi_final['#Pdb'].nunique()
NumMutations = skempi_final['#Pdb'].count()
print("There are %s unique single sided mutations in %s proteins" % 
      (NumMutations, NumProteins))  

#1a – store skempi in ~/data/intermediate
##skempi_final.to_csv('~/data/intermediate')

#1 – clean Other 
  #other = MutantDataSet('other.csv')
  #1a – store Other in ~/data/intermediate


#2 – combine 
  #2a – store in ~/data/final  


There are 5454 unique single sided mutations in 343 proteins


In [0]:
#skempi_final[['#Pdb',	'Mutation(s)_PDB',	'Mutation(s)_cleaned']]
#skempi_final.get(['#Pdb',	'Mutation(s)_PDB',	'Mutation(s)_cleaned'])
#skempi_final

#skempi_final['ddG']
skempi_final.ddG

##Alternative

An alternative is to keep mut-class seperate from initial dataframe. then transfer only required columns 

I change my mind. its better to just index a list of cols from the output.

#Final Comparison

Only difference was the recently discovered bug. All checks out!

(their index is different due to deletions from the original dataframe)

In [101]:
keylst = ['#Pdb',
 'Mutation(s)_PDB',
 'Mutation(s)_cleaned',
 'iMutation_Location(s)',
 'Affinity_mut_parsed',
 'Affinity_wt_parsed',
 'Reference',
 'Protein 1',
 'Protein 2',
 'Temperature', 
 'dgWT',
 'dgMut',
 'ddG',
 'ddgMedian',
 'MutSplit',
 'NumMutations',
 'Prot1Chain',
 'Prot2Chain',
 'CrossChain']
  

path = 'skempi_v2.0.csv'
print('Print Test')
#OG
OG = SKEMPItoPandas(path).get(keylst)

#v1.0
v1 = SKEMPItoPandas1(path).get(keylst)

#v1.1Final
v2 = clean_Skempi(path).get(keylst)
NumProteins = v2['#Pdb'].nunique()
NumMutations = v2['#Pdb'].count()
print("There are %s unique single sided mutations in %s proteins\n" % 
      (NumMutations, NumProteins)) 

print('Equivalency Test')
print('\tToggle df[tmp].fillna(298) in v1.1 to switch equivalence with OG & v1')
print('OG==v1?\n\t',OG.equals(v1), '– failure due to nan temp bug')
print('OG==v1.1?\n\t',OG.equals(v2), '– no mutKey, no nan tmp')
print('v1==v1.1?\n\t',v1.equals(v2), '– may fail as mutkey DNE in v1.1\n')

print('Test index equivalence')
print(OG.index.equals(v1.index))
print(OG.index.equals(v2.index))
print(v1.index.equals(v2.index))
print()

print('Find Differences')
for i in OG.index.values:
  a0, a1, a2 = OG['ddgMedian'][i], v1['ddgMedian'][i], v2['ddgMedian'][i]
  if a0!=a1 or a0!=a2: # or a1!=a2:
    print(i, a0, a1, a2)
    print('\t', OG['#Pdb'][i], v1['#Pdb'][i], v2['#Pdb'][i])
    print('\t', OG.Temperature[i], v1.Temperature[i], v2.Temperature[i])
    #x=OG
    #print('\t', f"-R*{x['Temperature'][i]}*np.log({x['Affinity_wt_parsed'][i]})")
    #print('\t', f"-R*{x['Temperature'][i]}*np.log({x['Affinity_mut_parsed'][i]})")

Print Test
There are 5454 unique single sided mutations in 343 proteins
There are 5454 unique single sided mutations in 343 proteins
There are 5454 unique single sided mutations in 343 proteins

Equivalency Test
	Toggle df[tmp].fillna(298) in v1.1 to switch equivalence with OG & v1
OG==v1?
	 False – failure due to nan temp bug
OG==v1.1?
	 False – no mutKey, no nan tmp
v1==v1.1?
	 True – may fail as mutkey DNE in v1.1

Test index equivalence
True
True
True

Find Differences
6472 3.8958503047001924 3.3114853574091185 3.3114853574091185
	 1LFD_A_B 1LFD_A_B 1LFD_A_B
	 310.0 310.0 310.0
6493 3.2996884744981614 2.9170490380241167 2.9170490380241167
	 4G0N_A_B 4G0N_A_B 4G0N_A_B
	 308.0 308.0 308.0
6665 3.2996884744981614 2.9170490380241167 2.9170490380241167
	 4G0N_A_B 4G0N_A_B 4G0N_A_B
	 nan 298.0 298.0
6666 nan 2.619152014020517 2.619152014020517
	 1C1Y_A_B 1C1Y_A_B 1C1Y_A_B
	 nan 298.0 298.0
6667 3.8958503047001924 3.3114853574091185 3.3114853574091185
	 1LFD_A_B 1LFD_A_B 1LFD_A_B
	 nan 29

In [102]:
R = 1.9872036e-3
a = -R*298.0*np.log(4.4e-09)
b =	-R*298.0*np.log(3e-11)
a-b

-2.9539233176897675

In [104]:
#per col,item comparison
init = OG
compare = v2

keys =['#Pdb',
 'Mutation(s)_PDB',
 'Mutation(s)_cleaned',
 'iMutation_Location(s)',
 'Affinity_mut_parsed',
 'Affinity_wt_parsed',
 'Reference',
 'Protein 1',
 'Protein 2',
 'Temperature', 
 'dgWT',
 'dgMut',
 'ddG',
 'ddgMedian',
 'MutSplit',
 'NumMutations',
 'Prot1Chain',
 'Prot2Chain',
 'CrossChain']
  
for col in keys: #list(init) #nan has weird bool behavior, ignoring bad columns
  try:
    print(col)
    for i, val1, val2 in zip(init.index, init[col], compare[col]):
      if val1 != val2:
        print(i, val1, val2, init['#Pdb'][i], compare['#Pdb'][i])
  except:
    pass

#Pdb
Mutation(s)_PDB
Mutation(s)_cleaned
iMutation_Location(s)
Affinity_mut_parsed
Affinity_wt_parsed
Reference
Protein 1
Protein 2
Temperature
6665 nan 298.0 4G0N_A_B 4G0N_A_B
6666 nan 298.0 1C1Y_A_B 1C1Y_A_B
6667 nan 298.0 1LFD_A_B 1LFD_A_B
6668 nan 298.0 1LFD_A_B 1LFD_A_B
dgWT
6665 nan 10.560402211067585 4G0N_A_B 4G0N_A_B
6666 nan 8.073392834256607 1C1Y_A_B 1C1Y_A_B
6667 nan 8.181361230354135 1LFD_A_B 1LFD_A_B
6668 nan 8.181361230354135 1LFD_A_B 1LFD_A_B
dgMut
6665 nan 8.025992609517512 4G0N_A_B 4G0N_A_B
6666 nan 5.45424082023609 1C1Y_A_B 1C1Y_A_B
6667 nan 5.45424082023609 1LFD_A_B 1LFD_A_B
6668 nan 8.39257937867634 1LFD_A_B 1LFD_A_B
ddG
6665 nan 2.534409601550072 4G0N_A_B 4G0N_A_B
6666 nan 2.619152014020517 1C1Y_A_B 1C1Y_A_B
6667 nan 2.7271204101180446 1LFD_A_B 1LFD_A_B
6668 nan -0.21121814832220487 1LFD_A_B 1LFD_A_B
ddgMedian
6472 3.8958503047001924 3.3114853574091185 1LFD_A_B 1LFD_A_B
6493 3.2996884744981614 2.9170490380241167 4G0N_A_B 4G0N_A_B
6665 3.2996884744981614 2.917049038

#Work

##Temp formatting

In [35]:
#initialize dataframes
SKEMPI_df = pd.read_csv('skempi_v2.0.csv', sep=';')
test = SKEMPI_df.copy()

print('following entry has nan tmp')
print(np.array(SKEMPI_df.iloc[[6665]]), '\n')


'''Origional method'''
# Convert non numeric temperature comments to numeric values. Default is 298K 
ConvertTemp = lambda x: int(re.search(r'\d+', x)[0] or 298)
BadTemps = SKEMPI_df.Temperature.str.isnumeric() == False


print("nan val tests 'False'? map not applied, thus unaltered")
print(BadTemps.iloc[[6665]], '\n')

SKEMPI_df['Temperature'].loc[BadTemps] = SKEMPI_df['Temperature'].loc[BadTemps].map(ConvertTemp)
#SKEMPI_df['Temperature'] = SKEMPI_df['Temperature'].apply(ConvertTemp)
SKEMPI_df['Temperature'] = pd.to_numeric(SKEMPI_df['Temperature'], errors='coerce')

'''
New Method:
  -likely a tad slower as regex is applied to all rather than binry mapping
  -no error 
  -handled nan issue
'''
test['Temperature'] = test['Temperature'].str.extract(r'(\d+)')
test['Temperature'] = pd.to_numeric(test['Temperature'], errors='coerce')
test['Temperature'].fillna(298, inplace=True)


SKEMPI_df.equals(og)

following entry has nan tmp
[['4G0N_A_B' 'DA38A' 'DA38A' 'COR' nan
  '4G0N_A_B,3KUD_A_B,1LFD_A_B,1GUA_A_B,1C1Y_A_B,1K8R_A_B,1HE8_A_B,1E96_A_B'
  '1.3E-06' 1.3e-06 '1.8E-08' 1.8e-08 '8636102' 'H-Ras1' 'Raf-RBD' nan
  nan nan nan nan nan nan nan nan nan nan nan nan nan 'IAFL' 2]] 

nan val tests 'False'? map not applied, thus unaltered
6665    False
Name: Temperature, dtype: bool 



NameError: ignored

In [266]:
'''both versions are equal but include nan'''
#print(SKEMPI_df['Temperature'].isnull().values.any())

#print(SKEMPI_df[SKEMPI_df['Temperature'].isnull()]) #['Temperature'])
#print(np.array(SKEMPI_df.iloc[[6665]]))

for init, new in zip(SKEMPI_df['Temperature'], test['Temperature']):
  if init!=new:
    print(init, type(init),':' ,new, type(new))

nan <class 'float'> : 298.0 <class 'float'>
nan <class 'float'> : 298.0 <class 'float'>
nan <class 'float'> : 298.0 <class 'float'>
nan <class 'float'> : 298.0 <class 'float'>


In [276]:
'''strange, i cant reproduce the issue'''
df = pd.DataFrame('', index=[0,1,2,3], columns=['A']) #str(np.nan)
print('init empty data\n', df)
baddies=df["A"].str.isnumeric() == False
print('\nfind non-numeric', baddies, sep='\n')
print('convert temps', df['A'].loc[baddies].map(ConvertTemp))
print(pd.to_numeric(df["A"], errors='coerce'))

init empty data
   A
0  
1  
2  
3  

find non-numeric
0    True
1    True
2    True
3    True
Name: A, dtype: bool


TypeError: ignored

In [278]:
test['Temperature'][6663:6670]

6663    273.0
6664    273.0
6665    298.0
6666    298.0
6667    298.0
6668    298.0
6669    298.0
Name: Temperature, dtype: float64

##other

In [309]:
SKEMPI_df['NumMutations'] = SKEMPI_df['Mutation(s)_PDB'].str.count(',')+1 

largest = [0,0]
for i,str_lst in enumerate(SKEMPI_df['Mutation(s)_PDB']):
  lst=str_lst.split(',')
  split_len = len(lst)
  camma_len = SKEMPI_df['NumMutations'][i]
  if split_len != camma_len:
    print(i,lst, SKEMPI_df['NumMutations'][i], SKEMPI_df['Mutation(s)_PDB'][i], str_lst)
    pass
  longest = max(len(el) for el in lst)
  if longest>largest[0]:
    largest=[longest,i]
  
print(largest)
print(SKEMPI_df.loc[largest[1]])

[7, 329]
#Pdb                                    1DVF_AB_CD
Mutation(s)_PDB                            RD100bA
Mutation(s)_cleaned                         RD106A
iMutation_Location(s)                          COR
Hold_out_type                                AB/AG
Hold_out_proteins                            AB/AG
Affinity_mut (M)                          1.08E-05
Affinity_mut_parsed                       1.08e-05
Affinity_wt (M)                           1.08E-08
Affinity_wt_parsed                        1.08e-08
Reference                                  8993317
Protein 1                       IgG1-kappa D1.3 Fv
Protein 2                                  E5.2 Fv
Temperature                                    298
kon_mut (M^(-1)s^(-1))                         NaN
kon_mut_parsed                                 NaN
kon_wt (M^(-1)s^(-1))                          NaN
kon_wt_parsed                                  NaN
koff_mut (s^(-1))                              NaN
koff_mut_parsed       

##why didnt this work???

OG version my have error. when grouping + avg is commented out on my version they test as equivalent. 
```
#skempi['ddG'] = skempi.groupby(group_keys)['ddG'].transform('median')

skempi = skempi.drop_duplicates(subset=['#Pdb', 'Mutation(s)_PDB', 'Temperature'], keep='first', inplace=False)
```
EDIT: I didnt realize it OG was getting stored to ddgMedian. i was comparing it to before the averaging, hence commenting out my averageing step tested as equal. 

**testing ddgMedian to my overwritten ddG passes. Both are correct.**

In [364]:
def ChainCheck(df):
        if df['NumMutations'] == 1:
            CrossChain = False
            return CrossChain
        else:
            Chain = df['MutSplit'][0][1]
            if Chain in df['Prot1Chain']:
                ChainSet = df['Prot1Chain']
            elif Chain in df['Prot2Chain']:
                ChainSet = df['Prot2Chain']
            for i in range(len(df['MutSplit'])):
                Chain = df['MutSplit'][i][1]
                if Chain in ChainSet:
                    CrossChain = False
                else:
                    CrossChain = True
                    break
        return CrossChain

def gibbsEq(Kd, tmp):
  R = 1.9872036e-3  # Ideal Gas Constant in kcal
  ΔG = -R * tmp * np.log(Kd) #log is ln in np
  return ΔG

#v1.0 function exploded
'''
Purpose:
    1. Loads SKEMPI CSV file.
    2. Calculates ddG
    3. For multiple measurements, keeps the median value
    4. Eliminates entries with mutations on both sides of the interface
Input:
    SKEMPI_loc : Location of SKEMPI CSV file
Output:
    SKEMPI_df : Pandas dataframe    
'''

SKEMPI_df = pd.read_csv('skempi_v2.0.csv', sep=';')


# Convert non numeric temperature comments to numeric values. 
# Default is 298K 
SKEMPI_df['Temperature'] = SKEMPI_df['Temperature'].str.extract(r'(\d+)')
SKEMPI_df['Temperature'] = pd.to_numeric(SKEMPI_df['Temperature'], 
                                         errors='coerce')
SKEMPI_df['Temperature'].fillna(298, inplace=True)

# Drop missing values
SKEMPI_df.dropna(subset=['Affinity_wt_parsed'], inplace=True)
SKEMPI_df.dropna(subset=['Affinity_mut_parsed'], inplace=True)

# Calculate free energies
SKEMPI_df['dgWT'] = gibbsEq(SKEMPI_df['Affinity_wt_parsed'], 
                            SKEMPI_df['Temperature'])
SKEMPI_df['dgMut'] = gibbsEq(SKEMPI_df['Affinity_mut_parsed'], 
                             SKEMPI_df['Temperature'])
SKEMPI_df['ddG'] = SKEMPI_df['dgWT']-SKEMPI_df['dgMut']

################################################################################
# initialize diplucate
skempi=SKEMPI_df

# OG version
SKEMPI_df['MutKey'] = SKEMPI_df['#Pdb']+'_'+SKEMPI_df['Mutation(s)_PDB']
SKEMPI_df['ddgMedian'] = SKEMPI_df.groupby('MutKey')['ddG'].transform('median')        
SKEMPI_df = SKEMPI_df.drop_duplicates(subset=['MutKey', 'Temperature'], 
                                      keep='first', inplace=False)   

def grouped_avg(df, group_keys, avg_key):
  '''
  DANGEROUS! not sure if median value will be returned to correct indecies
  '''
  averaged = df.groupby(group_keys)[avg_key].transform('median')
  return averaged  # returns series

#my version
group_keys = ['#Pdb', 'Mutation(s)_PDB']
## explicit version
#skempi['ddG'] = skempi.groupby(group_keys)['ddG'].transform('median')
#skempi = skempi.drop_duplicates(subset=['#Pdb', 'Mutation(s)_PDB', 'Temperature'], keep='first', inplace=False)
## condensed version
skempi['ddG'] = grouped_avg(skempi, group_keys, 'ddG')
skempi = skempi.drop_duplicates(subset=[*group_keys, 'Temperature'], keep='first', inplace=False)

print('Find Differences')
for i, v in enumerate(zip(SKEMPI_df['ddgMedian'], skempi.ddG)): #'ddgMedian' #'ddG'
  if v[0] != v[1]:
    print(i, v[0], v[1])

Find Differences


In [365]:
def ChainCheck(df):
        if df['NumMutations'] == 1:
            CrossChain = False
            return CrossChain
        else:
            Chain = df['MutSplit'][0][1]
            if Chain in df['Prot1Chain']:
                ChainSet = df['Prot1Chain']
            elif Chain in df['Prot2Chain']:
                ChainSet = df['Prot2Chain']
            for i in range(len(df['MutSplit'])):
                Chain = df['MutSplit'][i][1]
                if Chain in ChainSet:
                    CrossChain = False
                else:
                    CrossChain = True
                    break
        return CrossChain

def gibbsEq(Kd, tmp):
  R = 1.9872036e-3  # Ideal Gas Constant in kcal
  ΔG = -R * tmp * np.log(Kd) #log is ln in np
  return ΔG

#v1.0 function exploded
'''
Purpose:
    1. Loads SKEMPI CSV file.
    2. Calculates ddG
    3. For multiple measurements, keeps the median value
    4. Eliminates entries with mutations on both sides of the interface
Input:
    SKEMPI_loc : Location of SKEMPI CSV file
Output:
    SKEMPI_df : Pandas dataframe    
'''

SKEMPI_df = pd.read_csv('skempi_v2.0.csv', sep=';')


# Convert non numeric temperature comments to numeric values. 
# Default is 298K 
SKEMPI_df['Temperature'] = SKEMPI_df['Temperature'].str.extract(r'(\d+)')
SKEMPI_df['Temperature'] = pd.to_numeric(SKEMPI_df['Temperature'], 
                                         errors='coerce')
SKEMPI_df['Temperature'].fillna(298, inplace=True)

# Drop missing values
SKEMPI_df.dropna(subset=['Affinity_wt_parsed'], inplace=True)
SKEMPI_df.dropna(subset=['Affinity_mut_parsed'], inplace=True)

################################################################################
# initialize diplucate
skempi=SKEMPI_df

# OG version
R = 1.9872036e-3  # Ideal Gas Constant in kcal
SKEMPI_df['dgWT'] = -R*SKEMPI_df['Temperature']*np.log(SKEMPI_df['Affinity_wt_parsed'])
SKEMPI_df['dgMut'] = -R*SKEMPI_df['Temperature']*np.log(SKEMPI_df['Affinity_mut_parsed'])
SKEMPI_df['ddG'] = SKEMPI_df['dgWT']-SKEMPI_df['dgMut']



skempi['dgWT'] = gibbsEq(SKEMPI_df['Affinity_wt_parsed'], 
                            SKEMPI_df['Temperature'])
skempi['dgMut'] = gibbsEq(SKEMPI_df['Affinity_mut_parsed'], 
                             SKEMPI_df['Temperature'])
skempi['ddG'] = SKEMPI_df['dgWT']-SKEMPI_df['dgMut']

print('Find Differences')
for i, v in enumerate(zip(SKEMPI_df.ddG, skempi.ddG)):
  if v[0] != v[1]:
    print(i, v[0], v[1])

Find Differences


In [530]:
skempi = MutantDataSet('skempi_v2.0.csv', sep=';')

######### new version #######
# Convert non-numeric temperature comments to numeric values. Default is 298K 
skempi['Temperature'] = skempi['Temperature'].str.extract(r'(\d+)')
skempi['Temperature'] = skempi.to_numeric('Temperature')
skempi['Temperature'].fillna(298, inplace=True) #6665-6668 blank

# Drop missing values
skempi.dropna(subset=['Affinity_wt_parsed'], inplace=True)
skempi.dropna(subset=['Affinity_mut_parsed'], inplace=True)

skempi['dgWT'] = -R*skempi['Temperature']*np.log(skempi['Affinity_wt_parsed'])
skempi['dgMut'] = -R*skempi['Temperature']*np.log(skempi['Affinity_mut_parsed'])
skempi['ddG'] = skempi['dgWT']-skempi['dgMut']


######### og version #######
og_skempi = MutantDataSet('skempi_v2.0.csv', sep=';')
# Convert non-numeric temperature comments to numeric values. Default is 298K 
og_skempi['Temperature'] = og_skempi['Temperature'].str.extract(r'(\d+)')
og_skempi['Temperature'] = og_skempi.to_numeric('Temperature')
#og_skempi['Temperature'].fillna(298, inplace=True) #6665-6668 blank

# Drop missing values
og_skempi.dropna(subset=['Affinity_wt_parsed'], inplace=True)
og_skempi.dropna(subset=['Affinity_mut_parsed'], inplace=True)

og_skempi['dgWT'] = -R*og_skempi['Temperature']*np.log(og_skempi['Affinity_wt_parsed'])
og_skempi['dgMut'] = -R*og_skempi['Temperature']*np.log(og_skempi['Affinity_mut_parsed'])
og_skempi['ddG'] = og_skempi['dgWT']-og_skempi['dgMut']
    
    
print('Find Differences')
for i, v in enumerate(zip(og_skempi.ddG, skempi.ddG)):
  if v[0]!=v[1]:
    try:
      print(i, v[0], v[1])
    except:
      pass

Find Differences
6392 nan 2.534409601550072
6393 nan 2.619152014020517
6394 nan 2.7271204101180446
6395 nan -0.21121814832220487
