In [45]:
import pandas as pd
pd.options.display.max_colwidth = 50
import numpy as np
from pprint import pprint

# Guidelines for the Standardisation of InterPro Entry Names and Short Names:

## Long Names

### Must be no more than 100 characters

### Must be unique

### Must start with upper case (unless initial term represents an accepted and common abbreviation, e.g. tRNA, rRNA, cAMP)

* Only capitalise proper nouns within the name (e.g. Gram)

* Gene product abbreviations, the first and last letter should be capitalised: AbcA; this is irrespective of the genome nomenclature standards for individual organisms.

#### Can contain the following grammatical symbols:

* Slash - / represents 'or' or 'and', e.g. subunits B/E, alpha/beta, 1/2

* Prime - ' and brackets: (),[] can be used in enzyme or chemical terms

* Hyphen, can be used as a modifier, e.g. Abc1-like, archaeal-like

#### Must not contain the following grammatical symbols:

* underscore

* dash

* mathematical symbols +, -, *, \

* colon

* semi-colon

#### Must not contains the following words:

* plurals e.g. proteins, subunits

* with

* within

* on

* an

* to

* in

* involved

* protein - unless it is specifically part of the Name e.g. Uncharacterised conserved protein (UCP), Vacuolar protein sorting (Vps)

* CONJUNCTIONS: for, and, nor, but, or, so

#### Avoid referring to the following, unless this is accurate representation of the protein family name:

* organs

* tissue types

* cell types

#### Generally avoid the use of the following terms:

* precursor

* homolog

* paralog

* ortholog

* gene

#### Positional restraints; names should not begin with:

* Predicted

* Probable

* Putative

#### Modifiers

* type

* related,

* associated

#### These should ONLY be used with a hyphen, attached to the term they are intended to modify, e.g. bacterial-type, Abc1-related, Vps21-associated.

##### Avoid acronyms and abbreviations in names wherever possible

* Exceptions are the use of abbreviations used for protein COMPLEXs, for example:

* DASH complex, subunit Spc19

* MRN complex, subunit Mre11

### Name structure

#### Order the name with the most general classification first, going to the most specific, with taxonomy at the end (taxonomy to be added only if necessary to distinguish one entry from another):

* STEM (class/family/protein name), SUBDIVISION (type, subunit, example gene-product name), POSITION (C/N-terminal), TAXONOMY

* Use the terms site, domain, family and superfamily where these names accurately describe the biology (e.g. SH2 domain describes a known domain)

* Separate parts of names using commas

* Regarding, taxonomy, this should only be used if it is necessary to specify a particular lineage. For example:

* RNA polymerase alpha subunit, C-terminal, archaea

* Phosphoenolpyruvate carboxylase, archaeal-type

* Phosphoenolpyruvate carboxylase, bacterial/plant-type

#### Name conventions for uncharacterised families/domains:

* Protein/Domain of unknown function DUFnnnnn = Pfam (note n= numeric character)

* DUF numbering is provided by Pfam

* Uncharacterised conserved protein UCPnnnnnn = PIRSF

* UCPnnnnnn, numbering is taken from the method accession (PIRSF006287 = UCP006287). Either provided by PIR or added by InterPro where applicable.

* Conserved hypothetical protein CHPnnnnn = TIGRFAMs

* CHPnnnnn, numbering is taken from the method accession (TIGR01620 = CHP01620). Either provided by TIGR or added by InterPro where applicable.

* Uncharacterised protein family UPFnnnn = Swiss-Prot

* UPFnnnn, numbering is provide Swiss-Prot.

* also used by Pfam when they model Swiss-Prot families.

* Where a set of UPF Swiss-Prot entries are deemed to be representative of the InterPro entry the Swiss-Prot naming convention will be used in preference to the member database convention for families of unknown function.

* Uncharacterised protein family <text>, <modifier> = InterPro

* Uncharacterised_<text/modifier>

* For use where a member database family does not use its own format for an entry where the contents have no known function.

In [6]:
!wget ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/latest-release/cath-classification-data/cath-superfamily-list.txt

!tail -n +7 cath-superfamily-list.txt > parsed_list.txt

In [71]:
df = pd.read_csv("./parsed_list.txt", sep='\t',index_col=0)
df = df.dropna()
df['COMMENT'] = np.nan
class DataFrame_parser(object):
    def __init__(self, df):
        self.df = df
        
    ### AUTOMATED REPLACE
    
    def semicolon(self): #replace semicolons with commas
        ret = self.df[self.df['NAME'].str.contains(";")]['NAME'].str.replace(";", ',')
        comment = pd.Series(index=ret.index, name='COMMENT', data="S")
        return ret, comment
    
    def lowercase_start(self): #replace lowercase start with capital
        st_lower = df[df['NAME'].str[0].str.islower()]['NAME']
        st_lower = st_lower.mask(st_lower.str.contains(r'^[m|t|r|ss|ds][R|D]NA|^cAMP', regex=True)).dropna()
        ret = st_lower.str[0].str.upper() + st_lower.str[1:]
        comment = pd.Series(index=ret.index,name='COMMENT', data="L")
        return ret, comment
    
    def trailing_stop(self): #remove trailing dots
        ret = self.df[self.df["NAME"].str.endswith('.')]['NAME'].str[:-1]
        comment = pd.Series(index=ret.index, name='COMMENT', data="T")
        return ret, comment
    
    def other_stop(self): #replace other dots with commas
        s = self.df[self.df['NAME'].str.contains("\.")]['NAME']
        s = s.mask(s.str.contains(r'\d\.\d')).dropna()
        ret = s.str.replace(".", ',')
        comment = pd.Series(index=ret.index, name='COMMENT', data="C")
        return ret, comment
    
    def implement_replacements(self): #combine replacements with 
        ret_df = self.df[['NAME','COMMENT']]
        ret_df['NEW_NAME'] = ret_df['NAME']
        for r, c in [self.semicolon(), self.lowercase_start(), self.trailing_stop(), self.other_stop()]:
            ret_df['NEW_NAME'] = r.combine_first(ret_df['NEW_NAME'])
            ret_df['COMMENT'] = c.combine(ret_df['COMMENT'], lambda c, r:str(c)+str(r))
        ret_df['COMMENT'] = ret_df["COMMENT"].str.replace("nan", '')
        return ret_df.replace('', np.nan, regex=True)
    

    ### S - replace semicolon, L - lowercase start, T - trailing stop, C - other stop      
            
    ### FLAGGING FOR REPLACE
    
    def duplicates(self):
        ret = self.df[self.df.duplicated(subset="NAME", keep=False)].groupby(by='NAME')
        return ret
    
    def implement_duplicates(self):
        r = {}
        for name, group in D.duplicates():
            r[name] = ", ".join(group.index.values.tolist())
        s = pd.Series(data=r)
        return s
    
    def underscore(self):
        ret = self.df[self.df['NAME'].str.contains("_")]['NAME']
        comment = pd.Series(index=ret.index, data="U")
        return ret, comment
    
    def plus(self):
        ret = self.df[self.df['NAME'].str.contains("\+")]['NAME']
        comment = pd.Series(index=ret.index, data="P")
        return ret, comment
    
    def bad_words(self):
        bad_words = ["proteins", "subunits", "with", "within", "on", "an", "to", "in", "involved", "for", "and", "nor", "but", "or", "so"]
        bad_words = [' {0} '.format(elem) for elem in bad_words]
        ret = self.df[self.df['NAME'].str.contains("|".join(bad_words), regex=True)]
        comment = pd.Series(index=ret.index, data="F")
        return ret, comment
    
    def bad_start(self):
        ret = self.df[self.df['NAME'].str.contains("^[P|p]redicted|[P|p]robable|[P|]utative")]
        comment = pd.Series(index=ret.index, data="S")
        return ret, comment
    
    def pref_words(self):
        words = 'precursor', 'homolog', 'paralog', 'ortholog', 'gene'
        words = [' {0} '.format(elem) for elem in words]
        ret = self.df[self.df['NAME'].str.contains("|".join(words), regex=True)]
        comment = pd.Series(index=ret.index, data='W')
        return ret, comment
    
    def compile_flags(self):
        ret_df = self.df[['NAME','COMMENT']]
        for r, c in [self.underscore(), self.plus(), self.bad_words(), self.bad_start(), self.pref_words()]:
            ret_df['COMMENT'] = c.combine(ret_df['COMMENT'], lambda c, r:str(c)+str(r))
        ret_df['COMMENT'] = ret_df["COMMENT"].str.replace("nan", '')
        return ret_df.replace('', np.nan, regex=True)

        
D = DataFrame_parser(df)

In [73]:
D.compile_flags().dropna()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,NAME,COMMENT
# CATH_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1.10.8.300,putative atpase (yp_676785.1),U
1.10.10.710,PSPTO_1197 like,U
1.10.40.50,Probable gtpase engc; domain 3,S
1.10.60.10,"Iron dependent repressor, metal binding and di...",F
1.10.110.10,Plant lipid-transfer and hydrophobic proteins,F
1.10.150.20,"5' to 3' exonuclease, C-terminal subdomain",F
1.10.150.170,"Putative methyltransferase TM0872, insert domain",S
1.10.150.240,Putative phosphatase; domain 2,S
1.10.1200.80,Putative flavin oxidoreducatase; domain 2,S
1.10.3060.10,Helical scaffold and wing domains of SecA,F


#### Must not contains the following words:

* plurals e.g. proteins, subunits

* with

* within

* on

* an

* to

* in

* involved

* protein - unless it is specifically part of the Name e.g. Uncharacterised conserved protein (UCP), Vacuolar protein sorting (Vps)

* CONJUNCTIONS: for, and, nor, but, or, so

In [77]:
D.implement_replacements().dropna().to_csv('./results/renamed_superfamilies.tsv', sep='\t')
D.implement_duplicates().to_csv('./results/duplicates.tsv', sep='\t')
D.compile_flags().dropna().to_csv('./results/flagged.tsv', sep='\t')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

In [76]:
!mkdir results
