### Creating the seed data set

Starting from complete trEMBL dataset <span style='background:#f7f3f7;padding:0.4em;border-radius:2px; border:solid bgrey 1px'>arwen:/mobi/group/NOX_CH/data/uniprot_trembl.fasta.gz</span> which is a symbolic link for `arwen:/mobi/group/databases/flat/uniprot_trembl_2019_02.fasta.gz`
 *  Split the dataset in small volumes
     * script: <span style="color:green">**split.py**</span>
     * Usage:
     Create and go to the `/mobi/group/NOX_GL/volumes` 
```console
    ROOT_DIR=/mobi/group/NOX_CH
    SCRIPT_DIR=/mobi/group/NOX_CH/nox-analysis/scripts
    $SCRIPT_DIR/split.py $ROOT_DIR/data/uniprot_trembl.fasta.gz
```

 * Run the HMMR and TMHMM annotations
    * script: <span style="color:green">**runHMMR_slurm.sh**</span>
    * Usage:  
  
```console
    mkdir $ROOT_DIR/seedSet
    mkdir $ROOT_DIR/seedSet/work
    $SCRIPT_DIR/runHMMR_slurm.sh $ROOT_DIR/volumes $ROOT_DIR/seedSet/work $ROOT_DIR/data/profiles
```

 * Use this notebook to parse the _work_ folder (see **Parsing all data files** section)

    * Filter-out non eukaryotic entries and dump the corresponding fasta sequence in folder <span style='background:#f7f3f7;padding:0.4em;border-radius:2px; border:solid bgrey 1px'>/mobi/group/NOX_CH/seedSet/NOX_noEukaryota</span> (create directory before)
         


 * Preparing folders/sbatch scripts for pairwise N&W across the set of __NOX_noEukaryota__ fasta sequences
    * script: <span style="color:green">**runEMBOSS_slurm.sh**</span>
    * Usage:
```console
mkdir $ROOT_DIR/seedSet/NOX_noEukaryota_needlePairwise_work
$SCRIPT_DIR/runEMBOSS_slurm.sh $ROOT_DIR/seedSet/NOX_noEukaryota NOX_noEukaryota $ROOT_DIR/seedSet/NOX_noEukaryota_needlePairwise_work
```

 * Concatenate all fasta sequences in a single file, clusters redundant sequences
```console
     cat $ROOT_DIR/seedSet/NOX_noEukaryota/*.fasta > $ROOT_DIR/seedSet/NOX_noEukaryota.mfasta
    mmseqs createdb /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota.mfasta /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_mmseqsdb 
    mmseqs cluster /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_mmseqsdb /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_clust100 /Volumes/arwen$ROOT_DIR/seedSet/tmp_NOX_noEukaryota_clust100 -c 1 
    mmseqs createtsv /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_mmseqsdb /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_mmseqsdb  /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_clust100  /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_clust100.tsv --full-header
    mmseqs result2repseq /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_mmseqsdb /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_clust100 /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_clust100_seq 
    mmseqs result2flat /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_mmseqsdb /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_mmseqsdb /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_clust100_seq  /Volumes/arwen$ROOT_DIR/seedSet/NOX_noEukaryota_clust100.fasta --use-fasta-header
 ```
 
* Enrich the datacontainer with redundant information, see **Add redundant informations** section   


* Perform full Pfam annotation

```console
     sbatch $SCRIPT_DIR/runHMMSCAN.sbatch /mobi/group/databases/hmmr/Pfam-A.hmm $ROOT_DIR/seedSet/NOX_noEukaryota.mfasta $ROOT_DIR/seedSet/NOX_noEukaryota_hmmscan.out
```


 * Enrich the datacontainer with these new annotation, see **Read in additional PFAM annotations // Erase previous** section


 * Use the [Taxonomy notebook](http://localhost:8888/notebooks/NOX/Taxonomy.ipynb) to output a hierarchal tree
     * link the output json file as $latest.json$


 * Start adhoc http server
   
   Go to `~/work/projects/NOX`
```console
node index.js
``` 

* Visualize w/ D3 at `localhost:9615`
 
### Creating the extended data set


* Perform a psiblast on fasta files present in <span style='background:#f7f3f7;padding:0.4em;border-radius:2px; border:solid bgrey 1px'>arwen:/mobi/group/NOX_GL/seedSet/NOX_noEukaryota</span>

    * Create the `extendedSet` folder

    * script: <span style="color:green">**runPsiBlast_slurm.sh**</span>
    * Usage:
```console
$SCRIPT_DIR/runPsiBlast_slurm.sh $ROOT_DIR/seedSet/NOX_noEukaryota $ROOT_DIR/extendedSet/psiblastWork
```

* Browse all the psiblast workfolder and eliminate strictly identical proteins
    * Go to `$ROOT/extendedSet`
    * script:<span style="color:green">**makePsiBlastNR.py**</span>
    * Usage:
```console
python $SCRIPT_DIR/makePsiBlastNR.py ./psiblastWork ./NOX_noEukaryota_PB_NR.fasta > makePsiBlastNR.log
```
* Perform a full PFAM annotation
```console
hmmscan NOX_noEukaryota_PB_NR.fasta /mobi/group/databases/hmmr/Pfam-A.hmm > NOX_noEukaryota_PB_NR_hmmscan.out
```

In [1]:
%matplotlib inline
import sys, os
sys.path.append("/Users/chilpert/Work/pyproteinsExt/src")
sys.path.append("/Users/chilpert/Work/pyproteins/src")
%load_ext autoreload
%autoreload 2

In [2]:
import gzip, io
import urllib.request

def mFastaParseZip(inputFile):
    data = None
    with io.TextIOWrapper(gzip.open(inputFile, 'r')) as f:
        data = mFastaParseStream(f)
    return data

def mFastaParseUrl(url):
    fp = urllib.request.urlopen(url)
    mybytes = fp.read()
    #mFastaParseStream(fp)
    mystr = mybytes.decode("utf8")
    fp.close()
    data = mFastaParseStream(mystr.split('\n'))
    
#    print(mystr)
    return data

def mFastaParseStream(stream):
    
    data = {}    
    headPtr = ''
    for line in stream:
        #print (line)
        if line == '':
            continue
        s = line.replace('\n','')
        if s.startswith('>'):
            headPtr = s.split()[0][1:]
            
            if headPtr in data:
                raise ValueError('Smtg wrong')
            data[headPtr] = {'header': s, 'sequence' : '' }
            
            continue
        data[headPtr]['sequence'] += s
    return data

#mFastaParseUrl('http://www.uniprot.org/uniprot/S4Z6V5.fasta')
#data = mFastaParse('/Volumes/arwen/home/ygestin/prositetask-backup/alignTrembl/bibl/Trembl_47/Trembl_47.fasta.gz')
#test=None
#with open('/Volumes/arwen/mobi/group/NOX_GL/work/uniprot_trembl_v11/hmmsearch.fasta', 'r') as f:
#    test = mFastaParseStream(f)

In [3]:
import re

def num(s):
    try:
        return int(s)
    except ValueError:
        return float(s)
    
    
reTMH = re.compile('^(\# ){0,1}([\S]+)[\s]+([\S].*)[\s]+([\d\.]+)$')
def loadTMHMM(lDir):
    
    fastaContainer = None
    with open( lDir+ '/hmmsearch.fasta', 'r') as f:
        fastaContainer = mFastaParseStream(f)
    
    file = lDir+ '/tmhmm.out'
    data = {}
    with open(file, 'r') as f:
        for l in f:
            m = reTMH.search(l)
            if m:
                _id = m.groups()[1] 
                if _id not in data:
                    if _id not in fastaContainer:
                        raise ValueError("Misisng fasta for tmhmm prediction")
                    data[_id] = {'hCount':0 ,
                                'helix':[], 'fasta' : fastaContainer[_id],
                                'mask': '-' * len(fastaContainer[_id]['sequence'])
                                }
                
                if not m.groups()[2].startswith('TMHMM2'):
                    data[_id][re.sub('[\s]*:[\s]*$', '',m.groups()[2])] = num(m.groups()[3])
                    continue
                
                
                m2 = m.groups()[2].split('\t')
                if not m2:
                    raise ValueError('could not parse helix line')
                helixCoor =  {'volume' : m2[1], 
                              'start'  : num(m2[2].replace(' ', '')),
                              'stop'   : num(m.groups()[3]) 
                            }
                data[_id]['helix'].append(helixCoor)
                
                
                data[_id]['helix'].append(helixCoor)
                #print (data[_id]['mask']) 
                l_1 = len(data[_id]['mask'])
                buf = list(data[_id]['mask'])
                symbol = None
                if helixCoor['volume'] == 'TMhelix':
                    data[_id]['hCount'] += 1
                    #symbol = 'H'
                    symbol = str(data[_id]['hCount']) if data[_id]['hCount'] < 10 else str(data[_id]['hCount'])[-1]
                elif helixCoor['volume'] == 'inside':
                    symbol = 'i'
                elif helixCoor['volume'] == 'outside':
                    symbol = 'e'
                else :
                    raise ValueError("unknown symbol " + helixCoor['volume'])

                i=helixCoor['start'] - 1
                j=helixCoor['stop']
                #print(i,j,len(buf))
                toAdd = symbol * (j - i)
                buf[i:j] =  list(toAdd)#helixCoor['stop'] - helixCoor['start'] + 1
                data[_id]['mask'] = ''.join(buf)
                if len(data[_id]['mask']) != l_1:
                    print("ERROR ", _id, l_1, len(data[_id]['mask']), '>>', i, j, '<<')
                    print (len(buf[i:j]), len(list(toAdd)), symbol, '-->', toAdd )
                #print(data[_id]['mask'])
    
    #        Hcluster(data)
    return data
#d = loadTMHMM('/Volumes/arwen/home/ygestin/prositetask-backup/alignTrembl/bibl/Trembl_47')
#d = loadTMHMM('/Volumes/arwen/mobi/group/NOX_GL/work_sample/uniprot_trembl_v11')
#d

In [4]:
def HIS_clust(data, min=2, max=7):
    for _id in data:
        data[_id]['Htest'] = {'status' : False, 'data' : [] }

        #Discard unwanted numbe of helices
        if data[_id]['hCount'] < min or data[_id]['hCount'] > max:
            #print('Wrong helices number ', _id, data[_id]['hCount'])
            continue
        
        H_status = []
        iMax = len(data[_id]['mask'])
        # internal error check
        if len(data[_id]['mask']) != len(data[_id]['fasta']['sequence']) :
            print( len(data[_id]['mask']), len(data[_id]['fasta']['sequence']) )
            print(_id, data[_id])
            raise ValueError("")
        # Select only residues that are Histidine within TMH
        for i in range(0, iMax):
            if data[_id]['mask'][i] == "i" or  data[_id]['mask'][i] == "e":
                continue
            if not data[_id]['fasta']['sequence'][i] == "H":
                continue
            H_status.append( [i, data[_id]['mask'][i], False] )
        # Pairwise comparaison between Histidine of the same helix, marking pairs separated by 12 to 14 residues
        for i in range (0, len(H_status) - 1):
            for j in range (i + 1, len(H_status)):
                if H_status[i][1] != H_status[j][1]:
                    continue
                d = H_status[i][0] - H_status[j][0]
                if d >= 12 or d <= 14:
                    H_status[i][2] = True
                    H_status[j][2] = True
        
        #print(H_status)
        # Only keep marked histidine
        H_status = [ x for x in H_status if x[2] ]
        # Create a dicitinary where keys are Helices numbers
        H_groups = {}
        for x in H_status:
            if not x[2]:
                continue
            if x[1] not in H_groups:
                H_groups[x[1]]=[]
            H_groups[x[1]].append(x)
        
        # The test is passed if at least two distinct helices feature at least one correctly spaced histidine pair
        # ie : if the helice dictionary has more than 1 entrie
        #print(H_status)
        #print("-->", H_groups)
        HisTestBool = True if len(H_groups) > 1 else False
        
        data[_id]['Htest']['status'] = HisTestBool
        data[_id]['Htest']['data'] = H_groups
    return data

#m = HIS_clust(d)
#print(len([ m[x] for x in m if m[x]['Htest']['status'] ]), len(m))

In [5]:
import pickle, time
import time

def save(data, tag=None):
    saveDir="/Volumes/arwen/mobi/group/NOX_CH/pickle_saved"
    timestr = time.strftime("%Y%m%d-%H%M%S")
    fTag = "NOX_annotation_" + tag + "_" if tag else "NOX_annotation_"
    fSerialDump = fTag + timestr + ".pickle"
    with open(saveDir + '/' + fSerialDump, 'wb') as f:
        pickle.dump(data, f)
    print('data structure saved to', saveDir + '/' + fSerialDump)

def load(fileName):
    saveDir="/Volumes/arwen/mobi/group/NOX_CH/pickle_saved"
    d = pickle.load( open(saveDir + "/" + fileName, "rb" ) )
    print("restore a annotated container of ", len(d), "elements")
    return d

# Parsing all data files 

### Parsing HMMR data
NB: There are stdout of 3 consecutive hmmr calls

All in a single **data** container

In [6]:
import pyproteinsExt.hmmrContainerFactory as hm
import glob
dataDir=glob.glob('/Volumes/arwen/mobi/group/NOX_CH/seedSet/work/uniprot_trembl_v*')

data = hm.parse(inputFile=dataDir[0] + '/hmmsearch.out')
i=0

for iDir in dataDir[1:]:
    #print(iDir)
    data += hm.parse(inputFile=iDir + '/hmmsearch.out')
    i += 1
    #if i == 1:
     #   break   

   [No individual domains that satisfy reporting thresholds (although complete target did)]




## Loading TMHMM data

In [7]:
dataTMHMM = {}
for lDir in dataDir:
    d = loadTMHMM(lDir)
    if set( dataTMHMM.keys() ) & set( d.keys() ):
        print('doublons')
    dataTMHMM.update(d)

dataTMHMM = HIS_clust(dataTMHMM)

##### Transform a PFAM domain indexed data structure in a protein indexed data structure
Then filter out the protein that feature the 3 domains


In [8]:
T = data.T()
D = {}
fad=0
nad=0
ferric=0
for protein in T:
    if len(T[protein]) == 3:
           D[protein] = T[protein]
    for dom in T[protein]: 
        if dom == "PF08022_full":
            fad+=1
        elif dom == "PF01794_full": 
            ferric+=1
        elif dom == "PF08030_full": 
            nad+=1
        else: 
            print("OOOO")
        #if dom == "PF08022_full":
            
print('Number of proteins entries featuring FAD',fad)
print('Number of proteins entries featuring NAD',nad)
print('Number of proteins entries featuring Ferric reductase',ferric)
print('Size of their intersection',len(D))

Number of proteins entries featuring FAD 77203
Number of proteins entries featuring NAD 121386
Number of proteins entries featuring Ferric reductase 59209
Size of their intersection 18020


## Merge TMHMM & HMMR data

  * Proteins with the 3 domain types
  * Their TMHMM status


In [9]:
merged = {}
for _id in D:
    if _id not in dataTMHMM:
        print('Missing protein ID' + _id)
    if not dataTMHMM[_id]['Htest']['status']:
        continue
    merged[_id] = {
        'hmmr' : D[_id],
        'tmhmm' : dataTMHMM[_id]
    }
    
print('Number of protein entries featuring FAD,NAD and Ferric transferase domains', len(D))
print('Number of protein featuring 2 to 7 TMH and 2 bi-histine', len(dataTMHMM))
print('Size of their intersection', len(merged))

Number of protein entries featuring FAD,NAD and Ferric transferase domains 18020
Number of protein featuring 2 to 7 TMH and 2 bi-histine 178540
Size of their intersection 5972


#### Inspect NCBI Taxonomy

#### Extract TaxonID

In [10]:
def getTaxID(datum):
    reTaxID = re.compile('OX=([\d]+)')
    m = reTaxID.search(datum['tmhmm']['fasta']['header'])
    if not m:
        raise ValueError('Cant parse taxid from', datum['tmhmm']['fasta']['header'])
    datum['taxid'] = m.groups()[0]
    
for _id in merged:
    getTaxID(merged[_id])

###### Flag Non Eukaryota phylum members

In [11]:
from ete3 import NCBITaxa
ncbi=NCBITaxa() 

In [12]:
unclassified=0
archaea=0
bacteria=0
eukaryota=0
not_found=0
for _id in merged: 
    bool=True
    taxid=merged[_id]['taxid']
    #print(taxid)
    try : 
        lineage=ncbi.get_lineage(taxid)
        lineage_rank=ncbi.get_rank(lineage)
        superkingdom=[taxid for taxid in lineage_rank if lineage_rank[taxid]=='superkingdom']
        if superkingdom : 
            name=ncbi.get_taxid_translator(superkingdom)[superkingdom[0]]
            if name == "Eukaryota":
                bool=False
                eukaryota+=1
            elif name == "Bacteria":
                bacteria+=1
            elif name == "Archaea": 
                archaea+=1
            else: 
                print("OOO")
        else: 
            unclassified+=1
        merged[_id]['isNoEukaryota']=bool
            
    except : 
        not_found+=1

print("Eukaryota",eukaryota)
print("Bacteria",bacteria)
print("Archaea",archaea)
print("Unclassified",unclassified)
print("Not found", not_found)

Eukaryota 5116
Bacteria 848
Archaea 3
Unclassified 2
Not found 3


In [None]:
#### Cull for prokaryotic proteins (original 996)

#### Use proteins as seeds for blast ()

#### --> Tree reconstruction

#### Additional PFAM annotation

#### Sequence clustering

#### Profile génétique



##### Just keep non Eukaryota sequences in datacontainer

In [23]:
data=load("NOX_annotation_20190506-144039.pickle")

restore a annotated container of  5972 elements


In [24]:
new_data={}
for k in data:
    if not 'isNoEukaryota' in data[k]:
        continue
    if data[k]['isNoEukaryota']:
        new_data[k] = data[k]

###### Save non Eukaryota sequences in given directory

In [8]:
import re
saveDir="/Volumes/arwen/mobi/group/NOX_CH/seedSet/NOX_noEukaryota"
def mFastaSplitDump(data, saveDir, fileTag='default' ,distinct=True):
    c = 1
    f = None
    if not distinct:
        f = open(saveDir + '/'+ fileTag + '_all.fasta', 'w')
        
    for _id in data:
        if distinct:
            f = open(saveDir + '/'+ fileTag + '_' + str(c) + '.fasta', 'w')
        c += 1
        f.write(data[_id]['tmhmm']['fasta']['header'])
        f.write(re.sub("(.{81})", "\\1\n", data[_id]['tmhmm']['fasta']['sequence'], 0, re.DOTALL))
        if distinct:
            f.close()
    if not distinct:    
        f.close()

In [None]:
mFastaSplitDump(new_data, saveDir, 'NOX_noEukaryota')

##### Save

In [25]:
save(new_data,"noEukaryota")

data structure saved to /Volumes/arwen/mobi/group/NOX_CH/pickle_saved/NOX_annotation_noEukaryota_20190509-100821.pickle


### Full Pfam annotation

In [7]:
data=load("NOX_annotation_noEukaryota_20190509-100821.pickle")

restore a annotated container of  853 elements


In [30]:
import pyproteinsExt.hmmrContainerFactory as hm
fileName="/Volumes/arwen/mobi/group/NOX_CH/seedSet/NOX_noEukaryota_hmmscan.out"
#fileName="/tmp/hmmscan.out"
hscan = hm.parse(inputFile=fileName)
print( len(hscan.T()), 'proteins to reannotate' )
for e in hscan.T():
    data[e]['hmmr'] = hscan.T()[e]

853 proteins to reannotate


In [31]:
save(data,"fullPfam")

data structure saved to /Volumes/arwen/mobi/group/NOX_CH/pickle_saved/NOX_annotation_fullPfam_20190509-101105.pickle


### Discard obsolete Uniprot entries, add Uniprot informations

In [6]:
data=load("NOX_annotation_fullPfam_20190509-101105.pickle")

restore a annotated container of  853 elements


In [None]:
import pyproteinsExt.uniprot as uniprot
uColl = uniprot.getUniprotCollection()
uColl.setCache(location="/Users/chilpert/cache/uniprot")
uniprot.getPfamCollection().setCache(location="/Users/chilpert/cache/pfam")
new_data={}
c=0
not_found=[]
for p in data :
    p_id=p.split("|")[1]
    try : 
        obj=uColl.get(p_id)
        new_data[p]=data[p]
        new_data[p]['RefSeq']={}
        new_data[p]['RefSeq']['genome']=obj.Genome.RefSeqRef
        new_data[p]['RefSeq']['protein']=obj.Genome.RefSeqProteinRef
        new_data[p]['EMBL']={}
        new_data[p]['EMBL']['genome']=obj.Genome.EMBLRef 
        new_data[p]['EMBL']['protein']=obj.Genome.EMBLProteinRef
        #new_data[p]['Uniprot_domains']=obj.domains
        
    except : 
        c+=1
        not_found.append(p_id)
        continue

Acknowledged 0 entries (/Users/chilpert)
Changing cache location to /Users/chilpert/cache/uniprot
Reindexing /Users/chilpert/cache/uniprot
Acknowledged 0 entries (/Users/chilpert/cache/uniprot)
Acknowledged 29 entries (/Users/chilpert)
Changing cache location to /Users/chilpert/cache/pfam
Reindexing /Users/chilpert/cache/pfam
Acknowledged 29 entries (/Users/chilpert/cache/pfam)
got to fetch A0A1M7F9I0
got to fetch A0A1M7F9I0
got to fetch D0X6B7
got to fetch D0X6B7
got to fetch A0A1H1LLY0
got to fetch A0A1H1LLY0
got to fetch Q2IMP5
got to fetch Q2IMP5
got to fetch K2MAC9
got to fetch K2MAC9
got to fetch Q87IX8
got to fetch Q87IX8
got to fetch A0A261FVY3
got to fetch A0A261FVY3
got to fetch A0A1C0SBR1
got to fetch A0A1C0SBR1
got to fetch A0A024YN82
got to fetch A0A024YN82
got to fetch A0A3N9TE25
got to fetch A0A3N9TE25
got to fetch A0A349S867
got to fetch A0A349S867
got to fetch R7ZEG3
got to fetch R7ZEG3
got to fetch A0A2X4NMK6
got to fetch A0A2X4NMK6
got to fetch A0A0Q9DA37
got to fetc

got to fetch A0A2K4JCS1
got to fetch A0A1G8ANP9
got to fetch A0A1G8ANP9
got to fetch A0A2N7H8B2
got to fetch A0A2N7H8B2
got to fetch A0A0X3UQG2
got to fetch A0A0X3UQG2
got to fetch A0A1V5HAL9
got to fetch A0A1V5HAL9
got to fetch A0A1N6G2S4
got to fetch A0A1N6G2S4
got to fetch C5R8S4
got to fetch C5R8S4
got to fetch A0A371Y7W6
got to fetch A0A371Y7W6
got to fetch A0A1Q8L5Q3
got to fetch A0A1Q8L5Q3
got to fetch A0A3A4ZND6
got to fetch A0A3A4ZND6
got to fetch A0A0B1XSI4
got to fetch A0A0B1XSI4
got to fetch A0A2K7SPG0
got to fetch A0A2K7SPG0
got to fetch A0A3A2I0H6
got to fetch A0A3A2I0H6
got to fetch A0A0W7Y8U8
got to fetch A0A0W7Y8U8
got to fetch A0A3E1DU62
got to fetch A0A3E1DU62
got to fetch A0A136KU56
got to fetch A0A136KU56
got to fetch A0A0P6XZE1
got to fetch A0A0P6XZE1
got to fetch A0A2I0FGC7
got to fetch A0A2I0FGC7
got to fetch A3TNH7
got to fetch A3TNH7
got to fetch A0A2G2GL45
got to fetch A0A2G2GL45
got to fetch A0A1G5FSB5
got to fetch A0A1G5FSB5
got to fetch A0A285KXT9
got to f

got to fetch A0A2R4JQX8
got to fetch A0A1H0N7N1
got to fetch A0A1H0N7N1
got to fetch B5FDV5
got to fetch B5FDV5
got to fetch A0A0G0MNT3
got to fetch A0A0G0MNT3
got to fetch A0A2T1K3T6
got to fetch A0A2T1K3T6
got to fetch H9ULV8
got to fetch H9ULV8
got to fetch E6U9C6
got to fetch E6U9C6
got to fetch V2RUN2
got to fetch V2RUN2
got to fetch A0A0M8W889
got to fetch A0A0M8W889
got to fetch A4BB68
got to fetch A4BB68
got to fetch A0A1M6BC20
got to fetch A0A1M6BC20
got to fetch A0A345HY22
got to fetch A0A345HY22
got to fetch A0A0L0L909
got to fetch A0A0L0L909
got to fetch A0A0L0KYQ1
got to fetch A0A0L0KYQ1
got to fetch A0A0L0L0F6
got to fetch A0A0L0L0F6
got to fetch A0A345HX20
got to fetch A0A345HX20
got to fetch A0A062VHV2
got to fetch A0A062VHV2
got to fetch A0A1G9VT42
got to fetch A0A1G9VT42
got to fetch A0A383TAJ2
got to fetch A0A383TAJ2
got to fetch A0A2N7MYI4
got to fetch A0A2N7MYI4
got to fetch A0A233HC80
got to fetch A0A233HC80
got to fetch A0A0Q7JWU0
got to fetch A0A0Q7JWU0
got to f

got to fetch A0A1V2DSN3
got to fetch A0A371QPF7
got to fetch A0A371QPF7
got to fetch A0A1A6LML8
got to fetch A0A1A6LML8
got to fetch A0A371QMM9
got to fetch A0A371QMM9
got to fetch A0A2N7KTV7
got to fetch A0A2N7KTV7
got to fetch A0A2N7CCU0
got to fetch A0A2N7CCU0
got to fetch A0A2N7M9G7
got to fetch A0A2N7M9G7
got to fetch A0A2T5F062
got to fetch A0A2T5F062
got to fetch A0A094QEM8
got to fetch A0A094QEM8
got to fetch K1XT93
got to fetch K1XT93
got to fetch A0A2A4HRK5
got to fetch A0A2A4HRK5
got to fetch U2RLK2
got to fetch U2RLK2
got to fetch A0A364VQZ3
got to fetch A0A364VQZ3
got to fetch A0A2E3Q5K5
got to fetch A0A2E3Q5K5
got to fetch A0A2N1WYC7
got to fetch A0A2N1WYC7
got to fetch A0A0N0RDH8
got to fetch A0A0N0RDH8
got to fetch A0A387H7X6
got to fetch A0A387H7X6
got to fetch A0A0K9UUZ2
got to fetch A0A0K9UUZ2
got to fetch A0A2E3EYL9
got to fetch A0A2E3EYL9
got to fetch A0A0W0NTK0
got to fetch A0A0W0NTK0
got to fetch A0A3G7A0T9
got to fetch A0A3G7A0T9
got to fetch A0A0G1J7E2
got to f

got to fetch A0A117RX80
got to fetch A0A0X1L1W6
got to fetch A0A0X1L1W6
got to fetch A0A178K197
got to fetch A0A178K197
got to fetch A0A1G9I1L3
got to fetch A0A1G9I1L3
got to fetch D8F6K2
got to fetch D8F6K2
got to fetch A0A1E5Q259
got to fetch A0A1E5Q259
got to fetch A0A327VK97
got to fetch A0A327VK97
got to fetch A0A380NJC4
got to fetch A0A380NJC4
got to fetch A0A0D0J2R3
got to fetch A0A0D0J2R3
got to fetch A0A2U4FYL2
got to fetch A0A2U4FYL2
got to fetch A0A1Q1PJS1
got to fetch A0A1Q1PJS1
got to fetch A0A3A1PWQ6
got to fetch A0A3A1PWQ6
got to fetch A0A1Q3PZJ9
got to fetch A0A1Q3PZJ9
got to fetch A0A3G2HQS9
got to fetch A0A3G2HQS9
got to fetch A0A2N2V7V6
got to fetch A0A2N2V7V6
got to fetch A0A0G1H906
got to fetch A0A0G1H906
got to fetch A0A328WJ90
got to fetch A0A328WJ90
got to fetch A0A2P2DTW3
got to fetch A0A2P2DTW3
got to fetch A0A2S1I7F1
got to fetch A0A2S1I7F1
got to fetch A0A1H4JZD7
got to fetch A0A1H4JZD7
got to fetch A0A1V0TLI7
got to fetch A0A1V0TLI7
got to fetch A0A1G2XZF2


In [73]:
print(len(new_data),"with Uniprot entry.")  
save(new_data,"fullPfam_noObsolete")

785 with Uniprot entry.
data structure saved to /Volumes/arwen/mobi/group/NOX_CH/pickle_saved/NOX_annotation_fullPfam_noObsolete_20190509-124441.pickle


### Delete domains with evalue > 1e-3

In [74]:
data=load("NOX_annotation_fullPfam_noObsolete_20190509-124441.pickle")

restore a annotated container of  785 elements


In [76]:
def filter_evalue(data,threshold): 
    new_data={}
    for p in data : 
        keep=False
        new_data[p]=data[p].copy()
        new_data[p]['hmmr']={}
        for d in data[p]['hmmr']:
            deleted_hits=0
            hits=data[p]['hmmr'][d][0].data
            for hit in hits : 
                evalue=hit.iEvalue
                if float(evalue) > threshold : 
                    deleted_hits+=1
            if deleted_hits!=len(hits):
                keep=True
                new_data[p]['hmmr'][d]=data[p]['hmmr'][d]
            if not keep : 
                print("OOOO")
                del new_data[p]  
    return new_data

In [77]:
filtered_data=filter_evalue(data,1e-3)

In [78]:
save(filtered_data,"fullPfam_filteredDomains")

data structure saved to /Volumes/arwen/mobi/group/NOX_CH/pickle_saved/NOX_annotation_fullPfam_filteredDomains_20190509-124507.pickle


### Other


In [204]:
import re
reMotifNADPH = re.compile('G[ISVL]G[VIAF][TAS][PYTA]')
reMotifFAD = re.compile('H[PSA]F[TS][LIMV]')

NAD_miss = 0
FAD_miss = 0
Both_miss = 0
for p in merged_restore:
    seq = merged_restore[p]['tmhmm']['fasta']['sequence']
    m = reMotifNADPH.search(seq)
    n = reMotifFAD.search(seq)
    merged_restore[p]['NADPH_reg'] = True if m else False
    merged_restore[p]['FAD_reg']   = True if n else False

    if not m:
        NAD_miss += 1
        if not n:
            Both_miss += 1
    if not n:
        FAD_miss += 1

print('Total Number of filtered sequence', len(merged_restore))
print('Number of negative to:')
print('*The NAD pattern',str(NAD_miss), '\n*The FAD pattern', str(FAD_miss), '\n*Both patterns ', Both_miss)

Total Number of filtered sequence 386
Number of negative to:
*The NAD pattern 54 
*The FAD pattern 147 
*Both patterns  16


#### Delete domains with evalue > 1e-3

In [268]:
data3,c=filter_evalue(data,1e-3)
data1,c=filter_evalue(data,1e-1)

0.001
0.1


In [266]:
all_domains=set()
for p in data3 : 
    for d in data3[p]['hmmr']: 
        all_domains.add(d)

In [267]:
print(len(all_domains))

205


In [113]:
print(all_domains)

{'EF-hand_1', 'EF-hand_7', 'EF-hand_5', 'Fer2', 'EF-hand_8', 'DUF4405', 'FAD_binding_8', 'Ferric_reduct', 'NAD_binding_6', 'NAD_binding_1', 'FAD_binding_6', 'SdpI', 'EF-hand_6', 'DUF2339'}


In [269]:
save(data3,"fullPfam_filteredDomains1e-3")
save(data1,"fullPfam_filteredDomains1e-1")

data structure saved to /Volumes/arwen/mobi/group/NOX_CH/pickle_saved/NOX_annotation_fullPfam_filteredDomains1e-3_20190502-184642.pickle
data structure saved to /Volumes/arwen/mobi/group/NOX_CH/pickle_saved/NOX_annotation_fullPfam_filteredDomains1e-1_20190502-184643.pickle
