TO DO
TO DO
TO DO
import pyproteins.sequence.msa as msaLib
oMsa = msaLib.Msa(fileName="/Users/guillaumelaunay/work/projects/MSA/clustalw.aln")
print(oMsa[0])
will display
{'header': 'sp|Q5SJH5|RIMM_THET8',
'sequence':'------MRLVEIGRFGAPYALKGGLRF--RGEP---VVLHLER----'}
Retrieve a sequence in the msa, by specifying a predicate function or a regular expression. The predicate function will be applied to each record in turn, it will be passed a dictionary with the index and record keys. Records matching the lookup will be returned in a list with sequence gap striped.
def f(d):
return d["index"] < 10
recordList = oMsa.recordLookup(predicate=f)
print(len(recordList))
print(recordList[0])
will display
10
{'header': 'sp|Q5SJH5|RIMM_THET8', 'sequence': 'MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGEELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVDILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGSIHVDPIPGLFD'}
def g(d):
return re.search("THET", d["record"]['header'])
recordList = oMsa.recordLookup(predicate=g)
print(len(recordList))
print(recordList[0])
will display
3
{'header': 'sp|Q5SJH5|RIMM_THET8', 'sequence': 'MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGEELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVDILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGSIHVDPIPGLFD'}
The following methods all return a new Alignment object.
Specify a treshold of gap frequencies to filter out columns, default value is 0.5
oMsa.gapPurge(gapRatio=0.5)
Use any sequence of the alignment to delete all columns where this sequence features a gap. Default master sequence is number 0.
oMsa.maskMaster(self, masterIndex=0)
Sequence within a MSA can be filtered according to their relationships with a master sequence. The predicate function will be applied to all sequence in turn. Predicate will be passed 3 arguments :
- a 3-tuple statistic wrapping (sequence identity, sequence similarity, sequence coverage) of the master with respect to current sequence.
- the master sequence object
- the current sequence object
Returned object is a MSA of at least one sequence (the master).
An optional named masterIndex can be pass to use an alternative sequence as reference.
# Defining predicate, here minimal coverave of 85%
def f(stat, iSeq, jSeq):
return stat[2] > 0.85
bMsa = oMsa.masterFilter(predicate=f)
print(bMsa.fastaDump())
will print,
>sp|Q5SJH5|RIMM_THET8
---------MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGE
ELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVD
ILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGSIHVDPIPGLFD
>tr|H9ZRG5|H9ZRG5_THETH
---------MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGE
ELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVD
ILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGGIHVDPIPGLFD
>tr|E8PJQ1|E8PJQ1_THESS
MGLWHNGLGMRLVEIGRFGAPYALRGGLKFRGEPVVAHLERVYVEGHGWRAVEDLYQVGD
DLVVHLAGVSSRELAEPLVGLRVYAEVEELPPLEEGRYYYFALIGLPVYVGGLKMGEVVD
ILDAGAQDVLVIRGVGERLRDQTERLVPLQAPYVRVEEEGIHVEPIPGLFD
>tr|B7A7I3|B7A7I3_THEAQ
-------MAGRLVEIGRFGAPYALAGGLKFRGEPVVAHLTRIYVEGHGWRAVEDLYQVGE
ELVVHLAGVSTRELAEALVGLRVYAEVADLPPLEEGQYYYFALIGLPVYVEGQKVGEVAD
ILDAGAQDVLVIRGVGERLRDRAERLVPLQAPYVRVEAEGIHVEPIPGLFD
>tr|K7QWL8|K7QWL8_THEOS
---------MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGFRAVEDLYRVGE
VLILHLAGVSTRELAEALVGLRVYAEVEDLPPLEEGQYYYFALVGLPVYVGEEQVGEVAD
ILDAGAQDVLVIRGIGERLRDQRERLVPLQAPYVTVEEGRILVEPIPGLFD
TO DO
- oMsa.shape: [sequenceNumber, columnNumber]
import pyproteinsext.structure.coordinates as PDB
parser = PDB.Parser()
pdbObj = parser.load(file="./1syq.pdb")
pdbObj.SEQRES["A"]
A
is chain name.
import pyproteins.sequence.peptide as pep
p1 = {'id' : "SEQRES",
'desc' : 'pdb file fasta translation',
'seq' : pdbObj.SEQRES["A"]
}
pepSeqRes = pep.Entry(p1)
pepCoor = pep.Entry(pdbObj.chain("A").peptideSeed())
import pyproteins.alignment.nw_custom as N
import pyproteins.alignment.scoringFunctions as scoringFunctions
blosum = scoringFunctions.Needle().fScore
nw = N.nw(gapOpen=-10, gapExtend=-0.5, matchScorer=blosum)
aliResObj = nw.align(pepPDB, pepUniProt)
print(aliResObj)
Example with a sequence peptide from to a PDB file. In this illustration, PDB file is 2vkn.
#parsing PDB
import pyproteinsext.structure.coordinates as PDB
parser = PDB.Parser()
pdbObj = parser.load(file="path/2ns7.pdb")
pdbObj.SEQRES
It’s possible to create a tuple with AA name and his position ; with command :
import pyproteins.sequence.peptide as pep
AApdb = [(pep.threeToOne(aa.name), int(aa.num)) for aa in pdbObj.byres()]
threeToOne
is a translater of AA name code at 3 letters to AA name code at 2 letters.
You can access the content of any uniprot element. Corresponding XML file we ll be download locally if needed in a user defined cache directory.
import pyproteinsext.uniprot as uniprot
uniprot.proxySetting(https="https://yourproxy:port", http="http://yourproxy:port")
uColl = uniprot.getUniprotCollection()
uColl.setCache(location='/Users/guillaumelaunay/work/data/uniprot')
uniprot.getPfamCollection().setCache(location='/Users/guillaumelaunay/work/data/pfam')
obj=uColl.get("P98160")
print(obj.GO)
print("\n")
print(obj.DI)
print("\n")
print(obj.peptideSeed())
print("\n")
print(obj.fasta)
will print
[GO:0005605:C:basal lamina{ECO:0000501}, ...]
[DI-02288:Schwartz-Jampel syndrome (SJS1) {Rare autosomal recessive disorder
characterized by permanent myotonia (prolonged failure of muscle relaxation)
and skeletal dysplasia, resulting in reduced stature, kyphoscoliosis,
bowing of the diaphyses and irregular epiphyses.},
...]
{'id': 'P98160', 'desc': 'PGBM_HUMAN', 'seq': 'MGWRAAGALLLALLLHGRLLAVTHGLRAYDGLSLPEDIETVTA...}
>P98160 PGBM_HUMAN
MGWRAAGALLLALLLHGRLLAVTHGLRAYDGLSLPED...
You can give one or several file to parse method. For each protein, an entry hmmrObj
is created
import pyproteinsext.hmmrContainer as hm
hmmrContainer = hm.parse('hmmsearch_A.out', 'hmmsearch_B.out')
hmmrObj attributes and properties:
- prot : protein name
- domain : domain name
- hit : dictionnary that contains hmmr hit informations, like score, evalue, alignment positions...
- sequence : give protein sequence that corresponds to domain
- start : give domain start in protein
- end : give domain end in protein
for e in hmmrContainer:
print(e.prot)
print(e.domain)
print(e.hit)
print(e.sequence)
print(e.start)
print(e.end)
will print
tr|A0A1Q6ZBW6|A0A1Q6ZBW6_9ARCH
PF08022_full
{'hmmID': 'PF08022_full', 'aliID': 'tr|A0A1Q6ZBW6|A0A1Q6ZBW6_9ARCH', 'header': '1 score: 16.0 bits; conditional E-value: 0.00014', 'score': '16.0', 'bias': '0.0', 'cEvalue': '0.00014', 'iEvalue': '0.24', 'hmmFrom': '26', 'hmmTo': '102', 'aliFrom': '37', 'aliTo': '99', 'envFrom': '27', 'envTo': '102', 'acc': '0.89', 'hmmStringLetters': 'fkwkpGqhvylsvpsisklllesHPFtiasapekddelslvirarggwtkrlaelaekseaesksklkvlieGPYGa', 'matchString': ' +++pGq+++++vp + ++ P ++ ++ ++e +vi++ g ++++l e+ ++ GPYG+', 'aliStringLetters': 'HQANPGQFAMVWVPGVDEV-----PMSVLAIHG-KSEAGVVIKKGGPVSTALWEKKVGD--------IFFVRGPYGH', 'hmmSymbolStuff': {'CS': 'XXXXXXXXXXXXXXXXXXX--XXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX--TTSTTSH'}, 'aliSymbolStuff': {'PP': '5789*************88.....******999.******************9976555........69*******6'}}
HQANPGQFAMVWVPGVDEVPMSVLAIHGKSEAGVVIKKGGPVSTALWEKKVGDIFFVRGPYGH
37
99
...
Container that store TMHMM results. You can only give one tmhmm result file to parse (for now).
import pyproteinsext.tmhmmContainerFactory as tmhmm
tmhmmContainer = tmhmm.parse('tmhmm.out')
Container is a collection of TMHMM_Obj objects.
TMHMM_Obj attributes and properties:
- prot : protein name
- prot_length : protein length
- nb_helix : number of predicted helixes
- fragments : List of Fragment object. Each fragment have attributes cellular_location, start and end.
- topology_seq : protein sequence with 'o' for outside loop, 'i' for inside loop and helix number for helixes. WARNING : doesn't work if there are more than 9 helixes (use 2 characters instead of 1)
for e in tmhmmContainer:
print(e.prot)
print(e.prot_length)
print(e.nb_helix)
print(len(e.fragments),"fragments")
for f in e.fragments:
print(f.cellular_location, f.start, f.end)
will print
tr|A0A2E0DFV0|A0A2E0DFV0_9EURY
155
4
9 fragments
outside 1 5
TMhelix 6 25
inside 26 53
TMhelix 54 76
outside 77 90
TMhelix 91 109
inside 110 129
TMhelix 130 152
outside 153 155
...
Descriptions of the Psicquic service and related MITAB format can be found here
import pyproteinsext.psicquic as psq
psqObj = psq.PSICQUIC(offLine=True)
psqObj.read(mitabFile)
psqAllInR6 = psqObj.filter(predicate=f)
A mitabObject stores a set of psicquic.PSQDATA objects. Each psicquic.PSQDATA object handles one mitab record. Managment of the psicquic.PSQDATA set is done through the pyproteins.container.Core.dnTree container model.
In practice, a mitabObject can be used to
mitabTopologyObject.keys()
Obtain a one-level dictionary of all the partnairs of a query protein along with their list of psicquic.PSQDATA
mitabTopologyObject["P38801"]
mitabTopologyObject["P38801"]["P24783"]
[uniprotkb:P24783 uniprotkb:P38801 intact:EBI-5602|uniprotkb:Q05456|uniprotkb:D6W169 intact:EBI-1909|uniprotkb:D3DL32 psi-mi:dbp2_yeast(display_long)|uniprotkb:DBP2(gene name)|psi-mi:DBP2(display_short)|uniprotkb:YNL112W(locus name)|uniprotkb:N1945(orf name)|uniprotkb:DEAD box protein 2(gene name synonym)|uniprotkb:p68-like protein(gene name synonym) psi-mi:lrp1_yeast(display_long)|uniprotkb:LRP1(gene name)|psi-mi:LRP1(display_short)|uniprotkb:Like an rRNA processing protein 1(gene name synonym)|uniprotkb:rRNA processing protein 47(gene name synonym)|uniprotkb:RRP47(gene name synonym)|uniprotkb:YC1D(gene name synonym)|uniprotkb:Yeast C1D domain-containing protein(gene name synonym)|uniprotkb:YHR081W(locus name) psi-mi:"MI:0111"(dihydrofolate reductase reconstruction) Tarassov et al. (2008) pubmed:18467557|mint:MINT-6673767|imex:IM-14275 taxid:559292(yeast)|taxid:559292(Saccharomyces cerevisiae) taxid:559292(yeast)|taxid:559292(Saccharomyces cerevisiae) psi-mi:"MI:0915"(physical association) psi-mi:"MI:0471"(MINT) intact:EBI-6319091|imex:IM-14275-472 author score:99.00|intact-miscore:0.37]
The key order does not matter.
A library to build and query large database using the file system.
python uniprotFastaFS.py ~/tmp/vUniprot --cluster uniprot_sprot_2018_11.fasta.gz
Several node_*.gz
multifasta files were created.
Create a node folder foreach node file, uses the file system architecture to index their content. The --nodes
argument is a regexp to specify the node number(s) to index.
python uniprotFastaFS.py ~/tmp/vUniprot --nodes '*'
All indexation processes are performed in parrallel and the resulting
subfolder organisations are independant. By independant folder architecture, we mean, as illustrated in the exemple below, that node subfolders (eg:node_1
and node_2
) may present similar prefix have subfolders (eg:A,B
).
vUniprot
|____node_1
| |____A
| | |____index.txt
| | |____data.gz
| |____B
| | |____index.txt
| | |____data.gz
|____node_2
| |____A
| | |____index.txt
| | |____data.gz
| |____B
| | |____index.txt
| | |____data.gz
You can fecth a fasta entry the following way, optionally dumping it to file.
python uniprotFastaFS.py ~/tmp/vUniprot --get P98160