pyproteinsext

Documentation

notebook test&examples

TO DO

Generic container modules

TO DO

Anotation modules

TO DO

Specific container modules

Multiple Sequence Alignment

Reading a file

import pyproteins.sequence.msa as msaLib
oMsa = msaLib.Msa(fileName="/Users/guillaumelaunay/work/projects/MSA/clustalw.aln")

Accessing sequences

print(oMsa[0])

will display

{'header': 'sp|Q5SJH5|RIMM_THET8',
'sequence':'------MRLVEIGRFGAPYALKGGLRF--RGEP---VVLHLER----'}

Accessing matching sequences

Retrieve a sequence in the msa, by specifying a predicate function or a regular expression. The predicate function will be applied to each record in turn, it will be passed a dictionary with the index and record keys. Records matching the lookup will be returned in a list with sequence gap striped.

indexed-based search

def f(d):
    return d["index"] < 10
recordList = oMsa.recordLookup(predicate=f)
print(len(recordList))
print(recordList[0])

will display

10
{'header': 'sp|Q5SJH5|RIMM_THET8', 'sequence': 'MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGEELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVDILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGSIHVDPIPGLFD'}

header content based search

def g(d):
    return re.search("THET", d["record"]['header'])
recordList = oMsa.recordLookup(predicate=g)
print(len(recordList))
print(recordList[0])

will display

3
{'header': 'sp|Q5SJH5|RIMM_THET8', 'sequence': 'MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGEELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVDILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGSIHVDPIPGLFD'}

Transformations

The following methods all return a new Alignment object.

Column deletions

Purging gap

Specify a treshold of gap frequencies to filter out columns, default value is 0.5

oMsa.gapPurge(gapRatio=0.5)

sequence based masking

Use any sequence of the alignment to delete all columns where this sequence features a gap. Default master sequence is number 0.

oMsa.maskMaster(self, masterIndex=0)

sequence based filtering

Sequence within a MSA can be filtered according to their relationships with a master sequence. The predicate function will be applied to all sequence in turn. Predicate will be passed 3 arguments :

a 3-tuple statistic wrapping (sequence identity, sequence similarity, sequence coverage) of the master with respect to current sequence.
the master sequence object
the current sequence object

Returned object is a MSA of at least one sequence (the master).

An optional named masterIndex can be pass to use an alternative sequence as reference.

# Defining predicate, here minimal coverave of 85%
def f(stat, iSeq, jSeq):
    return stat[2] > 0.85
bMsa = oMsa.masterFilter(predicate=f)
print(bMsa.fastaDump())

will print,

>sp|Q5SJH5|RIMM_THET8
---------MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGE
ELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVD
ILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGSIHVDPIPGLFD
>tr|H9ZRG5|H9ZRG5_THETH
---------MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGWRAIEDLYRVGE
ELVVHLAGVTDRTLAEALVGLRVYAEVADLPPLEEGRYYYFALIGLPVYVEGRQVGEVVD
ILDAGAQDVLIIRGVGERLRDRAERLVPLQAPYVRVEEGGIHVDPIPGLFD
>tr|E8PJQ1|E8PJQ1_THESS
MGLWHNGLGMRLVEIGRFGAPYALRGGLKFRGEPVVAHLERVYVEGHGWRAVEDLYQVGD
DLVVHLAGVSSRELAEPLVGLRVYAEVEELPPLEEGRYYYFALIGLPVYVGGLKMGEVVD
ILDAGAQDVLVIRGVGERLRDQTERLVPLQAPYVRVEEEGIHVEPIPGLFD
>tr|B7A7I3|B7A7I3_THEAQ
-------MAGRLVEIGRFGAPYALAGGLKFRGEPVVAHLTRIYVEGHGWRAVEDLYQVGE
ELVVHLAGVSTRELAEALVGLRVYAEVADLPPLEEGQYYYFALIGLPVYVEGQKVGEVAD
ILDAGAQDVLVIRGVGERLRDRAERLVPLQAPYVRVEAEGIHVEPIPGLFD
>tr|K7QWL8|K7QWL8_THEOS
---------MRLVEIGRFGAPYALKGGLRFRGEPVVLHLERVYVEGHGFRAVEDLYRVGE
VLILHLAGVSTRELAEALVGLRVYAEVEDLPPLEEGQYYYFALVGLPVYVGEEQVGEVAD
ILDAGAQDVLVIRGIGERLRDQRERLVPLQAPYVTVEEGRILVEPIPGLFD

free slicing

TO DO

Meta-data and statistics

oMsa.shape: [sequenceNumber, columnNumber]

PDB container

Protein data container

Load a PDB file

import pyproteinsext.structure.coordinates as PDB
parser = PDB.Parser()
pdbObj = parser.load(file="./1syq.pdb")

Display SEQRES

pdbObj.SEQRES["A"]

A is chain name.

Aligning SEQRES and vald ATOM RECORD

Create wrapper peptide object

import pyproteins.sequence.peptide as pep
p1 = {'id' : "SEQRES",
    'desc' : 'pdb file fasta translation',
    'seq' : pdbObj.SEQRES["A"]
}
pepSeqRes = pep.Entry(p1)
pepCoor = pep.Entry(pdbObj.chain("A").peptideSeed())

Align "peptide" sequences

import pyproteins.alignment.nw_custom as N
import pyproteins.alignment.scoringFunctions as scoringFunctions
blosum = scoringFunctions.Needle().fScore
nw = N.nw(gapOpen=-10, gapExtend=-0.5, matchScorer=blosum)
aliResObj = nw.align(pepPDB, pepUniProt)
print(aliResObj)

Example with a sequence peptide from to a PDB file. In this illustration, PDB file is 2vkn.

#parsing PDB
import pyproteinsext.structure.coordinates as PDB
parser = PDB.Parser()
pdbObj = parser.load(file="path/2ns7.pdb")
pdbObj.SEQRES

It’s possible to create a tuple with AA name and his position ; with command :

import pyproteins.sequence.peptide as pep
AApdb = [(pep.threeToOne(aa.name), int(aa.num)) for aa in pdbObj.byres()]

threeToOne is a translater of AA name code at 3 letters to AA name code at 2 letters.

Uniprot container

You can access the content of any uniprot element. Corresponding XML file we ll be download locally if needed in a user defined cache directory.

import pyproteinsext.uniprot as uniprot
uniprot.proxySetting(https="https://yourproxy:port", http="http://yourproxy:port")

uColl = uniprot.getUniprotCollection()
uColl.setCache(location='/Users/guillaumelaunay/work/data/uniprot')
uniprot.getPfamCollection().setCache(location='/Users/guillaumelaunay/work/data/pfam')

obj=uColl.get("P98160")


print(obj.GO)
print("\n")
print(obj.DI)
print("\n")
print(obj.peptideSeed())
print("\n")
print(obj.fasta)

will print

[GO:0005605:C:basal lamina{ECO:0000501}, ...]

[DI-02288:Schwartz-Jampel syndrome (SJS1) {Rare autosomal recessive disorder
characterized by permanent myotonia (prolonged failure of muscle relaxation)
and skeletal dysplasia, resulting in reduced stature, kyphoscoliosis,
bowing of the diaphyses and irregular epiphyses.},
...]

{'id': 'P98160', 'desc': 'PGBM_HUMAN', 'seq': 'MGWRAAGALLLALLLHGRLLAVTHGLRAYDGLSLPEDIETVTA...}

>P98160 PGBM_HUMAN
MGWRAAGALLLALLLHGRLLAVTHGLRAYDGLSLPED...

HMMR results container

You can give one or several file to parse method. For each protein, an entry hmmrObj is created

import pyproteinsext.hmmrContainer as hm
hmmrContainer = hm.parse('hmmsearch_A.out', 'hmmsearch_B.out')

hmmrObj attributes and properties:

prot : protein name
domain : domain name
hit : dictionnary that contains hmmr hit informations, like score, evalue, alignment positions...
sequence : give protein sequence that corresponds to domain
start : give domain start in protein
end : give domain end in protein

for e in hmmrContainer:
    print(e.prot)
    print(e.domain)
    print(e.hit)
    print(e.sequence)
    print(e.start)
    print(e.end)

will print

tr|A0A1Q6ZBW6|A0A1Q6ZBW6_9ARCH
PF08022_full
{'hmmID': 'PF08022_full', 'aliID': 'tr|A0A1Q6ZBW6|A0A1Q6ZBW6_9ARCH', 'header': '1  score: 16.0 bits;  conditional E-value: 0.00014', 'score': '16.0', 'bias': '0.0', 'cEvalue': '0.00014', 'iEvalue': '0.24', 'hmmFrom': '26', 'hmmTo': '102', 'aliFrom': '37', 'aliTo': '99', 'envFrom': '27', 'envTo': '102', 'acc': '0.89', 'hmmStringLetters': 'fkwkpGqhvylsvpsisklllesHPFtiasapekddelslvirarggwtkrlaelaekseaesksklkvlieGPYGa', 'matchString': ' +++pGq+++++vp + ++     P ++  ++  ++e  +vi++ g ++++l e+              ++ GPYG+', 'aliStringLetters': 'HQANPGQFAMVWVPGVDEV-----PMSVLAIHG-KSEAGVVIKKGGPVSTALWEKKVGD--------IFFVRGPYGH', 'hmmSymbolStuff': {'CS': 'XXXXXXXXXXXXXXXXXXX--XXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX--TTSTTSH'}, 'aliSymbolStuff': {'PP': '5789*************88.....******999.******************9976555........69*******6'}}
HQANPGQFAMVWVPGVDEVPMSVLAIHGKSEAGVVIKKGGPVSTALWEKKVGDIFFVRGPYGH
37
99
...

TMHMM results container

Container that store TMHMM results. You can only give one tmhmm result file to parse (for now).

import pyproteinsext.tmhmmContainerFactory as tmhmm
tmhmmContainer = tmhmm.parse('tmhmm.out')

Container is a collection of TMHMM_Obj objects.

TMHMM_Obj attributes and properties:

prot : protein name
prot_length : protein length
nb_helix : number of predicted helixes
fragments : List of Fragment object. Each fragment have attributes cellular_location, start and end.
topology_seq : protein sequence with 'o' for outside loop, 'i' for inside loop and helix number for helixes. WARNING : doesn't work if there are more than 9 helixes (use 2 characters instead of 1)

for e in tmhmmContainer:
    print(e.prot)
    print(e.prot_length)
    print(e.nb_helix)
    print(len(e.fragments),"fragments")
    for f in e.fragments:
        print(f.cellular_location, f.start, f.end)

will print

tr|A0A2E0DFV0|A0A2E0DFV0_9EURY
155
4
9 fragments
outside 1 5
TMhelix 6 25
inside 26 53
TMhelix 54 76
outside 77 90
TMhelix 91 109
inside 110 129
TMhelix 130 152
outside 153 155
...

Pfam container

Protein-Protein interaction containers

psicquicData container API

Descriptions of the Psicquic service and related MITAB format can be found here

import pyproteinsext.psicquic as psq
psqObj = psq.PSICQUIC(offLine=True)
psqObj.read(mitabFile)
psqAllInR6 = psqObj.filter(predicate=f)

mitabObject API

A mitabObject stores a set of psicquic.PSQDATA objects. Each psicquic.PSQDATA object handles one mitab record. Managment of the psicquic.PSQDATA set is done through the pyproteins.container.Core.dnTree container model.

In practice, a mitabObject can be used to

List of all interactors

mitabTopologyObject.keys()

Obtain a one-level dictionary of all the partnairs of a query protein along with their list of psicquic.PSQDATA

mitabTopologyObject["P38801"]

Obtain all the psicquic.PSQDATA of a specific pair of proteins

mitabTopologyObject["P38801"]["P24783"]

[uniprotkb:P24783	uniprotkb:P38801	intact:EBI-5602|uniprotkb:Q05456|uniprotkb:D6W169	intact:EBI-1909|uniprotkb:D3DL32	psi-mi:dbp2_yeast(display_long)|uniprotkb:DBP2(gene name)|psi-mi:DBP2(display_short)|uniprotkb:YNL112W(locus name)|uniprotkb:N1945(orf name)|uniprotkb:DEAD box protein 2(gene name synonym)|uniprotkb:p68-like protein(gene name synonym)	psi-mi:lrp1_yeast(display_long)|uniprotkb:LRP1(gene name)|psi-mi:LRP1(display_short)|uniprotkb:Like an rRNA processing protein 1(gene name synonym)|uniprotkb:rRNA processing protein 47(gene name synonym)|uniprotkb:RRP47(gene name synonym)|uniprotkb:YC1D(gene name synonym)|uniprotkb:Yeast C1D domain-containing protein(gene name synonym)|uniprotkb:YHR081W(locus name)	psi-mi:"MI:0111"(dihydrofolate reductase reconstruction)	Tarassov et al. (2008)	pubmed:18467557|mint:MINT-6673767|imex:IM-14275	taxid:559292(yeast)|taxid:559292(Saccharomyces cerevisiae)	taxid:559292(yeast)|taxid:559292(Saccharomyces cerevisiae)	psi-mi:"MI:0915"(physical association)	psi-mi:"MI:0471"(MINT)	intact:EBI-6319091|imex:IM-14275-472	author score:99.00|intact-miscore:0.37]

The key order does not matter.

Database modules

DB FS

A library to build and query large database using the file system.

Building a database

Fetch desired multifasta gziped file database

Split this file into smaller zip file

python uniprotFastaFS.py ~/tmp/vUniprot --cluster uniprot_sprot_2018_11.fasta.gz

Several node_*.gz multifasta files were created.

Index the node files

Create a node folder foreach node file, uses the file system architecture to index their content. The --nodes argument is a regexp to specify the node number(s) to index.

python uniprotFastaFS.py ~/tmp/vUniprot --nodes '*'

All indexation processes are performed in parrallel and the resulting subfolder organisations are independant. By independant folder architecture, we mean, as illustrated in the exemple below, that node subfolders (eg:node_1 and node_2) may present similar prefix have subfolders (eg:A,B).

vUniprot
    |____node_1
    | |____A
    | | |____index.txt
    | | |____data.gz
    | |____B
    | | |____index.txt
    | | |____data.gz
    |____node_2
    | |____A
    | | |____index.txt
    | | |____data.gz
    | |____B
    | | |____index.txt
    | | |____data.gz

Querying a database

You can fecth a fasta entry the following way, optionally dumping it to file.

python uniprotFastaFS.py ~/tmp/vUniprot --get P98160

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
pyproteinsext		pyproteinsext
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

MMSB-MOBI/pyproteinsExt

Folders and files

Latest commit

History

Repository files navigation

pyproteinsext

notebook test&examples

Generic container modules

Anotation modules

Specific container modules

Multiple Sequence Alignment

Reading a file

Accessing sequences

Accessing matching sequences

indexed-based search

header content based search

Transformations

Column deletions

Purging gap

sequence based masking

sequence based filtering

free slicing

Meta-data and statistics

PDB container

Protein data container

Load a PDB file

Display SEQRES

Aligning SEQRES and vald ATOM RECORD

Create wrapper peptide object

Align "peptide" sequences

Uniprot container

HMMR results container

TMHMM results container

Pfam container

Protein-Protein interaction containers

psicquicData container API

mitabObject API

List of all interactors

Obtain a one-level dictionary of all the partnairs of a query protein along with their list of psicquic.PSQDATA

Obtain all the psicquic.PSQDATA of a specific pair of proteins

Database modules

DB FS

Building a database

Fetch desired multifasta gziped file database

Split this file into smaller zip file

Index the node files

Querying a database

Please consult CLI help for additional options

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages