# Fetching from the Fuzzle database

First import the ProtLego module.

In [2]:
from protlego.all import *

The webserver fuzzle is a useful source to find fragments shared by different folds. It uses the SCOP95 subset of the SCOP database. For each domain a Hidden Markov Model (HMM) was generated and all-against-all HMM profile comparisons with HHsearch were perfomed. The structural similarity was measured using TM-align. A final set of more than 10 million Hits have been identified and are contained in fuzzle.

You can check the web here https://fuzzle.uni-bayreuth.de/2.0

### 1. Fetching from ID


There are several ways to fetch from the Fuzzle database, perhaps one of the easiest way is fetching by the Hit ID. 
Each Hit in fuzzle has an ID which eases its identification. 

In [5]:
myhit= fetch_id('4413706')

In [6]:
type(myhit)

protlego.database.data.Hit

You can always get the documentation of any function or variable by using the function help()

In [8]:
help(myhit)

Help on Hit in module protlego.database.data object:

class Hit(builtins.tuple)
 |  Some of the documentation of this function was
 |  taken from the hhsuite python documentation:
 |  https://github.com/soedinglab/hh-suite/wiki
 |  as the sequence information from the Fuzzle hits
 |  come from HHsearch.
 |  The structural superimpositions were performed with
 |  TMalign:   https://zhanglab.ccmb.med.umich.edu/TM-align/
 |  
 |  
 |  Method resolution order:
 |      Hit
 |      builtins.tuple
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getnewargs__(self)
 |      Return self as a plain tuple.  Used by copy and pickle.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  _asdict(self)
 |      Return a new OrderedDict which maps field names to their values.
 |  
 |  _replace(_self, **kwds)
 |      Return a new Hit object replacing specified fields with new values
 |  
 |  ----------------------------------------------------------------------
 |  Class methods de

As we can see from the help, a hit contains lots of info stored in it. For example: the length, the probability, the domain names, and SCOP-IDs of the parents as well as the TM-align score.

In [9]:
print(myhit.cols)
print(myhit.prob)
print(myhit.q_cluster)
print(myhit.s_cluster)
print(myhit.rmsd_tm_pair)
print(myhit.q_scop_id)
print(myhit.s_scop_id)
print(myhit.score_tm)

101
81.7
d2dfda1_1
d1wa5a__2
2.86
c.2.1.5
c.37.1.8
0.54938


### 2. Search all the hits that contain a specific domain

In [10]:
myhits=fetch_by_domain('d1wa5a_')

In [7]:
myhits

Query from Fuzzle with 9 hits belonging to 3 fold(s)

The variable myhits could contain one or more hits (or none), depending on the promiscuity of the domain. In this case it appears in 9 hits, where overall 9 different folds are involved.

From the variable myhits, we can directly retrieve a few statistical values:

In [11]:
print(myhits.ids)
print(myhits.avg_len)
print(myhits.std_len)
print(myhits.list_folds)

[2290353, 361595, 787549, 1181040, 1258170, 2081517, 4413706, 6818256, 9075069, 7536792, 7536798, 7536807, 7536816, 7536888, 7536911, 7536939, 7536971, 7536978]
66.5555555556
41.6616293251
['c.2' 'c.37' 'c.91']


### 3. Fetch by parents (finding a hit between two domains)

One can also fetch by the domain names of the two parents. This type of search could also produce none, one, or several hits, as there could be query-subject and subject combinations, along with different ways to superimpose the structures in each of them.

In [12]:
myhits2 = fetch_by_domains('d1wa5a_','d2dfda1')

In [13]:
myhits2

Query from Fuzzle with 2 hits belonging to 2 fold(s)

In [11]:
help(fetch_by_domains)

Help on function fetch_by_domains in module protlego.database.data:

fetch_by_domains(domain1:str, domain2:str, prob:int=70, rmsd:float=3.0, ca_min:int=10, ca_max:int=200, score_tm_pair:float=0.3, ratio:float=1.25, diff_folds:bool=True)
    Fetch all the hits between two parent domains
    
    :param domain1: The 7 letter code for one of the parents
    :param domain2: The 7 letter code for one of the parents
    :param prob: the minimum allowed HHsearch probability
    :param rmsd: The maximum allowed RMSD (rmsd_tm_pair: "RMSD for the TMalign alignment between the two domains, passing the sequence alignment as seed)
    :param ca_min: The minimum allowed fragment length (for the TMalign alignment)
    :param ca_max: The maximun allowed fragment length (for the TMalign alignment)
    :param score_tm_pair: The minimum allowed TM-score (for the TMalign alignment)
    :param ratio: the maximum ratio for the sequence and structural alignment lengths (cols / ca_tm_pair)
    :param diff_fol

In [16]:
mynewhit=myhits2[0]

In [17]:
print(mynewhit.cols)
print(mynewhit.prob)
print(mynewhit.q_cluster)
print(mynewhit.s_cluster)
print(mynewhit.q_scop_id)
print(mynewhit.s_scop_id)
print(mynewhit.score_tm)

101
76.2
d1wa5a__2
d2dfda1_1
c.37.1.8
c.2.1.5
0.48058


### 4. Fetching two groups 

There is also the possibilty to fetch between two SCOP groups, for example between two families. Other options are searching between two superfamilies or two folds.

#### 4.1 Fetching between two families

In this case we try a different combination, between the Flavodoxin folds, and the PBP fold:

In [20]:
myhits3=fetch_group('c.23.1.1','c.93.1.0')
print(myhits3)

Query from Fuzzle with 472 hits belonging to 2 fold(s)


As before, the variable myhits contains the hits and additional information, like the average length of the hits and the standard deviation or the folds. 

In [21]:
print(myhits3.hits[0]) # printing the first hit because why not
print(myhits3.avg_len) # average length between the hits in these two families
print(myhits3.std_len)
print(myhits3.list_folds)

Hit between d2b4aa1 and d4nqra_ with probability 71.4 %

78.2330508475
14.8673772953
['c.23' 'c.93']


#### 4.2 Fetch between two superfamilies

One can also fetch between two superfamilies. In the previous section we had 472 hits between two families belonging to these superfamilies. Presumably we will obtain now many more:

In [22]:
myhits4=fetch_group('c.23.1','c.93.1') 

In [23]:
myhits4

Query from Fuzzle with 1859 hits belonging to 2 fold(s)

#### 4.3 Fetch between two folds

We can also search hits between two folds. Of course we can impose some criteria, like a certain probability, RMDS or or a certain minimal fragment length

In [24]:
myhits_1 = fetch_group('c.23','c.93',prob=70) # searching for hits with probability over 70% 

In [25]:
myhits_1

Query from Fuzzle with 4946 hits belonging to 2 fold(s)

In [26]:
myhits_2 = fetch_group('c.23','c.93',prob=80,rmsd=3) # fetching hits with prob. over 80 and rmsd <3

In [27]:
myhits_2

Query from Fuzzle with 2554 hits belonging to 2 fold(s)

In [28]:
myhits_3=fetch_group('c.23','c.93',prob=80, rmsd=3,ca_min=50) # fetching hits that besides\
#are larger than 50 aminoacids

In [29]:
myhits_3

Query from Fuzzle with 2478 hits belonging to 2 fold(s)

### 5. Fetching subspaces

Additionally, there is the possibility to fetch a group or a single query against the rest of the database. 

#### 5.1 All hits that contain a TIM-barrel

With the function fetch_subspace we can obtain sets of hits that fullfil any criteria. For example all hits belonging to the TIM-barrel fold. Take into account that these functions present some default cutoffs:

In [30]:
myhits5 = fetch_subspace(scop_q='c.1')

In [31]:
myhits5.list_folds

array(['a.1', 'a.100', 'a.101', 'a.102', 'a.108', 'a.114', 'a.118',
       'a.121', 'a.126', 'a.127', 'a.128', 'a.13', 'a.137', 'a.140',
       'a.144', 'a.149', 'a.15', 'a.150', 'a.152', 'a.153', 'a.156',
       'a.157', 'a.159', 'a.16', 'a.168', 'a.174', 'a.177', 'a.178',
       'a.179', 'a.18', 'a.182', 'a.185', 'a.186', 'a.193', 'a.199', 'a.2',
       'a.20', 'a.204', 'a.206', 'a.21', 'a.218', 'a.219', 'a.22', 'a.222',
       'a.229', 'a.23', 'a.237', 'a.24', 'a.244', 'a.247', 'a.248', 'a.25',
       'a.253', 'a.254', 'a.258', 'a.26', 'a.271', 'a.272', 'a.277',
       'a.28', 'a.284', 'a.287', 'a.29', 'a.291', 'a.293', 'a.294',
       'a.297', 'a.298', 'a.3', 'a.30', 'a.300', 'a.301', 'a.31', 'a.32',
       'a.34', 'a.35', 'a.36', 'a.39', 'a.4', 'a.40', 'a.41', 'a.42',
       'a.43', 'a.45', 'a.46', 'a.47', 'a.48', 'a.5', 'a.53', 'a.55',
       'a.56', 'a.58', 'a.59', 'a.6', 'a.60', 'a.61', 'a.64', 'a.65',
       'a.69', 'a.7', 'a.73', 'a.74', 'a.77', 'a.8', 'a.80', 'a.81',
       

#### 5.2 fetch full universes

We can also fetch the whole universe setting some cutoffs, like for example probability and rmsd. All the hits that present probability over 70 % , and rsmd below 2.0, for example: 

In [32]:
myhits6 = fetch_subspace(prob=70,rmsd=2.0)

In [33]:
myhits6

Query from Fuzzle with 95755 hits belonging to 451 fold(s)