# WRKY_DBD

## Domain Knowledge

* Transcription factors (TFs, 轉錄因子) are proteins that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence.

* Transcription factors contain two domains:
    * DNA-binding domain (DBD): attaches to specific sequences of DNA
    * Activation domain (AD):

* TFs are classified into families according to their DBDs.

### Motivations

* Only a handful of TFs have been studied their DNA binding patterns.
    * only < 2% of eukaryotic TFs (M. T. Weirauch, et. al, 2014)
    * About 57% of Arabidopsis thaliana TFs (Data from PlantPAN 3.0 and PlantTFDB v4.0)
* Experimental protein structure determination is hard.

### Problems

* Whether we can predict **DNA binding sites** (DNA結合位點) by using polypeptide (多肽) sequences or DNA-binding domain?
* Whether we can determine **the key amino acids** essential for **DNA recognition** from known **TF-DNA pairs**?
* Are there **unknown features** in polypeptide sequences which can be used to illustrate the interaction between TF and DNA?

## Datasets descriptions

Column Name:
* TF identifier (TF_ID)
* Protein sequence identifier (Pseq_ID)
* Protein primary sequence (Pseq)
* DNA- binding domain sequence (DBD_seq)
* Binding matrix identifier (matrix_ID)

Note: Please don’t use matrix_ID as a feature.

## Load datasets

In [1]:
import pandas as pd

In [2]:
dfp = pd.read_csv('WRKY_info_20190507/WRKY_info_table_positive.txt', sep='\t')
dfn1 = pd.read_csv('WRKY_info_20190507/WRKY_info_table_negative_one.txt', sep='\t')
dfn2 = pd.read_csv('WRKY_info_20190507/WRKY_info_table_negative_two.txt', sep='\t')
dfn3 = pd.read_csv('WRKY_info_20190507/WRKY_info_table_negative_three.txt', sep='\t')

In [3]:
dfp.head()

Unnamed: 0,TF_ID,Pseq_ID,Pseq,DBD_seq,matrix_ID
0,AT1G13960,TFprotseq_12499,MSEKEEAPSTSKSTGAPSRPTLSLPPRPFSEMFFNGGVGFSPGPMT...,ADDGYNWRKYGQKQVKGSEFPRSYYKCTNPGCPVKKKVERSLDGQV...,TF_motif_seq_0270
1,AT1G13960,TFprotseq_12499,MSEKEEAPSTSKSTGAPSRPTLSLPPRPFSEMFFNGGVGFSPGPMT...,ADDGYNWRKYGQKQVKGSEFPRSYYKCTNPGCPVKKKVERSLDGQV...,TF_motif_seq_0339
2,AT1G13960,TFprotseq_12499,MSEKEEAPSTSKSTGAPSRPTLSLPPRPFSEMFFNGGVGFSPGPMT...,ADDGYNWRKYGQKQVKGSEFPRSYYKCTNPGCPVKKKVERSLDGQV...,TFmatrixID_0449
3,AT1G13960,TFprotseq_12499,MSEKEEAPSTSKSTGAPSRPTLSLPPRPFSEMFFNGGVGFSPGPMT...,ADDGYNWRKYGQKQVKGSEFPRSYYKCTNPGCPVKKKVERSLDGQV...,TFmatrixID_0451
4,AT1G13960,TFprotseq_12499,MSEKEEAPSTSKSTGAPSRPTLSLPPRPFSEMFFNGGVGFSPGPMT...,ADDGYNWRKYGQKQVKGSEFPRSYYKCTNPGCPVKKKVERSLDGQV...,TFmatrixID_0465


### Parse .meme file

In [4]:
inputfile = open('WRKY_info_20190507/All_matrices.meme', 'r')

In [5]:
for i in range(9):
    inputfile.readline()

In [6]:
outputfile = open('WRKY_info_20190507/All_matrices_output.txt', 'w')

In [7]:
outputfile.write('matrix_ID\tA\tC\tG\tT\n')

18

In [8]:
while True:
    s1 = inputfile.readline().rstrip('\n')
    
    # if this is the eof
    if len(s1) == 0:
        break
        
    t1 = s1.split()
    matrix_id = t1[1]

    inputfile.readline()
    inputfile.readline()
    while True:
        s2 = inputfile.readline()
        if s2 == '\n':
            break
        outputfile.write(matrix_id + '\t')
        outputfile.write(s2)

In [9]:
outputfile.close()
inputfile.close()

In [10]:
dfmeme = pd.read_csv('WRKY_info_20190507/All_matrices_output.txt', sep='\t')
dfmeme.head()

Unnamed: 0,matrix_ID,A,C,G,T
0,TF_motif_seq_0001,1.0,0.0,0.0,0.0
1,TF_motif_seq_0001,1.0,0.0,0.0,0.0
2,TF_motif_seq_0001,0.0,1.0,0.0,0.0
3,TF_motif_seq_0001,0.0,1.0,0.0,0.0
4,TF_motif_seq_0001,0.0,0.0,0.0,1.0


## Analysis

In [11]:
dfmeme[dfmeme.matrix_ID == 'TF_motif_seq_0339']

Unnamed: 0,matrix_ID,A,C,G,T
11758,TF_motif_seq_0339,0.0,0.0,0.0,1.0
11759,TF_motif_seq_0339,0.0,0.0,0.0,1.0
11760,TF_motif_seq_0339,0.0,0.0,1.0,0.0
11761,TF_motif_seq_0339,1.0,0.0,0.0,0.0
11762,TF_motif_seq_0339,0.0,1.0,0.0,0.0
11763,TF_motif_seq_0339,0.0,0.5,0.0,0.5


In [12]:
dfp[dfp.matrix_ID == 'TF_motif_seq_0339']

Unnamed: 0,TF_ID,Pseq_ID,Pseq,DBD_seq,matrix_ID
1,AT1G13960,TFprotseq_12499,MSEKEEAPSTSKSTGAPSRPTLSLPPRPFSEMFFNGGVGFSPGPMT...,ADDGYNWRKYGQKQVKGSEFPRSYYKCTNPGCPVKKKVERSLDGQV...,TF_motif_seq_0339
7,AT1G13960,TFprotseq_12499,MSEKEEAPSTSKSTGAPSRPTLSLPPRPFSEMFFNGGVGFSPGPMT...,LDDGYRWRKYGQKVVKGNPYPRSYYKCTTPGCGVRKHVERAATDPK...,TF_motif_seq_0339
13,AT1G13960,TFprotseq_12500,MSEKEEAPSTSKSTGAPSRPTLSLPPRPFSEMFFNGGVGFSPGPMT...,ADDGYNWRKYGQKQVKGSEFPRSYYKCTNPGCPVKKKVERSLDGQV...,TF_motif_seq_0339
19,AT1G13960,TFprotseq_12500,MSEKEEAPSTSKSTGAPSRPTLSLPPRPFSEMFFNGGVGFSPGPMT...,LDDGYRWRKYGQKVVKGNPYPRSYYKCTTPGCGVRKHVERAATDPK...,TF_motif_seq_0339
25,AT1G18860,TFprotseq_12501,MDEAKEENRRLKSSLSKIKKDFDILQTQYNQLMAKHNEPTKFQSKG...,MNDGCQWRKYGQKIAKGNPCPRAYYRCTIAASCPVRKQVQRCSEDM...,TF_motif_seq_0339
48,AT1G29280,TFprotseq_12502,MKRGLDMARSYNDHESSQETGPESPNSSTFNGMKALISSHSPKRSR...,PSDSWAWRKYGQKPIKGSPYPRGYYRCSSTKGCPARKQVERSRDDP...,TF_motif_seq_0339
74,AT1G29860,TFprotseq_12503,MDDHVEHNYNTSLEEVHFKSLSDCLQSSLVMDYNSLEKVFKFSPYS...,LEDGYRWRKYGQKAVKNSPYPRSYYRCTTQKCNVKKRVERSFQDPS...,TF_motif_seq_0339
100,AT1G30650,TFprotseq_12504,MCSVSELLDMENFQGDLTDVVRGIGGHVLSPETPPSNIWPLPLSHP...,PSDLWAWRKYGQKPIKGSPFPRGYYRCSSSKGCSARKQVERSRTDP...,TF_motif_seq_0339
104,AT1G55600,TFprotseq_12505,MSDFDENFIEMTSYWAPPSSPSPRTILAMLEQTDNGLNPISEIFPQ...,PNDGYRWRKYGQKVVKGNPNPRSYFKCTNIECRVKKHVERGADNIK...,TF_motif_seq_0339
126,AT1G62300,TFprotseq_12506,MDRGWSGLTLDSSSLDLLNPNRISHKNHRRFSNPLAMSRIDEEDDQ...,ISDGCQWRKYGQKMAKGNPCPRAYYRCTMATGCPVRKQVQRCAEDR...,TF_motif_seq_0339
