# Prediction of TF binding based on DNA physical properties
COMP561: Computational Biology Methods and Research

### About:
Creating a heuristic method to predict TF binding sites based on DNA physical properties.
### Why I chose this project
- I chose to do this project (came from within vs being imposed from outside)
- There were so many things that I didn't know how to do e.g. ________. To solve them, I explored and found out solutions myself, in the forms of Python package, way to efficiently run a tool (Galaxy), and so on. It was an assignment, but I acted very proactively. I also remember myself having moments where I realized I was having a lot of fun.

### DNA physical properties can affect TF binding
Structural properties of DNA have a significant impact on DNA's affinity for proteins
- e.g. minor groove width (MGW), Roll, helical twist (HelT), and propeller twist (ProT)


DNA physical properties can be predicted from the local DNA sequence using tools such as DNAshape: https://academic.oup.com/nar/article/43/D1/D103/2439587

### Identification of binding sites for a given TF is important
- e.g. Application: structure-based drug discovery

### ChIP-seq combined with position weight matrix (PWM) can be used to identify TF binding sites
- Issue: Sites on DNA that look like they should be bound by TFs but are not in reality (=negative examples)
- positive examples = sequences in which TFs are bound

### Objective: To examine whether positive examples could be distinguished from negative examples based on predicted structural properties of their sequences

### Physical properties used
- Minor groove width (MGW)
- Roll
- Helical twist (HelT)
- Propeller twist (ProT)

### TFs used: TATA box-binding protein (TBP) and UA7
Rationales
- TBP: One of the most well-known and essential TF. Required for correct initiation of transcription and is used by all three eukaryotic RNA polymerases
- UA7: Had very different sample size than TBP => Used to assess the effect of the workflow developed and the performance of machine learning models

### Machine learning models used:
- K-nearest neighbour (KNN)
- Logistic regression

### Project Overview: Steps involved
1. Identifying a set of bound and non-bound DNA sequences for a given TF (based on existing experimental data)
2. Calculating the DNA physical properties of each sequence
3. Training a machine learning classifier to distinguish between bound and unbound sites.

## 1. Identification of a set of bound and non-bound DNA sequences for a given TF 
Based on existing experimental data

In [None]:
# If pyranges is not installed
#import sys
#!{sys.executable} -m pip install pyranges

In [46]:
# Import libraries
import pandas as pd
import numpy as np
import pyranges as pr
import time

In [47]:
overall_runtime_start = time.time()

In [48]:
# Create a dataframe from the textfile
file_name = "factorbookMotifPos.txt"  # Contains genomic coordinates of positive examples of the target TFs
df = pd.read_csv(file_name, usecols=[1,2,3,4,5,6], 
                 names=['Chom', 'TFStart', 'TFEnd', 'TFName', 'ScoreBS', 'StrandBS'],
                 header=None,
                 sep='\t')

# Get an overview of the dataframe
print(df.shape)
print(df.head(10))

(2366151, 6)
   Chom  TFStart   TFEnd TFName  ScoreBS StrandBS
0  chr1    10461   10476    UA1     2.02        -
1  chr1    10464   10479    UA1     2.26        +
2  chr1    16245   16260   CTCF     1.97        -
3  chr1    89933   89948    NFY     1.95        -
4  chr1    91265   91280   CTCF     1.40        +
5  chr1    91419   91434   CTCF     2.07        +
6  chr1    91421   91436   CTCF     2.80        -
7  chr1    91431   91446   CTCF     1.95        -
8  chr1   104986  105001   CTCF     2.83        +
9  chr1   138972  138987   CTCF     3.43        -


In [49]:
# List out all the transcription factors (TFs)
df_TFs = df["TFName"].value_counts()
print(df_TFs)
df_TFs.to_csv("TFs.csv")  # Create a csv file containing all the TFs
print("\nNumber of types of transcription factors:", df_TFs.shape[0])

AP1       166122
CTCF      158004
UAK42     130536
SP1        83782
ZNF263     75577
           ...  
UA14         149
NR4A1         90
UAK32         21
UAK44          8
UAK53          6
Name: TFName, Length: 133, dtype: int64

Number of types of transcription factors: 133


## Extract relevant rows: the TF and the +'ve strand
The + and - strands in the bedfile is unknown; therefore, will deal only with + strand.

In [None]:
# Specify the TF
tf = 'UA7'
#tf = 'AP2'

In [50]:
# Extract rows containing TBP binding sites information
df_TF = df[df['TFName'] == tf]
df_TF_fwd = df_TF[df_TF['StrandBS'] == '+']
print(df_TF_fwd.shape)
print(df_TF_fwd.head(10))

(1238, 6)
       Chom  TFStart    TFEnd TFName  ScoreBS StrandBS
2774   chr1  1260080  1260101    UA7     0.86        +
2980   chr1  1284764  1284785    UA7     1.32        +
4392   chr1  1711959  1711980    UA7     1.21        +
6112   chr1  2245838  2245859    UA7     2.38        +
7299   chr1  2517947  2517968    UA7     0.94        +
11097  chr1  6052803  6052824    UA7     2.50        +
12198  chr1  6453418  6453439    UA7     1.06        +
12995  chr1  6662844  6662865    UA7     1.31        +
15714  chr1  8483883  8483904    UA7     2.15        +
16092  chr1  8763513  8763534    UA7     3.57        +


## Read the bed file

In [54]:
# The bed file: contains active regulatory regions in GM12879 (a cell line derived from lymphoblasts)
gr = pr.read_bed("wgEncodeRegTfbsClusteredV3.GM12878.merged.bed")
df_bed = gr.df
print(df_bed.shape)
print("Total # of active regulatory regions:", df_bed.shape[0], "\n")

# Observe the top and bottom of the bed file
print(df_bed.head(10))
print(df_bed.tail(10))

(78792, 4)
Total # of active regulatory regions: 78792 

  Chromosome   Start     End     Name
0       chr1  237550  237989   chr1.1
1       chr1  521310  521756   chr1.2
2       chr1  713702  714675   chr1.3
3       chr1  762022  763345   chr1.4
4       chr1  805100  805473   chr1.5
5       chr1  839802  841023   chr1.6
6       chr1  848092  848620   chr1.7
7       chr1  856280  856826   chr1.8
8       chr1  858297  862058   chr1.9
9       chr1  873457  874094  chr1.10
      Chromosome      Start        End       Name
78782       chrX  154764452  154764876  chrX.2507
78783       chrX  154807318  154807679  chrX.2508
78784       chrX  154841965  154843013  chrX.2509
78785       chrX  154880827  154881309  chrX.2510
78786       chrX  154891942  154892376  chrX.2511
78787       chrX  155049813  155050189  chrX.2512
78788       chrX  155087251  155087687  chrX.2513
78789       chrX  155094268  155094730  chrX.2514
78790       chrX  155110578  155111395  chrX.2515
78791       chrX  1551965

In [62]:
# List out all chromosomes
list_choms = list(set(df_bed["Chromosome"].to_list()))
print(sorted(list_choms))
"""The list was created after inspecting the chromosomes contained in the bed file.
The list of chromosomes were created as below for ordering (sorting the list of chromosomes generated directly from the bed
file dataframe resulted in an undesirable ordering of chromosomes)"""
print("-------------")
list_chrs = ['chr%s' % s for s in range(1,23)]
list_chrs.append("chrX")
print(list_chrs)

['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chrX']
-------------
['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX']


### Extracting negative examples
Extracting genomic coordinates of sequences potentially containing negative examples

In [53]:
## Before the edit on df_TF_negative
start_time_total = time.time()
df_TF_negative_all = pd.DataFrame(columns=['Chom','NStart','NEnd','UnboundTF'])  # An empty dataframe used later

for chom in list_chrs:  # For each chromosome (chr)
    start_time = time.time()
    print(chom)

    # Extract portion of the bed file corresponding to the chromosome
    df_bed_chom = df_bed[df_bed["Chromosome"] == chom]
    print(df_bed_chom.head(10))
    # Extract portion of the genomic coordinate (of +'ve example) df corresponding to the chr
    df_TF_chom = df_TF_fwd[df_TF_fwd["Chom"] == chom]
    print(df_TF_chom.head(10))
    
    # Sorting to ensure the dataframe is sorted
    df_bed_chom_s = df_bed_chom.sort_values("Start")
    df_TF_chom_s = df_TF_chom.sort_values("TFStart")

    
    # For reducing running time
    # Recall: The bed file contains active regulatory region
    min_bed = df_bed_chom_s["Start"].min()  # Get the smallest coordinate in the bed file df for the chr
    max_bed = df_bed_chom_s["End"].max()  # Get the largest coordinate in the bed file df for the chr

    # Recall: the df_TF_chom_s contains coordinates of positive examples (forward strand) for the TF for the chromosome
    min_TF = df_TF_chom_s["TFStart"].min()  # Get the smallest coordinate in the TF positive example df for the chr
    max_TF = df_TF_chom_s["TFEnd"].max()  # Get the largest coordinate in the TF positive example df for the chr

    """
    Idea:
    Negative examples will fall under the active regulatory region.
    => Potential negative examples will be the active regulatory regions - where positive regions are
    i.e. Negative examples can't possibly be anywhere in the active regulatory region (stored in bed file) 
    that overlaps with positive region
    """
    # Extract all the coordinates of active regulatory region that comes before the 
    # smallest coordinate among the positive examples for the chromosome
        df_bed_non_top = df_bed_chom_s[df_bed_chom_s["End"] < min_TF]
    df_bed_non_top = df_bed_non_top.rename(columns = {"Chromosome":"Chom", "Start":"NStart", "End":"NEnd"})
    if df_bed_non_top.empty == False:
        df_bed_non_top.loc[:, "UnboundTF"] = tf
    df_bed_non_top = df_bed_non_top.drop(["Name"], axis=1)

    # Extract all the coordinates of active regulatory region that comes after the 
    # largest coordinate among the positive examples for the chromosome
    df_bed_non_bottom = df_bed_chom_s[df_bed_chom_s["Start"] > max_TF] # need to append these
    df_bed_non_bottom = df_bed_non_bottom.rename(columns = {"Chromosome":"Chom", "Start":"NStart", "End":"NEnd"})
    if df_bed_non_bottom.empty == False:
        df_bed_non_bottom.loc[:, "UnboundTF"] = tf
    df_bed_non_bottom = df_bed_non_bottom.drop(["Name"], axis=1)
    
    
    # Extract coordinates of active regulatory region that may overlap with coordinates of positive examples
    df_bed_relev = df_bed_chom_s[df_bed_chom_s["Start"] <= max_TF]
    df_bed_relev = df_bed_relev[df_bed_relev["End"] >= min_TF]
    print("dimension of bed_relev", df_bed_relev.shape)
    
    
    list_cols_TF = list(df_TF_chom_s.columns.values)
    df_TF_negative = pd.DataFrame(columns=['Chom','NStart','NEnd','UnboundTF'])  # Create an empty data frame

    count = 0
    for index_b, row_b in df_bed_relev.iterrows(): # For each active regulatory region (shortlisted)
        X = row_b["Start"]
        Y = row_b["End"]

        df_TF_Inregion = pd.DataFrame(columns=list_cols_TF)  # Create an empty dataframe
        # Gather any positive regions that may fall within the active regulatory region
        for index_TF, row_TF in df_TF_chom_s.iterrows():
            start = row_TF['TFStart'] 
            end = row_TF['TFEnd']    
            if X <= start and end <= Y:  # If the coordinates of positive examples fall within the active regulatory region
                df_1row = df_TF_chom_s.loc[[index_TF]]
                df_TF_Inregion = df_TF_Inregion.append(df_1row)

        if df_TF_Inregion.empty == False:
            print(count)
            df_TF_Inregion_r = df_TF_Inregion.reset_index()

            beg = X
            ter = Y
            idx_count = df_TF_Inregion_r.shape[0]

            for index_InR, row_InR in df_TF_Inregion_r.iterrows():
                st = row_InR["TFStart"]
                ed = row_InR["TFEnd"]

                if index_InR == 0:  # If it's the first row
                    new_row1 = {'Chom': chom, 'NStart': beg, 'NEnd': st-1, 'UnboundTF': tf}
                    df_TF_negative = df_TF_negative.append(new_row1, ignore_index=True)

                if index_InR == idx_count - 1: # If it's the last row
                    new_row2 = {'Chom': chom, 'NStart': ed+1, 'NEnd': ter, 'UnboundTF': tf}
                    df_TF_negative = df_TF_negative.append(new_row2, ignore_index=True)
                    break
                else:  # If there are multiple binding sites within a region and if the row is not the last row
                    st_2 = int(df_TF_Inregion_r.loc[[index_InR+1]].TFStart)
                    ed_2 = int(df_TF_Inregion_r.loc[[index_InR+1]].TFEnd)
                    if st_2 > ed: # If the TF binding sites don't overlap
                        new_row = {'Chom': chom, 'NStart': ed+1, 'NEnd': st_2-1, 'UnboundTF': tf}
                        df_TF_negative = df_TF_negative.append(new_row, ignore_index=True)

        count += 1    
    
    # Create a dataframe that contains all the negative examples
    all_negative_dfs = [df_bed_non_top, df_TF_negative, df_bed_non_bottom]
    df_negative_all = pd.concat(all_negative_dfs).reset_index(drop=True)
    print(df_negative_all)
    end_time = time.time()
    run_time = end_time - start_time
    print("Running time for {}: {}".format(chom, run_time))
    
    frames = [df_TF_negative_all, df_negative_all]
    df_TF_negative_all = pd.concat(frames).reset_index(drop=True)
    
    print("====================")

end_time_total = time.time()
run_time_total = end_time_total - start_time_total
print(df_TF_negative_all)
print(df_TF_negative_all.shape)
print("Total running time: ", run_time_total)

chr1
  Chromosome   Start     End     Name
0       chr1  237550  237989   chr1.1
1       chr1  521310  521756   chr1.2
2       chr1  713702  714675   chr1.3
3       chr1  762022  763345   chr1.4
4       chr1  805100  805473   chr1.5
5       chr1  839802  841023   chr1.6
6       chr1  848092  848620   chr1.7
7       chr1  856280  856826   chr1.8
8       chr1  858297  862058   chr1.9
9       chr1  873457  874094  chr1.10
       Chom  TFStart    TFEnd TFName  ScoreBS StrandBS
2774   chr1  1260080  1260101    UA7     0.86        +
2980   chr1  1284764  1284785    UA7     1.32        +
4392   chr1  1711959  1711980    UA7     1.21        +
6112   chr1  2245838  2245859    UA7     2.38        +
7299   chr1  2517947  2517968    UA7     0.94        +
11097  chr1  6052803  6052824    UA7     2.50        +
12198  chr1  6453418  6453439    UA7     1.06        +
12995  chr1  6662844  6662865    UA7     1.31        +
15714  chr1  8483883  8483904    UA7     2.15        +
16092  chr1  8763513  87635

53
82
221
635
645
862
1071
1284
1376
1403
1556
1653
1723
1899
1984
2321
2581
2597
2605
3050
3071
3147
3657
3771
     Chom     NStart       NEnd UnboundTF
0    chr5      13091      13703       UA7
1    chr5      21630      22054       UA7
2    chr5      42238      42674       UA7
3    chr5      49471      49910       UA7
4    chr5      57035      57525       UA7
..    ...        ...        ...       ...
231  chr5  180671786  180672245       UA7
232  chr5  180672491  180674405       UA7
233  chr5  180687509  180689387       UA7
234  chr5  180700284  180700747       UA7
235  chr5  180714191  180714761       UA7

[236 rows x 4 columns]
Running time for chr5: 33.1910924911499
chr6
      Chromosome   Start     End     Name
25885       chr6  148248  149088   chr6.1
25886       chr6  157133  157591   chr6.2
25887       chr6  168805  169270   chr6.3
25888       chr6  183657  184081   chr6.4
25889       chr6  188714  190168   chr6.5
25890       chr6  190532  190965   chr6.6
25891       chr6  193

191
200
473
650
799
857
930
1012
1170
1242
1266
1503
1508
1524
1652
1971
2013
2027
2091
2246
2276
2336
2470
2523
3197
3220
3250
      Chom     NStart       NEnd UnboundTF
0    chr10     119987     120366       UA7
1    chr10     122247     122802       UA7
2    chr10     180503     182596       UA7
3    chr10     183007     183438       UA7
4    chr10     224136     224577       UA7
..     ...        ...        ...       ...
345  chr10  135310867  135311347       UA7
346  chr10  135333447  135334337       UA7
347  chr10  135422318  135422750       UA7
348  chr10  135460348  135461035       UA7
349  chr10  135523048  135523574       UA7

[350 rows x 4 columns]
Running time for chr10: 19.696370124816895
chr11
      Chromosome   Start     End      Name
44500      chr11  175748  176263   chr11.1
44501      chr11  180103  180673   chr11.2
44502      chr11  189571  190445   chr11.3
44503      chr11  191712  192557   chr11.4
44504      chr11  200233  201092   chr11.5
44505      chr11  205999 

106
177
233
300
306
440
528
623
828
946
960
1003
1108
1151
1347
1420
1457
1556
1642
1663
1884
1900
1945
2132
2290
2393
      Chom     NStart       NEnd UnboundTF
0    chr15   20398188   20398853       UA7
1    chr15   20561391   20562017       UA7
2    chr15   21183357   21184021       UA7
3    chr15   21927229   21927614       UA7
4    chr15   22387307   22388087       UA7
..     ...        ...        ...       ...
204  chr15  102215983  102216558       UA7
205  chr15  102264074  102264866       UA7
206  chr15  102313914  102314376       UA7
207  chr15  102351126  102351488       UA7
208  chr15  102417810  102418300       UA7

[209 rows x 4 columns]
Running time for chr15: 14.187054634094238
chr16
      Chromosome   Start     End      Name
59614      chr16   60685   61248   chr16.1
59615      chr16   88736   89619   chr16.2
59616      chr16  102799  104328   chr16.3
59617      chr16  115628  116022   chr16.4
59618      chr16  126161  126597   chr16.5
59619      chr16  126661  129523  

109
112
384
439
696
791
896
900
1166
1304
1532
1605
1630
1671
1846
1918
2020
2021
     Chom    NStart      NEnd UnboundTF
0   chr20    118345    118776       UA7
1   chr20    189572    189997       UA7
2   chr20    200139    200589       UA7
3   chr20    216489    216919       UA7
4   chr20    239552    240009       UA7
..    ...       ...       ...       ...
79  chr20  62823725  62824162       UA7
80  chr20  62854747  62855197       UA7
81  chr20  62870108  62870873       UA7
82  chr20  62886583  62887545       UA7
83  chr20  62958750  62959393       UA7

[84 rows x 4 columns]
Running time for chr20: 10.97066593170166
chr21
      Chromosome     Start       End      Name
73740      chr21   9484957   9485598   chr21.1
73741      chr21   9584754   9585228   chr21.2
73742      chr21   9590106   9590558   chr21.3
73743      chr21   9909494   9910106   chr21.4
73744      chr21   9968064   9968900   chr21.5
73745      chr21  10000906  10001330   chr21.6
73746      chr21  11108969  11109406  

In [64]:
df_TF_negative_fwd = df_TF_negative_all
df_TF_negative_all_fwd = df_TF_negative_all.reindex(columns= ['Chom', 'NStart', 'NEnd','UnboundTF', 'Strand'])
df_TF_negative_all_fwd.loc[:, "Strand"] = "+"  # Extract the forward strand
print(df_TF_negative_all_fwd.head(10))
print(df_TF_negative_all_fwd.shape)

   Chom  NStart    NEnd UnboundTF Strand
0  chr1  237550  237989       UA7      +
1  chr1  521310  521756       UA7      +
2  chr1  713702  714675       UA7      +
3  chr1  762022  763345       UA7      +
4  chr1  805100  805473       UA7      +
5  chr1  839802  841023       UA7      +
6  chr1  848092  848620       UA7      +
7  chr1  856280  856826       UA7      +
8  chr1  858297  862058       UA7      +
9  chr1  873457  874094       UA7      +
(4208, 5)


In [65]:
df_all = df_TF_negative_all

In [66]:
# Save the result (genomic coordinates of potential negative examples on forward strand) in a csv file
fname = tf + "_negative_fwd.csv"
df_TF_negative_all_fwd.to_csv(fname, index=False)

In [67]:
# Save the result in a bed file
print(df_TF_negative_all_fwd.head(10))
df_TF_negative_all_fwd = df_TF_negative_all_fwd.drop(["UnboundTF", "Strand"], axis=1)
print(df_TF_negative_all_fwd.head(10))
fOutName = tf + "_fwd_neg.bed"
df_TF_negative_all_fwd.to_csv(fOutName, index=False, header=False, sep="\t")

   Chom  NStart    NEnd UnboundTF Strand
0  chr1  237550  237989       UA7      +
1  chr1  521310  521756       UA7      +
2  chr1  713702  714675       UA7      +
3  chr1  762022  763345       UA7      +
4  chr1  805100  805473       UA7      +
5  chr1  839802  841023       UA7      +
6  chr1  848092  848620       UA7      +
7  chr1  856280  856826       UA7      +
8  chr1  858297  862058       UA7      +
9  chr1  873457  874094       UA7      +
   Chom  NStart    NEnd
0  chr1  237550  237989
1  chr1  521310  521756
2  chr1  713702  714675
3  chr1  762022  763345
4  chr1  805100  805473
5  chr1  839802  841023
6  chr1  848092  848620
7  chr1  856280  856826
8  chr1  858297  862058
9  chr1  873457  874094


In [68]:
#df_TF_negative_all_fwd_chr1 =   df_TF_negative_all_fwd[df_TF_negative_all_fwd['Chom'] == 'chr1']
#print(df_TF_negative_all_fwd_chr1.shape)

# Check that the bed file was successfully created
gr2 = pr.read_bed(fOutName)
df2_bed = gr2.df
print(df2_bed.shape)
print(df2_bed.head(5))

(226, 3)
(4208, 3)
  Chromosome   Start     End
0       chr1  237550  237989
1       chr1  521310  521756
2       chr1  713702  714675
3       chr1  762022  763345
4       chr1  805100  805473


## 2. Obtaining the DNA physical properties of each sequence
GB Shape: http://rohsdb.cmb.usc.edu/
"Predicted structural properties for every human genome positions are available here"

### Reminder: Goal of the project
To determine whether the sites that are bound (positive examples) can be distinguished from those that are not bound (negative example) based on predicted structural properties of the sequence

In [21]:
# Create textfiles that can be directly used as input for DNAshape
# Positive examples
df_TF_fwd_m = df_TF_fwd.drop(["TFName", "ScoreBS", "StrandBS"], axis=1)
print(df_TF_fwd_m.head(10))
fout_name = "factorbookMotifPos_GBshape_" + tf
df_TF_fwd_m.to_csv(fout_name + ".txt", sep="\t", index=False, header=False)
df_TF_fwd_m.to_csv(fout_name + ".bed", sep="\t", index=False, header=False)

       Chom  TFStart    TFEnd
2774   chr1  1260080  1260101
2980   chr1  1284764  1284785
4392   chr1  1711959  1711980
6112   chr1  2245838  2245859
7299   chr1  2517947  2517968
11097  chr1  6052803  6052824
12198  chr1  6453418  6453439
12995  chr1  6662844  6662865
15714  chr1  8483883  8483904
16092  chr1  8763513  8763534


In [22]:
overall_runtime_end = time.time()
overall_runtime = overall_runtime_end - overall_runtime_start
print("Overall running time (s):", overall_runtime)

Overall running time (s): 964.6590223312378


### Remark:
We only need to run DNAshape for positive and potential negative examples whose locations were identified in part 1.

### Task 2.1:
Extract, from hg19, the actual sequences of bound and unbound sequences.

### Extract sequences of positive examples and potential negative examples: Using Galaxy
Workflow
1. For each fasta file of a chromosome (e.g. chr1)
2. Run Galaxy's "Extract Genomic DNA" using the fasta file and "factorbookMotifPos_TBP_fwd_pos.bed" which contains the positions of positive examples (where TFs are actually bound)
3. Download the output
4. Rename the output as, "seqs_chr#_TBP_fwd_pos.fasta" (e.g. seqs_chr1_TBP_fwd_pos.fasta)
5. Repeat for all chromosome

### Task 2-1
1. Calculate the DNA physical properties of each sequence
* Using GB shape and the sequences of positive examples and negative examples found in Part 1), get the DNA physical properties of each sequence 

### Note:
When I checked whether the outputted sequence (20bp) goes with what the PWM says, none of the sequences violated what PWM said (e.g. 'G' occuring at where Prob(G) = 0).

### Strengths
- Allows prediction (with relatively high accuracy) of TF binding based on physical properties relatively efficiently on a personal computer

### Limitations
- It is a heuristic
- Generalizability to other TFs (than TBP and UA7) are unknown