In [15]:
# %load ../start.py
# Load useful extensions

# Activate the autoreload extension for easy reloading of external packages
%reload_ext autoreload
%autoreload 1

# Set up cashdir
from ipycache import CacheMagics
CacheMagics.cachedir = '../cachedir'

# Trun on the water mark
%reload_ext watermark
%watermark -a "Justin M Fear" -u -d -v

# Load ipycache extension
%reload_ext ipycache
from ipycache import CacheMagics
CacheMagics.cachedir = '../cachedir'

# Add project library to path
import sys
sys.path.insert(0, '../../lib/python')


Justin M Fear 
last updated: 2016-10-05 

CPython 3.5.2
IPython 5.1.0


In the last S2 cell RNAi project meeting we reviewed Yijie's network model. One concern that I had was the addition of edges from transcription factors to genes, when a gene is highly correlated. For example:

```
TFa -> A

A and B are highly correlated...

Add edge TFa ->B
```

This is a good idea, but does not necessarily hold true. We decided the addition of motif information would be important. Then we could only add `TFa->B` when `B` has a motif from `TFa`. 

Lee had started generating a TF motif list/weights, but Brian wanted him to stop this and have me do it. They basic analysis plan is:

1. Download annotated TF motifs from various online sources.
2. Map motifs to the genome and identify motifs within some range of the TSS. 
3. Do the same thing across Drosophila species and calculate a conservation score (see DSX motif paper).
4. Build a weight matrix where each row is a gene and each column is transcription factor. Valuse can either be binary or weights for if the TF motif was within the regulatory region of the gene.

Lee has looked for sources of motifs and concluded that the MEME website is a good source. Here is an email from Lee updating me on this information:

    Justin, 

    My R scripts seem to be not super-helpful.  There are just about how I defined TSS regions (+ first introns).  I assigned a motif to a gene when a motif is within the range between 1kb upstream to min(500bp downstream or first intron end).  Just briefly go over for fun.


    You can download the weight matrices from MEME, but I attach here, too. 

    OnTheFly data, used very weird IDs, which are mixture of Swissprot and others.  It is very nagging to deal with it, so I actually crawled their website (which also in in the Handling_motif_data.R).  I attach the ID conversion matrix.  This will be very useful.


    FIMO results are quite large, and I don’t think that you will use them.  But the links follow.
    https://www.dropbox.com/s/czse5ur5md8wm1u/OnTheFly_2014_p0.0001.txt?dl=0
    https://www.dropbox.com/s/hq812lsv6hijqet/fly_factor_survey_p0.0001.txt?dl=0

    An example of my FIMO code is below
    cd /data/leehang/motif/fly_factor_survey; fimo --qv-thresh --thresh 0.05 ./fly_factor_survey.meme ~/Annotation/Dmel.FB6_06.fa


    Lee

I think I am going to approach this cleanly so that I know exactly where and how files were downloaded.

In [23]:
# Imports
import os
import re
import tarfile
from tempfile import mkstemp
from time import sleep
from urllib.request import urlretrieve, urlopen
from collections import defaultdict

import pandas as pd

from Bio import motifs

# My library
import meme

In [17]:
# Download MEME motif database
if not os.path.exists('../../data/external/meme/motif_databases/FLY/fly_factor_survey.meme'):
    # Download file
    temp = mkstemp(suffix='tgz')
    urlretrieve(url='http://meme-suite.org/meme-software/Databases/motifs/motif_databases.12.12.tgz', filename=temp[1])
    
    # Make sure output dir is there
    if not os.path.exists('../../data/external/meme'):
        os.mkdir('../../data/external/meme')

    # Open tar
    tar = tarfile.open(temp[1])

    # Extract only the Fly data
    def fly(members):
        for tarinfo in members:
            if 'FLY' in tarinfo.name:
                yield tarinfo
    
    tar.extractall(path='../../data/external/meme', members=fly(tar))
    
    # Clean up
    tar.close()
    os.unlink(temp[1])

In [18]:
%%cache -s flyfactory.pkl flyFactoryTFS
# Verify Meme downloads
flyFactoryTFS = meme.memeFile('../../data/external/meme/motif_databases/FLY/fly_factor_survey.meme')
# According to the meme website the fly factor database has 656 motifs, this will error if not
assert flyFactoryTFS.count() == 656

In [36]:
%%cache -s onthefly.pkl onTheFlyTFS otToFbgn FbgnToOt
# map OnTheFly to FBgn
FAILED = []

# The OnTheFly motifs are not as stright forward because they use their own identifier. There is an added step to query their website and get the FBgn value.
URL = 'https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID={0}'
def map_onthefly(tf):
    attempts = 1
    while attempts < 4:
        try:
            query = '_'.join(tf.name.split('_')[:-1])
            site = URL.format(query)
            with urlopen(site) as fh:
                page = fh.read().decode('utf-8')
                return re.findall('FBgn\d+', page)[0]
        except:
            attempts += 1
        sleep(2)
    
    global FAILED
    FAILED.append((tf.id, tf.name, site))
    return None
    
onTheFlyTFS = meme.memeFile('../../data/external/meme/motif_databases/FLY/OnTheFly_2014_Drosophila.meme')
# According to the meme website the fly factor database has 608 motifs, this will error if not
assert onTheFlyTFS.count() == 608

# Get a list of all the IDs
keys = list(onTheFlyTFS.keys())

otToFbgn = {}
FbgnToOt = defaultdict(list)

for key in keys:
    fbgn = map_onthefly(onTheFlyTFS[key][0])
    if fbgn is not None:
        otToFbgn[key] = fbgn
        FbgnToOt[fbgn].append(key)
    else:
        otToFbgn[key] = None
        FbgnToOt['None'].append(key)
        
# Print Failures
for f in FAILED:
    print(' '.join(f))

OTF0208  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=
OTF0160  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=
OTF0223  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=
OTF0067  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=
OTF0408  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=
OTF0415  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=
OTF0388  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=
OTF0339  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=
OTF0515  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=
OTF0279  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=
OTF0237  https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=


The above IDs had no associated protein and could not be mapped to an FBgn. 

Lee has done this process separately so I want to compare results and make sure things look the same.

In [37]:
# import Lee's webscrapper results
lee = pd.read_csv('../../data/lee/OnTheFly_ID_conversion.txt', sep='\t')
lee.head()

Unnamed: 0,motif,summary,uniprot,url,id
0,OTF0001.1,7UP1_DROME_B1H,7UP1_DROME,https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-b...,FBgn0003651
1,OTF0002.1,A0AQF9_DROME_B1H,A0AQF9_DROME,https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-b...,FBgn0034599
2,OTF0003.1,A0JQ60_DROME_SELEX,A0JQ60_DROME,https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-b...,FBgn0000567
3,OTF0003.2,A0JQ60_DROME_DNaseI,A0JQ60_DROME,https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-b...,FBgn0000567
4,OTF0004.1,A1A6R5_DROME_B1H,A1A6R5_DROME,https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-b...,FBgn0004914


Below are a list of OTF's that changed FBgn number between when Lee ran his script and when I ran mine. Following Lee's URL I can see that they have updated the FBgn so that it no longer matches what Lee had down.

In [38]:
# Compare webscapper results
for row in lee.to_records():
    try:
        assert otToFbgn[row.motif.split('.')[0]] == row.id
    except:
        print(row.motif, row.id, row.url)

OTF0026.1 FBgn0262656 https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=A8VEM3_DROME
OTF0031.1 FBgn0267033 https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=A9YHI4_DROME
OTF0037.1 FBgn0264442 https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=ABRU_DROME
OTF0043.1 FBgn0267978 https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=APTE_DROME
OTF0043.2 FBgn0267978 https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=APTE_DROME
OTF0045.1 FBgn0264075 https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=ARNT_DROME
OTF0061.1 FBgn0266411 https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=B7Z0S3_DROME
OTF0063.1 FBgn0267337 https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=B7Z0U8_DROME
OTF0072.1 FBgn0283451 https://bhapp.c2b2.columbia.edu/OnTheFly/cgi-bin/protein_entry.php?protein_ID=BRC1