## Plotting sequence logos

The code in this tutorial is released under the [MIT License](https://opensource.org/licenses/MIT). All the content in this notebook is under a [CC-by 4.0 License](https://creativecommons.org/licenses/by/4.0/). 

In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.path.append('../RegSeq/')
import utils
#import RegSeq

In [30]:
%matplotlib inline

# logomaker import
import logomaker

In [31]:
#We will first load in a binding site matrix.
arraydf = pd.io.parsers.read_csv('../data/example_arrays/aphAAnaerodataset_alldone_with_largeMCMC194_activator',
                                 delim_whitespace=True,index_col='pos')
#We will rename columns to be useable by the logomaker package
arraydf = arraydf.rename(columns={'val_A':'A','val_C':'C','val_G':'G','val_T':'T'})

We first need to find the proper scaling factor for conversion to information logo. Empirically, it is known
that binding sites have approximately 1 bit of information per base pair.

In [32]:
def get_info(df):
    '''This function finds the total information content of a binding site'''
    #define the background probabilities for E. coli bases.
    gc = .508
    background_array =np.array([(1-gc)/2,gc/2,gc/2,(1-gc)/2])
    #add in small value to make sure no probabilities are exactly zero.
    df = df + 1e-7
    
    return np.sum(df.values*np.log2(df.values/background_array))

In [33]:
def get_beta_for_effect_df(effect_df,target_info,\
    min_beta=.001,max_beta=100,num_betas=1000):
    '''This function finds the appropriate scaling factor for displaying sequence
    logos. From empirical results, most binding sites will ahve approximately
    1 bit per base pair of information'''
    betas = np.exp(np.linspace(np.log(min_beta),np.log(max_beta),num_betas))
    infos = np.zeros(len(betas))
    for i, beta in enumerate(betas):
        prob_df = logomaker.transform_matrix(df=beta*effect_df,from_type='weight',to_type='probability')
        infos[i] = get_info(prob_df)
    i = np.argmin(np.abs(infos-target_info))
    beta = betas[i]
    return beta

In [None]:
#finding scaling factor
target_info = len(arraydf.index)
beta = get_beta_for_effect_df(arraydf,target_info)

#we will now use logomaker to convert our energy matrix to an information matrix for plotting
binding_info = logomaker.transform_matrix(df=beta*arraydf,from_type='weight',to_type='information')

In [None]:
binding_logo = logomaker.Logo(binding_info,
                         font_name='Stencil Std',
                         vpad=.1,
                         width=.8)

# style using Logo methods
binding_logo.style_spines(visible=False)
binding_logo.style_spines(spines=['left', 'bottom'], visible=True)
binding_logo.style_xticks(rotation=90, fmt='%d', anchor=0)

# style using Axes methods
binding_logo.ax.set_ylabel("Information (bits)", labelpad=-1)
binding_logo.ax.xaxis.set_ticks_position('none')
binding_logo.ax.xaxis.set_tick_params(pad=-1)