In [2]:
libs <- c(
    'RColorBrewer',
    'ggplot2',
    'xgboost',
    'glmnet',
    'dplyr',
    'tidyr',
    'pROC',
    'ROCR',
    'stringr',
    'caret',
    'caTools'
)

for (lib in libs) {
        if (!require(lib, character.only = TRUE, quietly = TRUE)) {
            install.packages(lib, repos='http://cran.us.r-project.org')
        }
}

(.packages())

source("my_R_functions/utility_functions.R")
source("my_R_functions/stat_functions.R")
source("my_R_functions/plot_functions.R")
source("/ssd/mrichard/github/BDDS/trenadb/src/utils.R")
source("/ssd/mrichard/github/BDDS/footprints/testdb/src/dbFunctions.R")

In [3]:
load("Rdata_files/motif_class_pairs.Rdata")
head(motif.class)

motif,TF,class,family
MA0001.1,AGL3,Other Alpha-Helix,MADS
MA0002.1,RUNX1,Ig-fold,Runt
MA0003.1,TFAP2A,Zipper-Type,Helix-Loop-Helix
MA0004.1,Arnt,Basic helix-loop-helix factors (bHLH),PAS domain factors
MA0005.1,AG,Other Alpha-Helix,MADS
MA0006.1,Ahr::Arnt,Basic helix-loop-helix factors (bHLH),PAS domain factors


Now we're going to load both datasets and combine them. In doing so, we'll also tag them with something to tell us what seed they came from...

In [16]:
load("/ssd/mrichard/data/all.TF.df.fimo.hint.well.seed16.annotated.9.Rdata")

In [17]:
full.16 <- all.TF.df.fimo.hint.well.annotated
rm(all.TF.df.fimo.hint.well.annotated)

In [18]:
load("/ssd/mrichard/data/all.TF.df.fimo.hint.well.seed20.annotated.9.Rdata")

In [19]:
full.20 <- all.TF.df.fimo.hint.well.annotated
rm(all.TF.df.fimo.hint.well.annotated)

Our 2 datasets are now loaded, we need to put them together; let's remember what they look like:

In [9]:
str(full.16)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	24720556 obs. of  39 variables:
 $ motifname                               : chr  "Mmusculus-jaspar2016-Nfe2l2-MA0150.2" "Mmusculus-jaspar2016-Bach1::Mafk-MA0591.1" "Hsapiens-jaspar2016-JUND(var.2)-MA0492.1" "Hsapiens-jaspar2016-ATF7-MA0834.1" ...
 $ chrom                                   : chr  "1" "1" "1" "1" ...
 $ start                                   : int  1677938 1677939 1828558 2255916 2255789 2255823 2255891 2255917 2255949 2255983 ...
 $ endpos                                  : int  1677952 1677953 1828572 2255929 2255803 2255837 2255905 2255931 2255963 2255997 ...
 $ strand                                  : chr  "+" "+" "-" "-" ...
 $ motifscore                              : num  13.35 11.71 9.73 8.03 13.72 ...
 $ pval                                    : num  1.13e-05 1.37e-05 9.18e-05 6.68e-05 1.48e-05 1.48e-05 1.48e-05 8.28e-06 1.48e-05 1.48e-05 ...
 $ sequence                                : chr  "CACTGTGACTCCGCA" "ACTGTGA

In [20]:
str(full.20)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	24720556 obs. of  39 variables:
 $ motifname                               : chr  "Mmusculus-jaspar2016-Nfe2l2-MA0150.2" "Mmusculus-jaspar2016-Bach1::Mafk-MA0591.1" "Hsapiens-jaspar2016-JUND(var.2)-MA0492.1" "Hsapiens-jaspar2016-ATF7-MA0834.1" ...
 $ chrom                                   : chr  "1" "1" "1" "1" ...
 $ start                                   : int  1677938 1677939 1828558 2255916 2255789 2255823 2255891 2255917 2255949 2255983 ...
 $ endpos                                  : int  1677952 1677953 1828572 2255929 2255803 2255837 2255905 2255931 2255963 2255997 ...
 $ strand                                  : chr  "+" "+" "-" "-" ...
 $ motifscore                              : num  13.35 11.71 9.73 8.03 13.72 ...
 $ pval                                    : num  1.13e-05 1.37e-05 9.18e-05 6.68e-05 1.48e-05 1.48e-05 1.48e-05 8.28e-06 1.48e-05 1.48e-05 ...
 $ sequence                                : chr  "CACTGTGACTCCGCA" "ACTGTGA

## Part 1: Just add them

First, let's try just stacking the data frames and removing the duplicates by taking the union

In [21]:
full.data <- dplyr::union(full.16, full.20)

In [22]:
str(full.data)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	28912150 obs. of  39 variables:
 $ motifname                               : chr  "Hsapiens-jaspar2016-ZNF143-MA0088.2" "Hsapiens-jaspar2016-ZNF143-MA0088.2" "Hsapiens-jaspar2016-ZNF143-MA0088.2" "Hsapiens-jaspar2016-ZNF143-MA0088.2" ...
 $ chrom                                   : chr  "Y" "Y" "Y" "Y" ...
 $ start                                   : int  17944333 3242941 20943555 11099886 15339505 11139731 13387665 15515701 20731514 9209315 ...
 $ endpos                                  : int  17944348 3242956 20943570 11099901 15339520 11139746 13387680 15515716 20731529 9209330 ...
 $ strand                                  : chr  "-" "+" "-" "-" ...
 $ motifscore                              : num  10.56 2.24 7.84 9.65 2.07 ...
 $ pval                                    : num  6.84e-06 9.32e-05 1.71e-05 9.36e-06 9.77e-05 8.42e-05 3.57e-05 7.80e-05 1.19e-05 4.80e-05 ...
 $ sequence                                : chr  "CACCCTCGGTGCACTG" "TT

In [23]:
all.TF.df.fimo.hint.well.annotated <- full.data
save(all.TF.df.fimo.hint.well.annotated, file = "/ssd/mrichard/data/stacked.annotated.9.Rdata")

## Part 2: Adding Features

Alternatively, we can treat the 16 and 20 seed hits separately; thus, we can join the 2 dataframes together, being sure to keep the 2 sets of Wellington/Hint scores and fractions separate as well. 

In [None]:
# This will require a full outer join, where we put a 0 for any missing values
# We will also have to join on all the other columns
