# 11-mer
The goal of this notebook is to reproduce the 11-mer model.
While investigating the plotting notebook in the original repository it was found that the 11-mer model actually is the best markov model. In the config file for the best markov model in the results folder it can be seen, that it is a bidirectional markov model of order 5.

In this notebook no splitting of the data wil be performed, ie. the whole dataset will be used for training and testing.

```
CONFIG
├── datamodule
│   └── _target_: src.datamodules.motif_datamodule.MotifDataModule                                                                         
│       _recursive_: false                                                                                                                 
│       dataset:                                                                                                                           
│         _target_: src.datamodules.dna_datasets.CSVDataset                                                                                
│       data:                                                                                                                              
│         train_file: /s/project/semi_supervised_multispecies/all_fungi_reference/fungi/Annotation/Sequences/AAA_Concatenated/Scer_half_lif
│         test_file: /s/project/semi_supervised_multispecies/all_fungi_reference/fungi/Annotation/Sequences/AAA_Concatenated/Scer_half_life
│         seq_position: UTR3_seq                                                                                                           
│       transforms:                                                                                                                        
│         _target_: src.datamodules.sequence_encoders.SequenceDataEncoder                                                                  
│         seq_len: 300                                                                                                                     
│         total_len: 303                                                                                                                   
│         mask_rate: 0.1                                                                                                                   
│       test_transforms:                                                                                                                   
│         _target_: src.datamodules.sequence_encoders.RollingMasker                                                                        
│         mask_stride: 50                                                                                                                  
│         frame: 0                                                                                                                         
│       batched_dataset: true                                                                                                              
│       batch_size: 1                                                                                                                      
│       train_val_test_split:                                                                                                              
│       - 55000                                                                                                                            
│       - 5000                                                                                                                             
│       - 10000                                                                                                                            
│       num_workers: 16                                                                                                                    
│       pin_memory: true                                                                                                                   
│       persistent_workers: true                                                                                                           
│                                                                                                                                          
├── model
│   └── _target_: src.models.baseline.markov_model.MarkovModel                                                                             
│       halflife_df_path: /s/project/semi_supervised_multispecies/all_fungi_reference/fungi/Annotation/Sequences/AAA_Concatenated/Scer_half
│       markov_matrix_path: /s/project/semi_supervised_multispecies/Downstream/NearestNeighbour/markov_bimatrix_all.npy                    
│       order: 5                                                                                                                           
│       bidirectional: true                                                                                                                
│                                                                                                                                          
├── callbacks
│   └── {}                                                                                                                                 
│                                                                                                                                          
├── trainer
│   └── _target_: pytorch_lightning.Trainer                                                                                                
│       gpus: 1                                                                                                                            
│       min_epochs: 1                                                                                                                      
│       max_epochs: 50                                                                                                                     
│       resume_from_checkpoint: null                                                                                                       
│                                                                                                                                          
├── original_work_dir
│   └── /data/nasif12/home_if12/gankin/motif-modeling                                                                                      
├── data_dir
│   └── /s/project/semi_supervised_multispecies/all_fungi_reference/fungi/Annotation/Sequences/AAA_Concatenated/                           
├── print_config
│   └── True                                                                                                                               
├── ignore_warnings
│   └── True                                                                                                                               
├── seed
│   └── None                                                                                                                               
├── name
│   └── default                                                                                                                            
├── ckpt_path
│   └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-11-10/12-29-46/motif-training/3vvsocva/checkpoints/epoch=49-s
├── base_ssm
│   └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-07-29/15-54-24/motif-training/3dvk81nk/checkpoints/epoch=49-s
├── base_ssm_frame
│   └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-07-30/19-35-37/motif-training/1iqkna36/checkpoints/epoch=49-s
├── spec_sacc_schizzo_out
│   └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-10-15/14-51-12/motif-training/1yesuk16/checkpoints/epoch=49-s
├── spec_on_all
│   └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-11-03/22-02-00/motif-training/20p1vu1v/checkpoints/epoch=49-s
└── spec_sacc_out
    └── /s/project/semi_supervised_multispecies/dgbackup/outputs/outputs/2022-11-10/12-29-46/motif-training/3vvsocva/checkpoints/epoch=49-s
```

# Imports

In [1]:
%load_ext autoreload
%autoreload 2

import sys, os
sys.path.insert(0, '../..')

import gc
import pysam
import pandas as pd
import re
import torch
from torch.utils.data import DataLoader, Dataset
import matplotlib.pyplot as plt
import numpy as np


import helpers.train_eval as train_eval    #train and evaluation
import helpers.misc as misc                #miscellaneous functions

import encoding_utils.sequence_encoders as sequence_encoders
import encoding_utils.sequence_utils as sequence_utils
from models.spec_dss import DSSResNet, DSSResNetEmb, SpecAdd
from models.baseline.markov_model import *

from Bio import SeqIO

ModuleNotFoundError: No module named 'markov_model'

# Data

In [3]:
# load train split
with open('train_split.pickle', 'rb') as f:
    train_df = pickle.load(f)
train_df

Unnamed: 0,3-UTR
0,ATCTTATATAACTGTGAGATTAATCTCAGATAATGACACAAAATAT...
1,GGTTGCCGGGGGTAGGGGTGGGGCCACACAAATCTCCAGGAGCCAC...
2,GGCAGCCCATCTGGGGGGCCTGTAGGGGCTGCCGGGCTGGTGGCCA...
3,CCCACCTACCACCAGAGGCCTGCAGCCTCCCACATGCCTTAAGGGG...
4,TGGCCGCGGTGAGGTGGGTTCTCAGGACCACCCTCGCCAAGCTCCA...
...,...
12688,GGCTCCCACAGGCACCAGCAAAACAACGGATGAATGTAGCCCTTCC...
12689,AGCATGAAGACTTTCTGAAACCTGCCCTAGAGCTGGGATATTGTTT...
12690,GTTTCTGAGTGGCGGAGTGGCCAAACCCTAGAGCTAGCAGTTCCCA...
12691,ACAGTGTGCCAAACACCAGCTAAACCAAGAGAGAAAGCAAGAAACT...


In [4]:
# load test split
with open('test_split.pickle', 'rb') as f:
    test_df = pickle.load(f)
test_df

Unnamed: 0,3-UTR
0,GAGTGAATAAAATTGGACTTTGTTTAAAATAAGTGAATAAGCGATA...
1,GCTAGACATGGCAGAGATGAGGAGGTTTGGCACAGAAAACATAGCC...
2,GAATCACACAGAGTCTTCTGTAGGGGTATGGTGCGCCGCATGACAT...
3,AATGTGATTCCTTTGAAGAGGAAAATGAATAATACATTGAATTAGA...
4,TTTAAGTGGCTATGGGTATTTCTTTCATACTTTATTAAAGTATCAA...
...,...
5436,AGCAAGCATTGAAAATAATAGTTATTGCATACCAATCCTTGTTTGC...
5437,AGCAAGCATTGAAAATAATAGTTATTGCATACCAATCCTTGTTTGC...
5438,GCCTACTTCATCTCAGGACCCGCCCAAGAGTGGCCGCGGCTTTGGG...
5439,TTGTCAGTCTGTCTGCTCAGGACACAAGAACTAAGGGGCAACAAAT...


In [5]:
# load test split
with open('full_df.pickle', 'rb') as f:
    full_df = pickle.load(f)
full_df

Unnamed: 0,3-UTR
0,ATCTTATATAACTGTGAGATTAATCTCAGATAATGACACAAAATAT...
1,GGTTGCCGGGGGTAGGGGTGGGGCCACACAAATCTCCAGGAGCCAC...
2,GGCAGCCCATCTGGGGGGCCTGTAGGGGCTGCCGGGCTGGTGGCCA...
3,CCCACCTACCACCAGAGGCCTGCAGCCTCCCACATGCCTTAAGGGG...
4,TGGCCGCGGTGAGGTGGGTTCTCAGGACCACCCTCGCCAAGCTCCA...
...,...
18129,AGCAAGCATTGAAAATAATAGTTATTGCATACCAATCCTTGTTTGC...
18130,AGCAAGCATTGAAAATAATAGTTATTGCATACCAATCCTTGTTTGC...
18131,GCCTACTTCATCTCAGGACCCGCCCAAGAGTGGCCGCGGCTTTGGG...
18132,TTGTCAGTCTGTCTGCTCAGGACACAAGAACTAAGGGGCAACAAAT...


# Training: Full 

In [7]:
# get the frequency counts of all motifs till 11mer
kmer_all = KmerCount(11,pseudocount=0.1)
kmer_all.compute_counts(full_df['3-UTR'])
kmer_all.kmer_counts_dict

100%|██████████| 18134/18134 [07:59<00:00, 37.78it/s] 


{0: array([0.1]),
 1: array([9013620.1, 6807717.1, 6902860.1, 9775298.1]),
 2: array([2924303.1, 1554187.1, 2207057.1, 2312999.1, 2214433.1, 1878296.1,
         370284.1, 2343464.1, 1851765.1, 1480330.1, 1807635.1, 1762154.1,
        2017858.1, 1891017.1, 2511994.1, 3353585.1]),
 3: array([1136671.1,  431973.1,  611269.1,  737142.1,  569283.1,  381559.1,
          88735.1,  514463.1,  647192.1,  465670.1,  570100.1,  523958.1,
         580620.1,  381657.1,  570937.1,  779674.1,  520142.1,  464987.1,
         674412.1,  551769.1,  596478.1,  528491.1,  113075.1,  640002.1,
          73943.1,   89721.1,  104954.1,  101630.1,  375283.1,  544382.1,
         732703.1,  690962.1,  608999.1,  310876.1,  526084.1,  403753.1,
         461797.1,  445595.1,   88038.1,  484615.1,  508182.1,  414860.1,
         498847.1,  385489.1,  380478.1,  322883.1,  536467.1,  522202.1,
         656935.1,  345397.1,  393584.1,  619292.1,  585750.1,  521239.1,
          80074.1,  703397.1,  621104.1,  508346.1,

In [8]:
# save dictionary pkl file
with open('kmer_all.pickle', 'wb') as f:
    pickle.dump(kmer_all, f)

In [9]:
# get the frequency counts of all motofs till a length of 11
kmer_train = KmerCount(11,pseudocount=0.1)
kmer_train.compute_counts(train_df['3-UTR'])
kmer_train.kmer_counts_dict

100%|██████████| 12693/12693 [04:53<00:00, 43.24it/s] 


{0: array([0.1]),
 1: array([6115943.1, 4807743.1, 4852448.1, 6638040.1]),
 2: array([1955476.1, 1072834.1, 1538437.1, 1538704.1, 1539114.1, 1358977.1,
         267620.1, 1641116.1, 1284252.1, 1056040.1, 1301430.1, 1210020.1,
        1333556.1, 1317061.1, 1740695.1, 2246149.1]),
 3: array([755896.1, 292604.1, 416567.1, 485450.1, 387560.1, 271294.1,
         62856.1, 351028.1, 445262.1, 329380.1, 407538.1, 356158.1,
        378904.1, 261137.1, 385463.1, 513122.1, 353668.1, 329396.1,
        479964.1, 373869.1, 425717.1, 389397.1,  83303.1, 460364.1,
         52870.1,  65583.1,  77115.1,  72023.1, 253526.1, 390591.1,
        523362.1, 473539.1, 413700.1, 219165.1, 374918.1, 274971.1,
        324200.1, 325146.1,  63909.1, 342577.1, 360865.1, 302439.1,
        365814.1, 272121.1, 254147.1, 227224.1, 376002.1, 352560.1,
        431214.1, 230984.1, 265814.1, 403726.1, 400877.1, 372057.1,
         57271.1, 486441.1, 424343.1, 357353.1, 449526.1, 509086.1,
        446621.1, 437695.1, 455110.1,

In [10]:
# save dictionary pkl file
with open('kmer_train_split.pickle', 'wb') as f:
    pickle.dump(kmer_train, f)

In [20]:
mkv_all = MarkovModel(kmer_all, markov_matrix_path="markov_model_all.npy", order=5, bidirectional=True, sequences=sequences)

In [21]:
mkv_all.mkv.compile_from_counts()

  def impute_for_seq(self, seq, order=None):


In [22]:
mkv_all.mkv.markov_matrix

array([[[0.27734646, 0.20947147, 0.21239899, 0.30078307],
        [0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        ],
        ...,
        [0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        ]],

       [[0.38744431, 0.19404514, 0.2206011 , 0.19790945],
        [0.26009011, 0.22973594, 0.280379  , 0.22979495],
        [0.33202355, 0.04819832, 0.30966175, 0.31011638],
        ...,
        [       nan,        nan,        nan,        nan],
        [       nan,        nan,        nan,        nan],
        [       nan,        nan,        nan,        nan]],

       [[0.48337845, 0.12624776, 0.1831624 , 0.20721139],
        [0.39164257, 0.18935866, 0.20368593, 0.21531284],
        [0.38705476, 0.20699629, 0.23225393, 0.17369502],
        ...,
        [       nan,        nan,        nan,        nan],
        [       nan,        n

In [23]:
mkv_all.test()

# Reading sequences and filtering motifs


In [6]:
fasta_fa = "../../Homo_sapiens_3prime_UTR.fa"
species_list = "../../240_species.txt"

seq_df = pd.read_csv(fasta_fa + '.fai', header=None, sep='\t', usecols=[0], names=['seq_name'])
seq_df['species_name'] = seq_df.seq_name.apply(lambda x:x.split(':')[1])
species_encoding = pd.read_csv(species_list, header=None).squeeze().to_dict()
species_encoding = {species:idx for idx,species in species_encoding.items()}
species_encoding['Homo_sapiens'] = species_encoding['Pan_troglodytes']
seq_df['species_label'] = seq_df.species_name.map(species_encoding)
seq_df

Unnamed: 0,seq_name,species_name,species_label
0,ENST00000641515.2_utr3_2_0_chr1_70009_f:Homo_s...,Homo_sapiens,181
1,ENST00000616016.5_utr3_13_0_chr1_944154_f:Homo...,Homo_sapiens,181
2,ENST00000327044.7_utr3_18_0_chr1_944203_r:Homo...,Homo_sapiens,181
3,ENST00000338591.8_utr3_11_0_chr1_965192_f:Homo...,Homo_sapiens,181
4,ENST00000379410.8_utr3_15_0_chr1_974576_f:Homo...,Homo_sapiens,181
...,...,...,...
18129,ENST00000303766.12_utr3_11_0_chrY_22168542_r:H...,Homo_sapiens,181
18130,ENST00000250831.6_utr3_11_0_chrY_22417604_f:Ho...,Homo_sapiens,181
18131,ENST00000303728.5_utr3_4_0_chrY_22514071_f:Hom...,Homo_sapiens,181
18132,ENST00000382407.1_utr3_0_0_chrY_24045793_r:Hom...,Homo_sapiens,181


In [7]:
motif_overlap = [
    ("EWSR1","GGGGG"),
    ("FUS", "GGGGG"),
    ("TAF15", "GGGGG"),
    ("HNRNPL", "ACACA"),
    ("PABPN1L", "AAAAA"),
    ("TRA2A", "GAAGA"),
    ("PCBP2", "CCCCC"),
    ("RBFOX2", "GCATG"),
    ("TARDBP", "GTATG"),
    ("HNRNPC", "TTTTT"),
    ("TIA1","TTTTT"),
    ("PTBP3", "TTTCT"),
    ("CELF1", "TATGT"),
    ("FUBP3", "TATAT"),
    ("KHSRP", "TGTAT"),
    ("PUM1", "TGTAT"),
    ("KHDRBS2", "ATAAA")
]

motifs = list(set(map(lambda x: x[1], motif_overlap)))
motifs

['ACACA',
 'TATGT',
 'GCATG',
 'TATAT',
 'GAAGA',
 'AAAAA',
 'GTATG',
 'TTTCT',
 'ATAAA',
 'TGTAT',
 'GGGGG',
 'TTTTT',
 'CCCCC']

In [38]:
kseq_len = 5000
total_len = 5000

seq_transform = sequence_encoders.RollingMasker(mask_stride=1)
                       
test_dataset = SeqDataset(fasta_fa, seq_df, transform = seq_transform, motifs=motifs)
test_dataloader = DataLoader(dataset = test_dataset, batch_size = 128, num_workers = 1, collate_fn = None, shuffle = False)



In [39]:
gc.collect()
torch.cuda.empty_cache()
device = torch.device('cuda')
d_model = 128
n_layers = 4
dropout = 0.
learn_rate = 1e-4
weight_decay = 0.
output_dir = "./test/"
get_embeddings = False
save_at = None

species_encoder = SpecAdd(embed = True, encoder = 'label', d_model = 128)

model = DSSResNetEmb(d_input = 5, d_output = 5, d_model = d_model, n_layers = n_layers, 
                     dropout = dropout, embed_before = True, species_encoder = species_encoder)

model = model.to(device) 

model_params = [p for p in model.parameters() if p.requires_grad]

optimizer = torch.optim.Adam(model_params, lr = learn_rate, weight_decay = weight_decay)

last_epoch = 0

In [40]:

model_weight = "../../test/MLM_mammals_species_aware_5000_weights"
model.load_state_dict(torch.load(model_weight))

<All keys matched successfully>

In [41]:

predictions_dir = os.path.join(output_dir, 'predictions') #dir to save predictions
weights_dir = os.path.join(output_dir, 'weights') #dir to save model weights at save_at epochs
if save_at:
    os.makedirs(weights_dir, exist_ok = True)

def metrics_to_str(metrics):
    loss, total_acc, masked_acc = metrics
    return f'loss: {loss:.4}, total acc: {total_acc:.3f}, masked acc: {masked_acc:.3f}'

from helpers.misc import print    #print function that displays time

In [42]:

print(f'EPOCH {last_epoch}: Test/Inference...')

test_metrics, test_embeddings =  train_eval.model_eval(model, optimizer, test_dataloader, device, 
                                                        get_embeddings = get_embeddings, silent = True)



print(f'epoch {last_epoch} - test, {metrics_to_str(test_metrics)}')

if get_embeddings:   
    print(f'EPOCH {last_epoch}: Test/Inference...')

test_metrics, test_embeddings =  train_eval.model_eval(model, optimizer, test_dataloader, device, 
                                                        get_embeddings = get_embeddings, silent = True)

print(f'epoch {last_epoch} - test, {metrics_to_str(test_metrics)}')

if get_embeddings:
    os.makedirs(output_dir, exist_ok = True)
    with open(output_dir + '/embeddings.npy', 'wb') as f:
        test_embeddings = np.vstack(test_embeddings)
        np.save(f, test_embeddings)
    os.makedirs(output_dir, exist_ok = True)
    with open(output_dir + '/embeddings.npy', 'wb') as f:
        test_embeddings = np.vstack(test_embeddings)
        np.save(f, test_embeddings)

[2023/06/08-15:25:43]- EPOCH 0: Test/Inference...


RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/lukas/.local/lib/anaconda3/envs/ML4RG-mlm/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/lukas/.local/lib/anaconda3/envs/ML4RG-mlm/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/lukas/.local/lib/anaconda3/envs/ML4RG-mlm/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 264, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/home/lukas/.local/lib/anaconda3/envs/ML4RG-mlm/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/lukas/.local/lib/anaconda3/envs/ML4RG-mlm/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/lukas/.local/lib/anaconda3/envs/ML4RG-mlm/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/lukas/.local/lib/anaconda3/envs/ML4RG-mlm/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/lukas/.local/lib/anaconda3/envs/ML4RG-mlm/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 119, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/home/lukas/.local/lib/anaconda3/envs/ML4RG-mlm/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 161, in collate_tensor_fn
    out = elem.new(storage).resize_(len(batch), *list(elem.size()))
RuntimeError: Trying to resize storage that is not resizable
