# Bacterial Ensemble Language Model

This notebook trains a language model on the ensemble of bacterial genomes assembled in the [Bacterial Ensemble 0 Data Processing](https://github.com/kheyer/Genomic-ULMFiT/blob/master/Bacteria/Bacterial%20Ensemble/Bacterial%20Ensemble%200%20Data%20Processing.ipynb) notebook. The language model trained is based on the AWD-LSTM architecture. the genomic input information is split into 5-mers with a stride of 2 bases between each 5-mer. The model is trained to take an input sequence of 5-mers and predict the next 5-mer. This allows us to train a model that learns the structure of genomic information in a totally unsupervised way.

The base of the language model (token embedding + LSTM layers) will then be used to initialize a classification model.

For more detail on how genomic data is processed and how these language models are trained, see the following notebooks:

[E. coli 1 Naive Model](https://github.com/kheyer/Genomic-ULMFiT/blob/master/Bacteria/E.%20Coli/E.%20coli%201%20Naive%20Model.ipynb)

[E. coli 2 Genomic Pretraining](https://github.com/kheyer/Genomic-ULMFiT/blob/master/Bacteria/E.%20Coli/E.%20coli%202%20Genomic%20Pretraining.ipynb)

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [7]:
from fastai import *
from fastai.text import *
from Bio import Seq
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import FeatureLocation, CompoundLocation
import networkx as nx

In [9]:
sys.path.append("../../..")
from utils import *

In [3]:
path = Path('/home/martin/Downloads/CMSCSP/methylations/Csv')

In [4]:
df = pd.read_csv(path/'methylationscorpusfair4.txt', header=None)

In [5]:
df.head()

Unnamed: 0,0
0,GTATTGGTTAAGGTTTCTATCTTGTAATCCTGCCTCCATCC
1,TGCCAAGAACGCTTGTGCCTCCTTGGCGAAGCGATGGAGTT
2,TTTGTGGCCGCTGTGGAAATCGCTTCGCCCTCACCTCCTCC
3,CGCACATATGTATTTCAAATCGATTTTCATATCTTATCTAT
4,TAGATTCGTTTTTTATAGGACGGATAGAATATAAGTTTGAA


In [10]:
# 10% of the data used for validation
train_df, valid_df = split_data(df, 0.9)

In [11]:
train_df.shape, valid_df.shape

((2720350, 1), (302261, 1))

In [12]:
tok = Tokenizer(partial(GenomicTokenizer, ngram=3, stride=1), n_cpus=4, pre_rules=[], post_rules=[], special_cases=['xxpad'])

In [13]:
data = GenomicTextLMDataBunch.from_df(path, train_df, valid_df, bs=850, tokenizer=tok, text_cols=0, label_cols=1)

In [13]:
len(data.vocab.itos)

86

In [14]:
# Save model vocabulary - this will be important later
np.save(path/'methylations_vocab_3m1s.npy', data.vocab.itos)

In [14]:
config = dict(emb_sz=400, n_hid=1150, n_layers=3, pad_token=0, qrnn=False, output_p=0.25, 
                          hidden_p=0.1, input_p=0.2, embed_p=0.02, weight_p=0.15, tie_weights=True, out_bias=True)
drop_mult = 0.25

In [15]:
learn = get_model_LM(data, drop_mult, config)

In [16]:
# learn = learn.to_fp16(dynamic=True);

In [17]:
learn.model = learn.model.cuda(0)

In [18]:
learn.lr_find()
learn.recorder.plot()

epoch,train_loss,valid_loss,accuracy,time


LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.


RuntimeError: CUDA out of memory. Tried to allocate 1.28 GiB (GPU 0; 3.95 GiB total capacity; 2.76 GiB already allocated; 513.81 MiB free; 2.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
learn.fit_one_cycle(5, 1e-2, moms=(0.8,0.7))

In [None]:
learn.save('b1_3m1s')

In [None]:
learn.save_encoder('b1_3m1s_enc')