# Sequence Classification ML Assignment 

#### In this jupyter notebook, you will modify and run a machine learning model to classify human DNA sequences into coding vs intergenomic sequences. This script has several functions that are written for you, please do NOT modify any code unless it specifies to change it. 

In [None]:
import os
import numpy as np

import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers import (
    BatchNormalization,
    Conv1D,
    Dense,
    Dropout,
    GlobalAveragePooling1D,
    MaxPooling1D,
)

# import wandb #uncomment if using weights and biases
# from wandb.integration.keras import WandbMetricsLogger, WandbModelCheckpoint
# import random

from genomic_benchmarks.data_check import is_downloaded, info
from genomic_benchmarks.models.tf import vectorize_layer

from genomic_benchmarks.models.tf import get_basic_cnn_model_v0 as get_model

### Importing Dataset

In [None]:
DATASET = "demo_coding_vs_intergenomic_seqs"
VERSION = 0
BATCH_SIZE = 64
EPOCHS = 10

In [None]:
if not is_downloaded(DATASET):
    download_dataset(DATASET)

info(DATASET)

**Does anything strike you about the number of sequences? Why do you think this dataset was created with 100,000 200bp sequences from the human genome?**

*Put your answer here*

### Creating the training dataset

In [None]:
CLASSES = ['coding_seqs', 'intergenomic_seqs']
NUM_CLASSES = len(CLASSES)

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    '/projects/bgmp/shared/Bi625/ML_Assignment/Datasets/demo_coding_vs_intergenomic_seqs/train',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

**How are the sequences stored currently? Can you figure out if the below sequence is coding vs intergenomic sequence?**


*Put your answer here*

In [None]:
list(train_dset)[0][0][0]

### Pre-processing the sequences  

In [None]:
vectorize_layer.adapt(train_dset.map(lambda x, y: x))
vocab_size = len(vectorize_layer.get_vocabulary())
vectorize_layer.get_vocabulary()

In [None]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text)-2, label

train_ds = train_dset.map(vectorize_text)

**How did the pre-processing change the sequence?**

*Put your answer here*

In [None]:
list(train_ds)[0][0][4]

In [None]:
test_dset = tf.keras.preprocessing.text_dataset_from_directory(
    '/projects/bgmp/shared/Bi625/ML_Assignment/Datasets/demo_coding_vs_intergenomic_seqs/test',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

test_ds = test_dset.map(vectorize_text)

list(test_ds)[0][0][3]

### Example Recursive Neural Network

In [None]:
f1 = tfa.metrics.F1Score(num_classes=1, threshold=0.5, average="micro")
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
acc = tf.metrics.BinaryAccuracy(threshold=0.0)

In [None]:
## Remove comments if using weights and biases

# Start a run, tracking hyperparameters
# wandb.init(
#     # set the wandb project where this run will be logged
#     project="sequence_classification_assignment",

#     # track hyperparameters and run metadata with wandb.config
#     config={
#         "activation_2": "softmax",
#         "optimizer": "adam",
#         "loss": "binary_crossentropy",
#         "metric": "accuracy",
#         "epoch": 10,
#         "batch_size": 64
#     }
# )

# config = wandb.config

In [None]:
character_split_fn = lambda x: tf.strings.unicode_split(x, "UTF-8")
vectorize_layer = TextVectorization(output_mode="int", split=character_split_fn)
onehot_layer = tf.keras.layers.Lambda(lambda x: tf.one_hot(tf.cast(x, "int64"), vocab_size))

In [None]:
model_rnn = tf.keras.Sequential()
#LSTM is a type of RNN layer
model_rnn.add(tf.keras.layers.Embedding(input_dim=6, output_dim=64, input_length=200))
##instead of doing the one-hot encoding in this example, we used embeddings (code for a one_hot layer is provided above if you want to incorporate it)
##input-dim = vocab size, outputdim=batch size, and inlength=sequence length
model_rnn.add(tf.keras.layers.LSTM(64))
model_rnn.add(tf.keras.layers.Dense(40,activation='relu'))
model_rnn.add(tf.keras.layers.Dense(1))
model_rnn.build((200,))
model_rnn.summary()

In [None]:
model_rnn.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam', metrics = "accuracy")
              #metrics= [config.metric]) # USE this if using weights and baises

In [None]:
history = model_rnn.fit(
    train_ds,
    epochs=EPOCHS, batch_size=64)

## Run this model fit command if using weights and biases
# history = model_rnn.fit(
#     train_ds,
#     epochs=EPOCHS, batch_size=config.batch_size, callbacks=[
#                       WandbMetricsLogger(log_freq=5),
#                       WandbModelCheckpoint("models")])

In [None]:
model_rnn.evaluate(test_ds)

### Code and Explore!

In this exploration, you are **required to create three different neural networks to solve the above problem**. Creating a model can include 1) fundamentally changing the type of layers (ex: recursive layers to convolutional layers), adding additional layers including pooling and activation layers, or changing the functions (loss, optimizer). Your new models do **not** have to be better than the recursive model shown above; however, you **must explain what you did and why you decided to try something out**. You may also change hyperparameters (batch size, epoch number), but please make some major structure changes in addition to hyperparameter changes. 

Most importantly, have fun and be curious!

#### Inspiration: 
https://github.com/Jawwad-Fida/DNA-sequence-classification-by-Deep-Neural-Network

https://colab.research.google.com/github/google/nucleus/blob/master/nucleus/examples/dna_sequencing_error_correction.ipynb

https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

https://www.tensorflow.org/text/tutorials/text_classification_rnn

https://github.com/const-ae/Neural_Network_DNA_Demo/blob/master/nn_for_sequence_data.ipynb 

#### Your Model 1

*Explain your change here (what you did and why you tried that out)*

#### Your Model 2

*Explain your change here (what you did and why you tried that out)*

#### Your Model 3

*Explain your change here (what you did and why you tried that out)*

**Are any of your models more successful than model_rnn? Explain why**

*Put answer here*