# Neural Network
This notebook will work over creating and working with the Neural Network used.

In [2]:
%pip install pandas numpy~=1.19.2 sklearn matplotlib seaborn tensorflow-gpu

Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn import metrics
import pandas as pd
import numpy as np
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc

import matplotlib.pyplot as plt
from itertools import cycle

from sklearn.metrics import f1_score

## Load data
Load the preprocessed training/testing data

In [4]:
ompk35 = pd.read_csv("../data/processed/form_3_ompk35.csv", index_col=0)
ompk36 = pd.read_csv("../data/processed/form_3_ompk36.csv", index_col=0)
ompk37 = pd.read_csv("../data/processed/form_3_ompk37.csv", index_col=0)



labels = pd.read_csv("../data/processed/labels.csv", index_col=0)
set_mics = pd.read_csv("../data/processed/mic_set.csv")

## Rename columns
We will be merging all columns together and performing an inner join on the rows. In order to do that, we need to make sure all columns for each gene have different names. Otherwise, the columns will be merged.

In [5]:
def set_columns(df, gene_name):
    df = df.set_axis([i for i in range(len(df.columns))], axis=1)
    df = df.add_prefix(f'{gene_name}_')
    return df

In [6]:
ompk35 = set_columns(ompk35, 'ompk35')
ompk36 = set_columns(ompk36, 'ompk36')
ompk37 = set_columns(ompk37, 'ompk37')

In [7]:
form_3 = pd.concat([ompk35, ompk36, ompk37], axis=1, join='inner')

## Setting labels
The labels contain all isolates that have no holes for at least 1 gene. However, we want to get isolates that have no holes for all genes. For that, we will need to shrink the list of labels down to only have isolates that are in form_3 variable.

In [8]:
labels = labels[labels.index.isin(form_3.index)]

## Sorting data and labels
The labels and data must be in sorted order when training. Otherwise, the an MIC value could be matched up with the wrong datapoint.

In [9]:
labels = labels.sort_index()
form_3 = form_3.sort_index()

## Update 4 from XGBoost required here

In [10]:
def scale_labels(x, classes=[]):
    """Scaling down labels to be [0, num_classes)"""
    return classes.index(x)  # np.where(classes == x)[0][0]

In [11]:
classes = list(labels['Antibiotic_1'].unique())
classes.sort()
y = labels['Antibiotic_1'].apply(scale_labels, classes=classes)

# Network
The first network that will be tried is the small, one-hidden-layer NN from the from:

D. Aytan-Aktug, P. T. L. C. Clausen, V. Bortolaia, F. M. Aarestrup, and O. Lund. "Prediction of Acquired Antimicrobial Resistance for Multiple Bacterial Species Using Neural Networks". American Society for Microbiology Journals, January 5, 2020, e00774-19. [https://doi.org/10.1128/MSYSTEMS.00774-19](https://doi.org/10.1128/MSYSTEMS.00774-19).

It has a hidden layer with 200 neurons.

In [12]:
num_genes = len(form_3.columns)
num_mics = len(classes)
model = keras.Sequential(
                    [
                        layers.Dense(200, activation="relu", name="hidden", input_shape=(num_genes,)),
                        layers.Dense(num_mics, activation="softmax", name="output"),
                    ])

# Summary of model
Let's see what the summary of the model shows to see if we have the right structure.

In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
hidden (Dense)               (None, 200)               230000    
_________________________________________________________________
output (Dense)               (None, 7)                 1407      
Total params: 231,407
Trainable params: 231,407
Non-trainable params: 0
_________________________________________________________________


# Compile model
Next, we need to compile the model.

In [14]:
model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_crossentropy', 'sparse_categorical_accuracy'])

# Train the model
We want to use 20% of the input data for validation, and we want to shuffle all data (that is set to true by default). Even with large batch size and number of epochs, it has high loss for both training and validation which means it is underfitting the data. Because of this, next I will try DeepARG model.

In [15]:
model.fit(form_3, y, batch_size=100, epochs=100, validation_split=0.2)

6302 - val_sparse_categorical_crossentropy: 1.6302 - val_sparse_categorical_accuracy: 0.5347
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/1

<tensorflow.python.keras.callbacks.History at 0x7fe476b220d0>

# Building and compiling DeepARG model

In [16]:
def build_model():
    model = keras.Sequential(
                    [
                        layers.Dense(2000, activation="relu", input_shape=(num_genes,)),
                        layers.Dropout(0.5),
                        layers.Dense(1000, activation="relu"),
                        layers.Dropout(0.5),
                        layers.Dense(500, activation="relu"),
                        layers.Dropout(0.5),
                        layers.Dense(100, activation="relu"),
                        layers.Dense(num_mics, activation="softmax", name="output"),
                    ])
    model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_crossentropy', 'sparse_categorical_accuracy'])
    return model

In [17]:
model = build_model()
model.fit(form_3, y, batch_size=100, epochs=300, validation_split=0.2)

al_accuracy: 0.5918
Epoch 227/300
Epoch 228/300
Epoch 229/300
Epoch 230/300
Epoch 231/300
Epoch 232/300
Epoch 233/300
Epoch 234/300
Epoch 235/300
Epoch 236/300
Epoch 237/300
Epoch 238/300
Epoch 239/300
Epoch 240/300
Epoch 241/300
Epoch 242/300
Epoch 243/300
Epoch 244/300
Epoch 245/300
Epoch 246/300
Epoch 247/300
Epoch 248/300
Epoch 249/300
Epoch 250/300
Epoch 251/300
Epoch 252/300
Epoch 253/300
Epoch 254/300
Epoch 255/300
Epoch 256/300
Epoch 257/300
Epoch 258/300
Epoch 259/300
Epoch 260/300
Epoch 261/300
Epoch 262/300
Epoch 263/300
Epoch 264/300
Epoch 265/300
Epoch 266/300
Epoch 267/300
Epoch 268/300
Epoch 269/300
Epoch 270/300
Epoch 271/300
Epoch 272/300
Epoch 273/300
Epoch 274/300
Epoch 275/300
Epoch 276/300
Epoch 277/300
Epoch 278/300
Epoch 279/300
Epoch 280/300
Epoch 281/300
Epoch 282/300
Epoch 283/300
Epoch 284/300
Epoch 285/300
Epoch 286/300
Epoch 287/300
Epoch 288/300
Epoch 289/300
Epoch 290/300
Epoch 291/300
Epoch 292/300
Epoch 293/300
Epoch 294/300
Epoch 295/300
Epoch 296/300


<tensorflow.python.keras.callbacks.History at 0x7fe474285d60>

# Cross Validation
We still need to do Cross Validation to get a really good sense of how the model would perform, so that is what will be next. Code block below was mostly taken from [this StackOverflow answer](https://stackoverflow.com/a/57775402).

In [18]:
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
est = KerasClassifier(build_fn=build_model, epochs=100, batch_size=100)
kfold= RepeatedKFold(n_splits=5, n_repeats=100)
cv_results = cross_val_score(est, form_3, y, cv=kfold, scoring="f1_micro")

NameError: name 'buildModel' is not defined

In [19]:
model.predict(form_3)

array([[0.34739387, 0.14793424, 0.0982163 , ..., 0.06934967, 0.07976612,
        0.17324272],
       [0.41908008, 0.1555149 , 0.08490692, ..., 0.04756605, 0.06280009,
        0.16183753],
       [0.18169662, 0.02881275, 0.0230271 , ..., 0.04126183, 0.08896692,
        0.63037777],
       ...,
       [0.38174483, 0.15214421, 0.09193739, ..., 0.05807449, 0.07138063,
        0.16833109],
       [0.41895375, 0.15550558, 0.08493099, ..., 0.04759885, 0.06282821,
        0.16186136],
       [0.41895375, 0.15550557, 0.08493099, ..., 0.04759885, 0.06282821,
        0.16186136]], dtype=float32)