<a href="https://colab.research.google.com/github/OlekanmaVictoria/Data-science-Portfolio/blob/main/Copy_of_Copy_of_TUABtxt_ETL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

This notebook demonstrates how to extract, transform, and load the TUABtxt dataset for use with Tensorflow.

First, let's install and import some useful libraries.

In [None]:
# Be sure you're using the stable versions of both tf and tf-text, for binary compatibility.
!pip install -q -U tensorflow==2.7
!pip install -q -U tensorflow-text==2.7

[K     |████████████████████████████████| 489.6 MB 24 kB/s 
[K     |████████████████████████████████| 463 kB 50.7 MB/s 
[K     |████████████████████████████████| 1.3 MB 40.3 MB/s 
[K     |████████████████████████████████| 4.9 MB 4.4 MB/s 
[?25h

In [None]:
import collections
import pathlib
import re
import string
import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

# Download and explore the dataset

First we'll use a handy tool called `gdown` to download the dataset (just the text reports) from where your team have stored them on Google Drive.

In [None]:
!gdown --id 120d8bHo6NxEsUDprDqiuqj1OTfmhgJzR

Downloading...
From: https://drive.google.com/uc?id=120d8bHo6NxEsUDprDqiuqj1OTfmhgJzR
To: /content/TUAB_txt_relabelled.tar
100% 12.3M/12.3M [00:00<00:00, 50.9MB/s]


The dataset is compressed inside the archive file TUABtxt.tar, so let's extract it (like unzipping a zip file).

In [None]:
import tarfile
tar = tarfile.open("TUAB_txt_relabelled.tar")
tar.extractall()
tar.close()

Now we've extracted a folder called TUAB_txt_relabelled. Let's use pathlib library to explore this directory.

In [None]:
dataset_dir = pathlib.Path('TUAB_txt_relabelled/v2.0.0/edf') # First create a Path variable.
list(dataset_dir.iterdir())           # Then print a list of the folders contained in TUABtxt

[PosixPath('TUAB_txt_relabelled/v2.0.0/edf/train'),
 PosixPath('TUAB_txt_relabelled/v2.0.0/edf/eval')]

We see above that the dataset has already been split into 'train' and 'eval' subsets. This is common practice in widely used machine learning research datasets, to ensure that everyone uses the same test (eval) set when comparing performance.

### Load the full dataset

Next, we will load the data off disk and prepare it into a format suitable for training. The [text_dataset_from_directory](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory) utility makes this easy, and creates a `tf.data.Dataset` object with labels ('normal' and 'abnormal') automatically recognised from the folder structure. ([tf.data](https://www.tensorflow.org/guide/data) is a collection of tools for building input pipelines for machine learning).

In [None]:
train_and_val_ds = preprocessing.text_dataset_from_directory(dataset_dir/'train', batch_size=32)
raw_test_ds = preprocessing.text_dataset_from_directory(dataset_dir/'eval', batch_size=32)

Found 2716 files belonging to 2 classes.
Found 276 files belonging to 2 classes.


When running a machine learning experiment, it is a best practice to divide your dataset into three splits: [train](https://developers.google.com/machine-learning/glossary#training_set), [validation](https://developers.google.com/machine-learning/glossary#validation_set), and [test](https://developers.google.com/machine-learning/glossary#test-set). There are no strict rules, but usually it's best to put most of your data in the training (so that there's plenty to learn from. Let's split the training-and-validation data into 80% training and 20% validation.

In [None]:
# Set the size of each subset of data:
n = len(list(train_and_val_ds)) # Number of batches in original 'train' dataset
n_train = int(0.8*n)   # Use about 80% as training data ...
n_val = int(0.2*n)     # and 20% as validation data.

Now we're ready to actually make the split.

In [None]:
# Split the data into training, validation, and test sets:
raw_train_ds = train_and_val_ds.take(n_train)
raw_val_ds = train_and_val_ds.skip(n_train)

Let's print out a few examples, to get more of a feel for the data.

In [None]:
for text_batch, label_batch in raw_train_ds.take(1):   # Take a single batch from the dataset.
  for i in range(10):                                  # Iterate through the first 10 examples in that batch.
    print("Report: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

Report:  b'CLINICAL HISTORY:  51 year old right handed woman with syncope and headaches, past history of closed head injury in June 2011.\nMEDICATIONS:  HCTZ, Lymphedema, Aspirin, Omeprazole.\nINTRODUCTION:  Digital video EEG is performed in lab using standard 10-20 system of electrode placement with 1 channel EKG.  Hyperventilation and photic stimulation were performed.\nDESCRIPTION OF THE RECORD:  In wakefulness, there is a 9 Hz symmetric reactive alpha rhythm with a small amount of low voltage, frontal central beta.  Central mu activity is noted, but testing is not performed to confirm this. Hyperventilation produces an increase in amplitude of the background and a small amount of slowing. Features of drowsiness include slow rolling eye movements, attenuation of the alpha, and POSTS.\nPhotic stimulation elicits vigorous bilateral driving.\nHR:    72 bpm\nIMPRESSION:  Normal EEG.\nCLINICAL CORRELATION:  No focal nor epileptiform features were observed.  Normal EEG does not exclude a 

The labels are `0` or `1`. To see which of these correspond to which string label, you can check the `class_names` property on the dataset, as below.


In [None]:
for i, label in enumerate(train_and_val_ds.class_names):
  print("Label", i, "corresponds to", label)

Label 0 corresponds to abnormal
Label 1 corresponds to normal


### Prepare the dataset for training

Next, you will standardize, tokenize, and vectorize the data using the `preprocessing.TextVectorization` layer.
* Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.

* Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words by splitting on whitespace).

* Vectorization refers to converting tokens into numbers so they can be fed into a neural network.

All of these tasks can be accomplished with this layer. You can learn more about each of these in the [API doc](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization).

* The default standardization converts text to lowercase and removes punctuation.

* The default tokenizer splits on whitespace.

* The default vectorization mode is `int`. This outputs integer indices (one per token). This mode can be used to build models that take word order into account. You can also use other modes, like `binary`, to build bag-of-word models.


Here we will use the `binary` mode to build a bag-of-words model (essentially one-hot encoding of whether each word in the vocabulary appears in the report). Then we will use the `int` mode (integer encoding of each word in the report, with order preserved) with a 1D ConvNet.

In [None]:
VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

For `int` mode, in addition to maximum vocabulary size, you need to set an explicit maximum sequence length, which will cause the layer to pad or truncate sequences to exactly sequence_length values.

In [None]:
MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

Next, you will call `adapt` to make the VectorizationLayer adjust itself according to the vocabulary in the dataset.

Note: it's important to only use your training data when calling adapt (using the test set would leak information).

In [None]:
# To avoid some errors caused by non-standard characters, we create a function
# that does some additional 'cleaning' of the text.
def clean_text(text, labels):
  cleaned_version_of_text = tf.strings.unicode_transcode(text, "US ASCII", "UTF-8") 
  return cleaned_version_of_text
  
# Now apply our clean_text function to the full dataset.
train_text = raw_train_ds.map(clean_text) 

# Finally, let the vectorize layers adjust themselves to fit the vocabulary of the dataset.
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

See the result of using these layers to preprocess data:

In [None]:
def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

In [None]:
def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

In [None]:
# Retrieve a batch (of 32 reports and labels) from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_report, first_label = text_batch[0], label_batch[0]
print("Report", first_report)
print("Label", first_label)

Report tf.Tensor(b'CLINICAL HISTORY:  A 28-year-old woman with a 7-year history of epilepsy.  Seizures increased 1 month ago.  Last seizure was 12 days ago.  The patient reports falling back, loss of consciousness, generalized tonic-clonic activity followed by confusion.  The patient had difficulty remembering to take her medications and past history of drug abuse.\nMEDICATIONS:  Keppra.\nINTRODUCTION:  A digital video EEG was performed in the lab using the standard 10-20 system of electrode placement with one channel of EKG.  Hyperventilation and photic stimulation were completed.  This is an awake and asleep record.\nDESCRIPTION OF THE RECORD:  In wakefulness, the background EEG includes a 10.5 Hz alpha rhythm and a background rich in beta activity.  Hyperventilation produces a small amount of theta.  Rhythmic data is noted as the patient becomes drowsy.  She transitions in and out of drowsiness.\nPhotic stimulation elicited bilateral driving.\nHeart rate:  72 BPM.\nIMPRESSION:  This

In [None]:
print("'binary' vectorized report:", 
      binary_vectorize_text(first_report, first_label)[0])

'binary' vectorized report: tf.Tensor([[0. 1. 1. ... 0. 0. 0.]], shape=(1, 6475), dtype=float32)


In [None]:
print("'int' vectorized report:",
      int_vectorize_text(first_report, first_label)[0])

'int' vectorized report: tf.Tensor(
[[  12   18    6 1785  113    7    6 4027   18    3   76   23  444   55
   527  245  197   94   11  417  605  245    2   17  757  599  566  174
     3  213  129  339   38  232   95  241    2   17  177  775 4647   10
  2319  216   34    4  183   18    3 1195  546   34  140   48    6   60
    59    9   11   13    8    2   83   50    2   45   46   44    3   29
    52    7  142   72    3   37   31    4   20   19   26  214   16    5
    25   86    4  164   14   42    3    2   14    8   35    2   41    9
   280    6  513   39   70   32    4    6   41 1247    8   61   38   31
   156    6  137  112    3   80  120 1315    5   79   66    2   17  370
   138  153  357    8    4  173    3   78   20   19  465  159  104  123
   125  246   89   43   16    5    6   36    9    8   35  383   68   98
    22   12   49   16    5    2  479   36    9   54   16   17  184   36
     9  904   55  763    8   16  310    7  320   23    4   98   36  333
   114    9  323    5 1885  

As you can see above, `binary` mode returns an array denoting which tokens exist at least once in the input, while `int` mode replaces each token by an integer, thus preserving their order. You can lookup the token (string) that each integer corresponds to by calling `.get_vocabulary()` on the layer.

In [None]:
print("12 ---> ", int_vectorize_layer.get_vocabulary()[12])
print("18 ---> ", int_vectorize_layer.get_vocabulary()[18])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

12 --->  clinical
18 --->  history
Vocabulary size: 6491


You are nearly ready to train your model. As a final preprocessing step, you will apply the `TextVectorization` layers you created earlier to the train, validation, and test dataset.

In [None]:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

### Configure the dataset for performance

These are two important methods you should use when loading data to make sure that I/O does not become blocking.

- `Dataset.cache` keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.
- `Dataset.prefetch` overlaps data preprocessing and model execution while training.

You can learn more about both methods, as well as how to cache data to disk in the *Prefetching* section of the [Better performance with the tf.data API](../../guide/data_performance.ipynb) guide.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [None]:
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

### Train the model

It's time to create your neural network.

For the `'binary'` vectorized data, define a simple bag-of-words linear model, then configure and train it:

In [None]:
binary_model = tf.keras.Sequential([layers.Dense(4)])

binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Next, you will use the `'int'` vectorized layer to build a 1D ConvNet:

In [None]:
def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])
  return model

In [None]:
# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Compare the two models:

In [None]:
print("Linear model on binary vectorized data:")
print(binary_model.summary())

Linear model on binary vectorized data:
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 4)                 25904     
                                                                 
Total params: 25,904
Trainable params: 25,904
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
print("ConvNet model on int vectorized data:")
print(int_model.summary())

ConvNet model on int vectorized data:
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          640064    
                                                                 
 conv1d (Conv1D)             (None, None, 64)          20544     
                                                                 
 global_max_pooling1d (Globa  (None, 64)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_1 (Dense)             (None, 4)                 260       
                                                                 
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________
None


Evaluate both models on the test data:

In [None]:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

Binary model accuracy: 98.55%
Int model accuracy: 99.64%


Note: This example dataset represents a rather simple classification problem. More complex datasets and problems bring out subtle but significant differences in preprocessing strategies and model architectures. Be sure to try out different hyperparameters and epochs to compare various approaches.

### Export the model

In the code above, you applied `tf.keras.layers.TextVectorization` to the dataset before feeding text to the model. If you want to make your model capable of processing raw strings (for example, to simplify deploying it), you can include the `TextVectorization` layer inside your model.

To do so, you can create a new model using the weights you have just trained:

In [None]:
export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))

Accuracy: 98.55%


Now, your model can take raw strings as input and predict a score for each label using `Model.predict`. Define a function to find the label with the maximum score:

In [None]:
def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

Including the text preprocessing logic inside your model enables you to export a model for production that simplifies deployment, and reduces the potential for [train/test skew](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew).

There is a performance difference to keep in mind when choosing where to apply `tf.keras.layers.TextVectorization`. Using it outside of your model enables you to do asynchronous CPU processing and buffering of your data when training on GPU. So, if you're training your model on the GPU, you probably want to go with this option to get the best performance while developing your model, then switch to including the `TextVectorization` layer inside your model when you're ready to prepare for deployment.

Visit the [Save and load models](../keras/save_and_load.ipynb) tutorial to learn more about saving models.

# Rule-Based (non-ML) Approach

Looking through the reports, it seems as though it's usually stated quite clearly when the EEG is abnormal. Rather than attempting any machine learning, why don't we just look for that key word (or related words/phrases) in the text? This approach is implemented below.

In [None]:
# First initialise some counters
n = 0
n_correct = 0
n_failed_decode = 0

# Iterate over all batches, taking the text and labels batch-by-batch.
# N.B. take(-1) has the effect of pulling out all the batches, instead of a specific number, as explained in the docs here: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#take
for text_batch, label_batch in train_and_val_ds.take(-1):
# for text_batch, label_batch in raw_test_ds.take(-1):

  # Iterate over the report examples in the batch:
  for ind,text in enumerate(text_batch):

    # Get rid of any pesky non-standard characters using the function we created previously.
    cleaned_text = clean_text(text,0)
    # Then convert it from a tensorflow Tensor to a python string so that we can 
    # use some standard python text analysis on it.
    cleaned_and_decoded_text = cleaned_text.numpy().decode("UTF-8")

    # Check if the word 'abnormal' is in the report, and label it accordingly.
    if 'no abnormal' in cleaned_and_decoded_text.lower():
      predicted_label = 1
    elif 'abnormal' in cleaned_and_decoded_text.lower():
      predicted_label = 0
    else:
      predicted_label = 1
      
    # If we predicted correctly, add one to our count of correct predictions.
    if predicted_label==label_batch[ind]:
      n_correct = n_correct+1
    else:
      # Uncomment the lines below if you want to inspect the cases where we were wrong.
      print("--- Wrong example ---")
      print(text.numpy().decode("UTF-8"))
      print()
      print("---------------------")
      print(f"The above example was classified with label {predicted_label} but it's actual label is {label_batch[ind].numpy()}.")
      print("---------------------")
      pass

    # Add one to our count of the total number of examples examined.
    n = n+1

print(f"Accuracy = {100*n_correct/n} percent ({n_correct} correct predictions out of {n}).")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
IMPRESSION:  Normal wake and sleep EEG.
CLINICAL CORRELATION: No seizures or epileptiform discharges were clearly seen.





---------------------
The above example was classified with label 0 but it's actual label is 1.
---------------------
--- Wrong example ---
REASON FOR STUDY:  Seizures.
CLINICAL HISTORY:  This is a 51-year-old woman with history of epilepsy, syncope, seizure free for 3 years, characterized by left sided shaking with loss of consciousness and generalized convulsions.

MEDICATIONS:  Dilantin.
INTRODUCTION:  A routine EEG was performed using the standard 10-20 electrode placement system with additional anterior temporal and single lead EKG electrode.  The patient was recorded during wakefulness and drowsiness.  Activating procedures included hyperventilation and photic stimulation.
TECHNICAL DIFFICULTIES:  There were some A1 electrode artifacts which was fixed during the recording.
DESCRIPTION OF THE R