# <center><font color='blue'>SkimLit</center></font>

## Table of contents
- [1 - Objetivos](#1)
- [2 - Librerías necesarias](#2)
- [3 - Carga y visualización de datos](#3)
- [4 - Pre-procesamiento de datos](#4)
    - [4.1. - Datos faltantes](#4.1)
    - [4.2. - Data Categóricos](#4.2)
    - [4.3. - Balanceo de clases](#4.3)
    - [4.4. - Pre-Procesamiento especial para NLP](#4.4)
- [5 - Modelos](#5)
    - [5.1. - Modelo 1](#5.1)
    - [5.2. - Modelo 2](#5.2)
    - [5.3. - Modelo 3](#5.3)
    - [5.4. - Modelo 4](#5.4)
    - [5.4. - Modelo 5](#5.5)
    - [5.4. - Modelo 6](#5.6)
    - [5.4. - Modelo 7](#5.7)
- [6 - Comparando los modelos y eligiendo el mejor](#6)
- [7 - Ajuste de hiperparámetros](#7)
- [8 - Predicciones con el modelo final](#8)
- [9 - Guardando el modelo](#9)
- [10 - Conclusiones](#10)

<a name="1"></a>
## <b> <font color='blue'> 1. Objectives </font> </b>
Build an NLP model to make reading medical abstracts easier.

The paper we're replicating (the source of the dataset that we'll be using) is available here: https://arxiv.org/abs/1710.06071



<a name="2"></a>
## <b> <font color='blue'> 2. Setup </font> </b>

What we are looking for is to associate a specific label (objective, background, result...) with a given sentence (composed of many words), so it is a many-to-one problem.

### Modules

In [1]:
# que no se impriman info y warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 


In [2]:
import csv
import random
import re

import numpy as np

import pandas as pd
import seaborn as sns
import string
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

import tensorflow as tf
from tensorflow.keras import layers, callbacks, models, Sequential, losses
from tensorflow import keras


In [None]:
# random seed
tf.random.set_seed(42)

<a name="3"></a>
## <b> <font color='blue'> 3.  Data Loading and Visualization </font> </b>

Let's download the data.

We can do so from the authors GitHub: https://github.com/Franck-Dernoncourt/pubmed-rct 

In [None]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct
!dir pubmed-rct #ls

There are 2 datasets, one with 20000 examples (usefull for the initial tests) and another one with 200k examples.

In [None]:
# Check what files are in the PubMed_20K dataset
!dir pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign

In [None]:
# Start our experiments using the 20k dataset with numbers replaced by "@" sign
data_dir = "pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

In [None]:
# Check all of the filenames in the target directory
filenames = [data_dir + filename for filename in os.listdir(data_dir)]
filenames

So with that in mind, let's write a function to read in all of the lines of a target text file.

In [None]:
# Create function to read the lines of a document
def get_lines(filename):
  """
  Reads filename (a text filename) and returns the lines of text as a list.

  Args:
    filename: a string containing the target filepath.

  Returns:
    A list of strings with one string per line from the target filename.
  """
  with open(filename, "r") as f:
    return f.readlines()

In [None]:
# Let's read in the training lines and see some of them
train_lines = get_lines(data_dir+"train.txt") # read the lines with the training file

to_show = 15
train_lines[:to_show]

We see that the abstracts:

- Start with "###
- Followed by an ID and a newline character (\n)
- Each sentence has a label (for example RESULTS, METHODS..) (starting with the label and then \t)
- The end is indicated by a newline charecter (\n).

<br>
We need a function to separate the text from the labels and the different abstracts.

<a name="4"></a>
## <b> <font color='blue'> 4.  Data pre-processing </font> </b>


<a name="4.1"></a>
### <b> <font color='#1F618D'> 4.1. Formatting our data </font> </b>

We want that our data looks like this:

```
[{'line_number': 0,
   'target': 'BACKGROUND',
   'text': "Emotional eating is associated with overeating and the development of obesity .\n"
   'total_lines': 11},
   ...]
```

total lines it's the number of lines in the abstract (that we want to classify sequentially)

Let's write a function which turns each of our datasets into the above format so we can continue to prepare our data for modelling.


In [None]:
def preprocess_text_with_line_numbers(filename):
  """
  Returns a list of dictionaries of abstract line data.

  Takes in filename, reads it contents and sorts through each line,
  extracting things like the target label, the text of the sentnece,
  how many sentences are in the current abstract and what sentence
  number the target line is.
  """
  input_lines = get_lines(filename) # get all lines from filename
  abstract_lines = "" # create an empty abstract
  abstract_samples = [] # create an empty list of abstracts

  # Loop through each line in the target file
  for line in input_lines:
    if line.startswith("###"): # check to see if the is an ID line
      abstract_id = line
      abstract_lines = "" # reset the abstract string if the line is an ID line

    elif line.isspace(): # check to see if line is a new line
      abstract_line_split = abstract_lines.splitlines() # split abstract into separate lines

      # Iterate through each line in a single abstract and count them at the same time
      for abstract_line_number, abstract_line in enumerate(abstract_line_split):
        line_data = {} # create an empty dictionary for each line
        target_text_split = abstract_line.split("\t") # split target label from text 
        line_data["target"] = target_text_split[0] # get target label
        line_data["text"] = target_text_split[1].lower() # get target text and lower it
        line_data["line_number"] = abstract_line_number # what number line does the line appear in the abstract?
        line_data["total_lines"] = len(abstract_line_split) - 1 # how many total lines are there in the target abstract? (start from 0)
        abstract_samples.append(line_data) # add line data to abstract samples list

    else: # if the above conditions aren't fulfilled, the line contains a labelled sentence
      abstract_lines += line
  
  return abstract_samples

          

In [None]:
# Get data from file and preprocess it
train_samples = preprocess_text_with_line_numbers(data_dir + "train.txt")
val_samples = preprocess_text_with_line_numbers(data_dir + "dev.txt") # dev is another name for validation dataset
test_samples = preprocess_text_with_line_numbers(data_dir + "test.txt")

print(len(train_samples), len(val_samples), len(test_samples))

In [None]:
# Check the first abstract of our training data
train_samples[:12]

Let's create dataframes:

In [None]:
train_df = pd.DataFrame(train_samples)
val_df = pd.DataFrame(val_samples)
test_df = pd.DataFrame(test_samples)

train_df.head(5)

<a name="4.2"></a>
### <b> <font color='#1F618D'> 4.2. More visualization </font> </b>

#### Number of classes and class balance

In [None]:
num_classes = train_df['target'].nunique()
num_classes

In [None]:
# Distribution of labels in training data
train_df.target.value_counts(normalize=True)

#### Total lines distribution

In [None]:
# Let's check the length of different lines
train_df.total_lines.plot.hist(); 


#### Sentences

In [None]:
# Convert abstract text lines into lists
train_sentences = train_df["text"].tolist()
val_sentences = val_df["text"].tolist()
test_sentences = test_df["text"].tolist()
len(train_sentences), len(val_sentences), len(test_sentences)

In [None]:
# View 5 lines of training sentences
train_sentences[:5]

In [None]:
# How long is each sentence on average?
sent_lens = [len(sentence.split()) for sentence in train_sentences]
avg_sent_len = np.mean(sent_lens)
avg_sent_len

In [None]:
# What's the distribution look like?

# Crear el histograma
n, bins, patches = plt.hist(sent_lens, bins=5, edgecolor='black')

# Colores para cada barra
colors = ['blue', 'cyan', 'green', 'purple', 'orange']

# Asignar un color a cada barra
for patch, color in zip(patches, colors):
    patch.set_facecolor(color)

# Mostrar el gráfico
plt.show()



In [None]:
# How long of a sentence lenght covers 95% of examples?
output_seq_len = int(np.percentile(sent_lens, 95))
output_seq_len

In [None]:
# Maximum sequence length in the training set
max(sent_lens)

<a name="4.3"></a>
### <b> <font color='#1F618D'> 4.3. Categorical Data </font> </b>

We will use one-hot encoding for our targets, since there are no ordinal relationship between them.

In [None]:
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') # we want non-sparse matrix
train_labels_one_hot = one_hot_encoder.fit_transform(train_df["target"].to_numpy().reshape(-1, 1))

# here there is no fit, we fit with the training data only
val_labels_one_hot = one_hot_encoder.transform(val_df["target"].to_numpy().reshape(-1, 1))
test_labels_one_hot = one_hot_encoder.transform(test_df["target"].to_numpy().reshape(-1, 1))

# check what one hot encoded labels look like
train_labels_one_hot

<a name="4.4"></a>
### <b> <font color='#1F618D'> 4.4. Pre-processing for NLP </font> </b>

#### Create text vectorizer layer

In [None]:
# How many words are in our vocab? (taken from table 2 in: https://arxiv.org/pdf/1710.06071.pdf)
max_tokens = 68000

In [None]:
#from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.layers import TextVectorization


# Create text vectorizer
text_vectorizer = TextVectorization(max_tokens=max_tokens, # number of words in vocabulary
                                    output_sequence_length=output_seq_len) # desired output length of vectorized sequences

In [None]:
# Adapt text vectorizer to training sentences
text_vectorizer.adapt(train_sentences)

In [None]:
target_sentence = random.choice(train_sentences)
print(f"Text:\n{target_sentence}")
print(f"\nLength of text: {len(target_sentence.split())}")
print(f"\nVectorized text: {text_vectorizer([target_sentence])}")

Let's observe that it pads with zeros up to the specified output sequence length (output_seq_length, which is 55 in this case).

In [None]:
# How many words in our training vocabulary
rct_20k_text_vocab = text_vectorizer.get_vocabulary()
print(f"Number of words in vocab: {len(rct_20k_text_vocab)}")
print(f"Most common words in the vocab: {rct_20k_text_vocab[:5]}")
print(f"Least common words in the vocab: {rct_20k_text_vocab[-5:]}")

In [None]:
# Get the config of our text vectorizer
text_vectorizer.get_config()

<b> 
We will apply it later as the first layer of the model after obtaining the input. </b>

<a name="4.5"></a>
### <b> <font color='#1F618D'> 4.5. Creating tensorflow datasets </font> </b>

In [None]:
# Turn our data into TensorFlow Datasets
train_dataset = tf.data.Dataset.from_tensor_slices((train_sentences, train_labels_one_hot))
valid_dataset = tf.data.Dataset.from_tensor_slices((val_sentences, val_labels_one_hot))
test_dataset = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels_one_hot))


In [None]:
for x, y in train_dataset.take(1):
    print(f"Text: {x}\n")
    print(f"Label: {y}")

In [None]:
# Take the TensorSliceDataset's and turn them into prefected datasets
train_dataset = train_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
valid_dataset = valid_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

<a name="5"></a>
## <b> <font color='blue'> 5. Models </font> </b>

In [None]:
# to save results and compare
results = {}

In [None]:
# constants
INPUT_SHAPE=(1,)
BATCH_SIZE = 32
NUM_CLASSES=num_classes

<a name="5.1"></a>
### <b> <font color='#1F618D'> 5.1. Embedding layer </font> </b>

In [None]:
# Create token embedding layer
token_embed = layers.Embedding(input_dim=len(rct_20k_text_vocab), # length of vocabulary
                               output_dim=128, # Note: different embedding sizes result in drastically differnt 
                                               #numbers of parameters to train
                               mask_zero=True, # use masking to handle variable sequence lengths (save space),
                               name="token_embedding")

In [None]:
# Show example embedding
print(f"Sentence before vectorization:\n {target_sentence}\n")
vectorized_sentence = text_vectorizer([target_sentence])
print(f"Sentence after vectorization (before embedding):\n {vectorized_sentence}\n")
embedded_sentence = token_embed(vectorized_sentence)
print(f"Sentence after embedding:\n {embedded_sentence}\n")
print(f"Embedded sentence shape: {embedded_sentence.shape}")

<a name="5.2"></a>
### <b> <font color='#1F618D'> 5.2. Trying different models </font> </b>

<a name="5.2.1"></a>
### <b> <font color='#5499C7'> 5.2.1. Model 1: Conv1D </font> </b>

In [None]:
def build_model_1(name, input_shape = INPUT_SHAPE, num_classes = NUM_CLASSES):
    inputs = layers.Input(shape=input_shape,dtype=tf.string)
    x = text_vectorizer(inputs)
    x = token_embed(x)
    x = layers.Conv1D(64, kernel_size=5, padding="same", activation="relu")(x)
    x = tf.keras.layers.GlobalAveragePooling1D()(x)
    outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)
    model = tf.keras.Model(inputs,outputs,name=name)
    return model


model_1 = build_model_1('model_1')

model_1.summary()

In [None]:
model_1.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.CategoricalCrossentropy(), # one-hot encoded labels
    metrics=['accuracy', 'Precision', 'Recall']
)


history_model_1 = model_1.fit(
    train_dataset,
    steps_per_epoch=int(0.1*len(train_dataset)),
    epochs=3,
    batch_size=BATCH_SIZE,
    validation_data=valid_dataset,
    validation_steps=int(0.1 * len(valid_dataset))
)

In [None]:
# evaluate
model_1.evaluate(test_dataset)

In [None]:
# Make predictions (our model predicts prediction probabilities for each class)
model_1_pred_probs = model_1.predict(valid_dataset)
model_1_pred_probs, model_1_pred_probs.shape

In [None]:
# Convert pred probs to classes
model_1_preds = tf.argmax(model_1_pred_probs, axis=1)
model_1_preds

<a name="5.2.2"></a>
### <b> <font color='#5499C7'> 5.2.2. Model 2: Feature extraction with pre-trained token embeddings </font> </b>

In [None]:
import tensorflow_hub as hub

In [None]:
tf_hub_embedding_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        trainable=False,
                                        name="universal_sentence_encoder")

In [None]:
# Test out the pretrained embedding on a random sentence 
random_train_sentence = random.choice(train_sentences)
print(f"Random sentence:\n {random_train_sentence}")
use_embedded_sentence = tf_hub_embedding_layer([random_train_sentence])
print(f"Setence after embedding:\n{use_embedded_sentence[0][:30]}\n")
print(f"Length of sentence embedding: {len(use_embedded_sentence[0])}")

In [None]:
def build_model_2(name, input_shape=INPUT_SHAPE, num_classes=NUM_CLASSES):
    inputs = layers.Input(shape=input_shape, dtype=tf.string)
    x = tf_hub_embedding_layer()(inputs)
    x = tf.keras.layers.Dense(128,activation='relu')(x)
    outputs = tf.keras.layers.Dense(num_classes,activation='softmax')(x)
    model = tf.keras.Model(inputs,outputs,name=name)
    return model


model_2 = build_model_1('model_2_USE_feature_extractor')
    
model_2.summary()

In [None]:
model_2.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.CategoricalCrossentropy(), # one-hot encoded labels
    metrics=['accuracy']
)


history_model_2 = model_2.fit(
    train_dataset,
    steps_per_epoch=int(0.1*len(train_dataset)),
    epochs=3,
    validation_data=valid_dataset,
    validation_steps=int(0.1 * len(valid_dataset))
)

In [None]:
# Make predictions with feature extraction model
model_2_pred_probs = model_2.predict(valid_dataset)
model_2_pred_probs

In [None]:
# Convert the prediction probabilities found with feature extraction model to labels
model_2_preds = tf.argmax(model_2_pred_probs, axis=1)
model_2_preds

<a name="5.2.3"></a>
### <b> <font color='#5499C7'> 5.2.3. Model 3: Conv1D with character embeddings </font> </b>

#### Creating a character-level tokenizer

In [None]:
train_sentences[:5]

In [None]:
# Make function to split sentences into characters
def split_chars(text):
  return " ".join(list(text))

In [None]:
# Split sequence-level data splits into character-level data splits
train_chars = [split_chars(sentence) for sentence in train_sentences]
val_chars = [split_chars(sentence) for sentence in val_sentences]
test_chars = [split_chars(sentence) for sentence in test_sentences]
train_chars[:5]

In [None]:
# What's the average character length?
char_lens = [len(sentence) for sentence in train_sentences]
mean_char_len = np.mean(char_lens)
mean_char_len

In [None]:
# Crear el histograma
n, bins, patches = plt.hist(char_lens, bins=5, edgecolor='black')

# Colores para cada barra
colors = ['blue', 'cyan', 'green', 'purple', 'orange']

# Asignar un color a cada barra
for patch, color in zip(patches, colors):
    patch.set_facecolor(color)

# Mostrar el gráfico
plt.show()

In [None]:
# Find what character length covers 95% of sequences
output_seq_char_len = int(np.percentile(char_lens, 95))
output_seq_char_len

In [None]:
# Get all keyboard characters
alphabet = string.ascii_lowercase + string.digits + string.punctuation
alphabet

In [None]:
# Create char-level token vectorizer instance
NUM_CHAR_TOKENS = len(alphabet) + 2 # add 2 for space and OOV token (OOV = out of vocab, '[UNK]')

char_vectorizer = TextVectorization(max_tokens=NUM_CHAR_TOKENS,
                                    output_sequence_length=output_seq_char_len,
                                    # standardize=None, # set standardization to "None" if you want to leave punctuation in
                                    name="char_vectorizer")

In [None]:
# Adapt character vectorizer to training character
char_vectorizer.adapt(train_chars)

In [None]:
# Check character vocab stats
char_vocab = char_vectorizer.get_vocabulary()
print(f"Number of different characters in character vocab: {len(char_vocab)}")
print(f"5 most common characters: {char_vocab[:5]}")
print(f"5 least common characters: {char_vocab[-5:]}")

In [None]:
# Test out character vectorizer
random_train_chars = random.choice(train_chars)
print(f"Charified text:\n {random_train_chars}")
print(f"\nLength of random_train_chars: {len(random_train_chars.split())}")
vectorized_chars = char_vectorizer([random_train_chars])
print(f"\nVectorized chars:\n {vectorized_chars}")
print(f"\nLength of vectorized chars: {len(vectorized_chars[0])}")

#### Creating a character-level embedding

In [None]:
# Create char embedding layer
char_embed = layers.Embedding(input_dim=len(char_vocab), # number of different characters
                              output_dim=25, # this is the size of the char embedding in the paper: https://arxiv.org/pdf/1612.05251.pdf (Figure 1)
                              mask_zero=True,
                              name="char_embed")

In [None]:
# Test our character embedding layer
print(f"Charified text:\n {random_train_chars}\n")
char_embed_example = char_embed(char_vectorizer([random_train_chars]))
print(f"Embedded chars (after vectorization and embedding):\n {char_embed_example}\n")
print(f"Character embedding shape: {char_embed_example.shape}")

Each sentence has length 290 and the size of the embedding is 25.

In [None]:
# Create char level datasets
train_char_dataset = tf.data.Dataset.from_tensor_slices((train_chars, train_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)
val_char_dataset = tf.data.Dataset.from_tensor_slices((val_chars, val_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)
test_char_dataset = tf.data.Dataset.from_tensor_slices((test_chars, test_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)

train_char_dataset

In [None]:
def build_model_3(name, input_shape=INPUT_SHAPE, num_classes=NUM_CLASSES):
    inputs = layers.Input(shape=input_shape,dtype=tf.string)
    x = char_vectorizer(inputs)
    x = char_embed(x)
    x = layers.Conv1D(64, kernel_size=5, padding="same", activation="relu")(x)
    x = tf.keras.layers.GlobalAveragePooling1D()(x) # try MaxPooling!!
    outputs = tf.keras.layers.Dense(num_classes, activation='softmax')(x)
    model = tf.keras.Model(inputs,outputs,name=name)
    return model

    
model_3 = build_model_3('model_3')

model_3.summary()

In [None]:
# Compile
model_3.compile(loss="categorical_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy", 'Precision', 'Recall'])

# Fit the model on chars only
model_3_history = model_3.fit(train_char_dataset,
                              steps_per_epoch=int(0.1*len(train_char_dataset)),
                              epochs=3,
                              validation_data=val_char_dataset,
                              validation_steps=int(0.1*len(val_char_dataset)))

In [None]:
model_3.evaluate(test_char_dataset)

In [None]:
# Make predictions with character model only
model_3_pred_probs = model_3.predict(val_char_dataset)
model_3_pred_probs

In [None]:
# Convert prediction probabilities to class labels
model_3_preds = tf.argmax(model_3_pred_probs, axis=1)
model_3_preds