# SkimLit: Skimming Literature with NLP

This notebook uses natural language processing techniques to analyze and classify sentences from scientific abstracts. The goal is to automate the extraction of important information from scientific papers, which is crucial for researchers and professionals who need to quickly review and understand the literature.

## Required Datasets

**PubMed 20k RCT Dataset**:

This notebook utilizes text files from the `PubMed_20k_RCT_numbers_replaced_with_at_sign` folder, which should be downloaded and stored in a `data` directory accessible by the notebook.

### How to Download Dataset

To access and set up the datasets, please follow these steps:

1. Create a `data` folder in your project directory if it doesn't already exist.
2. Download the text files from the following Kaggle dataset link:
   - [PubMed 20k RCT Dataset](https://www.kaggle.com/datasets/matthewjansen/pubmed-200k-rtc?select=PubMed_200k_RCT_numbers_replaced_with_at_sign) (ensure you comply with the dataset's usage rules).
3. Place the downloaded text files into the `data` folder. This step ensures that all data files are ready to be accessed by the notebook.

## Contents of the Notebook

- **Introduction**: Overview of the project's aim and importance.
- **Basic Exploratory Data Analysis**: Initial analysis of the data to understand the distribution and nature of the dataset.
- **Text Normalization**: Processing steps to clean and normalize the text data.
- **Model Building**: Implementation of various models to classify sentences in scientific abstracts.
- **Model Evaluation**: Evaluation of the models' performance using appropriate metrics.

 ## Install Required Packages

- To enhance the functionality of the CoreAI environment, you may need to install some libraries not pre-installed but required for this notebook. Follow these steps to install the necessary libraries from the `requirements.txt` file:

 ### Create and Activate the Virtual Environment:
   
   Open your terminal or command prompt within the jupyter notebook. `File -> New -> Terminal`
   
   Navigate to the project directory where you want to set up the environment.
   
   Execute the following commands in a `bash` to create and activate the virtual environment:
   
   ```
   python3 -m venv --system-site-packages myvenv
   source myvenv/bin/activate
   pip3 install ipykernel
   python -m ipykernel install --user --name=myvenv --display-name="Python (myvenv)"
   ```

### Important Note

It is crucial to load the new "myvenv" kernel for the notebook to work correctly. If the new "myvenv" kernel is not loaded, the required libraries and environment settings will not be applied, and the notebook will not function as expected.

 ### Install Required Libraries
   
   Before running the following command in jupyter notebook, make sure you are in the directory where the Jupyter Notebook and virtual environment is located. This ensures the ./ path is always current. You can use the cd command to change to your project directory and pwd to verify your current directory.
   

In [None]:
# Confirm that tf_keras matches the TF version (to avoid an unnecessary upgrade)
!. ./myvenv/bin/activate; pip install -r requirements.txt

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow_hub as tfhub
import os
import random
import string
from helper_functions import *
from sklearn.preprocessing import *
from sklearn.feature_extraction.text import *
from sklearn.naive_bayes import *
from sklearn.pipeline import *

In [None]:
LOGS_DIR = 'logs/'

In [None]:
data_dir_path = './data/'
train_dir_path = data_dir_path + 'train.txt'
test_dir_path = data_dir_path + 'test.txt'
validation_dir_path = data_dir_path + 'dev.txt'

In [None]:
filenames = [data_dir_path + filename for filename in os.listdir(data_dir_path)]
filenames

In [None]:
def read_doc(filename):
  """
  Reads filename (txt) and returns the lines of text as a list

  Args:

    filename: A string containing the target filepath

  Returns:

    A list of strings with one string per line from the target filename
  """
  with open(filename, 'r') as f:
    return f.readlines()


In [None]:
train_lines = read_doc(train_dir_path)
train_lines[:10]

In [None]:
def preprocess_doc(filename):
  """
  Returns a list of dictionaries for each line relating to one abstract, doing the same for all abstracts

  Args:

    filename: A string which is the path of the doc

  Returns:

    A list of dictionaries with preprocesse data from the doc
  """
  input_lines = read_doc(filename)
  abstract_lines = ''
  abstract_samples = []

  for line in input_lines:
    if line.startswith('###'):
      abstract_id = line
      abstract_lines = ''
    elif line.isspace():
      abstract_line_split = abstract_lines.splitlines()
      for abstract_line_number, abstract_line in enumerate(abstract_line_split):
        line_data = {}
        label_text_split = abstract_line.split('\t')
        line_data['line_number'] = abstract_line_number
        line_data['label'] = label_text_split[0]
        line_data['text'] = label_text_split[1].lower()
        line_data['total_lines'] = len(abstract_line_split) - 1
        abstract_samples.append(line_data)
    else:
      abstract_lines += line

  return abstract_samples

In [None]:
train_samples = preprocess_doc(train_dir_path)
validation_samples = preprocess_doc(validation_dir_path)
test_samples = preprocess_doc(test_dir_path)
len(train_samples), len(validation_samples), len(test_samples)

In [None]:
train_samples[:10]

In [None]:
train_df = pd.DataFrame(train_samples)
validation_df = pd.DataFrame(validation_samples)
test_df = pd.DataFrame(test_samples)

In [None]:
train_df.head(10)

In [None]:
train_df['label'].value_counts()

In [None]:
train_df['total_lines'].plot.hist()

In [None]:
train_sentences = train_df['text'].to_list()
validation_sentences = validation_df['text'].to_list()
test_sentences = test_df['text'].to_list()

In [None]:
train_sentences[:10]

In [None]:
one_hot_encoder = OneHotEncoder(sparse_output = False)
train_labels_one_hot_encoded = one_hot_encoder.fit_transform(train_df['label'].to_numpy().reshape(-1,1))
validation_labels_one_hot_encoded = one_hot_encoder.transform(validation_df['label'].to_numpy().reshape(-1,1))
test_labels_one_hot_encoded = one_hot_encoder.transform(test_df['label'].to_numpy().reshape(-1,1))

In [None]:
label_encoder = LabelEncoder()
train_labels_label_encoded = label_encoder.fit_transform(train_df['label'].to_numpy())
validation_labels_label_encoded = label_encoder.transform(validation_df['label'].to_numpy())
test_labels_label_encoded = label_encoder.transform(test_df['label'].to_numpy())

In [None]:
total_classes = len(label_encoder.classes_)
class_names = label_encoder.classes_
total_classes, class_names

In [None]:
baseline_model = Pipeline([
    ('tf-idf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

baseline_model.fit(X = train_sentences,
                   y = train_labels_label_encoded)

In [None]:
baseline_model.score(X = validation_sentences,
                     y = validation_labels_label_encoded)

In [None]:
baseline_preds = baseline_model.predict(validation_sentences)
baseline_results = calculate_results(y_true = validation_labels_label_encoded,
                                     y_pred = baseline_preds)
baseline_results

In [None]:
sentence_lengths = [len(sentence.split()) for sentence in train_sentences]
average_sentence_length = np.mean(sentence_lengths)
average_sentence_length

In [None]:
plt.hist(sentence_lengths, bins = 20)

In [None]:
output_sentence_length = int(np.percentile(sentence_lengths, 95))
output_sentence_length

In [None]:
max_tokens = 68000

In [None]:
token_vectorizer = tf.keras.layers.TextVectorization(max_tokens = max_tokens,
                                                                               output_sequence_length = output_sentence_length)

In [None]:
token_vectorizer.adapt(train_sentences)

In [None]:
sample_sentence = random.choice(train_sentences)
print(f'Text:\n{sample_sentence}')
print(f'\nLength of sentence: {len(sample_sentence.split())}')
print(f'\nVectorized text: {token_vectorizer([sample_sentence])}')

In [None]:
token_vocab = token_vectorizer.get_vocabulary()
print(f'Number of words in token_vocab: {len(token_vocab)}')
print(f'Most common words in token_vocab: {(token_vocab[:10])}')
print(f'Least common words in token_vocab: {(token_vocab[-10:])}')

In [None]:
token_vectorizer.get_config()

In [None]:
token_embedder = tf.keras.layers.Embedding(input_dim = len(token_vocab),
                                          output_dim = 128,
                                          mask_zero = True,
                                          name = 'token_embedding_layer')

In [None]:
print(f'Sentence:\n {sample_sentence}\n')
vectorized_sample_sentence = token_vectorizer([sample_sentence])
print(f'Vectorized sentence:\n{vectorized_sample_sentence}\n')
embedded_sample_sentence = token_embedder(vectorized_sample_sentence)
print(f'Embedded sentence:\n{embedded_sample_sentence}\n')

In [None]:
train_token_data = tf.data.Dataset.from_tensor_slices((train_sentences, train_labels_one_hot_encoded))
validation_token_data = tf.data.Dataset.from_tensor_slices((validation_sentences, validation_labels_one_hot_encoded))
test_token_data = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels_one_hot_encoded))

In [None]:
train_token_data = train_token_data.batch(32).prefetch(tf.data.AUTOTUNE)
validation_token_data = validation_token_data.batch(32).prefetch(tf.data.AUTOTUNE)
test_token_data = test_token_data.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
inputs = tf.keras.layers.Input(shape = (1,), dtype = tf.string)
text_vectors = token_vectorizer(inputs)
text_embeddings = token_embedder(text_vectors)
x = tf.keras.layers.Conv1D(64, kernel_size = 5, padding = 'same', activation = 'relu')(text_embeddings)
x = tf.keras.layers.GlobalAveragePooling1D()(x)
outputs = tf.keras.layers.Dense(total_classes, activation = 'softmax')(x)
model_1 = tf.keras.Model(inputs, outputs, name = 'conv1d_20k')

In [None]:
model_1.compile(loss = tf.keras.losses.categorical_crossentropy,
                optimizer = tf.keras.optimizers.Adam(learning_rate = 0.003),
                metrics = ['accuracy'])

In [None]:
model_1_history = model_1.fit(train_token_data,
                              epochs = 5,
                              steps_per_epoch = int(0.25 * len(train_token_data)),
                              validation_data = validation_token_data,
                              validation_steps = int(0.25 * len(validation_token_data)),
                              callbacks=[create_tensorboard_callback(dir_name = LOGS_DIR, experiment_name = 'conv1d_20k')])

In [None]:
model_1.evaluate(validation_token_data)

In [None]:
model_1_preds = tf.argmax(model_1.predict(validation_token_data), axis = 1)
model_1_preds, validation_labels_one_hot_encoded

In [None]:
model_1_results = calculate_results(y_true = validation_labels_label_encoded,
                                    y_pred = model_1_preds)
model_1_results

In [None]:
use_layer = tfhub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4", trainable=False)

In [None]:
print(f'Sentence:\n {sample_sentence}\n')
use_sample_sentence = use_layer([sample_sentence])
print(f'Embedded sentence:\n{use_sample_sentence}\n')

In [None]:
inputs = tf.keras.layers.Input(shape=[], name="Input", dtype=tf.string)
text_use_embedding = tf.keras.layers.Lambda(lambda x: use_layer(x), output_shape=(512,))(inputs)
x = tf.keras.layers.Dense(128, activation = 'relu')(text_use_embedding)
outputs = tf.keras.layers.Dense(total_classes, activation = 'softmax')(x)
model_2 = tf.keras.Model(inputs, outputs, name = 'use_20k')

In [None]:
model_2.compile(loss = tf.keras.losses.categorical_crossentropy,
                optimizer = tf.keras.optimizers.Adam(learning_rate = 0.003),
                metrics = ['accuracy'], jit_compile=True)

In [None]:
model_2_history = model_2.fit(train_token_data,
                              epochs = 5,
                              steps_per_epoch = int(0.25 * len(train_token_data)),
                              validation_data = validation_token_data,
                              validation_steps = int(0.25 * len(validation_token_data)),
                              callbacks = [create_tensorboard_callback(dir_name = LOGS_DIR, experiment_name = 'use_20k')])

In [None]:
model_2.evaluate(validation_token_data)

In [None]:
model_2_preds = tf.argmax(model_2.predict(validation_token_data), axis = 1)

In [None]:
model_2_results = calculate_results(y_true = validation_labels_label_encoded,
                                    y_pred = model_2_preds)
model_2_results

In [None]:
def split_chars(text):
  return ' '.join(list(text))

In [None]:
train_chars = [split_chars(sentence) for sentence in train_sentences]
validation_chars = [split_chars(sentence) for sentence in validation_sentences]
test_chars = [split_chars(sentence) for sentence in test_sentences]

In [None]:
character_lengths = [len(sentence) for sentence in train_sentences]
average_character_length = np.mean(character_lengths)
average_character_length

In [None]:
plt.hist(character_lengths, bins = 5)

In [None]:
output_character_length = int(np.percentile(character_lengths, 95))
output_character_length

In [None]:
characters = string.ascii_lowercase + string.digits + string.punctuation
characters

In [None]:
max_chars = len(characters) +2
max_chars

In [None]:
char_vectorizer = tf.keras.layers.TextVectorization(max_tokens = max_chars,
                                                    output_sequence_length = output_character_length,
                                                    name = 'character_vectorizer_layer')

In [None]:
char_vectorizer.adapt(train_chars)

In [None]:
char_vocab = char_vectorizer.get_vocabulary()
print(f'Number of words in char_vocab: {len(char_vocab)}')
print(f'Most common words in char_vocab: {(char_vocab[:10])}')
print(f'Least common words in char_vocab: {(char_vocab[-10:])}')

In [None]:
char_vectorizer.get_config()

In [None]:
char_embedder = tf.keras.layers.Embedding(input_dim = len(char_vocab),
                                               output_dim = 25,
                                               mask_zero = True,
                                               name = 'character_embedder_layer')

In [None]:
sample_chars = split_chars(sample_sentence)
print(f'Sentence:\n {sample_chars}\n')
vectorized_sample_chars = char_vectorizer([sample_chars])
print(f'Vectorized sentence:\n{vectorized_sample_chars}\n')
embedded_sample_chars = char_embedder(vectorized_sample_chars)
print(f'Embedded sentence:\n{embedded_sample_chars}\n')

In [None]:
inputs = tf.keras.layers.Input(shape = (1, ), dtype = tf.string)
char_vectors = char_vectorizer(inputs)
char_embeddings = char_embedder(char_vectors)
x = tf.keras.layers.Conv1D(64, kernel_size = 5, padding = 'same', activation = 'relu')(char_embeddings)
x = tf.keras.layers.GlobalAveragePooling1D()(x)
outputs = tf.keras.layers.Dense(total_classes, activation = 'softmax')(x)
model_3 = tf.keras.Model(inputs, outputs, name = 'conv1d_char_20k')

In [None]:
model_3.compile(loss = tf.keras.losses.categorical_crossentropy,
                optimizer = tf.keras.optimizers.Adam(learning_rate = 0.003),
                metrics = ['accuracy'])

In [None]:
train_char_data = tf.data.Dataset.from_tensor_slices((train_chars, train_labels_one_hot_encoded))
validation_char_data = tf.data.Dataset.from_tensor_slices((validation_chars, validation_labels_one_hot_encoded))
test_char_data = tf.data.Dataset.from_tensor_slices((test_chars, test_labels_one_hot_encoded))

In [None]:
train_char_data = train_char_data.batch(32).prefetch(tf.data.AUTOTUNE)
validation_char_data = validation_char_data.batch(32).prefetch(tf.data.AUTOTUNE)
test_char_data = test_char_data.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
model_3_history = model_3.fit(train_char_data,
                              epochs = 5,
                              steps_per_epoch = int(0.25 * len(train_char_data)),
                              validation_data = validation_char_data,
                              validation_steps = int(0.25 * len(validation_char_data)),
                              callbacks = [create_tensorboard_callback(dir_name = LOGS_DIR, experiment_name = 'conv1d_char_20k')])


In [None]:
model_3.evaluate(validation_char_data)

In [None]:
model_3_preds = tf.argmax(model_3.predict(validation_char_data), axis = 1)

In [None]:
model_3_results = calculate_results(y_true = validation_labels_label_encoded,
                                    y_pred = model_3_preds)
model_3_results

In [None]:
token_inputs = tf.keras.layers.Input(shape=(), dtype=tf.string, name='token_input')
token_embeddings = tf.keras.layers.Lambda(lambda x: use_layer(x), output_shape=(512,))(token_inputs)
token_outputs = tf.keras.layers.Dense(256, activation='relu', name='token_output')(token_embeddings)
token_model = tf.keras.Model(token_inputs, token_outputs, name='token_model')


In [None]:
# Define the char input model
char_inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string, name='char_input')
char_vectors = char_vectorizer(char_inputs)
char_embeddings = char_embedder(char_vectors)
char_outputs = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(24), name='char_output')(char_embeddings)
char_model = tf.keras.Model(char_inputs, char_outputs, name='char_model')

In [None]:
token_char_inputs = tf.keras.layers.Concatenate(name='token_char_input')([token_model.output, char_model.output])


In [None]:
token_char_dropout_1 = tf.keras.layers.Dropout(0.5, name='token_char_dropout_1')(token_char_inputs)
token_char_dense = tf.keras.layers.Dense(128, activation='relu', name='token_char_dense')(token_char_dropout_1)
token_char_dropout_2 = tf.keras.layers.Dropout(0.5, name='token_char_dropout_2')(token_char_dense)
token_char_outputs = tf.keras.layers.Dense(total_classes, activation='softmax', name='token_char_output')(token_char_dropout_2)


In [None]:
model_4 = tf.keras.Model(inputs=[token_model.input, char_model.input], outputs=token_char_outputs, name='token_char_20k')


In [None]:
tf.keras.utils.plot_model(model_4, show_shapes = True)

In [None]:
train_token_char_data = tf.data.Dataset.from_tensor_slices((train_sentences, train_chars))
train_token_char_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot_encoded)
train_token_char_dataset = tf.data.Dataset.zip((train_token_char_data, train_token_char_labels))

validation_token_char_data = tf.data.Dataset.from_tensor_slices((validation_sentences, validation_chars))
validation_token_char_labels = tf.data.Dataset.from_tensor_slices(validation_labels_one_hot_encoded)
validation_token_char_dataset = tf.data.Dataset.zip((validation_token_char_data, validation_token_char_labels))


In [None]:
train_token_char_dataset = train_token_char_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
validation_token_char_dataset = validation_token_char_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
model_4.compile(
    loss=tf.keras.losses.categorical_crossentropy,
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.003),
    metrics=['accuracy']
)

In [None]:
model_4_history = model_4.fit(
    train_token_char_dataset,
    epochs=5,
    steps_per_epoch=int(0.25 * len(train_token_char_dataset)),
    validation_data=validation_token_char_dataset,
    validation_steps=int(0.25 * len(validation_token_char_dataset)),
    callbacks=[create_tensorboard_callback(dir_name=LOGS_DIR, experiment_name='token_char_20k')]
)

In [None]:
model_4.evaluate(validation_token_char_dataset)

In [None]:
model_4_preds = tf.argmax(model_4.predict(validation_token_char_dataset), axis = 1)

In [None]:
model_4_results = calculate_results(y_true = validation_labels_label_encoded,
                                    y_pred = model_4_preds)
model_4_results

In [None]:
train_df['line_number'].plot.hist()

In [None]:
train_line_numbers_one_hot_encoded = tf.one_hot(train_df['line_number'].to_numpy(), depth = 15)
validation_line_numbers_one_hot_encoded = tf.one_hot(validation_df['line_number'].to_numpy(), depth = 15)
test_line_numbers_one_hot_encoded = tf.one_hot(test_df['line_number'].to_numpy(), depth = 15)

In [None]:
train_df['total_lines'].plot.hist()

In [None]:
train_total_lines_one_hot_encoded = tf.one_hot(train_df['total_lines'].to_numpy(), depth = 20)
validation_total_lines_one_hot_encoded = tf.one_hot(validation_df['total_lines'].to_numpy(), depth = 20)
test_total_lines_one_hot_encoded = tf.one_hot(test_df['total_lines'].to_numpy(), depth = 20)

In [None]:
line_number_inputs = tf.keras.layers.Input(shape = (15, ), dtype = tf.float32, name = 'line_number_input')
line_number_outputs = tf.keras.layers.Dense(32, activation = 'relu', name = 'line_number_output')(line_number_inputs)
line_number_model = tf.keras.Model(line_number_inputs, line_number_outputs)

In [None]:
total_lines_inputs = tf.keras.layers.Input(shape = (20, ), dtype = tf.float32, name = 'total_lines_input')
total_lines_outputs = tf.keras.layers.Dense(32, activation = 'relu', name = 'total_line_output')(total_lines_inputs)
total_lines_model = tf.keras.Model(total_lines_inputs, total_lines_outputs, name = 'total_lines_model')

In [None]:
token_char_dense = tf.keras.layers.Dense(256, activation = 'relu', name = 'token_char_dense')(token_char_inputs)
token_char_dropout = tf.keras.layers.Dropout(0.5, name = 'token_char_dropout')(token_char_dense)
token_char_positional_inputs = tf.keras.layers.Concatenate(name = 'token_char_positional_inputs')([line_number_model.output,
                                                                                                   total_lines_model.output,
                                                                                                   token_char_dropout])
token_char_positional_outputs = tf.keras.layers.Dense(total_classes, activation = 'softmax', name = 'token_char_positional_output')(token_char_positional_inputs)
model_5 = tf.keras.Model([line_number_model.input,
                          total_lines_model.input,
                          token_model.input,
                          char_model.input],
                         token_char_positional_outputs)

In [None]:
tf.keras.utils.plot_model(model_5)

In [None]:
model_5.compile(loss = tf.keras.losses.CategoricalCrossentropy(label_smoothing = 0.2),
                optimizer = tf.keras.optimizers.Adam(learning_rate = 0.003),
                metrics = ['accuracy'])

In [None]:
train_token_char_positional_data = tf.data.Dataset.from_tensor_slices((train_line_numbers_one_hot_encoded, train_total_lines_one_hot_encoded, train_sentences, train_chars))
train_token_char_positional_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot_encoded)
train_token_char_positional_dataset = tf.data.Dataset.zip((train_token_char_positional_data, train_token_char_positional_labels))
validation_token_char_positional_data = tf.data.Dataset.from_tensor_slices((validation_line_numbers_one_hot_encoded, validation_total_lines_one_hot_encoded, validation_sentences, validation_chars))
validation_token_char_positional_labels = tf.data.Dataset.from_tensor_slices(validation_labels_one_hot_encoded)
validation_token_char_positional_dataset = tf.data.Dataset.zip((validation_token_char_positional_data, validation_token_char_positional_labels))

In [None]:
train_token_char_positional_dataset = train_token_char_positional_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
validation_token_char_positional_dataset = validation_token_char_positional_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
model_5_history = model_5.fit(train_token_char_positional_dataset,
                              epochs = 5,
                              steps_per_epoch = int(0.25 * len(train_token_char_positional_dataset)),
                              validation_data = validation_token_char_positional_dataset,
                              validation_steps = int(0.25 * len(validation_token_char_positional_dataset)),
                              callbacks = [create_tensorboard_callback(dir_name = LOGS_DIR, experiment_name = 'token_char_positional_20k')])

In [None]:
model_5.evaluate(validation_token_char_positional_dataset)

In [None]:
model_5_preds = tf.argmax(model_5.predict(validation_token_char_positional_dataset), axis = 1)

In [None]:
model_5_results = calculate_results(y_true = validation_labels_label_encoded,
                                    y_pred = model_5_preds)
model_5_results

In [None]:
all_models_results = pd.DataFrame({
    'naive_bayes_model': baseline_results,
    'token_model': model_1_results,
    'use_model': model_2_results,
    'char_model': model_3_results,
    'token_char_model': model_4_results,
    'token_char_positional_model': model_5_results
})

all_models_results = all_models_results.transpose()
all_models_results.reset_index(inplace = True)
all_models_results

In [None]:
all_models_results.plot(kind = 'bar', figsize = (10,7)).legend(bbox_to_anchor = (1.0,1.0))

In [None]:
test_token_char_positional_data = tf.data.Dataset.from_tensor_slices((test_line_numbers_one_hot_encoded, test_total_lines_one_hot_encoded, test_sentences, test_chars))
test_token_char_positional_labels = tf.data.Dataset.from_tensor_slices(test_labels_one_hot_encoded)
test_token_char_positional_dataset = tf.data.Dataset.zip((test_token_char_positional_data, test_token_char_positional_labels))
test_token_char_positional_dataset = test_token_char_positional_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

In [None]:
model_5_test_preds = tf.argmax(model_5.predict(test_token_char_positional_dataset), axis = 1)

In [None]:
model_5_test_results = calculate_results(y_true = test_labels_label_encoded,
                                         y_pred = model_5_test_preds)

In [None]:
model_5_test_results