# Milestone project 2 - SkimLit

## SkimLit sequece problem: Many to one

To investigate the efficacy of @week of daily low-dose oral prednisolone in improving pain, mobility, systemic low-grade inflamation in the short term and wether the effect would be sustained at @week in order adults with moderate to sever knee osteoarthritis (OA).

## What we're going to cover

* Download a text dataset (PubMed 200k RCT)
* Writing a preprocessing function for our text data
* Setting up multiple modelling experiments with differents levels of embeddings
* Building a multimodal model to take in different source of data
    - Replicating the model powering https://arxiv.org/abs/1710.06071
* Finding the most wrong prediction example

see: https://www.nltk.org/install.html


In [None]:
!nvidia-smi -L

## Get dataset

Since we'll be replicating a paper above (PubMed 200k RCT), let's download the data they used.

We can do so from the authors GitHub: https://github.com/Franck-Dernoncourt/pubmed-rct


In [1]:
import os
import sys
import shutil
import random
import string
import re
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import sklearn as skl
import numpy as np
import matplotlib.pyplot as plt
import nltk
import multiprocessing

from keras.utils import plot_model
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.phrases import Phrases, Phraser
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer



# add path root project to read helper fuctions
sys.path.append(os.path.join('../'))

from helper_functions import calculate_results

print(f'pandas: {pd.__version__}')
print(f'tensorflow: {tf.__version__}')
print(f'sklearn: {skl.__version__}')

pandas: 1.3.5
tensorflow: 2.9.3
sklearn: 1.0.2


In [None]:
nltk_lists = ['tokenizers/punkt', 'stemmers/rslp', 'corpora/stopwords']

for name in nltk_lists:
    try:
        nltk.data.find(name)
    except LookupError:
        nltk.download(name.split('/')[1])

In [3]:
STORAGE = os.path.join('../../', 'storage')
MODEL_PATH = f'{STORAGE}/models'
NLP = f'{STORAGE}/nlp'

In [None]:
# !git clone https://github.com/Franck-Dernoncourt/pubmed-rct.git
# shutil.move('pubmed-rct', f'{NLP}')

In [None]:
# list dir and see your content
os.listdir(f'{NLP}/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign')

In [None]:
# start our experiments using 20k dataset with number replaced by '@' sign
data_dir = f'{NLP}/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/'

In [None]:
# check all of the filenames in the target directory
filenames = [data_dir + filename for filename in os.listdir(data_dir)]
filenames

## Preprocess data

Now we've got some text data, it's time to become one with it.

And one of the best ways to become one with the data is to...

> Visualize, Visualize, Visualize

So with that in mind, let's write a function to read in all of the lines of a target text file


In [None]:
# create a function to read the lines of a document
def get_lines(filename):
    """ 
    Reads filename (a txt filename) and returns the lines of a text as a list
    Args:
        filename:  a string  containing the target filepath
    Returns:
        A list of strings with one string per line from the target filename
    """
    with open(filename, 'r') as file:
        return file.readlines()

In [None]:
# Let's read in the training lines

train_lines = get_lines(data_dir + 'train.txt') # read the lines with the training file
train_lines[:15]

Let's think about how we want our data to look

How I think our data would be best represented

```json
[   
    {
        "line_number":0,
        "target": "BACKGROUND",
        "text":" 'Serum levels of interleukin @ ( IL-@ ) , IL-@  and high-sensitivity C-reactive protein ( hsCRP ) were measured .\n'",
        "total_lines": 11
    }
]

```

Let's write a function which turns each of our dataset into the above format so we can continue to prepare

In [None]:
def preprocess_text_with_line_numbers(filename):
    """ 
    Args:
        filename: (str) required
    Returns:
        A list of dictionaries of abstract line data.

        Takes in filename, reads it contents and sorts through each line, extracting
        things like the target label, the text of the sentence, how many sentences are
        in the current abstract and what sentence number the target line is.
    """
    input_lines = get_lines(filename) # get all lines from filename
    abstract_lines = "" # create an empty abstract
    abstract_sample = [] # create an empty list

    # loop through each line in the target file
    for line in input_lines:
        if line.startswith('###'): # check if line start with ###
            # get id from line
            abstract_id = line

            # reset the abstract string if the line is an ID line
            abstract_lines = "" 
        elif line.isspace(): # check line is a new line
            # split abstract into separate lines
            abstract_line_split = abstract_lines.splitlines()

            # iterate through each line in a single abstract and count them at the same time
            for abstract_line_number, abstract_line in enumerate(abstract_line_split):
                # create an empty dictionary for each line
                line_data = {}
                
                # split target label from text
                target_text_split = abstract_line.split('\t')
                
                # get target label
                line_data['target'] = target_text_split[0]
                
                # get target text and lower it
                line_data['text'] = target_text_split[1].lower()
                
                # what number does the line appear in
                line_data['number'] = abstract_line_number
                
                # how many total lines are there in the target abstract? (start from 0)
                line_data['total_line'] = len(abstract_line_split) - 1
                
                # add line data to abstract sample list
                abstract_sample.append(line_data)
        else: # if the above conditions aren't fulfilled the line contains a labelled sentences 
            abstract_lines += line
            
    return abstract_sample
    

In [None]:
# get data from file and preprocess it

train_samples = preprocess_text_with_line_numbers(data_dir + 'train.txt')
val_samples = preprocess_text_with_line_numbers(data_dir + 'dev.txt') # dev is another name for validation data
test_samples = preprocess_text_with_line_numbers(data_dir + 'test.txt')

print(len(train_samples), len(val_samples), len(test_samples))

In [None]:
# check the first abstract of our training data
train_samples[:5]

Use pandas to visualize our train_samples

In [None]:
train_df = pd.DataFrame(train_samples)
val_df = pd.DataFrame(val_samples)
test_df = pd.DataFrame(test_samples)

In [None]:
train_df.head()

In [None]:
# how are distribution of labels in training data
train_df['target'].value_counts()

In [None]:
# let's check the length of different lines
train_df['total_line'].plot.hist()

In [None]:
# convert abstract text line to list
train_sentences = train_df['text'].tolist()
val_sentences = val_df['text'].tolist()
test_sentences = test_df['text'].tolist()

len(train_sentences), len(val_sentences), len(test_sentences)

In [None]:
train_sentences[:5]

## Make numeric label (ML models require numeric label)

see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

see: https://en.wikipedia.org/wiki/Sparse_matrix

**WARNING**: Tensorflow is uncompatible with matrix sparse so use hyperameter false in OneHotEncoder of sklearn preprocessing

In [None]:
# one hot encoder
one_hot_encoder = skl.preprocessing.OneHotEncoder(sparse=False) # we want non-sparse matrix
train_labels_one_hot = one_hot_encoder.fit_transform(train_df['target'].to_numpy().reshape(-1, 1))
val_labels_one_hot = one_hot_encoder.transform(val_df['target'].to_numpy().reshape(-1, 1))
test_labels_one_hot = one_hot_encoder.transform(test_df['target'].to_numpy().reshape(-1, 1))

# check what trainining one hot encoder look like
train_labels_one_hot

### Label Encoder

In [None]:
# Extract labels ('target' column) and encode them into integers
label_encoder = skl.preprocessing.LabelEncoder()
train_labels_label_encoded = label_encoder.fit_transform(train_df['target'].to_numpy())
val_labels_label_encoded = label_encoder.transform(val_df['target'].to_numpy())
test_labels_label_encoded = label_encoder.transform(test_df['target'].to_numpy())

# check what training label look like
train_labels_label_encoded

In [None]:
# get class name and number of classes from LabelEncoder intances
num_classes = len(label_encoder.classes_)
class_names =  label_encoder.classes_

num_classes, class_names

## Model 0: Baseline

In [None]:
# create model 0 baseline
model_0 = Pipeline([ 
    ('tf-idf', TfidfVectorizer()),  # convert words to number using tf-idf
    ('clf', MultinomialNB()) # model text
])

model_0_history = model_0.fit(train_sentences, train_labels_label_encoded)

In [None]:
model_0.score(val_sentences, val_labels_label_encoded)

In [None]:
# make some predictions using our baseline
model_0_preds = model_0.predict(val_sentences)
model_0_preds

In [None]:
model_0_results = calculate_results(y_true=val_labels_label_encoded, y_pred=model_0_preds)
model_results = pd.DataFrame({'model_0_results': model_0_results})
model_results.transpose()

## Preparing our data (the text) for deep sequence models

Before we start building deeper models, we've got to create vectorization and embedding layers

In [None]:
# how many words there are in our train_sentences: https://arxiv.org/pdf/1710.06071.pdf
sent_lens = [len(sentence.split()) for sentence in train_sentences]
len(sent_lens)

In [None]:
# how long is each sentence on average?
np.mean(sent_lens)

In [None]:
# how long is each sentence on average?
sum(sent_lens)/len(train_sentences)

In [None]:
# What's the distribuition look like?
ax = plt.hist(sent_lens, bins=20)

In [None]:
# How long of a sentence length covers 95% of examples?
output_seq_len = int(np.percentile(sent_lens, 95))
output_seq_len

In [None]:
# Maximum sequence length in the training set
max(sent_lens)

In [None]:
max_tokens = 68000

# create text vectorizer
text_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=max_tokens, # number of words in vocab
                                                                               output_sequence_length=output_seq_len,
                                                                               )

In [None]:
# adapt text vecotorizer to training sentences
text_vectorizer.adapt(train_sentences)

In [None]:
target_sentence = random.choice(train_sentences)
print(f'Text:\n{target_sentence}')
print(f'\nLength of text: {len(target_sentence.split())}')
print(f'\nVectorizer text: {text_vectorizer([target_sentence])}')
print(f'\nLength Vectorizer: {len(text_vectorizer([target_sentence])[0])}')

In [None]:
# how many words in our training vocabulary
rct_20k_text_vocab = text_vectorizer.get_vocabulary()

print(f'Number of words in vocab: {len(rct_20k_text_vocab)}')
print(f'The most common words: {rct_20k_text_vocab[:5]}')
print(f'The least common words: {rct_20k_text_vocab[-5:]}')

In [None]:
# get the config of text_vectorizer
text_vectorizer.get_config()

In [None]:
# create token embedding layer
token_embed = tf.keras.layers.Embedding(input_dim=len(rct_20k_text_vocab), # length of vocabulary
                                        output_dim=128, # NOTE: different embedding sizes result in 
                                        mask_zero=False, # use masking to handle variable sequence length
                                        name='token_embedding'
                                        )

In [None]:
print(f'Sentence before vectorization:\n {target_sentence}\n')
vectorized_sentence = text_vectorizer([target_sentence])
print(f'Sentence after vectorization (before embedding):\n{vectorized_sentence}\n')
embedded_sentence = token_embed(vectorized_sentence)
print(f'Sentence after embedding:\n{embedded_sentence}\n')
print(f'Sentence after embedding shape:\n{embedded_sentence.shape}')

In [None]:
# Turn our data into Tensorflow Dataset
train_dataset = tf.data.Dataset.from_tensor_slices((train_sentences, train_labels_one_hot))
val_dataset = tf.data.Dataset.from_tensor_slices((val_sentences, val_labels_one_hot))
test_dataset = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels_one_hot))

train_dataset

In [None]:
# Take the TensorSliceDataset's and turn them into prefected dataset
train_dataset = train_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
val_dataset = val_dataset.batch(32).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(32).prefetch(tf.data.AUTOTUNE),

train_dataset

## Model 1: Conv1D with token embedding

In [None]:
# create Conv 1D model to process sequences
inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs) # vectorize text input
x = token_embed(x) # create embedding
x = tf.keras.layers.Conv1D(filters=64, 
                           kernel_size=5, 
                           padding='same', 
                           activation='relu')(x)
x = tf.keras.layers.GlobalAveragePooling1D()(x) # condense the output of our feature vector from conv layer
outputs = tf.keras.layers.Dense(num_classes, 
                                activation='softmax')(x) # we have more than two class
model_1 = tf.keras.Model(inputs, 
                         outputs,
                         name='model_1_conv1d_token_embedding')

# compile mode 1
model_1.compile(loss=tf.keras.losses.CategoricalCrossentropy(), 
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])
# show summary
model_1.summary()

In [None]:
# fit the model with 10% of train_dataset
model_1_history = model_1.fit(train_dataset,
                              steps_per_epoch=int(0.1 * len(train_dataset)), # use only 10%
                              epochs=3,
                              validation_data=val_dataset,
                              validation_steps=int(0.1 * len(val_dataset)) # only validate on 10% of batches
                              )

In [None]:
# evaluate the model
model_1.evaluate(val_dataset)

In [None]:
# make some predictions (out model predictions probabilities for each class)
model_1_pred_prob = model_1.predict(val_dataset)
model_1_pred_prob

In [None]:
# convert to pred prob to classes
model_1_preds = tf.argmax(model_1_pred_prob, axis=1)
model_1_preds

In [None]:
model_1_results = calculate_results(y_true=val_labels_label_encoded, y_pred=model_1_preds)

model_results = pd.DataFrame({'model_0_results': model_0_results,
                              'model_1_results': model_1_results})
model_results.transpose()

## Model 2: Feature Extraction with Pretrained Model

see: https://keras.io/examples/nlp/pretrained_word_embeddings/

see: http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc

see: https://www.kaggle.com/code/chewzy/tutorial-how-to-train-your-custom-word-embedding

In [None]:
# model_use = tf.keras.models.load_model(f'{MODEL_PATH}/tfhub_universal_sentence_encoder')

tfhub_url = 'https://www.kaggle.com/models/google/universal-sentence-encoder/TensorFlow2/universal-sentence-encoder/2'

# create a Keras Layer using the pretrained layer from tensorflow hub
sentence_encoder_layer = hub.KerasLayer(tfhub_url, 
                                        input_shape=[], 
                                        dtype=tf.string,
                                        trainable=False, 
                                        name='USE')

In [None]:
# Test out the pretrained embedding on a random sentences
random_train_sentence = random.choice(train_sentences)
print(f'Random sentences:\n{random_train_sentence}')
use_embedding_sentence = sentence_encoder_layer([random_train_sentence])

In [None]:
use_embedding_sentence

In [None]:
# building a model and fitting an NLP feature extraction from tensorflow hub
inputs = tf.keras.layers.Input(shape=(), 
                               dtype=tf.string)
x = sentence_encoder_layer(inputs) # tokenize text and create embedding of each sequence (512 long vector)
x = tf.keras.layers.Dense(128, 
                          activation='relu')(x)
# if you could add more layers here if you wanted to
outputs = tf.keras.layers.Dense(num_classes, 
                                activation='softmax')(x)
model_2 = tf.keras.Model(inputs, 
                         outputs, 
                         name='model_2_use_feature_extractor')
model_2.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

model_2.summary()

In [None]:
# fit the model
model_2_history = model_2.fit(train_dataset,
                              epochs=3,
                              steps_per_epoch=int(0.1 * len(train_dataset)),
                              validation_data=val_dataset,
                              validation_steps=int(0.1 * len(val_dataset)))

In [None]:
model_2.evaluate(val_dataset)

In [None]:
model_2_pred_prob = model_2.predict(val_dataset)

In [None]:
# convert to pred prob to classes
model_2_preds = tf.argmax(model_2_pred_prob, axis=1)
model_2_preds

In [None]:
model_2_results = calculate_results(y_true=val_labels_label_encoded, y_pred=model_2_preds)

model_results = pd.DataFrame({'model_0_results': model_0_results,
                              'model_1_results': model_1_results,
                              'model_2_results': model_2_results})
model_results.transpose()

## Model 3: Conv1D with character embeending

The paper  which we're replicationg states they used a combination of token and character-level embeddings.

Previously we've token-level embeddings but we'll nedd to do similar steps for characters if we want to use char-level embeddings.

see: https://medium.com/@WojtekFulmyk/text-tokenization-and-vectorization-in-nlp-ac5e3eb35b85

see: https://www.kaggle.com/code/parulpandey/getting-started-with-nlp-feature-vectors


In [None]:
train_sentences[:5]

In [None]:
# make functions to split sentences into characters
def split_chars(text:str):
    return ' '.join(list(text))

In [None]:
def custom_preprocessor(text):
    '''
    Make text lowercase, remove text in square brackets,remove links,remove special characters
    and remove words containing numbers.
    '''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub("\\W"," ",text) # remove special chars
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    
    return text

In [None]:
# text splitting no-character-level sequence into characteres
split_chars(random_train_sentence)

In [None]:
# split sequence-level data splits into character-level data splits
train_chars = [split_chars(sentence) for sentence in train_sentences]
val_chars = [split_chars(sentence) for sentence in val_sentences]
test_chars = [split_chars(sentence) for sentence in test_sentences]

train_chars[:5]

In [None]:
# what's the average character length?
chars_lens = [len(sentence) for sentence in train_sentences]
mean_chars_lens = np.mean(chars_lens)
mean_chars_lens

In [None]:
# check the distribution of our sequences at a character-level
ax = plt.hist(chars_lens, bins=7)

In [None]:
# find what character length cover 95% of sequence
output_seq_char_len = int(np.percentile(chars_lens, 95))
output_seq_char_len

In [None]:
# get all character
alphabet = string.ascii_lowercase + string.digits + string.punctuation
alphabet

In [None]:
NUM_CHAR_TOKENS = len(alphabet) + 2 # add 2 for space and OOV token (OOV = out of vocab, '[UNK]')
char_vectorizer = tf.keras.layers.TextVectorization(max_tokens=NUM_CHAR_TOKENS,
                                                    output_sequence_length=output_seq_char_len,
                                                    # standardize='lower_and_strip_punctuation', # set to None if want to leave punctuation in
                                                    name='char_vectorizer')

In [None]:
tf_idf_vectorizer = TfidfVectorizer(analyzer='char',
                                    ngram_range=(2, 2), 
                                    strip_accents='ascii',
                                    max_features=output_seq_char_len).fit(train_chars)

In [None]:
tf_idf_vectorizer.get_feature_names_out()[:5]

In [None]:
choice = random.choice(train_chars)
x = tf_idf_vectorizer.transform([choice]).toarray()
x

In [None]:
# World level unigrams and bigrams
count_vectorizer = CountVectorizer(stop_words=nltk.corpus.stopwords.words('english'),
                                   preprocessor=custom_preprocessor, 
                                   ngram_range=(1,2),
                                   min_df=2,
                                   max_df=0.8)
count_vectorizer.fit(train_sentences)

In [None]:
list(count_vectorizer.vocabulary_)[:5]

In [None]:
# character level bigrams
count_vectorizer = CountVectorizer(preprocessor=custom_preprocessor,
                                   stop_words=nltk.corpus.stopwords.words('english'),
                                   ngram_range=(2,2),
                                   max_features=output_seq_char_len,
                                   strip_accents='ascii',
                                   analyzer='char_wb')

In [None]:
train_vectors = count_vectorizer.fit_transform([random.choice(train_chars)])
train_vectors.toarray()

In [None]:
# adapt character vectorizer to training
char_vectorizer.adapt(train_chars)
# test_vectors = count_vectorizer.transform(train_chars)

In [None]:
# check vocab stats
char_vocab = char_vectorizer.get_vocabulary()
print(f'Number of different characters in character vocab: {len(char_vocab)}')
print(f'5 most common char: {char_vocab[:5]}')
print(f'5 lear common char: {char_vocab[-5:]}')

In [None]:
print(f'Number of different characters in character vocab: {len(list(count_vectorizer.vocabulary_))}')
print(f'5 most common char: {list(count_vectorizer.vocabulary_)[:5]}')
print(f'5 lear common char: {list(count_vectorizer.vocabulary_)[-5:]}')

In [None]:
# Test out character vectorizer
random_train_chars = random.choice(train_chars)
print(f'Charified text:\n{random_train_chars}')
print(f'\nLength of character: {len(random_train_chars.split())}')

vectorized_chars = char_vectorizer([random_train_chars])
vectorized_chars_counter = count_vectorizer.fit_transform([random_train_chars])

print(f'\nVectorized Vector: {vectorized_chars}')
# print(f'\nLength of vectorized chars: {len(vectorized_chars[0])}')
print(f'\nShape of vectorized chars: {vectorized_chars.shape}')

In [None]:
print(f'Charified text:\n{random_train_chars}')
print(f'\nLength of character: {len(random_train_chars.split())}')
vectorized_chars_counter = count_vectorizer.fit_transform([random_train_chars])

# https://stackoverflow.com/questions/12668027/good-ways-to-expand-a-numpy-ndarray
vectorized_chars_counter_array = np.hstack((vectorized_chars_counter.toarray()[0], 
                                            np.zeros(output_seq_char_len - len(vectorized_chars_counter.toarray()[0])))).reshape(1, -1)
print(f'\nVectorized Counter: {vectorized_chars_counter_array}')
# print(f'\nLength of vectorized chars: {vectorized_chars_counter.getnnz()}')
print(f'\nShape of vectorized chars: {vectorized_chars_counter_array.shape}')

In [None]:
np.mean([len(char) for char in train_chars])

In [None]:
np.percentile([len(sentence) for sentence in train_sentences], 95.57)

In [None]:
len(char_vocab) # alphabet

In [None]:
char_embed_keras = tf.keras.layers.Embedding(input_dim=len(char_vocab), # size of vocabulary
                                             output_dim=25, # this is the size of the char embedding in the paper
                                             mask_zero=True,
                                             name='char_embed_keras')

In [None]:
print(f'Charified text:\n{random_train_chars}')

print(f'\nCharified text:{len(random_train_chars.split())}')
char_embed_keras_sample = char_embed_keras(char_vectorizer([random_train_chars]))

print(f'\nVectorized Vector: {char_embed_keras_sample}')
print(f'\nShape of vectorized chars: {char_embed_keras_sample.shape}')

see: https://towardsdatascience.com/word-embedding-techniques-word2vec-and-tf-idf-explained-c5d02e34d08

In [None]:
len(random_train_chars)

## Build Conv1D model and fit with character-level

In [None]:
# Make Conv1D on chars
inputs = tf.keras.layers.Input(shape=(1,), 
                               dtype=tf.string)
x = char_vectorizer(inputs)
x = char_embed_keras(x)
x = tf.keras.layers.Conv1D(filters=64, 
                           kernel_size=5, 
                           padding='same', 
                           activation='relu')(x)
# x = tf.keras.layers.GlobalAveragePooling1D()(x)
x = tf.keras.layers.GlobalMaxPool1D()(x)
x = tf.keras.layers.Dropout(0.1)(x)
outputs = tf.keras.layers.Dense(num_classes, 
                                activation='softmax')(x)
model_3 = tf.keras.Model(inputs=inputs, 
                         outputs=outputs,
                         name='model_3_conv1d_char_embedding')

model_3.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])
model_3.summary()

In [None]:
train_char_dataset = tf.data.Dataset.from_tensor_slices((train_chars, 
                                                         train_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)
val_char_dataset = tf.data.Dataset.from_tensor_slices((val_chars, 
                                                       val_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)
test_char_dataset = tf.data.Dataset.from_tensor_slices((test_chars, 
                                                        test_labels_one_hot)).batch(32).prefetch(tf.data.AUTOTUNE)

train_char_dataset

In [None]:
# fit the model
model_3_history = model_3.fit(train_char_dataset,
                              steps_per_epoch=int(0.1 * len(train_char_dataset)),
                              epochs=3,
                              validation_data=val_char_dataset,
                              validation_steps=int(0.1 * len(val_char_dataset)))

In [None]:
model_3.evaluate(val_char_dataset)

In [None]:
model_3_preds_prob = model_3.predict(val_char_dataset)
model_3_preds = tf.argmax(model_3_preds_prob, 
                          axis=1)

model_3_results = calculate_results(y_true=val_labels_label_encoded, 
                                    y_pred=model_3_preds)

model_results = pd.DataFrame({'model_0_results': model_0_results,
                              'model_1_results': model_1_results,
                              'model_2_results': model_2_results,
                              'model_3_results': model_3_results})

model_results.transpose().sort_values(by='accuracy', ascending=False)

## Model 4: Combine pretrained token embedding + characteres (hydrid)

1. Create a token level embedding (similar model 1)
2. create a character-level (similar model 3)
3. Combine 1 & 2 model witch concatenate (layers.concatenate)
4. Build a series of output layers on top  of 3 similar
5. Construct a model which takes token and caracter-level sequences as input and produces sequence label probabilities as output

In [None]:
# 1. Setup token inputs/model
token_inputs =  tf.keras.layers.Input(shape=(), 
                                      dtype=tf.string, 
                                      name='token_input')

token_embeddings = sentence_encoder_layer(token_inputs)

token_output = tf.keras.layers.Dense(units=128, 
                                     activation='relu')(token_embeddings)

token_model = tf.keras.Model(token_inputs, token_output, name='token_model')

# 2. Setup character inputs/model
char_inputs = tf.keras.Input(shape=(1,),
                              dtype=tf.string,
                              name='char_input')

char_vectors = char_vectorizer(char_inputs)
char_embeddings = char_embed_keras(char_vectors)
char_bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(25))(char_embeddings) # bi-LSTM shown in figure 1 of https://arxiv.org/abs/1710.06071
char_model = tf.keras.Model(inputs=char_inputs,
                            outputs=char_bi_lstm,
                            name='char_model')

# 3. Setup concatenate token and char inputs/model
token_char_concat = tf.keras.layers.Concatenate(name='token_char_hydrid')([token_model.output, 
                                                                           char_model.output])

# 4. create output layers - adding in Dropout, discussed in section 4.2 of https://arxiv.org/abs/1710.06071
combined_dropout = tf.keras.layers.Dropout(0.5)(token_char_concat)
combined_dense = tf.keras.layers.Dense(128, 
                                       activation='relu')(combined_dropout)
final_dropout = tf.keras.layers.Dropout(0.6)(combined_dense)
output_layer = tf.keras.layers.Dense(num_classes, activation='softmax')(final_dropout)

# 5. Construct model with char and token input/model
model_4 = tf.keras.Model(inputs=[token_model.input, char_model.input], outputs=output_layer, name='model_4_token_and_char_embeddings')

In [None]:
model_4.summary()

In [None]:
# plot the model
plot_model(model_4, show_shapes=True)

In [None]:
# compile the model
# model_4.compile(loss=tf.keras.losses.CategoricalCrossentropy(), optimizer=tf.keras.optimizers.Adam(), metrics=['accuracy'])
model_4.compile(loss=tf.keras.losses.CategoricalCrossentropy(), optimizer=tf.keras.optimizers.SGD(), metrics=['accuracy'])

## Combining token and character data into tf.data dataset

see: https://www.tensorflow.org/api_docs/python/tf/data/Dataset

In [None]:
# combine chars and token
train_char_token_data = tf.data.Dataset.from_tensor_slices((train_sentences, 
                                                            train_chars)) # make data
train_char_token_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot) # make labels
train_char_token_dataset = tf.data.Dataset.zip((train_char_token_data, 
                                                train_char_token_labels)) # combile data and labels
train_char_token_dataset = train_char_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE) # prefetch and batch dataset

val_char_token_data = tf.data.Dataset.from_tensor_slices((val_sentences, 
                                                          val_chars))
val_char_token_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
val_char_token_dataset = tf.data.Dataset.zip((val_char_token_data, 
                                              val_char_token_labels))
val_char_token_dataset = val_char_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE) # prefetch and batch dataset

test_char_token_data = tf.data.Dataset.from_tensor_slices((test_sentences, 
                                                           test_chars))
test_char_token_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
test_char_token_dataset = tf.data.Dataset.zip((test_char_token_data, 
                                               test_char_token_labels))
test_char_token_dataset = test_char_token_dataset.batch(32).prefetch(tf.data.AUTOTUNE) # prefetch and batch dataset


train_char_token_dataset

In [None]:
# fit the model 4
# fit the model
model_4_history = model_4.fit(train_char_token_dataset,
                              steps_per_epoch=int(0.1 * len(train_char_token_dataset)),
                              epochs=5,
                              validation_data=val_char_token_dataset,
                              validation_steps=int(0.1 * len(val_char_token_dataset)))

In [None]:
model_4_preds_prob = model_4.predict(val_char_token_dataset)
model_4_preds = tf.argmax(model_4_preds_prob, 
                          axis=1)

model_4_results = calculate_results(y_true=val_labels_label_encoded, 
                                    y_pred=model_4_preds)

model_results = pd.DataFrame({'model_0_results': model_0_results,
                              'model_1_results': model_1_results,
                              'model_2_results': model_2_results,
                              'model_3_results': model_3_results,
                              'model_4_results': model_4_results})

model_results.transpose().sort_values(by='accuracy', ascending=False)

# Model 5: Pretrained token embeddings + character embeddings + positional embeddings

## Feature Engineering
* Taking `non-obvious features` from the data and encoding them numerically to help our model learn
* How can we add extra sources of data to our model?

Data augmentation is a form of feature engineering

> **Note**: Any engineered features used to train a model need to be available at test time.  
> In our case, line numbers and total line are available.

## Create positional embeddings

In [None]:
# how many different line numbers are there?
train_df['number'].value_counts()

In [None]:
train_df.head()

In [None]:
# check distribution of number
train_df.number.plot.hist()

In [None]:
# use tensorflow one-hot-encoder to create tensors of our number column
train_line_numbers_one_hot = tf.one_hot(train_df['number'].to_numpy(), depth=15) # depth=15 to prevent large dimension 
val_line_numbers_one_hot = tf.one_hot(val_df['number'].to_numpy(), depth=15) # depth=15 to prevent large dimension 
test_line_numbers_one_hot = tf.one_hot(test_df['number'].to_numpy(), depth=15) # depth=15 to prevent large dimension 

train_line_numbers_one_hot[:10], train_line_numbers_one_hot.shape

Now we've encoded our line numbers feature. let's do the same for our total lines

In [None]:
train_df['total_line'].value_counts()

In [None]:
# check distribution of total lines

train_df['total_line'].plot.hist()

In [None]:
# check the coverage of a total lines
np.percentile(train_df.total_line, 98)

In [None]:
# use Tensorflow to create one-hot-encoder tensors of our total_line feature
train_total_line_one_hot = tf.one_hot(train_df.total_line.to_numpy(), depth=20)
val_total_line_one_hot = tf.one_hot(val_df.total_line.to_numpy(), depth=20)
test_total_line_one_hot = tf.one_hot(test_df.total_line.to_numpy(), depth=20)

train_total_line_one_hot.shape, train_total_line_one_hot[:10]

## Building a tribrid embedding model

1. create a token-level model
2. create a character-level model
3. create a model for the line number feature
4. create a model for the total line feature
5. combine the outputs of 1 and 2 using tf.keras.layers.Concatenate
6. combine the outputs of 3, 4, 5 using tf.keras.layers.Concatenate
7. create an output layer to accept the tribrided embedding and output label probabilities
8. combine the inputs of 1, 2, 3, 4 and outputs of into a tf.keras.Model

In [None]:
train_line_numbers_one_hot[0].shape, train_total_line_one_hot[0].shape

In [None]:
# 1. Token inputs
token_inputs = tf.keras.layers.Input(shape=(), 
                                     dtype=tf.string, 
                                     name='token_input')

# 1.1 create token embedding with pretrained model
token_embeddings = sentence_encoder_layer(token_inputs) # transfer learning - pretrained model 

# 1.2 token output with activation relu
token_outputs = tf.keras.layers.Dense(units=128, 
                                      activation='relu')(token_embeddings)

# 1.3 create token model
token_model = tf.keras.Model(inputs=token_inputs,
                             outputs=token_outputs,
                             name='token_model')

# 2. Char inputs
char_inputs = tf.keras.layers.Input(shape=(1,), 
                                    dtype=tf.string, 
                                    name='char_input')

# 2.1 create a character vectorizer
char_vectors = char_vectorizer(char_inputs)

# 2.2 create a character embeddings with keras Embedding output_dim
char_embeddings = char_embed_keras(char_vectors)

# 2.3 create bidirectional layer with LSTM
char_bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(24))(char_embeddings) # bi-LSTM shown in figure 1 of https://arxiv.org/abs/1710.06071

# 2.3 create a char model
char_model = tf.keras.Model(inputs=char_inputs,
                            outputs=char_bi_lstm,
                            name='char_model')

# 3. Line number model
line_number_inputs = tf.keras.layers.Input(shape=(15,), # train_line_numbers_one_hot[0].shape
                                           dtype=tf.float32,
                                           name='line_number_input')

# 3.1 create a dense layer with 32 units
line_number_outputs = tf.keras.layers.Dense(units=32,
                                            activation='relu')(line_number_inputs)

# 3.2 create a line model
line_number_model = tf.keras.Model(inputs=line_number_inputs,
                                   outputs=line_number_outputs,
                                   name='line_number_model')
# 4. Total lines model
total_number_inputs = tf.keras.layers.Input(shape=(20,), # train_total_line_one_hot[0].shape
                                            dtype=tf.float32,
                                            name='total_number_input')

# 4.1 create a dense layer with 32 units and activation relu
total_number_outputs = tf.keras.layers.Dense(units=32,
                                             activation='relu')(total_number_inputs)

# 4.2 create a total model
total_number_model = tf.keras.Model(inputs=total_number_inputs,
                                    outputs=total_number_outputs,
                                    name='total_number_model')

# 5. combine token and char embeddings into hydrid embedding
combined_embeddings = tf.keras.layers.Concatenate(name='char_token_hydrid_embedding')([token_model.output, 
                                                                                       char_model.output])
# 5.1 create a dense layer with 256 unis and activation relu
z = tf.keras.layers.Dense(units=256, 
                          activation='relu')(combined_embeddings)

# 5.2 create a Dropout layer with 0.5
z = tf.keras.layers.Dropout(0.5)(z)

# 6. combine positional embedding with conbinaed token and char embeddings
tribrid_embeddings = tf.keras.layers.Concatenate(name='char_token_possitional_embedding')([line_number_model.output,
                                                                                          total_number_model.output,
                                                                                          z])

# 7. create output layer 
output_layer = tf.keras.layers.Dense(units=num_classes,
                                     activation='softmax',
                                     name='output_layer')(tribrid_embeddings)

# 8. put together model with all kinds of inputs
model_5 = tf.keras.Model(inputs=[line_number_model.input,
                                 total_number_model.input,
                                 token_model.input,
                                 char_model.input],
                                 outputs=output_layer,
                                 name='model_5_tribrid')

model_5.summary()

In [None]:
# plot model 5 to explore visualize
plot_model(model_5, show_shapes=True)

In [None]:
# compile the model
model_5.compile(loss=tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.2), # helps to prevent overfitting
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

## Create tribrid embedding datasets using tf.data

In [None]:
# create training and validation datasets (with all for kinds of inputs data) must be the same to the inputs model order
train_char_token_pos_data = tf.data.Dataset.from_tensor_slices((train_line_numbers_one_hot,
                                                                train_total_line_one_hot,
                                                                train_sentences,
                                                                train_chars))

train_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(train_labels_one_hot)
train_char_token_pos_dataset = tf.data.Dataset.zip((train_char_token_pos_data, train_char_token_pos_labels))
train_char_token_pos_dataset = train_char_token_pos_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

train_char_token_pos_data = tf.data.Dataset.from_tensor_slices((train_line_numbers_one_hot,
                                                                train_total_line_one_hot,
                                                                train_sentences,
                                                                train_chars))
# do the same for validation data
val_char_token_pos_data = tf.data.Dataset.from_tensor_slices((val_line_numbers_one_hot,
                                                              val_total_line_one_hot,
                                                              val_sentences,
                                                              val_chars))
val_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(val_labels_one_hot)
val_char_token_pos_dataset = tf.data.Dataset.zip((val_char_token_pos_data, val_char_token_pos_labels))
val_char_token_pos_dataset = val_char_token_pos_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

# do the same for test data
test_char_token_pos_data = tf.data.Dataset.from_tensor_slices((test_line_numbers_one_hot,
                                                                test_total_line_one_hot,
                                                                test_sentences,
                                                                test_chars))
test_char_token_pos_labels = tf.data.Dataset.from_tensor_slices(test_labels_one_hot)
test_char_token_pos_dataset = tf.data.Dataset.zip((test_char_token_pos_data, test_char_token_pos_labels))
test_char_token_pos_dataset = test_char_token_pos_dataset.batch(32).prefetch(tf.data.AUTOTUNE)

train_char_token_pos_dataset

In [None]:
# fit the model 5
model_5_history = model_5.fit(train_char_token_pos_dataset,
                              steps_per_epoch=int(0.1 * len(train_char_token_pos_dataset)),
                              epochs=3,
                              validation_data=val_char_token_pos_dataset,
                              validation_steps=int(0.1 * len(val_char_token_pos_dataset)))

In [None]:
model_5_preds_prob = model_5.predict(val_char_token_pos_dataset)
model_5_preds = tf.argmax(model_5_preds_prob, 
                          axis=1)

model_5_results = calculate_results(y_true=val_labels_label_encoded, 
                                    y_pred=model_5_preds)

model_results = pd.DataFrame({'model_0_results': model_0_results,
                              'model_1_results': model_1_results,
                              'model_2_results': model_2_results,
                              'model_3_results': model_3_results,
                              'model_4_results': model_4_results,
                              'model_5_results': model_5_results})

model_results.transpose().sort_values(by='accuracy', ascending=False)

In [None]:
# Plot and compare all models results
model_results.plot(kind='bar', figsize=(20, 5)).legend(bbox_to_anchor=(1.0,  1.0))

## save the model

In [None]:
model_5.save(f'{MODEL_PATH}/skimlist_tribrid_model')

In [4]:
model_loaded = tf.keras.models.load_model(f'{MODEL_PATH}/skimlist_tribrid_model')

In [None]:
model_loaded_preds_prob = model_loaded.predict(val_char_token_pos_dataset)
model_loaded_preds = tf.argmax(model_loaded_preds_prob, 
                          axis=1)

model_loaded_results = calculate_results(y_true=val_labels_label_encoded, 
                                    y_pred=model_loaded_preds)

model_results = pd.DataFrame({'model_0_results': model_0_results,
                              'model_1_results': model_1_results,
                              'model_2_results': model_2_results,
                              'model_3_results': model_3_results,
                              'model_4_results': model_4_results,
                              'model_5_results': model_5_results,
                              'model_5_loaded_results': model_loaded_results})

model_results.transpose().sort_values(by='accuracy', ascending=False)

In [None]:
cores = multiprocessing.cpu_count()
tf_idf_vectorizer = TfidfVectorizer(max_features=25)
vectorized_chars_counter = count_vectorizer.fit_transform([random_train_chars])
vectorized_chars_counter_array = np.hstack((vectorized_chars_counter.toarray()[0], 
                                            np.zeros(output_seq_char_len - len(vectorized_chars_counter.toarray()[0])))).reshape(1, -1)

vectorized_chars_counter_array

In [None]:
tf_idf_vectorizer = TfidfVectorizer(max_features=25,
                                    stop_words=nltk.corpus.stopwords.words('english'),
                                    strip_accents='ascii',
                                    analyzer='char_wb')
char_embed_tf_idf = tf_idf_vectorizer.fit_transform(np.hstack((random_train_chars.split(), 
                                                               np.zeros(output_seq_char_len - len(random_train_chars.split())))))

In [None]:
char_embed_tf_idf

In [None]:
cores = multiprocessing.cpu_count()
vocabs = np.hstack((random_train_chars.split(),
                    np.zeros(output_seq_char_len - len(random_train_chars.split()))))
w2v_model = Word2Vec(min_count=4,
                     window=4,
                     vector_size=len(vocabs),
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     sg = 1,
                     workers=cores-1)
w2v_model.build_vocab(vocabs, progress_per=10000)
w2v_model.train(vocabs, total_examples=w2v_model.corpus_count, epochs=100, report_delay=1)

In [None]:
w2v_model.wv.vectors.shape

In [None]:
# with open('D:/txt/cbow_s1000.txt', 'r', encoding='utf-8') as f:
#     lines = f.readlines()

In [None]:
# for line in lines[1:10]:
#     print(line.split())

In [None]:
# parser = argparse.ArgumentParser()
# parser.add_argument('input', help='Single embedding txt file')
# parser.add_argument('output', help='Output basename without extension')
# args = parser.parse_args()

# embedding_cbow_file = args.ouput + '.npy'
# vocabulary_cbow_file = args.output + '.txt'

# https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455
# https://gist.github.com/erickrf/e54cd0f3d917ec61b3ae758a5e47b883
# https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
# https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1
# https://github.com/deeplearning4j
# https://medium.com/@erkajalkumari/step-by-step-guide-to-word2vec-with-gensim-2c4cd9dde01f

# vocabs = []
# words_vector = []
# for line in lines[1:100]:
#     splitlines = line.split()
#     vocabs.append(splitlines[0].strip())
#     words_vector.append(np.fromiter((np.float32(x.replace(',', '.')) for x in splitlines[1:]), dtype=np.float32))

In [None]:
# np.save('D:/txt/cbow_1000.npy', np.array(words_vector))
# with open('D:/txt/cbow_1000.vocab', 'w', encoding='utf-8') as f:
#     for vocab in vocabs:
#         f.write(vocab)
#         f.write('\n')

In [None]:
def convert_to_binary(input_txt, output_vocab, output_binary):
    """
    :param input_txt - takes path of embedding which is in text format
    :param output_vocab - output vocabulary
    :param output_binary - output numpy array binary 
    :return a binary file Numpy .npy format
    """
    with open(input_txt, 'r', encoding='utf-8') as read_file:
        lines = read_file.readlines()
        word_vector = []
        with open(output_vocab, 'w', encoding='utf-8') as write_file:
            for line in lines[1:100]:
                splitlines = line.split()
                write_file.write(splitlines[0].strip().encode('utf-8'))
                write_file.write("\n")
                word_vector.append(np.fromiter((np.float32(x.replace(',', '.')) for x in splitlines[1:]), dtype=np.float32))
    np.save(output_binary, np.array(word_vector))

In [None]:
def load_embeddings_binary():
    """
    encoding='cp1252'
    It loads embedding provided by glove which is saved as binary file. Loading of this model is
    about  second faster than that of loading of txt glove file as model.
    :param embeddings_path: path of glove file.
    :return: glove model
    """
    model = {}

    with open('D:/txt/cbow_1000.vocab', 'r', encoding='utf-8') as file:
        words = [line.strip() for line in file]
    
    wv = np.load('D:/txt/cbow_1000.npy')
    
    for i, w in enumerate(words):
        model[w] = wv[i]
    
    return model

In [None]:
def build_phrases(sentences):
    phrases = Phrases(sentences,
                      min_count=5,
                      threshold=7,
                      progress_per=1000)
    return Phraser(phrases)

In [None]:
result = build_phrases(["hoje é bonito", "amanhã será feito"])

In [None]:
# dict_w2v = load_embeddings_binary()
# embedding_df = pd.DataFrame(dict_w2v)
# embedding_df

In [None]:
def get_w2v(sentence, model):
    """
    :param sentence: inputs a single sentences whose word embedding is to be extracted.
    :param model: inputs glove model.
    :return: returns numpy array containing word embedding of all words    in input sentence.
    """
    return np.array([model.get(val, np.zeros(1000)) for val in sentence.split()], dtype=np.float64)

In [None]:
w2v_sentences = ["oi me ferrei!"]
model_w2v = Word2Vec(sentences=[sentence.split() for sentence in w2v_sentences], 
                     vector_size=1000,
                     min_count=1)

In [None]:
model_w2v.build_vocab([sentence.split() for sentence in w2v_sentences])

In [None]:
model_w2v.wv.get_normed_vectors()

In [None]:
model_w2v.wv.get_vector('ferrei!')

In [None]:
model_w2v.wv.similar_by_vector(model_w2v.wv.get_vector('oi'))

In [None]:
# w2v = KeyedVectors.load_word2vec_format('D:/txt/cbow_s1000.txt')

In [None]:
# w2v.add_vector("ferrei!", model_w2v.wv.get_vector('ferrei!'))

In [None]:
# [ [w2v.get_vector(w) for w in sentence.split()] for sentence in ["hoje é um bom dia"] ]

In [None]:
# dict_w2v.update({'ferrei!': model_w2v.wv.get_vector('ferrei!')})
# result = get_w2v("eu me ferrei!", dict_w2v)
# result.shape, result[2]

In [None]:
# convert_to_binary('D:/txt/cbow_s1000.txt', 'D:/txt/cbow_1000.vocab', 'D:/txt/cbow_1000.npy')