## Segmentation Model

Here I intend to develop a segmentation model with the aid of previously pre-trained deep neural models. To load the GloVe Embeddings, I followed [this tutorial](https://keras.io/examples/nlp/pretrained_word_embeddings/).

---
### To Do:
- Find a way to best embed numbers

In [1]:
import numpy as np
import tensorflow as tf
import pandas as pd

from tensorflow import keras
from keras.models import Sequential
from tensorflow.keras.layers import TextVectorization, Embedding, LSTM, Bidirectional

2021-12-29 17:52:10.613515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/oracle/instantclient_19_11:
2021-12-29 17:52:10.613594: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
EMBEDDING_DIM = 100
MIN_TOKENS = 30
MAX_TOKENS = 200
VOCAB_SIZE = 20000
VALIDATION_PCT = 0.15
TEST_PCT = 0.15

def get_sentence_labels(sentence):
    """
        This function gets the label for each word on the sentences
    """
    
    ORDINARY_TOKEN = 0
    END_OF_SENTENCE_TOKEN = 1
    
    labels = []
    for i, sentence in enumerate(sentences, start=1):
        n_words = len(sentence.split())

        if (sentence == ''):
            continue

        partial_labels = [ORDINARY_TOKEN] * n_words

        if (i != len(sentences)):
            partial_labels[-1] = END_OF_SENTENCE_TOKEN

        labels += partial_labels
    
    return labels
    

def build_random_sized_sentences(text):
    """
        Build random sized sentences out of a single corpus per function call
    """
    global MIN_TOKENS
    global MAX_TOKENS
    
    text_tokens = text.split()
    prev_index = 0
    sentences = []
    while prev_index < len(text_tokens):
        final_index = prev_index + np.random.randint(MIN_TOKENS, MAX_TOKENS)
        sentences.append(' '.join(text_tokens[prev_index:final_index]))
        
        prev_index = final_index
        
    return np.array(sentences)

---
## Training sentences

The training sentences are built by drawing random sentence sizes, and sequentially slicing the original corpora

In [3]:
df = pd.read_csv("../data/TED_Talks/02_preprocessed/subtitles_preprocessed.csv")
df.head(3)

Unnamed: 0.1,Unnamed: 0,subtitle,url
0,0,this is the air jordan 3 black cement. this mi...,https://www.ted.com/talks/josh_luber_why_sneak...
1,1,if you want to buy high quality low price coca...,https://www.ted.com/talks/jamie_bartlett_how_t...
2,2,do you know how many choices you make in a typ...,https://www.ted.com/talks/sheena_iyengar_how_t...


In [4]:
entire_corpora = df.subtitle.apply(lambda text: build_random_sized_sentences(text))
entire_corpora = np.concatenate(entire_corpora.values)

In [5]:
corpus_size = len(entire_corpora)
train_upper_idx = int(corpus_size * (1 - VALIDATION_PCT - TEST_PCT))
valid_upper_idx = int(corpus_size * (1 - TEST_PCT))

train_data = entire_corpora[:train_upper_idx]
valid_data = entire_corpora[train_upper_idx:valid_upper_idx]
test_data = entire_corpora[valid_upper_idx:]

---
## Text Vectorization

In [6]:
vectorizer = TextVectorization(max_tokens=VOCAB_SIZE, output_sequence_length=MAX_TOKENS)
text_ds = tf.data.Dataset.from_tensor_slices(train_data).batch(128)

2021-12-29 17:52:21.929487: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-12-29 17:52:21.929597: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (kunumi): /proc/driver/nvidia/version does not exist
2021-12-29 17:52:21.937839: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Learn the most frequent words on the `train_data`

In [7]:
vectorizer.adapt(text_ds)

Build a mapper from words to their respective indices

In [8]:
word_index = {word: i for i, word in enumerate(vectorizer.get_vocabulary())}

---

## Build the Embedding Layer

In [9]:
embeddings = {}
with open(f"../data/embeddings/glove/glove.6B.{EMBEDDING_DIM}d.txt") as file:
    for line in file:
        word, coefs = line.split(maxsplit=1)
        embeddings[word] = np.fromstring(coefs, 'f', sep=' ')

Build the embedding matrix

In [10]:
hits, misses = 0, 0

# +2 due to empty string and UKN token
embedding_matrix = np.zeros((VOCAB_SIZE+2, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 11195 words (265 misses)


In [11]:
model = Sequential()

model.add(
    Embedding(
        VOCAB_SIZE+2,
        EMBEDDING_DIM,
        embeddings_initializer=keras.initializers.Constant(embedding_matrix),
        trainable=False,
    )
)

model.add(
    Bidirectional(
        LSTM(
            units=20
        )
    )
)

model.compile()

In [12]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 100)         2000200   
                                                                 
 bidirectional (Bidirectiona  (None, 40)               19360     
 l)                                                              
                                                                 
Total params: 2,019,560
Trainable params: 19,360
Non-trainable params: 2,000,200
_________________________________________________________________
