# Natural Language Processing

Started by following this excellent tutorial series from [TensorFlow on YouTube](https://www.youtube.com/playlist?list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S).

Before that, let's get some example text from my blog.

In [39]:
import glob
import re
from bs4 import BeautifulSoup

def get_dan_paragraphs():
    for filename in glob.glob('/Users/dan/Development/dantelore/hugo/public/posts/*/*.html'):
        html = None
        with open(filename) as f:
            html = f.read()
        if html:
            soup = BeautifulSoup(html, features='html.parser')

            # Remove code and preformatted blocks
            for x in soup.findAll('code'):
                x.extract()

            # Remove footers
            for x in soup.findAll('footer'):
                x.extract()

            # Remove header/nav
            for x in soup.findAll('nav'):
                x.extract()

            # Remove post list
            for x in soup.findAll('ul', {'id': 'post-list'}):
                x.extract()

            for p in soup.findAll('p'):    
                text = p.get_text()

                # Remove long whitespaces
                text = re.sub(r'\s+', ' ', text)

                # Remove non ASCII chars
                text = re.sub(r'[^\x00-\x7F]+', '', text)

                # Remove urls
                text = re.sub(r'http(s?)[^\s]+', '', text)

                # Remove newlines
                text = text.replace('/n', ' ')

                yield text


def get_dan_sentences():
    for text in get_dan_paragraphs():
        sentences = [x for x in re.split(r'[\.\?\!;:]+\s+', text) if len(x) > 10]
        for s in sentences:
            yield s.strip()


punctuation = {
    r'\.\s*': ' [full_stop] ',
    r'\,\s*': ' [comma] ',
    r'\!\s*': ' [bang] ',
    r'\?\s*': ' [eh] '
}

def get_dan_chunks():
    for text in get_dan_paragraphs():
        chunk = text
        for f, r in punctuation.items():
            chunk = re.sub(f, r, chunk)
            
        yield chunk

sentences = list(get_dan_chunks())

sentences

['The code for this article is available on my github [comma] here: ',
 'Building on the Live Departures Board project from the other day [comma] I decided to try out mapping some departure data [full_stop] The other article shows pretty much all the back-end code [comma] which wasnt changed much [full_stop] ',
 '',
 'The AngularJS app takes the routes of imminent departures from various stations and displays them on a CartoDB [comma] which is free [comma] unlike Mapbox [full_stop] ',
 '',
 'Heres the code-behind for the Angular app:',
 'Heres a session I did for the Venturis Voice user group [comma] at Trainlines HQ in London [full_stop] I talk about some of the challenges that face teams as they migrate to a cloud-based data architecture [full_stop] ',
 'Decided to spend lunchtime tempting fate round an apple tree [full_stop] Great fun [bang] ',
 '',
 'Need to practice the FPV so I can do this in even riskier areas [bang] ',
 'Once Id got Mono up and running [comma] the first little 

# Tokenisation

This approach extracts full sentenses from the input data - then converts these into n-grams for training.  All punctuation is removed.  This is probably not the best way to do this - as it's stopping the model understanding what a sentence is.  

Better to include key punctuation marks like full stop, comma, hyphen, colon etc as words/tokens in and of themselves.  This way, the model will learn to add sentence structure - and there will be an easier way to stop reading from the output - rather than just getting a set number of words, we could stop after the nth full stop, for example.

In [40]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import json
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# tokenizer = Tokenizer(num_words=2500, oov_token='<OOV>')
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
total_words = len(word_index) + 1

max_length = 12

input_sequences = []

tokenizer_json = tokenizer.to_json()
with open('./data/nlp_models/tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

for line in sentences:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[max(i-max_length, 0):i+1]
        input_sequences.append(n_gram_sequence)

input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_length, padding='pre'))

xs = input_sequences[:,:-1]
labels = input_sequences[:,-1]
# One-hot encode the output, as the words are categorical
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

# Wrap in a Pandas Dataframe just for nicer display
pd.set_option('display.min_rows', 10)
pd.DataFrame(input_sequences)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0,0,0,0,0,0,0,0,0,0,1,52
1,0,0,0,0,0,0,0,0,0,1,52,13
2,0,0,0,0,0,0,0,0,1,52,13,14
3,0,0,0,0,0,0,0,1,52,13,14,297
4,0,0,0,0,0,0,1,52,13,14,297,11
...,...,...,...,...,...,...,...,...,...,...,...,...
46053,0,0,0,0,0,0,9,114,57,11,50,5393
46054,0,0,0,0,0,9,114,57,11,50,5393,17
46055,0,0,0,0,9,114,57,11,50,5393,17,1
46056,0,0,0,9,114,57,11,50,5393,17,1,2707


# Build the model

In [41]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Bidirectional, LSTM
from tensorflow.keras.optimizers import Adam

model = Sequential()
model.add(Embedding(total_words, 200, input_length=max_length - 1))
model.add(Bidirectional(LSTM(500)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])


  super(Adam, self).__init__(name, **kwargs)


# Do some training

In [42]:
from datetime import datetime
import matplotlib.pyplot as plt
from IPython.display import clear_output

history_df = pd.DataFrame(columns=['loss', 'accuracy'])

run_count = 1
epochs_per_run = 10

for run in range(0, run_count):
    print(f"Starting Run {run}/{run_count}")
    # Use verbose=2 here to prevent progreess bars locking up jupyter after a few hours
    history = model.fit(xs, ys, epochs=epochs_per_run, verbose=2)
    
    dt = datetime.now() 
    model_filename = f"data/nlp_models/model_{dt.year}_{dt.month}_{dt.day}_{dt.hour}_{dt.minute}.h5"
    model.save(model_filename)

    history_df = pd.concat([history_df, pd.DataFrame(history.history)], ignore_index=True)

    clear_output(wait=True)
    
    plt.plot(history_df['accuracy'])
    plt.title('Model Training Progress')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.show()

Starting Run 0/1
Epoch 1/10


2023-01-19 15:08:38.161417: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-01-19 15:08:38.450884: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-01-19 15:08:38.465486: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-01-19 15:08:38.545063: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-01-19 15:08:38.567796: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


1440/1440 - 36s - loss: 6.4333 - accuracy: 0.1183 - 36s/epoch - 25ms/step
Epoch 2/10
1440/1440 - 33s - loss: 5.6354 - accuracy: 0.1527 - 33s/epoch - 23ms/step
Epoch 3/10
1440/1440 - 34s - loss: 4.8720 - accuracy: 0.1805 - 34s/epoch - 23ms/step
Epoch 4/10
1440/1440 - 34s - loss: 4.1909 - accuracy: 0.2214 - 34s/epoch - 23ms/step
Epoch 5/10


# See how well the replacement Dan is working

In [None]:
seed_text = "lathe"
next_words = 20

word_lookup = {v: k for k, v in tokenizer.word_index.items()}

seed_texts = [
    "the landrover",
    "a data strategy is important because",
    "it flies really",
    "the diesel engine pumps out black smoke because",
    "this article is about",
    "Supply for indicators comes from aux relay",
    "so we made one out of a chunk of",
    "If a train is delayed or cancelled",
    "Here we use KSQL to create",
    "Looks like there are four",
    "How much you invest in your data engineering capability",
    "I believe"
]
next_words = 30
max_length = 12

seed_texts = [s.lower() for s in seed_texts]
for seed_text in seed_texts:
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_length-1, padding='pre')
        predicted = np.argmax(model.predict(token_list, verbose=0), axis=1)[0]
        output_word = word_lookup[predicted]
        seed_text += ' ' + output_word
    print(seed_text)
    

2023-01-19 05:17:25.826850: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-01-19 05:17:25.951195: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-01-19 05:17:25.963706: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


lathe the new one over one over one over one afternoon about saving so now is managed directly kafka connect into
