# Natural Language Processing

Started by following this excellent tutorial series from [TensorFlow on YouTube](https://www.youtube.com/playlist?list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S).

Before that, let's get some example text from my blog.

Other sources of data could include:
https://huggingface.co/datasets/bookcorpus
https://huggingface.co/datasets/openwebtext
https://huggingface.co/datasets/wikitext

Or you could just use GPT!
https://huggingface.co/openai-gpt
https://huggingface.co/gpt2

In [5]:
import glob
import re
from bs4 import BeautifulSoup

def get_dan_paragraphs():
    for filename in glob.glob('/Users/dan/Development/dantelore/hugo/public/posts/*/*.html'):
        html = None
        with open(filename) as f:
            html = f.read()
        if html:
            soup = BeautifulSoup(html, features='html.parser')

            # Remove code and preformatted blocks
            for x in soup.findAll('code'):
                x.extract()

            # Remove footers
            for x in soup.findAll('footer'):
                x.extract()

            # Remove header/nav
            for x in soup.findAll('nav'):
                x.extract()

            # Remove post list
            for x in soup.findAll('ul', {'id': 'post-list'}):
                x.extract()

            for p in soup.findAll('p'):    
                text = p.get_text()

                # Remove long whitespaces
                text = re.sub(r'\s+', ' ', text)

                # Remove non ASCII chars
                text = re.sub(r'[^\x00-\x7F]+', '', text)

                # Remove urls
                text = re.sub(r'http(s?)[^\s]+', '', text)

                # Remove newlines
                text = text.replace('/n', ' ')

                yield text


def get_dan_sentences():
    for text in get_dan_paragraphs():
        sentences = [x for x in re.split(r'[\.\?\!;:]+\s+', text) if len(x) > 10]
        for s in sentences:
            yield s.strip()


punctuation = {
    r'\.\s*': " ''fullstop'' ",
    r'\,\s*': " ''comma'' ",
    r'\!\s*': " ''bang'' ",
    r'\?\s*': " ''eh'' ",
    r'$': " ''endpara'' "
}

def get_dan_chunks():
    for text in get_dan_paragraphs():
        chunk = text
        for f, r in punctuation.items():
            chunk = re.sub(f, r, chunk)
            
        yield chunk

sentences = list(get_dan_chunks())

sentences

["The code for this article is available on my github ''comma'' here:  ''endpara'' ",
 "Building on the Live Departures Board project from the other day ''comma'' I decided to try out mapping some departure data ''fullstop'' The other article shows pretty much all the back-end code ''comma'' which wasnt changed much ''fullstop''  ''endpara'' ",
 " ''endpara'' ",
 "The AngularJS app takes the routes of imminent departures from various stations and displays them on a CartoDB ''comma'' which is free ''comma'' unlike Mapbox ''fullstop''  ''endpara'' ",
 " ''endpara'' ",
 "Heres the code-behind for the Angular app: ''endpara'' ",
 "Heres a session I did for the Venturis Voice user group ''comma'' at Trainlines HQ in London ''fullstop'' I talk about some of the challenges that face teams as they migrate to a cloud-based data architecture ''fullstop''  ''endpara'' ",
 "Decided to spend lunchtime tempting fate round an apple tree ''fullstop'' Great fun ''bang''  ''endpara'' ",
 " ''endpara'' "

# Tokenisation

This approach extracts full sentenses from the input data - then converts these into n-grams for training.  All punctuation is removed.  This is probably not the best way to do this - as it's stopping the model understanding what a sentence is.  

Better to include key punctuation marks like full stop, comma, hyphen, colon etc as words/tokens in and of themselves.  This way, the model will learn to add sentence structure - and there will be an easier way to stop reading from the output - rather than just getting a set number of words, we could stop after the nth full stop, for example.

In [6]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import json
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# tokenizer = Tokenizer(num_words=2500, oov_token='<OOV>')
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
total_words = len(word_index) + 1

max_length = 12

input_sequences = []

tokenizer_json = tokenizer.to_json()
with open('./data/nlp_models/tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

for line in sentences:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[max(i-max_length, 0):i+1]
        input_sequences.append(n_gram_sequence)

input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_length, padding='pre'))

xs = input_sequences[:,:-1]
labels = input_sequences[:,-1]
# One-hot encode the output, as the words are categorical
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

# Wrap in a Pandas Dataframe just for nicer display
pd.set_option('display.min_rows', 10)
pd.DataFrame(input_sequences)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0,0,0,0,0,0,0,0,0,0,1,52
1,0,0,0,0,0,0,0,0,0,1,52,13
2,0,0,0,0,0,0,0,0,1,52,13,14
3,0,0,0,0,0,0,0,1,52,13,14,297
4,0,0,0,0,0,0,1,52,13,14,297,11
...,...,...,...,...,...,...,...,...,...,...,...,...
45029,0,0,0,0,0,9,114,57,11,50,5397,17
45030,0,0,0,0,9,114,57,11,50,5397,17,1
45031,0,0,0,9,114,57,11,50,5397,17,1,2710
45032,0,0,9,114,57,11,50,5397,17,1,2710,21


# Build the model

In [7]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Bidirectional, LSTM
from tensorflow.keras.optimizers import Adam

model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_length - 1))
model.add(Bidirectional(LSTM(250)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

model.summary()


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 11, 100)           539800    
                                                                 
 bidirectional_1 (Bidirectio  (None, 1000)             2404000   
 nal)                                                            
                                                                 
 dense_1 (Dense)             (None, 5398)              5403398   
                                                                 
Total params: 8,347,198
Trainable params: 8,347,198
Non-trainable params: 0
_________________________________________________________________


  super(Adam, self).__init__(name, **kwargs)


# Do some training

In [8]:
from datetime import datetime
import matplotlib.pyplot as plt
from IPython.display import clear_output

history_df = pd.DataFrame(columns=['loss', 'accuracy'])

run_count = 20
epochs_per_run = 100

for run in range(0, run_count):
    print(f"Starting Run {run}/{run_count}")
    # Use verbose=2 here to prevent progreess bars locking up jupyter after a few hours
    history = model.fit(xs, ys, epochs=epochs_per_run, verbose=2)
    
    dt = datetime.now() 
    model_filename = f"data/nlp_models/model_{dt.year}_{dt.month}_{dt.day}_{dt.hour}_{dt.minute}.h5"
    model.save(model_filename)

    history_df = pd.concat([history_df, pd.DataFrame(history.history)], ignore_index=True)

    clear_output(wait=True)
    
    plt.plot(history_df['accuracy'])
    plt.title(f'Model Training Progress - Run {run} of {run_count}')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.show()

Starting Run 0/20
Epoch 1/100


2023-01-19 20:43:30.146267: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-01-19 20:43:31.440401: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-01-19 20:43:31.736287: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-01-19 20:43:31.751259: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-01-19 20:43:31.872168: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
2023-01-19 20:43:31.894139: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


# See how well the replacement Dan is working

In [None]:
seed_text = "lathe"
next_words = 20

word_lookup = {v: k for k, v in tokenizer.word_index.items()}

seed_texts = [
    "a data strategy is important because",
    "it flies really",
    "the diesel engine pumps out black smoke because",
    "this article is about",
    "supply for indicators comes from aux relay",
    "so we made one out of a chunk of",
    "if a train is delayed or cancelled",
    "here we use KSQL to create",
    "looks like there are four",
    "how much you invest in your data engineering capability",
    "i believe"
]
next_words = 30
max_length = 12

reverse_punctuation = {
    " ''fullstop''": ".",
    " ''comma''": ",",
    " ''bang''": "!",
    " ''eh''": "?",
    " ''endpara''": ""
}

seed_texts = [s.lower() for s in seed_texts]
for seed_text in seed_texts:
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_length-1, padding='pre')
        predicted = np.argmax(model.predict(token_list, verbose=0), axis=1)[0]
        output_word = word_lookup[predicted]
        seed_text += ' ' + output_word

        if output_word == "''endpara''":
            break

    for f, r in reverse_punctuation.items():
        seed_text = seed_text.replace(f, r)
    print(seed_text)
    

a data strategy is important because i went with the dog end to the lake as id. maybe im using ms4w, there year, this turns out properly either. im not
it flies really hard as well as long as possible. it was very easy as well as well as well as well as much more about back, process as the
the diesel engine pumps out black smoke because the hardest should think its difficult more pretty good tech, child resistant finish, just need to write a bit of the data next etl gets id ep9639 have
this article is about angularjs app with the basics sorting out towards embeddings into the same last weekend before, pliers, its a sad here. started snipping and has ci algorithms
supply for indicators comes from aux relay to be a lot of the data is a pay teams of kafka, the female strangers. the next thing i decided they also out the ability to
so we made one out of a chunk of the data, its pushed to the admin ui to say that os will produce out to cluster api resource as much as set of a single acting,