# Cleaning the crawled texts
## Converting JSON to plain text
At first, the JSON output of the spider is converted into plaintext. This is achieved by splitting the strings in the JSON into seperate lines unsing `split()` and the combine them unsing `join()`. 

In [2]:
path_to_json = "webcrawler/biology.json" # Where webcrawler output lives
with open(path_to_json, 'r') as fr:
    pre_ = fr.read() # read JSON file
    lines = pre_.split('\n') # split text into seperate lines
    new_filename = path_to_json.split('.')[0]+".txt" # To keep the same name except ext
    with open(new_filename, "a") as fw:
        fw.write("\n".join(lines)) # join lines together

A plain text file of the same filename is saved in the directory of the JSON file.

## Cleaning the plain text for model building

The steps described here are based on the following tutorials:
* [Text Cleaning for NLP: A Tutorial](https://monkeylearn.com/blog/text-cleaning/)
* [Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – I](https://linux-blog.anracom.com/2021/09/04/pandas-dataframe-german-vocabulary-select-words-by-matching-a-few-3-char-grams-i/)

### Step 1: Text Normalization

Text normalization aims at easing the computers understanding of the text at hand. For instance, we commonly use capitalizations and other special characters, which might interfere with model building.

If not normalized, our machine would intepret "Hello" differently than "hello" which doesn't really matter. On the other hand - especially in German language which we will be dealing with here - missing capitalization might interfere with our understanding of the text. For example, the German word "das Schreiben" means a particular document whereas the lowercase verb "schreiben" translates to writing. Outputs completly written in lowercase letter would need extensive additional editing.

However, in this iteration texts will be normalized to lowercase to improve model building.

In [11]:
path_to_rawtext = "webcrawler/biology.txt"
rawtext = open(path_to_rawtext, "r").read()

lowercase_text =  rawtext.lower()
print(lowercase_text[:500])

[
{"title": "entwicklungsbiologie", "contents": ["<div id=\"api-content\">\n                        <div><div></div></div><div><p>findest du es nicht auch immer wieder aufs neue faszinierend, wie aus einer<span> </span><a data-course-subject-id=\"3012649\" data-summary-id=\"21827141\" href=\"/schule/biologie/entwicklungsbiologie/eizelle/\">eizelle</a><span> </span>und einem samen ein mensch im bauch einer frau heranwachsen kann? dieser prozess geh\u00f6rt wohl zu den gr\u00f6\u00dften wundern de


### Step 2: Removing unwanted characters

As you can see from the output above, the crawled text contains HTML tags. We do not want those the interfere with our model building. Therefore, we will now remove all unicode characters.

In addtion, we can not expect our machine to use correct puntuation and commas - they just appear to rarely to be interpreted in a useful was. We could also remove all punctuation but I feel this would be to much. Therefore, we will just remove all commas.

In [12]:
import re

nonunicode_text = re.sub(r"\\n|<.+?>|(@\[A-Za-z0-9]+)|([^0-9A-Za-z.!? \t])|(\w+:\/\/\S+)|^rt|http.+?|contents|title", "", lowercase_text)
print(nonunicode_text[:500])


 entwicklungsbiologie                          findest du es nicht auch immer wieder aufs neue faszinierend wie aus einer eizelle und einem samen ein mensch im bauch einer frau heranwachsen kann? dieser prozess gehu00f6rt wohl zu den gru00f6u00dften wundern der natur. diese beeindruckenden vorgu00e4nge erforschen wissenschaftlerinnen im rahmen der entwicklungsbiologie. dabei wird die ontogenese  entwicklung von organismen vom stadium der zygote  befruchtete eizelle bis hin zum erwachsenen lebewe


## Step 3: Replacing Hex representations of German *umlaute* whith the correct characters

As you can clearly see from the output, we have some issue here. This issue stems from some special characters present in the German language: the *umlaute*. *Umlaute* are the character *ä, ö, and ü*. In addition to that, the german language also has this letter: *ß*. 

Our cralwer did return the unicode hex characters instead of the actual letters.

An example:
The word `gru00F6u00dften` should actually be `größten`.

So, we need to replace those hex characters with the correct letters. We can either choose the original *umlaute* or their also valid representations *ae, oe and ue*. For *ß* we can use *ss*. Here is a list of the hex characters and their corresinding characters:
* u00e4 --> *ae* or *ä*
* u00f6 --> *oe* or *ö*
* u00fc --> *ue* or *ü*
* u00df --> *ss* or *ß*

For now, we will try to use their actual characters.

In [32]:
# this section needs streamlining. It is not elegant at all.
noae_text = nonunicode_text.replace('u00e4','ä')
nooe_text = noae_text.replace('u00f6','ö')
noue_text = nooe_text.replace('u00fc','ü')
text = noue_text.replace('u00df', 'ß')

print(text[:500])
textfile = open('clean_text.txt', 'w')
textfile.write(text)
textfile.close()

 entwicklungsbiologie                          findest du es nicht auch immer wieder aufs neue faszinierend wie aus einer eizelle und einem samen ein mensch im bauch einer frau heranwachsen kann? dieser prozess gehört wohl zu den größten wundern der natur. diese beeindruckenden vorgänge erforschen wissenschaftlerinnen im rahmen der entwicklungsbiologie. dabei wird die ontogenese  entwicklung von organismen vom stadium der zygote  befruchtete eizelle bis hin zum erwachsenen lebewesen untersucht. 


# Model Building and Training

## Sources

The following section is based on these tutorials:
* ["Create your First Text Geberator with LSTM in few minutes](https://pub.towardsai.net/create-your-first-text-generator-with-lstm-in-few-minutes-3b59ee139ca0)
* ["NLP using RNN - Can you be the next Shakespeare?"](https://medium.com/analytics-vidhya/nlp-using-rnn-can-you-be-the-next-shakespeare-27abf9af523)
* ["How to Build a Text generator using TensorFlow 2 and Keras in Python"](https://www.thepythoncode.com/article/text-generation-keras-python)

## Prerequisites

Befor we start, we need to install the needed packages.

In [6]:
%pip install numpy
%pip install pandas
%pip install matplotlib
%pip install tensorflow

Collecting numpy
  Downloading numpy-1.23.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[K     |████████████████████████████████| 17.1 MB 31.1 MB/s eta 0:00:01
[?25hInstalling collected packages: numpy
Successfully installed numpy-1.23.1
Note: you may need to restart the kernel to use updated packages.
Collecting pandas
  Downloading pandas-1.4.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[K     |████████████████████████████████| 11.7 MB 1.3 MB/s eta 0:00:01
Installing collected packages: pandas
Successfully installed pandas-1.4.3
Note: you may need to restart the kernel to use updated packages.
Collecting matplotlib
  Downloading matplotlib-3.5.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 2.7 MB/s eta 0:00:01
Collecting pillow>=6.2.0
  Downloading Pillow-9.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[K     |████████████████████████████████

No we load the needed libraries.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import os
import pickle
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Embedding
from tensorflow.keras.losses import sparse_categorical_crossentropy
from string import punctuation

## Step 1: Analyzing some text stats
Before we start going into the depth of neural networks, we'll have a look at the text at hand.

We will check for unique characters - to see if there is something left to be cleaned - and how many characters there are in total.

In [14]:
# print some stats
n_chars = len(text)
vocab = sorted(set(text))
vocab[:10]
print("unique_chars:", vocab)
n_unique_chars = len(vocab)
print("Number of characters:", n_chars)
print("Number of unique characters:", n_unique_chars)

unique_chars: [' ', '!', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ß', 'ä', 'ö', 'ü']
Number of characters: 24869155
Number of unique characters: 44


Looks good so far. There are no weird characters present and we have a good amount of characters to start with.

## Step 2: Vectorize the Strings

Our neural network cannot operate on strings. It needs a vectorized represantation of the text. Therefore, we will create two dictionaries, mapping each character to an integer and *vice versa*.

In [15]:
# dictionary that converts characters to integers
char2int = {c: i for i, c in enumerate(vocab)}
# dictionary that converts integers to characters
int2char = np.array(vocab)

This dictionaries can be saved using `pickle()`.

In [7]:
# save these dictionaries for later generation
BASENAME = 'elearning_textgen'
pickle.dump(char2int, open(f"{BASENAME}-char2int.pickle", "wb"))
pickle.dump(int2char, open(f"{BASENAME}-int2char.pickle", "wb"))

We now need to encode the text. We are using the dictionaries we've just created and convert each character into its corresponding integer.

In [33]:
# convert all text into integers
encoded_text = np.array([char2int[c] for c in text])
print(encoded_text[:20])

[ 0 18 27 33 36 22 16 24 25 34 27 20 32 15 22 28 25 28 20 22]


TypeError: write() argument must be str, not numpy.ndarray

This encoded text will now be used to create a `tf.data.Dataset` object which allows us to scale our code for larger datasets. For this we use the `tf.data` API.

In [17]:
# construct tf.data.Dataset object
char_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)

# print first 5 characters
for i in char_dataset.take(500):
     print(int2char[i.numpy()])


2022-07-17 20:30:11.831175: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-07-17 20:30:11.834870: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-17 20:30:11.834929: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2022-07-17 20:30:11.834970: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2022-07-17 20:30:11.835015: W tensorflow/stream_executor/platform/default/dso_loader.cc:6

 
e
n
t
w
i
c
k
l
u
n
g
s
b
i
o
l
o
g
i
e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f
i
n
d
e
s
t
 
d
u
 
e
s
 
n
i
c
h
t
 
a
u
c
h
 
i
m
m
e
r
 
w
i
e
d
e
r
 
a
u
f
s
 
n
e
u
e
 
f
a
s
z
i
n
i
e
r
e
n
d
 
w
i
e
 
a
u
s
 
e
i
n
e
r
 
e
i
z
e
l
l
e
 
u
n
d
 
e
i
n
e
m
 
s
a
m
e
n
 
e
i
n
 
m
e
n
s
c
h
 
i
m
 
b
a
u
c
h
 
e
i
n
e
r
 
f
r
a
u
 
h
e
r
a
n
w
a
c
h
s
e
n
 
k
a
n
n
?
 
d
i
e
s
e
r
 
p
r
o
z
e
s
s
 
g
e
h
ö
r
t
 
w
o
h
l
 
z
u
 
d
e
n
 
g
r
ö
ß
t
e
n
 
w
u
n
d
e
r
n
 
d
e
r
 
n
a
t
u
r
.
 
d
i
e
s
e
 
b
e
e
i
n
d
r
u
c
k
e
n
d
e
n
 
v
o
r
g
ä
n
g
e
 
e
r
f
o
r
s
c
h
e
n
 
w
i
s
s
e
n
s
c
h
a
f
t
l
e
r
i
n
n
e
n
 
i
m
 
r
a
h
m
e
n
 
d
e
r
 
e
n
t
w
i
c
k
l
u
n
g
s
b
i
o
l
o
g
i
e
.
 
d
a
b
e
i
 
w
i
r
d
 
d
i
e
 
o
n
t
o
g
e
n
e
s
e
 
 
e
n
t
w
i
c
k
l
u
n
g
 
v
o
n
 
o
r
g
a
n
i
s
m
e
n
 
v
o
m
 
s
t
a
d
i
u
m
 
d
e
r
 
z
y
g
o
t
e
 
 
b
e
f
r
u
c
h
t
e
t
e
 
e
i
z
e
l
l
e
 
b
i
s
 
h
i
n
 
z
u
m
 
e
r
w
a
c
h
s
e
n
e
n
 
l
e
b
e
w
e
s
e
n
 
u
n
t
e
r
s
u
c
h
t
.
 


## Step 3: Building Sequences

What our model is supposed to do is to predict the next character based on a historical sequence. That means our model iterates over all the text in our input and stores the probability for each character to appear in a certain position.

We can now choose how long this historical sequence should be. We need to balance between having too little information about textual patterns and taking too long. Short sequences - let's say of length 1 - would over no real insights. If the model hast to predict which character follows after "b" that wouldn't help much. On the other hand, long sequences will slow down the training and increase the risk of overfitting.

Therefore, we choose a length of 180 characters. Whith one sentence having 75 - 100 characters on average, this roughly entails one and a half sentence, which seems like a good textual unit.

In [22]:
# build sequences by batching
sequence_length = 180
total_num_seq = len(text)//(sequence_length+1)
sequences = char_dataset.batch(sequence_length+1, drop_remainder=True)

# print sequences
for sequence in sequences.take(2):
    print(''.join([int2char[i] for i in sequence.numpy()]))

 entwicklungsbiologie                          findest du es nicht auch immer wieder aufs neue faszin
ierend wie aus einer eizelle und einem samen ein mensch im bauch einer frau heranwachsen kann? dieser


Our whole text data will be shifted one character forward. The method batch now converts the seperate character calls into sequences that can be fed to the model as one batch. We set drop_remainder=True. This causes the last batch to be dropped if it has less elements than specified in BATCH_SIZE.

In the Output above we can now see the first two sequences of our dataset.

Now, we will take the input text sequence and define the target sequence as the input sequence shifted by one character. They will be then grouped as a tuple.

In [23]:
def create_seq_targets(seq):
    input_txt = seq[:-1]
    target_txt = seq[1:]
    return input_txt, target_txt
    
dataset = sequences.map(create_seq_targets)

for input_txt, target_txt in  dataset.take(1):
    print(input_txt.numpy())
    print(''.join(int2char[input_txt.numpy()]))
    print('\n')
    print(target_txt.numpy())
    # There is an extra whitespace!
    print(''.join(int2char[target_txt.numpy()]))

Here we see what our tuple looks like. The upper text is our input, the lower text is the target which is shifted forward by one character.

## Step 4: Generating Batches

In the above section we've build the training sequences. With those sequences alone, we couldn't do much. Therefore, we will group them in batches. For that purpose, we define `BATCH_SIZE`.

In addtion, we will shuffle the sequences in each batch so we don't get an overfit of certain text sections.

In [25]:
# Batch size
BATCH_SIZE = 128
BUFFER_SIZE = 10000

# repeat, shuffle and batch the dataset
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

## Step 5: Model building

The model we're using is based on what is called *Long Short-Term Memory* or LTSM. So, what is this?

In [26]:
def sparse_cat_loss(y_true,y_pred):
  return sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)

In [27]:
def create_model(vocab_size, embed_dim, rnn_neurons, batch_size):
    model = Sequential()
    model.add(Embedding(vocab_size, embed_dim,batch_input_shape=[batch_size, None]))
    model.add(LSTM(rnn_neurons,return_sequences=True,stateful=True,recurrent_initializer='glorot_uniform'))
    # Final Dense Layer to Predict
    model.add(Dense(vocab_size))
    model.compile(optimizer='adam', loss=sparse_cat_loss) 
    return model

In [28]:
# Length of the vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embed_dim = 64
# Number of RNN units
rnn_neurons = 1026

In [29]:
#Create the model
model = create_model(
  vocab_size = vocab_size,
  embed_dim = embed_dim,
  rnn_neurons = rnn_neurons,
  batch_size = BATCH_SIZE)

In [18]:
#Train the model
epochs = 30
model.fit(dataset,epochs=epochs)
from tensorflow.keras.models import load_model
model.save('elearning_gen.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f577af69550>

In [30]:
#Currently our model only expects 128 sequences at a time. We can create a new model that only expects a batch_size=1. We can create a new model with this batch size, then load our saved models weights.
#Then call .build() on the mode

model = create_model(vocab_size, embed_dim, rnn_neurons, batch_size=1)
model.load_weights('elearning_gen.h5')
model.build(tf.TensorShape([1, None]))


def generate_text(model, start_seed,gen_size=100,temp=1.0):
  '''
  model: Trained Model to Generate Text
  start_seed: Intial Seed text in string form
  gen_size: Number of characters to generate
  Basic idea behind this function is to take in some seed text, format it so
  that it is in the correct shape for our network, then loop the sequence as
  we keep adding our own predicted characters. Similar to our work in the RNN
  time series problems.
  '''
  # Number of characters to generate
  num_generate = gen_size
  # Vecotrizing starting seed text
  input_eval = [char2int[s] for s in start_seed]
  # Expand to match batch format shape
  input_eval = tf.expand_dims(input_eval, 0)
  # Empty list to hold resulting generated text
  text_generated = []
  # Temperature effects randomness in our resulting text
  # The term is derived from entropy/thermodynamics.
  # The temperature is used to effect probability of next characters.
  # Higher probability == lesss surprising/ more expected
  # Lower temperature == more surprising / less expected
  temperature = temp
  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      # Generate Predictions
      predictions = model(input_eval)
      # Remove the batch shape dimension
      predictions = tf.squeeze(predictions, 0)
      # Use a cateogircal disitribution to select the next character
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
      # Pass the predicted charracter for the next input
      input_eval = tf.expand_dims([predicted_id], 0)
      # Transform back to character letter
      text_generated.append(int2char[predicted_id])
  return (start_seed + ''.join(text_generated))


print(generate_text(model,"biologie",gen_size=1000))

biologie  hoch größere komponenten. hat ein mangel an stackshleben schwachster begliedert als die der fundamentalnivation des genetischen material wird in dem pantoffeltierchen genannt.  im normierte sie wuchsel sind zeigen dir wie das therapeutische klonen ausnahmslosen infektionen und pflanzen bzw. den umweltfaktor löst nur noch zur bildauch gestein in der sansition der stechmückerregar denn sie unterscheiden sich dabei jedoch bspw. nahrungsketten die lichtenergie zur abwehr von dna resornäher ausbrochen ist.genauer gesagt wird ucht man zählt einen plasmidvielfachen chromosomen besitzen enthaltenen seiten blutzellen gehen die entsprechende huan auf unseren geispiel zur arms von wildt dem einem fortschritt außerhalb des protoplasten und die dafür bestandliche abschnitte einer ribose der dna und rna unterscheiden sich wasser bellen sondern mehrere lebewesen zeigt sich dass ein fehlende kontinuierliche erreger oder die sterberate sind gut im vergleich zu jahren. allerdings gingst du das