# Cleaning the crawled texts
## Converting JSON to plain text
At first, the JSON output of the spider is converted into plaintext. This is achieved by splitting the strings in the JSON into seperate lines unsing `split()` and the combine them unsing `join()`. 

In [2]:
path_to_json = "webcrawler/biology.json" # Where webcrawler output lives
with open(path_to_json, 'r') as fr:
    pre_ = fr.read() # read JSON file
    lines = pre_.split('\n') # split text into seperate lines
    new_filename = path_to_json.split('.')[0]+".txt" # To keep the same name except ext
    with open(new_filename, "a") as fw:
        fw.write("\n".join(lines)) # join lines together

A plain text file of the same filename is saved in the directory of the JSON file.

## Cleaning the plain text for model building

The steps described here are based on the following tutorials:
* [Text Cleaning for NLP: A Tutorial](https://monkeylearn.com/blog/text-cleaning/)
* [Pandas dataframe, German vocabulary – select words by matching a few 3-char-grams – I](https://linux-blog.anracom.com/2021/09/04/pandas-dataframe-german-vocabulary-select-words-by-matching-a-few-3-char-grams-i/)

### Step 1: Text Normalization

Text normalization aims at easing the computers understanding of the text at hand. For instance, we commonly use capitalizations and other special characters, which might interfere with model building.

If not normalized, our machine would intepret "Hello" differently than "hello" which doesn't really matter. On the other hand - especially in German language which we will be dealing with here - missing capitalization might interfere with our understanding of the text. For example, the German word "das Schreiben" means a particular document whereas the lowercase verb "schreiben" translates to writing. Outputs completly written in lowercase letter would need extensive additional editing.

However, in this iteration texts will be normalized to lowercase to improve model building.

In [8]:
path_to_rawtext = "webcrawler/biology.txt"
rawtext = open(path_to_rawtext, "r").read()

lowercase_text =  rawtext.lower()
print(lowercase_text[:500])

[
{"title": "entwicklungsbiologie", "contents": ["<div id=\"api-content\">\n                        <div><div></div></div><div><p>findest du es nicht auch immer wieder aufs neue faszinierend, wie aus einer<span> </span><a data-course-subject-id=\"3012649\" data-summary-id=\"21827141\" href=\"/schule/biologie/entwicklungsbiologie/eizelle/\">eizelle</a><span> </span>und einem samen ein mensch im bauch einer frau heranwachsen kann? dieser prozess geh\u00f6rt wohl zu den gr\u00f6\u00dften wundern de


### Step 2: Removing unwanted characters

As you can see from the output above, the crawled text contains HTML tags. We do not want those the interfere with our model building. Therefore, we will now remove all unicode characters.

In addtion, we can not expect our machine to use correct puntuation and commas - they just appear to rarely to be interpreted in a useful was. We could also remove all punctuation but I feel this would be to much. Therefore, we will just remove all commas.

In [23]:
import re

nonunicode_text = re.sub(r"\\n|<.+?>|(@\[A-Za-z0-9]+)|([^0-9A-Za-z.!? \t])|(\w+:\/\/\S+)|^rt|http.+?|contents|title", "", lowercase_text)
print(nonunicode_text[:500])


 entwicklungsbiologie                          findest du es nicht auch immer wieder aufs neue faszinierend wie aus einer eizelle und einem samen ein mensch im bauch einer frau heranwachsen kann? dieser prozess gehu00f6rt wohl zu den gru00f6u00dften wundern der natur. diese beeindruckenden vorgu00e4nge erforschen wissenschaftlerinnen im rahmen der entwicklungsbiologie. dabei wird die ontogenese  entwicklung von organismen vom stadium der zygote  befruchtete eizelle bis hin zum erwachsenen lebewe


## Step 3: Replacing Hex representations of German *umlaute* whith the correct characters

As you can clearly see from the output, we have some issue here. This issue stems from some special characters present in the German language: the *umlaute*. *Umlaute* are the character *ä, ö, and ü*. In addition to that, the german language also has this letter: *ß*. 

Our cralwer did return the unicode hex characters instead of the actual letters.

An example:
The word `gru00F6u00dften` should actually be `größten`.

So, we need to replace those hex characters with the correct letters. We can either choose the original *umlaute* or their also valid representations *ae, oe and ue*. For *ß* we can use *ss*. Here is a list of the hex characters and their corresinding characters:
* u00e4 --> *ae* or *ä*
* u00f6 --> *oe* or *ö*
* u00fc --> *ue* or *ü*
* u00df --> *ss* or *ß*

For now, we will try to use their actual characters.

In [38]:
# this section needs streamlining. It is not elegant at all.
noae_text = nonunicode_text.replace('u00e4','ä')
nooe_text = noae_text.replace('u00f6','ö')
noue_text = nooe_text.replace('u00fc','ü')
text = noue_text.replace('u00df', 'ß')

print(text[:500])

 entwicklungsbiologie                          findest du es nicht auch immer wieder aufs neue faszinierend wie aus einer eizelle und einem samen ein mensch im bauch einer frau heranwachsen kann? dieser prozess gehört wohl zu den größten wundern der natur. diese beeindruckenden vorgänge erforschen wissenschaftlerinnen im rahmen der entwicklungsbiologie. dabei wird die ontogenese  entwicklung von organismen vom stadium der zygote  befruchtete eizelle bis hin zum erwachsenen lebewesen untersucht. 


# Model Building and Training
## Prerequisites

Befor we start, we need to install the needed packages.

In [36]:
%pip install numpy
%pip install pandas
%pip install matplotlib
%pip install tensorflow

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement os (from versions: none)[0m
[31mERROR: No matching distribution found for os[0m
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement pickle (from versions: none)[0m
[31mERROR: No matching distribution found for pickle[0m
Note: you may need to restart the kernel to use updated packages.


No we load the needed libraries.

In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import os
import pickle
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from string import punctuation

## Step 1: Analyzing some text stats
Before we start going into the depth of neural networks, we'll have a look at the text at hand.

We will check for unique characters - to see if there is something left to be cleaned - and how many characters there are in total.

In [39]:
# print some stats
n_chars = len(text)
vocab = ''.join(sorted(set(text)))
print("unique_chars:", vocab)
n_unique_chars = len(vocab)
print("Number of characters:", n_chars)
print("Number of unique characters:", n_unique_chars)

unique_chars:  !.0123456789?abcdefghijklmnopqrstuvwxyzßäöü
Number of characters: 4973831
Number of unique characters: 44


Looks good so far. There are no weird characters present and we have a good amount of characters to start with.

## Step 2: Vectorize the Strings

Our neural network cannot operate on strings. It needs a vectorized represantation of the text. Therefore, we will create two dictionaries, mapping each character to an integer and *vice versa*.

In [40]:
# dictionary that converts characters to integers
char2int = {c: i for i, c in enumerate(vocab)}
# dictionary that converts integers to characters
int2char = {i: c for i, c in enumerate(vocab)}

This dictionaries can be saved using `pickle()`.

In [42]:
# save these dictionaries for later generation
BASENAME = 'elearning_textgen'
pickle.dump(char2int, open(f"{BASENAME}-char2int.pickle", "wb"))
pickle.dump(int2char, open(f"{BASENAME}-int2char.pickle", "wb"))

We now need to encode the text. We are using the dictionaries we've just created and convert each character into its corresponding integer.

In [46]:
# convert all text into integers
encoded_text = np.array([char2int[c] for c in text])
print(encoded_text[:20])

[ 0 18 27 33 36 22 16 24 25 34 27 20 32 15 22 28 25 28 20 22]


This encoded text will now be used to create a `tf.data.Dataset` object which allows us to scale our code for larger datasets. For this we use the `tf.data` API.

In [47]:
# construct tf.data.Dataset object
char_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)

# print first 5 characters
for char in char_dataset.take(8):
    print(char.numpy(), int2char[char.numpy()])


2022-07-11 22:46:47.397106: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-07-11 22:46:47.398029: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-07-11 22:46:47.398473: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (Schmalzmann-Laptop): /proc/driver/nvidia/version does not exist
2022-07-11 22:46:47.412325: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


0  
18 e
27 n
33 t
36 w
22 i
16 c
24 k


In [49]:
# build sequences by batching
sequence_length = 100
sequences = char_dataset.batch(2*sequence_length + 1, drop_remainder=True)

# print sequences
for sequence in sequences.take(2):
    print(''.join([int2char[i] for i in sequence.numpy()]))

 entwicklungsbiologie                          findest du es nicht auch immer wieder aufs neue faszinierend wie aus einer eizelle und einem samen ein mensch im bauch einer frau heranwachsen kann? diese
r prozess gehört wohl zu den größten wundern der natur. diese beeindruckenden vorgänge erforschen wissenschaftlerinnen im rahmen der entwicklungsbiologie. dabei wird die ontogenese  entwicklung von org


In [50]:
def split_sample(sample):
    # example :
    # sequence_length is 10
    # sample is "python is a great pro" (21 length)
    # ds will equal to ('python is ', 'a') encoded as integers
    ds = tf.data.Dataset.from_tensors((sample[:sequence_length], sample[sequence_length]))
    for i in range(1, (len(sample)-1) // 2):
        # first (input_, target) will be ('ython is a', ' ')
        # second (input_, target) will be ('thon is a ', 'g')
        # third (input_, target) will be ('hon is a g', 'r')
        # and so on
        input_ = sample[i: i+sequence_length]
        target = sample[i+sequence_length]
        # extend the dataset with these samples by concatenate() method
        other_ds = tf.data.Dataset.from_tensors((input_, target))
        ds = ds.concatenate(other_ds)
    return ds

# prepare inputs and targets
dataset = sequences.flat_map(split_sample)

In [51]:
def one_hot_samples(input_, target):
    # onehot encode the inputs and the targets
    # Example:
    # if character 'd' is encoded as 3 and n_unique_chars = 5
    # result should be the vector: [0, 0, 0, 1, 0], since 'd' is the 4th character
    return tf.one_hot(input_, n_unique_chars), tf.one_hot(target, n_unique_chars)

dataset = dataset.map(one_hot_samples)

In [52]:
# print first 2 samples
for element in dataset.take(2):
    print("Input:", ''.join([int2char[np.argmax(char_vector)] for char_vector in element[0].numpy()]))
    print("Target:", int2char[np.argmax(element[1].numpy())])
    print("Input shape:", element[0].shape)
    print("Target shape:", element[1].shape)
    print("="*50, "\n")

Input:  entwicklungsbiologie                          findest du es nicht auch immer wieder aufs neue faszi
Target: n
Input shape: (100, 44)
Target shape: (44,)

Input: entwicklungsbiologie                          findest du es nicht auch immer wieder aufs neue faszin
Target: i
Input shape: (100, 44)
Target shape: (44,)



In [54]:
# Batch size
BATCH_SIZE = 128

# repeat, shuffle and batch the dataset
ds = dataset.repeat().shuffle(1024).batch(BATCH_SIZE, drop_remainder=True)

In [55]:
model = Sequential([
    LSTM(256, input_shape=(sequence_length, n_unique_chars), return_sequences=True),
    Dropout(0.3),
    LSTM(256),
    Dense(n_unique_chars, activation="softmax"),
])

In [56]:
# define the model path
model_weights_path = f"results/{BASENAME}-{sequence_length}.h5"
model.summary()
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 100, 256)          308224    
                                                                 
 dropout (Dropout)           (None, 100, 256)          0         
                                                                 
 lstm_1 (LSTM)               (None, 256)               525312    
                                                                 
 dense (Dense)               (None, 44)                11308     
                                                                 
Total params: 844,844
Trainable params: 844,844
Non-trainable params: 0
_________________________________________________________________


In [57]:
EPOCHS = 30
# make results folder if does not exist yet
if not os.path.isdir("results"):
    os.mkdir("results")
# train the model
model.fit(ds, steps_per_epoch=(len(encoded_text) - sequence_length) // BATCH_SIZE, epochs=EPOCHS)
# save the model
model.save(model_weights_path)

Epoch 1/30