# Adding Romanian Diacritics using Bidirectional LSTM

Diacritics are part of the Romanian identity but are usually dismissed in colloquial speech to favour faster typing. In this project I am proposing a faster alternative to inserting them, using *Bidirectional Long-Short Term Memory* Artificial Neural Networks.

## Understanding the problem

* There are 5 types of diacritics in the romanian language. (Ș /ʃ/, Ă /ə/, Ț /t͡s/, Â /ɨ/, Î /ɨ/ and their lowercase parts)
* Comma-below (ș and ț) versus cedilla (ş and ţ) --  * Many printed and online texts still incorrectly use "s with cedilla" and "t with cedilla". * [[Wikipedia:en@Romanian_alphabet]](https://en.wikipedia.org/wiki/Romanian_alphabet)
* According to the 1993 reform, the choice between î and â is thus again based on a rule that is neither strictly etymological nor phonological, but positional and morphological. The sound is always spelled as â, except at the beginning and the end of words, where î is to be used instead. Exceptions include proper nouns where the usage of the letters is frozen, whichever it may be, and compound words, whose components are each separately subjected to the rule (e.g. ne- + îndemânatic → neîndemânatic "clumsy", not *neândemânatic). [[Wikipedia:en@Romanian_alphabet]](https://en.wikipedia.org/wiki/Romanian_alphabet#%C3%8E_versus_%C3%82)

### Output Targets

- Ă ă (a with breve)
- Â â (a with circumflex)
- Î î (i with circumflex)
- Ș ș (s with comma)
- Ț ț (t with comma)
- Not Diacritic (Ignore / Discard)

Now that we understand the problem at hand, we need to tinker around with the inputs and outputs of our ANN. Since we don't really need 6 outputs, we can simplify our targets using the following format:

| ă or ş or ț 	| î 	| â 	| Not diacritic 	|
|:-----------:	|:-:	|:-:	|:-------------:	|
|      1      	| 0 	| 0 	|       0       	|
|      0      	| 1 	| 0 	|       0       	|
|      0      	| 0 	| 1 	|       0       	|
|      0      	| 0 	| 0 	|       1       	|

## Text to Sequence
For LSTMs to work with text, we need to convert it to a sequence. If we were to work with words, this could have been easily done using the Tokenizer class. In our case, I thought it would be much easier to convert chars to Unicode and store them in an array, also giving me the ability to convert them back, much easier.

In [1]:
import numpy as np

def textToSequence(text):
    return np.array([ord(c) for c in list(text)])

print(textToSequence("Imi place foarte mult aceasta casa."))

[ 73 109 105  32 112 108  97  99 101  32 102 111  97 114 116 101  32 109
 117 108 116  32  97  99 101  97 115 116  97  32  99  97 115  97  46]


Now that we have sequences, we need to break them into time steps, since the input for LSTMs is *(N,TIMESTEPS,INPUT)*. Although there are probably easier methods of doing this, I came up with this function, which breaks Numpy arrays into equal sized chunks (also pads them if necessary with 0).

In [2]:
def breakInto(arr, n = 30):
    arr = np.array(arr, dtype=int)

    if len(arr) % n:
        padSize = n - (len(arr) % n)
        pad = np.zeros((1, padSize), dtype=int)
        arr = np.append(arr, pad)
    
    arr = np.reshape(arr, (int(len(arr) / n), -1))

    return arr

In [3]:
seq = textToSequence("îmi place foarte mult această casă și țin să locuiesc în ea.")
print(breakInto(seq, 8))

print("\n -- Chunks of 128. -- \n")
print(breakInto(seq, 128))

[[238 109 105  32 112 108  97  99]
 [101  32 102 111  97 114 116 101]
 [ 32 109 117 108 116  32  97  99]
 [101  97 115 116 259  32  99  97]
 [115 259  32 537 105  32 539 105]
 [110  32 115 259  32 108 111  99]
 [117 105 101 115  99  32 238 110]
 [ 32 101  97  46   0   0   0   0]]

 -- Chunks of 128. -- 

[[238 109 105  32 112 108  97  99 101  32 102 111  97 114 116 101  32 109
  117 108 116  32  97  99 101  97 115 116 259  32  99  97 115 259  32 537
  105  32 539 105 110  32 115 259  32 108 111  99 117 105 101 115  99  32
  238 110  32 101  97  46   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0]]


## Prepering the Y
Now that we successfully created the X axis, we need to focus on our Y, our targets for our ANN. To do this, we need to turn our text, in a machine friendly format, which was stated above.

In [4]:
from keras.utils import to_categorical

OUTPUT_SIZE = 4

def fixDia(text):
    transformationTable = {
        "ş": "ş",
        "ţ": "ţ",
    }
    
    for char in transformationTable.keys():
        text = text.replace(char, transformationTable[char])
    
    return text

def removeDia(text):
    text = fixDia(text)
    
    transformationTable = {
        "î": "i",
        "ă": "a",
        "ț": "t",
        "â": "a",
        "ș": "s",
        "Î": "I",
        "Ă": "A",
        "Ț": "T",
        "Â": "A",
        "Ș": "S",
    }
    
    for char in transformationTable.keys():
        text = text.replace(char, transformationTable[char])
    
    return text    

print("îmi place foarte mult această casă și țin să locuiesc în ea.")
print(removeDia("îmi place foarte mult această casă și țin să locuiesc în ea."))

def toTarget(text):
    text = fixDia(text)
    
    returnable = []
    for char in text.lower():
        if char in ["ă", "ș", "ț"]:
            returnable.append([0])
        elif char in ["î"]: 
            returnable.append([1])
        elif char in ["â"]:
            returnable.append([2])
        else: 
            returnable.append([3])

    returnable = breakInto(returnable)
    returnable = to_categorical(returnable, OUTPUT_SIZE)
    
    return returnable

print(toTarget("îmi place foarte mult această casă și țin să locuiesc în ea.")[0])

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


îmi place foarte mult această casă și țin să locuiesc în ea.
imi place foarte mult aceasta casa si tin sa locuiesc in ea.
[[0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]]


## Creating the Neural Network

In [5]:
from keras.models import Model
from keras.layers import Input, LSTM, Dropout, TimeDistributed, Dense, Bidirectional, Embedding

def initNeuralNetwork():
    inputs = Input(shape=(30, 1))
    x = Bidirectional(LSTM(128, return_sequences=True))(inputs)
    x = Dropout(0.25)(x)
    x = TimeDistributed(Dense(OUTPUT_SIZE, activation='softmax'))(x)
    
    model = Model(inputs=inputs, outputs=x)
    
    model.compile('adam', 'categorical_crossentropy', metrics=['acc'])
    
    return model

In [6]:
model = initNeuralNetwork()

def predict(text):
    X = breakInto(textToSequence(text))
    X = np.reshape(X, X.shape + (1,))
    pred = model.predict(X)
    pred = pred.reshape(-1, pred.shape[-1])
    
    out = []
    labels = [np.argmax(amax) for amax in pred[:len(text)]]

    for i, label in enumerate(labels):
        if label == 3: out.append(text[i])
        if label == 2: out.append('â')
        if label == 1: out.append('î')
        if label == 0: out.append('#')
        
    print(text)
    print(''.join(out))

with open('dataset', 'r') as f:
    text = ' '.join(f.readlines())


X = breakInto(textToSequence(removeDia(text)))
X = np.reshape(X, X.shape + (1,))

Y = toTarget(text)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)


from keras.callbacks import LambdaCallback, ModelCheckpoint

def test(epoch, logs):
    text = 'Republica moldova este o tara foarte frumoasa si bogata si imi place foarte mult.'
    predict(text)


pred = model.predict(X)

model.fit(X_train, y_train, validation_data=(X_test, y_test),  epochs=100, batch_size=32, callbacks=[
    LambdaCallback(test), ModelCheckpoint('save', monitor='val_loss', verbose=0, save_best_only=False, save_weights_only=False, mode='auto', period=1)
])

KeyboardInterrupt: 

In [None]:
text = 'Republica moldova este o tara foarte frumoasa si bogata si imi place foarte mult.'
X = breakInto(textToSequence(text))
X = np.reshape(X, X.shape + (1,))

pred = model.predict(X)

In [None]:
pred = pred.reshape(-1, pred.shape[-1])

out = []
labels = [np.argmax(amax) for amax in pred[:len(text)]]

print(text)
print(labels)

