# Words to Phonemes
## A Machine Learning model that is able to translate written english words to their corresponding Phonemes.

This Machine Learning algorithm has the objective of creating a Sequence-To-Sequence NLP model that is capable of translating a word input to its corresponding phonemes in the *arpabet* format.

For example, it is able to receive the word "car" and return the phonemes "K AA R".

It uses the Keras library to create an LSTM Recurrent Neural Network to be trained with thousands of english words.

In [0]:
%tensorflow_version 1.x

from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np
import random

### Training Data
The Training Data is gathered from the [CMUDict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict).

In [11]:
!wget https://raw.githubusercontent.com/microsoft/CNTK/v2.0/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b.train

#Open Text File
f = open("cmudict-0.7b.train", "r")
dataString = f.read()

# Seperate it into lines
lines = dataString.split('\n')

# Shuffle
random.shuffle(lines)
lines[:4]

--2020-03-30 14:42:53--  https://raw.githubusercontent.com/microsoft/CNTK/v2.0/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b.train
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2851082 (2.7M) [text/plain]
Saving to: ‘cmudict-0.7b.train.7’


2020-03-30 14:42:54 (33.7 MB/s) - ‘cmudict-0.7b.train.7’ saved [2851082/2851082]



['SYNONYMOUS  S AH N AA N AH M AH S',
 'EDITORIALIZING  EH D AH T AO R IY AH L AY Z IH NG',
 "WINNER'S  W IH N ER Z",
 'DELMONT  D EY L M OW N T']

After downloading, we need to prepare it. 

By iterating over each line of the document, the algorithm fills out two Arrays containing all of the work-phoneme pairs and two Sets containing every used character or phoneme, creating the vocabulary.

In [12]:
#Input and Target Arrays
input_texts = []
target_texts = []

#Vocabulary
input_characters = set()
target_phonemes = set()

#Iterate over the lines
for line in lines:
    if line != "":
      # Get the input word and the Target Phoneme
      input_text, target_text = line.split('  ')

      # TAB is the Start Character
      # '\n' is the End Chatachter
      target_text = '\t ' + target_text + ' \n'

      #Add the data to the Input and Target arrays
      input_texts.append(input_text)
      target_texts.append(target_text)

      #Add new characters and phonemes to the vocabulary
      for char in input_text:
          if char not in input_characters:
              input_characters.add(char)
      for phoneme in target_text.split(" "):
          if phoneme not in target_phonemes:
              target_phonemes.add(phoneme)
            
# Sort the Vocabulary
input_characters = sorted(list(input_characters))
target_phonemes = sorted(list(target_phonemes))

# Usefeull property variables
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_phonemes)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt.split(" ")) for txt in target_texts])
num_samples = len(input_texts)

print('Number of samples:', num_samples)
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

Number of samples: 114399
Number of unique input tokens: 27
Number of unique output tokens: 41
Max sequence length for inputs: 22
Max sequence length for outputs: 22


Next, we create a Dictionary that pairs each Carachter and Phoneme to a corresponging number index.

Ex: A --> 1

In [13]:
input_token_index = dict(
    [(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
    [(phoneme, i) for i, phoneme in enumerate(target_phonemes)])

input_token_index['A']

1

### The Machine Learning Model
Since this model is "Sequence-to-Sequence", it will receive a large list as input. 

This list, is the size of the entire training data **X** the largest input word's length **X** the size of the Output (Phoneme) Vocabulary.

Therefore, the first step is to initialize this Array with only Zeros.

In [14]:
# Initiate all entrences and outputs in the initial state
encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

encoder_input_data[:2]

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]], dtype=float32)

Next, we need to fill the Array with the correct corresponding values.

The value **1** will be placed at:
- The index of the word inside the training data.
- The index of the Character/Phoneme inside the word.
- The index of the Character/Phoneme inside the Vocabulary.


In [15]:
# Iterate over touples of input and target to gererate the decoder and encoder input data
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):

    # for every character in input text
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1. ## Bag of Letters
    
    # for every phoneme in target text
    for t, phoneme in enumerate(target_text.split(" ")):
        # decoder_target_data is one step ahead of decoder_input_data
        decoder_input_data[i, t, target_token_index[phoneme]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start phoneme.
            decoder_target_data[i, t - 1, target_token_index[phoneme]] = 1.
        
encoder_input_data[0]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],


####  ENCODER
The first part of this sequence-to-sequence Machine Learning Model is an LSTM network used to encode the previously created 3D input matrix into data that will be latter used by the Decoder network.

In [0]:
batch_size = 64  # Batch size for training.
epochs = 30  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.

In [17]:
# Define an input sequence
encoder_inputs = Input(shape=(None, num_encoder_tokens))

# Create the Encoder LSTM
encoder = LSTM(latent_dim, return_state=True)

# Use the LSTM to fill out the encoder output and middle states.
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# For the decoder, only the middle states will be necessary.
encoder_states = [state_h, state_c]






####  DECODER
The next step in the model is creating a Decoding LSTM network that receives the previous state generated by the Encoder to generate the final correct output sequence, translating the characters into phonemes.

In [0]:

# Define an input sequence.
decoder_inputs = Input(shape=(None, num_decoder_tokens))

# Creathe the Decoder LSTM
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)

# Use the LSTM to fill out the decoder Output
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)

# Create the Dense layer and use it to update the decoder output
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

### Model
Now, we ca create the Model that will convert the input data from the encoder (and its generated decoder input) into the target output.

In [0]:
# Define the Model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

### Training
Now its time to train out Model! It will run with the previously established configurations and input data for 30 epochs (aproximately 30 minutes).

It uses the RMSProf optimizer and utilizes accurary as its metric.

In [20]:
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)

# Save model
model.save('s2s.h5')



Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Train on 91519 samples, validate on 22880 samples
Epoch 1/30





Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### Testing
Let's test our trained model! First, we create a sampling encoder and decoder models.

In [26]:
# Create Encoder Sampling Model
encoder_model = Model(encoder_inputs, encoder_states)

# Define decoder inputs
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

# Generate decoder outputs by using the trained dense layer
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

# Create Decoder Sampling Model
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)


ERROR! Session/line number was not unique in database. History logging moved to new session 59


Create a Dictionary to reverse-lookup the index into its respective character or phoneme.


In [0]:
# Character reverse look-up
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())

# Phoneme reverse look-up
reverse_target_phoneme_index = dict(
    (i, phoneme) for phoneme, i in target_token_index.items())

Now, we need a function that receives the input sequence and returns the decoded word.

In [0]:
def decode_sequence(input_seq):
    # Get the encoder model prediction
    states_value = encoder_model.predict(input_seq)

    # Generate an empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))

    # Set the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Now, we need to iterate over the sequence until we find the stop character
    stop_condition = False
    decoded_word = ''
    while not stop_condition:
        # Get the decoder model prediction
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Get the sampled phoneme and add it to the decoded word
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_phoneme = reverse_target_phoneme_index[sampled_token_index]
        decoded_word += sampled_phoneme + " "

        # Exit the loop if \n is fount or if the decoded word is larger then the maximum permited size.
        if (sampled_phoneme == '\n' or
           len(decoded_word) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    # Return the decoded word
    return decoded_word

Now, to test on any input value, we create a function that receives a word, converts it into a sequence an then decodes it with our previous function.

In [0]:
def getPhonemes(word):
  #make the input uppercase
  word = word.upper()

  # Initialize the input sequence with only zeros.
  myInput = np.zeros(
      (1, max_encoder_seq_length, num_encoder_tokens),
      dtype='float32')

  # Fill the sequence correctly with the word
  for t, char in enumerate(word):
      myInput[0, t, input_token_index[char]] = 1.

  # Return the decoded value
  return decode_sequence(myInput).replace('\n', '').replace('\t', '').replace('  ', '')

We can now call the function with any word (that was not even in the training dataset) to see its predicted phonetic separation.

In [36]:
getPhonemes("trailer")

'T R EY L ER'

Finally, we can itarete over some values in a test dataset and see its corresponding results and accuracy.

In [37]:
# Download the Test Dataset
!wget https://raw.githubusercontent.com/microsoft/CNTK/v2.0/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b.test

# Separate the data into shuffled lines
f = open("cmudict-0.7b.train", "r")
dataString = f.read()
words_phonemes = dataString.split('\n')
random.shuffle(words_phonemes)

size = 4
correct = 0

for seq_index in range(size):
    # Get an input from the encoder's input.
    word_phoneme = words_phonemes[seq_index].split('  ')
    word = word_phoneme[0]
    correct_phoneme = word_phoneme[1]

    #Have the word be decoded
    decoded_phoneme= getPhonemes(word)

    #Update the number of correct translations
    if decoded_phoneme == correct_phoneme:
      correct += 1

    #Print the results
    print('-')
    print('Input Word:', word)
    print('Decoded Phonemes:', decoded_phoneme)
    print('Actual Phonemes:', correct_phoneme)

print('Correctly answered ', correct, ' out of ', size)

--2020-03-30 15:33:19--  https://raw.githubusercontent.com/microsoft/CNTK/v2.0/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b.test
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 321219 (314K) [text/plain]
Saving to: ‘cmudict-0.7b.test.3’


2020-03-30 15:33:19 (20.3 MB/s) - ‘cmudict-0.7b.test.3’ saved [321219/321219]

-
Input Word: NASHUA
Decoded Phonemes: N AE SH UW AH
Actual Phonemes: N AE SH UW AH
-
Input Word: PRIED
Decoded Phonemes: P R AY D
Actual Phonemes: P R AY D
-
Input Word: WHISKERS
Decoded Phonemes: W IH S K ER Z
Actual Phonemes: W IH S K ER Z
-
Input Word: FERRONICKEL
Decoded Phonemes: F EH R AH N IH K AH L
Actual Phonemes: F EH R AH N IH K AH L
Correctly answered  4  out of  4
