## Sequencing : Turning sentences into data

#### Hemant Thapa

The process of converting text data into a sequence of numerical values.  Sequencing converts text - which could be words, sentences, or even entire documents - into a numerical format.

- Tokenization: This is the first step, where the text is split into smaller units called tokens. Tokens can be words, characters, or subwords. For example, the sentence "Hello world" might be broken down into ["Hello", "world"].

- Assigning Numeric Values: Each unique token is assigned a specific numeric value. This process creates a mapping where each word or character is represented by a unique number.

- Creating Sequences: Once each token has a numeric value, the text can be converted into a sequence of numbers. For instance, if "Hello" is assigned the number 1 and "world" the number 2, the sentence "Hello world" would become [1, 2].

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [2]:
#creating a list of string
sentences = [
    'I love to read books',
    'I love to travel around world',
    'Do you love reading books!',
    'What is your best read ?'
]

In [3]:
#printing sentences 
print(sentences)

['I love to read books', 'I love to travel around world', 'Do you love reading books!', 'What is your best read ?']


In [4]:
#creating an object 
tokenizer = Tokenizer(num_words = 100)

In [5]:
#fitting model 
tokenizer.fit_on_texts(sentences)

In [6]:
#dictonary with index
word_index = tokenizer.word_index

In [7]:
#print dictonary 
print(word_index)

{'love': 1, 'i': 2, 'to': 3, 'read': 4, 'books': 5, 'travel': 6, 'around': 7, 'world': 8, 'do': 9, 'you': 10, 'reading': 11, 'what': 12, 'is': 13, 'your': 14, 'best': 15}


In [8]:
for i,j in word_index.items():
    print(j, "->", i)

1 -> love
2 -> i
3 -> to
4 -> read
5 -> books
6 -> travel
7 -> around
8 -> world
9 -> do
10 -> you
11 -> reading
12 -> what
13 -> is
14 -> your
15 -> best


In [9]:
#list with sequences 
sequences = tokenizer.texts_to_sequences(sentences)

In [10]:
#prinitng sequences 
print(sequences)

[[2, 1, 3, 4, 5], [2, 1, 3, 6, 7, 8], [9, 10, 1, 11, 5], [12, 13, 14, 15, 4]]


#### Testing 

In [11]:
#creating a list of string
test_data = [
    'I really love to read books and always prefer general knowledge over fictions',
    'I love to travel around world and have a dream to visit seven wonder around the world',
    'Do you enjoy reading books!',
    'What is your best read till now and what would you recommend?'
]

In [12]:
test_seq = tokenizer.texts_to_sequences(test_data)

In [13]:
print(test_seq)

[[2, 1, 3, 4, 5], [2, 1, 3, 6, 7, 8, 3, 7, 8], [9, 10, 11, 5], [12, 13, 14, 15, 4, 12, 10]]


In [14]:
def inverse_word_index(word_index):
    inverse_word_index = {}
    for key, value in word_index.items():
        inverse_word_index[value] = key
    return inverse_word_index
    
def decode_text(test_seq, inverse_word_index):
    decoded_texts = []
    
    #loop through each sequence in test_seq
    for seq in test_seq:
        decoded_seq = ""
        
        #loop through each index in the sequence
        for i in seq:
            #find the word corresponding to the index
            #using ? as a placeholder for missing words
            word = inverse_word_index.get(i, '?')  
            
            #append the word to the decoded sequence
            #adding a space for separation between words
            decoded_seq += word + " "  
    
        #append the decoded sequence to the list of decoded texts
        #using strip() to remove trailing spaces
        decoded_texts.append(decoded_seq.strip())  
        
    for i in decoded_texts:
        print(i)

In [15]:
inverse_word_index(word_index)

{1: 'love',
 2: 'i',
 3: 'to',
 4: 'read',
 5: 'books',
 6: 'travel',
 7: 'around',
 8: 'world',
 9: 'do',
 10: 'you',
 11: 'reading',
 12: 'what',
 13: 'is',
 14: 'your',
 15: 'best'}

In [16]:
decode_text(test_seq, inverse_word_index(word_index))

i love to read books
i love to travel around world to around world
do you reading books
what is your best read what you


#### Out Of Vocabulary token

When you tokenize text - convert it into a series of tokens (like words or characters) - you usually have a fixed vocabulary: a set list of tokens that your model recognizes based on the training data. However, when your model encounters a new word in new or unseen data that wasn't in the training vocabulary, it's considered an out-of-vocabulary word.

- Placeholder for Unknown Words: The oov_token is a special token that is used as a placeholder for words that are not in the tokenizer's vocabulary. It's a way to handle these unknown words.

- Consistency in Tokenization: Without an oov_token, any word not in the vocabulary would be completely ignored during tokenization, leading to loss of information. With an oov_token, you maintain the structure and length of your text data.

- Common Usage in Tokenizers: Many tokenization tools, including those in popular libraries like Keras, allow you to specify an oov_token. For example, when creating a tokenizer in Keras, you can set oov_token="<OOV>". Then, during tokenization, any word not found in the word index is replaced by this token.

In [17]:
#objaect with out of vocabulary token
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")

In [18]:
#fitting model
tokenizer.fit_on_texts(sentences)

In [19]:
#dictonary with index
word_index = tokenizer.word_index

In [20]:
#printing key and values
for i,j in word_index.items():
    print(j, "->", i)

1 -> <OOV>
2 -> love
3 -> i
4 -> to
5 -> read
6 -> books
7 -> travel
8 -> around
9 -> world
10 -> do
11 -> you
12 -> reading
13 -> what
14 -> is
15 -> your
16 -> best


In [21]:
#text to sequences 
test_seq = tokenizer.texts_to_sequences(test_data)

In [22]:
#printing sequences 
print(test_seq)

[[3, 1, 2, 4, 5, 6, 1, 1, 1, 1, 1, 1, 1], [3, 2, 4, 7, 8, 9, 1, 1, 1, 1, 4, 1, 1, 1, 8, 1, 9], [10, 11, 1, 12, 6], [13, 14, 15, 16, 5, 1, 1, 1, 13, 1, 11, 1]]


In [23]:
inverse_word_index(word_index)

{1: '<OOV>',
 2: 'love',
 3: 'i',
 4: 'to',
 5: 'read',
 6: 'books',
 7: 'travel',
 8: 'around',
 9: 'world',
 10: 'do',
 11: 'you',
 12: 'reading',
 13: 'what',
 14: 'is',
 15: 'your',
 16: 'best'}

In [24]:
decode_text(test_seq, inverse_word_index(word_index))

i <OOV> love to read books <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV>
i love to travel around world <OOV> <OOV> <OOV> <OOV> to <OOV> <OOV> <OOV> around <OOV> world
do you <OOV> reading books
what is your best read <OOV> <OOV> <OOV> what <OOV> you <OOV>


#### Padding 

Technique of standardising the lengths of sequences (like sentences or paragraphs) to be the same size. This is important because models, particularly those in deep learning, require inputs of a consistent size.

- Handling Variable Lengths: Text data often comes in varying lengths (different number of words or characters in different sentences). However, models like neural networks require input data to be of a fixed size.

- Batch Processing: When training models, it's efficient to process data in batches. Padding ensures all sequences in a batch have the same length, allowing for efficient batch processing.

- Adding Extra Values: Padding involves adding extra values to sequences to make them all the same length. The padding value is typically 0 but can be set to other values depending on the context and requirements.
  
- Pre-padding vs. Post-padding: Padding can be added either at the beginning (pre-padding) or the end (post-padding) of the sequences. The choice depends on the model and the nature of the data.

Sentence 1: ["I", "love", "cats"]

Sentence 2: ["Dogs", "are", "great", "pets"]

If converted to sequences with numerical values, you might have:

Sentence 1: [5, 12, 7]

Sentence 2: [8, 3, 9, 10]


If you decide each sequence should have a length of 5 for model input, you'd pad them like this:

Sentence 1 with post-padding: [5, 12, 7, 0, 0]

Sentence 2 with post-padding: [8, 3, 9, 10, 0]

In [25]:
padded = pad_sequences(test_seq)

In [26]:
print(padded)

[[ 0  0  0  0  3  1  2  4  5  6  1  1  1  1  1  1  1]
 [ 3  2  4  7  8  9  1  1  1  1  4  1  1  1  8  1  9]
 [ 0  0  0  0  0  0  0  0  0  0  0  0 10 11  1 12  6]
 [ 0  0  0  0  0 13 14 15 16  5  1  1  1 13  1 11  1]]


In [27]:
#zero after sentences with padding parameter post
post_padded = pad_sequences(test_seq, padding='post')

In [28]:
print(post_padded)

[[ 3  1  2  4  5  6  1  1  1  1  1  1  1  0  0  0  0]
 [ 3  2  4  7  8  9  1  1  1  1  4  1  1  1  8  1  9]
 [10 11  1 12  6  0  0  0  0  0  0  0  0  0  0  0  0]
 [13 14 15 16  5  1  1  1 13  1 11  1  0  0  0  0  0]]


In [29]:
#zero after sentences with padding parameter post, also including max len ten
post_padded_max_len = pad_sequences(test_seq, padding='post', maxlen=10)

In [30]:
print(post_padded_max_len) 

[[ 4  5  6  1  1  1  1  1  1  1]
 [ 1  1  1  4  1  1  1  8  1  9]
 [10 11  1 12  6  0  0  0  0  0]
 [15 16  5  1  1  1 13  1 11  1]]


In [31]:
temp_padded = pad_sequences(test_seq, padding='post', truncating='post', maxlen=10)

In [32]:
print(temp_padded)

[[ 3  1  2  4  5  6  1  1  1  1]
 [ 3  2  4  7  8  9  1  1  1  1]
 [10 11  1 12  6  0  0  0  0  0]
 [13 14 15 16  5  1  1  1 13  1]]


In [33]:
#inverse_word_index function to get the dictionary
inverse_index_dict = inverse_word_index(word_index)

#using this dictionary to decode the sequences
for seq in padded:
    words = [inverse_index_dict.get(i, '') for i in seq if i != 0]  # Exclude padding
    sentence = ' '.join(words).strip()
    print(sentence)

i <OOV> love to read books <OOV> <OOV> <OOV> <OOV> <OOV> <OOV> <OOV>
i love to travel around world <OOV> <OOV> <OOV> <OOV> to <OOV> <OOV> <OOV> around <OOV> world
do you <OOV> reading books
what is your best read <OOV> <OOV> <OOV> what <OOV> you <OOV>
