# TensorFlow 2.0 alpha - Unicode Strings
### Unicode - standard encoding system used to represent characters from nearly all languages
* Models that handle Natural Language often work with languages with different character sets
* Each character is encoded - using a unique integer Code Point between 0 - 0x10FFFF
* Unicode String - sequence of 0 or more code points

In [1]:
from __future__ import absolute_import, division, unicode_literals, print_function

import tensorflow as tf

  from ._conv import register_converters as _register_converters


## The Data Type - tf.string 
* allows building tensors of byte strings
* unicode strings - utf-8 encoded by default

In [2]:
tf.constant(u'Thanks 😊')

<tf.Tensor: id=0, shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

#### tf.string - can hold byte strings of varying lengths - byte strings treated as atomic units

In [3]:
# string length is Not included in the tensor dimensions

tf.constant([u"You're", u"welcome!"]).shape

TensorShape([2])

## Representing Unicode
### 2 standard ways
* String scalar - sequence of code points encoded using a known character encoding
* int32 vector - each position contains a single code point

### All 3 of the following represent the unicode string - "语言处理"

In [4]:
# unicode string - represented as a UTF-8 encoded string scalar

text_utf8 = tf.constant(u"语言处理")
text_utf8

<tf.Tensor: id=3, shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

In [5]:
# unicode string - represented as a UTF-16-BE encoded string scalar

text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
text_utf16be

<tf.Tensor: id=5, shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

In [7]:
# unicode string - represented as a vector of unicode code points

text_chars = tf.constant([ord(char) for char in u"语言处理"])
text_chars

<tf.Tensor: id=8, shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

## Converting between Representations
* tf.strings.unicode_decode - converts encoded string scalar - to vector of code points
* tf.strings.unicode_encode - converts vector of code points - to encoded string scalar
* tf.strings.unicode_transcode - converts encoded string scalar - to different encoding

In [8]:
tf.strings.unicode_decode(text_utf8, input_encoding='UTF-8')

<tf.Tensor: id=13, shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

In [9]:
tf.strings.unicode_encode(text_chars, output_encoding='UTF-8')

<tf.Tensor: id=24, shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

In [10]:
tf.strings.unicode_transcode(text_utf8, input_encoding='UTF8', output_encoding='UTF-16-BE')

<tf.Tensor: id=26, shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

## Batch Dimensions
#### When decoding multiple strings, of different lengths - result is a tf.RaggedTensor

In [11]:
# batch of UTF-8 encoded unicode strings

batch_utf8 = [s.encode('UTF-8') for s in
             [u'hÃllo', u'What is the weather tomorrow', u'Göödnight', u'😊']]

batch_chars_ragged = tf.strings.unicode_decode(batch_utf8, input_encoding='UTF-8')

for sentence_chars in batch_chars_ragged.to_list():
    print(sentence_chars)

[104, 195, 108, 108, 111]
[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]
[71, 246, 246, 100, 110, 105, 103, 104, 116]
[128522]


#### tf.RaggedTensor can be used directly 
* or use tf.RaggedTensor.to_tensor - to convert it to a dense tf.Tensor with padding
* or use tf.RaggedTensor.to_sparse - to convert it to tf.SparseTensor

In [12]:
batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_padded)

tf.Tensor(
[[   104    195    108    108    111     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [    87    104     97    116     32    105    115     32    116    104
     101     32    119    101     97    116    104    101    114     32
     116    111    109    111    114    114    111    119]
 [    71    246    246    100    110    105    103    104    116     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [128522     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]], shape=(4, 28), dtype=int32)


In [13]:
batch_chars_sparse = batch_chars_ragged.to_sparse()

#### tf.Tensor may be used - when encoding multiple strings of the SAME length

In [14]:
tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [99, 111, 119]],
                         output_encoding='UTF-8')

<tf.Tensor: id=130, shape=(3,), dtype=string, numpy=array([b'cat', b'dog', b'cow'], dtype=object)>

#### tf.RaggedTensor should be used - when encoding multiple strings with varying lengths

In [16]:
tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')

<tf.Tensor: id=132, shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

#### With a tensor with multiple strings in padded, or sparse format:
* convert it to a tf.RaggedTensor - before calling unicode_encode

In [17]:
tf.strings.unicode_encode(tf.RaggedTensor.from_sparse(batch_chars_sparse),
                         output_encoding='UTF-8')

<tf.Tensor: id=215, shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

## Unicode Operations
### Character Length - tf.strings.length operation has a parameter, unit
* unit - defaults to "BYTE" - but can be set to other values - to determine number of unicode codepoints in each encoded string

In [18]:
thanks = u'Thanks 😊'.encode('UTF-8')

num_bytes = tf.strings.length(thanks).numpy()
num_chars = tf.strings.length(thanks, unit = 'UTF8_CHAR').numpy()

print('{} bytes; UTF-8 characters'.format(num_bytes, num_chars))

# final character takes up 4 bytes

11 bytes; UTF-8 characters


### Character Substrings - tf.strings.substr operation accepts unit parameter
* used to determine what kind of offsets, contained by the 'pos' and 'len' parameters
* DEFAULT - unit=BYTE

In [19]:
tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy()

b'\xf0\x9f\x98\x8a'

### Split Unicode Strings
* tf.strings.unicode_split - splits unicode strings into substrings of individual characters

In [20]:
tf.strings.unicode_split(thanks, 'UTF-8').numpy()

array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\xf0\x9f\x98\x8a'],
      dtype=object)

### Byte Offsets for Characters
* to align the character string (generated by tf.strings.unicode_decode) with the original string - helpful to know the Offset (for where each charcter begins)
* tf.strings.unicode_decode_with_offsets is similar (to unicode_decode) - but returns a 2nd tensor (containing start offsets of each character)

In [22]:
codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')

for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
    print("At byte offset {}: codepoint{}".format(offset, codepoint))

At byte offset 0: codepoint127880
At byte offset 4: codepoint127881
At byte offset 8: codepoint127882


## Unicode Scripts
* each unicode code point - belongs to a single collection of codepoints - known as a Script
* character's Script - helpful in determining which language character may belong in
* tf.strings.unicode_script - determines which script a given codepoint uses

#### Script codes are int32 values - corresponding to International Components for Unicode (ICU) UScriptCode values

In [23]:
uscript = tf.strings.unicode_script([33464, 1041])

print(uscript.numpy())

[17  8]


In [24]:
# [33464, 1041] ['芸', 'Б'] 
# [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]

#### tf.strings.unicode_script - can be applied to multidimensional tf.Tensors or tf.RaggedTensors (of codepoints

In [25]:
print(tf.strings.unicode_script(batch_chars_ragged))

<tf.RaggedTensor [[25, 25, 25, 25, 25], [25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 25], [25, 25, 25, 25, 25, 25, 25, 25, 25], [0]]>


# Example - Simple Segmentation
### Segmentation - task of splitting text into word-like units
* in web text, different languages and scripts are frequently mixed together
* rough segmentation (without ML) can be performed - using changes in script to approximate word boundaries
* This will work for most languages that use Spaces (space characters all classified as USCRIPT_COMMON)

In [26]:
# dtype - string; shape - [num_sentences]

sentence_texts = [u'Hello, world.', u'世界こんにちは']

#### Decode the sentences into character codepoints - Find the script Identifier for each character

In [27]:
# dtype - int32; shape - [num_sentences, (num_chars_per_sentence)]

# sentence_char_codepoint[i, j] - codepoint for jth character - in ith sentence

sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')
print(sentence_char_codepoint)

# same as above

sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)
print(sentence_char_script)

<tf.RaggedTensor [[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 46], [19990, 30028, 12371, 12435, 12395, 12385, 12399]]>
<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0], [17, 17, 20, 20, 20, 20, 20]]>


#### Use Script Identifiers
* determine where word boundaries should be added
* add a word boundary at beginning of each sentence - and for each character whose script differs from previous character

In [28]:
# dtype - bool; shape - [num_sentences, (num_chars_per_sentence)]

# sentence_char-starts_word[i, j] - True, if jth character in ith sentence, starts a word

sentence_char_starts_word = tf.concat([tf.fill([sentence_char_script.nrows(), 1], True),
                                      tf.not_equal(sentence_char_script[:, 1:], 
                                                   sentence_char_script[:, :-1])],
                                     axis=1)


# dtype - int64; shape - [num_words]

# word_starts[i] - index, of character starting the ith word - in flattened list of all chars

word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)

tf.Tensor([ 0  5  7 12 13 15], shape=(6,), dtype=int64)


#### Can use these Start Offsets - to build a RaggedTensor - containing list of words from all batches

In [29]:
# dtype - int32; shape - [num_words, (num_chars_per_word)]

# word_char_codepoint[i, j] - codepoint for jth character in ith word

word_char_codepoint = tf.RaggedTensor.from_row_starts(values = sentence_char_codepoint.values,
                                                     row_starts = word_starts)
print(word_char_codepoint)

<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>


#### Finally, Segment the word codepoints RaggedTensor - back into sentences

In [31]:
# dtype - int64; shape - [num_sentences]

# sentence_num_words[i] - number of words in ith sentence

sentence_num_words = tf.reduce_sum(tf.cast(sentence_char_starts_word, tf.int64),
                                  axis=1)

# dtype - int32; shape - [num_sentences, (num_words_per_sentence), (num_chars_per_word)]

# sentence_word_char_codepoint[i, j, k] - is codepoint for kth character, in jth word, in 
# ith sentence

sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(values = word_char_codepoint,
                                                               row_lengths = sentence_num_words)
print(sentence_word_char_codepoint)

<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]], [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>


#### Final Result - can be made easier to read - by Encoding back into UTF-8 strings

In [32]:
tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()

[[b'Hello', b', ', b'world', b'.'],
 [b'\xe4\xb8\x96\xe7\x95\x8c',
  b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]