NPL models often handle different languages with different character sets. Unicode is a standard encoding system that is udes to represent characters from almost all languages.
Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF.

In [None]:
import tensorflow as tf
import numpy as np

In [None]:
tf.constant(u"Thanks 😊").shape #Unicode strings are utf-8 encoded by defautl

## Representing Unicode

There are two standard ways to represent a Unicode string in TF:
*   string scalar - where the sequence of code points is encoded using a known character encoding.
*   int32 vector - where each postion contains a single code point.



For example, the following three values all represent the Unicode string "语言处理" (which means "language processing" in Chinese)

In [None]:
# Unicode string, represented as a UTF-8 encoded string scalar
text_utf8 = tf.constant(u"语言处理")
text_utf8

In [None]:
# Unicode string, represented as a UTF-16-BE encoded string scalar
text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
text_utf16be

In [None]:
# Unicode string, represented as a vector of Unicode code points
text_chars = tf.constant([ord(char) for char in u"语言处理"])
text_chars

### Converting between representations

TF provides operations to convert between these different representations:
*      tf.strings.unicode_decode: Converts an encoded string scalar toa vector of code points.
*      tf.strings.unicode_encode: Converts a vector of code points to an encoded string scalar.
*      tf.strings.unicode_transcode: Converts an ecoded string scalar to a different encoding.

In [None]:
text_chars_converted = tf.strings.unicode_decode(text_utf8, input_encoding='UTF-8')
assert tf.reduce_all(tf.equal(text_chars, text_chars_converted))

In [None]:
text_utf8_converted = tf.strings.unicode_encode(text_chars, output_encoding='UTF-8')
assert tf.reduce_all(tf.equal(text_utf8, text_utf8_converted))

In [None]:
text_utf16be_converted = tf.strings.unicode_transcode(text_utf8, input_encoding='UTF-8', output_encoding='UTF-16-BE')
assert tf.reduce_all(tf.equal(text_utf16be, text_utf16be_converted))

### Batch dimensions

When decoding multiple strings, the number of characters in each string may not be equal.
The return result is a tf.RaggedTensor, where the innermost dimension length varies deoending in the number of characters in each string.

** A RaggedTensor is a tensor with one or more ragged dimensions, which are dimensions whose slices may have different lengths. For example, the inner (column) dimension of rt=[[3, 1, 4, 1], [], [5, 9, 2], [6], []] is ragged, since the column slices (rt[0, :], ..., rt[4, :]) have different lengths.

In [None]:
# A batch of Unicode strings, each represented as a UTF8-encoded string
batch_utf8 = [s.encode('UTF-8') for s in [u'hÃllo', u'What is the weather tomorrow', u'Göödnight', u'😊']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8, input_encoding='UTF-8')

for sentence_chars in batch_chars_ragged.to_list():
  print(sentence_chars)

In [None]:
batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_padded.numpy())

In [None]:
# A nicer way to represent the ragged tensor
batch_chars_sparse = batch_chars_ragged.to_sparse()

nrows, ncols = batch_chars_sparse.dense_shape.numpy()
elements = [['_' for i in range(ncols)] for j in range(nrows)]
for (row, col), value in zip(batch_chars_sparse.indices.numpy(), batch_chars_sparse.values.numpy()):
  elements[row][col] = str(value)
# max_width = max(len(value) for row in elements for value in row)
value_lengths = []
for row in elements:
  for value in row:
    value_lengths.append(len(value))
max_width = max(value_lengths)
print('[%s]' % '\n '.join(
    '[%s]' % ', '.join(value.rjust(max_width) for value in row)
    for row in elements))

When encoding multiple strings with the same lenghts, use tf.Tensor as the input

In [None]:
tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [99, 111, 119]],
                          output_encoding='UTF-8')

When encoding multiple string with varying length, use tf.RaggedTensor as the input instead.

In [None]:
tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')
#Here batch_chars_ragged is a tf.RaggedTensor

if you have a tensor with multiple strings in padded or sparse format, convert it first into a tf.RaggedTensor before calling tf.string.enicode_encode.

In [None]:
tf.strings.unicode_encode(tf.RaggedTensor.from_sparse(batch_chars_sparse), output_encoding='UTF-8')

In [None]:
tf.strings.unicode_encode(tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1), output_encoding='UTF-8')

## Unicode operations

### Character length

Use the unit parameter of the tf.strings.lenght op to indicate how character lengths should be computed. unit defaults to "BYTE", but it can be set to other values, such as "UTF8_CHAR" or "UTF16_CHAR", to determine the number of Unicode codepoints in each encoded string.

In [None]:
  # Note that the final character (emoji) takes up 4 bytes in UTF8.
  thanks = u'Thanks 😊'.encode('UTF-8')
  print(thanks)
  num_bytes = tf.strings.length(thanks).numpy()
  num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()
  print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars))

### Character substrings

The tf.strings.substr op accepts the unit parameter, and uses it to determine what kind of offsets the pos and len parameters contains.

In [None]:
# Here, unit='BYTE' (default). Returns a single byte (position 7) with len=1
tf.strings.substr(thanks, pos=7, len=1).numpy()

In [None]:
# Specifying unit='UTF8_CHAR', returns a single 4 byte character (emoji) in this case
tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy()

### Split Unicode strings

The tf.strings.unicode_split op splits Unicode strings into substrings of individual characters.

In [None]:
tf.strings.unicode_split(thanks, input_encoding='UTF-8').numpy()

### Byte iffsets for characters

To align the character tensor generated by tf.strings.unicode_decode with the original string, it is useful to know the offset for where each character begins. The method tf.strings.unicode_decode_with_offsets is similar to unicode_decode, except that it returns a second tensor containing the start offset of each character.

In [None]:
codepoints, offsets = tf.strings.unicode_decode_with_offsets(u'🎈🎉🎊', 'UTF-8')

print(codepoints.numpy())
print(offsets.numpy())

for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
  print('At byte offset {}: codepoints {}'.format(offset, codepoint))

## Unicode scripts

Each Unicode code point belongs to a single collection of codepoints known as a script. A character's script i helpful in determining which language the character might be in.
<br>TF provides the tf.strings.unicode_script op to determine which script a given codepoint uses. The script codes are in int32 values corresponding to ICU [UScriptCode](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uscript_8h.html) values.

In [None]:
uscript = tf.strings.unicode_script([33464, 1041]) # ['芸', 'Б']
print(uscript.numpy()) # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]

The tf.strings.unicode_script op can also be applied to multidimensional tf.Tensor's or tf.RaggedTensor's of codepoints.

In [None]:
tf.strings.unicode_script(batch_chars_ragged)

## Example: Simple segmentation

Segmentation is the task of splitting text into word-like units. This is often easy when space characters are used to separate words, but some languages (like Chinese and Japanese) do not use spaces, and some languages (like German) contain long compounds that must be split in order to analyze their meaning. In web text, different languages and scripts are frequently mixed together, as in "NY株価" (New York Stock Exchange).
<br>We can perform very rough segmentation (without implementing any ML models) by using changes in script to approximate word boundaries. This will work for strings like the "NY株価" example above. It will also work for most languages that use spaces, as the space characters of various scripts are all classified as USCRIPT_COMMON, a special script code that differs from that of any actual text.

In [None]:
# dtype: string; shape: [num_sentences]

# The sentences to process.  Edit this line to try out different inputs!
sentence_texts = [u'Hello, world.', u'世界こんにちは']

First, decode the sentences into character codepoints, and find the script identifier for each character.

In [None]:
sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, input_encoding='UTF-8')
print(sentence_char_codepoint)

sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)
print(sentence_char_script)

Use the script identifiers to determine where word boundaries should be added. Add a word boundary at the beggining of each sentence, and for each character whose script differs from the previous character.

In [None]:
sentence_char_starts_word = tf.concat(
    [
    tf.fill([sentence_char_script.nrows(), 1], True),
    tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])
    ], axis=1
)
print(sentence_char_starts_word)

word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)

You can use those start offsets to build a RaggedTensor containing the list of words from all batches.

In [None]:
word_char_codepoint = tf.RaggedTensor.from_row_starts(
    values = sentence_char_codepoint.values,
    row_starts = word_starts,
)
print(word_char_codepoint)

To finish, segment the word codepoints RaggedTensor back into sentences and encoed into UTF-8 strings for readability.

In [None]:
sentence_num_words = tf.reduce_sum(
    tf.cast(sentence_char_starts_word, tf.int64),
    axis=1,
)
print(sentence_num_words)

sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
    values = word_char_codepoint,
    row_lengths = sentence_num_words,
)
print(sentence_word_char_codepoint)

tf.strings.unicode_encode(sentence_word_char_codepoint, output_encoding='UTF-8').to_list()