## Introduction
**NLP** models often handle different languages with different character sets. Unicode is a standard encoding system that is used to represent characters from almost all languages. Every **Unicode** character is encoded using a unique integer code point between **0** and **0x10FFFF**. A Unicode string is a sequence of **zero** or more **code points**.

In [1]:
import numpy as np
import tensorflow as tf

The basic TensorFlow `tf.string` dtype allows you to build tensors of byte strings. Unicode strings are **UTF-8** encoded by default.

In [2]:
#Creates a constant tensor from a tensor-like object.
tf.constant(u"Thanks 😊")

<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

In [5]:
tf.constant([u"You're", u"welcome!"])

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b"You're", b'welcome!'], dtype=object)>

In [6]:
tf.constant([u"You're", u"welcome!"]).shape

TensorShape([2])

### Representing Unicode
There are two standard ways to represent a Unicode string in TensorFlow:

- **string** scalar — where the sequence of code points is encoded using a known character encoding.
- **int32** vector — where each position contains a single code point.

In [10]:
#Unicode string, represented as a UTF-8 encoded string scalar.
text_utf8 = tf.constant(u"语言处理")
text_utf8

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

In [11]:
#unicode string, represented as a UTF-16-BE encoded string scalar.
text_utf16_BE = tf.constant(u"语言处理".encode("UTF-16-BE"))
text_utf16_BE 

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

In [14]:
#Unicode string, represented as a vector of Unicode points 
text_char = tf.constant([ord(char) for char in u"语言处理"])
text_char

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702])>

## Converting between representations
- `tf.strings.unicode_decode`: Converts an encoded string scalar to a vector of code points.
- `tf.strings.unicode_encode`: Converts a vector of code points to an encoded string scalar.
- `tf.strings.unicode_transcode`: Converts an encoded string scalar to a different encoding.

In [15]:
tf.strings.unicode_decode(text_utf8, input_encoding = 'UTF-8')

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702])>

In [16]:
tf.strings.unicode_transcode(text_utf8, 
                             input_encoding = 'UTF-8', 
                             output_encoding='UTF-16-BE')

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

In [17]:
tf.strings.unicode_encode(text_char, output_encoding='UTF-8')

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

## Batch Dimension
when decoding multiple strings, the number of characters in each string may not be equal. The return result is a `tf.RaggedTensor`, where the innermost dimension length varies depending on the number of characters in each string.

In [18]:
batch_utf8 = [s.encode('UTF-8') for s in [u'hÃllo', u'What is the weather tomorrow', u'Göödnight', u'😊']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8, 
                                               input_encoding = 'UTF-8')

for sentence_chars in batch_chars_ragged.to_list():
    print(sentence_chars)

[104, 195, 108, 108, 111]
[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]
[71, 246, 246, 100, 110, 105, 103, 104, 116]
[128522]


You can use this `tf.RaggedTensor` directly, or convert it to a dense `tf.Tensor` with **padding** or a `tf.sparse.SparseTensor` using the methods `tf.RaggedTensor.to_tensor` and `tf.RaggedTensor.to_sparse`.

In [20]:
batch_chars_padded = batch_chars_ragged.to_tensor(default_value=0)
batch_chars_padded.numpy()

array([[   104,    195,    108,    108,    111,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0],
       [    87,    104,     97,    116,     32,    105,    115,     32,
           116,    104,    101,     32,    119,    101,     97,    116,
           104,    101,    114,     32,    116,    111,    109,    111,
           114,    114,    111,    119],
       [    71,    246,    246,    100,    110,    105,    103,    104,
           116,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0],
       [128522,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,
             