# Segmentation

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/Ali-Alameer/NLP/blob/main/segmentation.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Introduction

NLP models often handle different languages with different character sets.  *Unicode* is a standard encoding system that is used to represent characters from almost all languages.  Every Unicode character is encoded using a unique integer [code point](https://en.wikipedia.org/wiki/Code_point) between `0` and `0x10FFFF`. A *Unicode string* is a sequence of zero or more code points.

This tutorial shows how to represent Unicode strings in TensorFlow and manipulate them using Unicode equivalents of standard string ops. It separates Unicode strings into tokens based on script detection.

In [None]:
import tensorflow as tf
import numpy as np

## Example: Simple segmentation

Segmentation is the task of splitting text into word-like units. This is often easy when space characters are used to separate words, but some languages (like Chinese and Japanese) do not use spaces, and some languages (like German) contain long compounds that must be split in order to analyze their meaning. In web text, different languages and scripts are frequently mixed together, as in "NY株価" (New York Stock Exchange).

We can perform very rough segmentation (without implementing any ML models) by using changes in script to approximate word boundaries. This will work for strings like the "NY株価" example above. It will also work for most languages that use spaces, as the space characters of various scripts are all classified as USCRIPT_COMMON, a special script code that differs from that of any actual text.

In [None]:
# dtype: string; shape: [num_sentences]
#
# The sentences to process.  Edit this line to try out different inputs!
sentence_texts = [u'Hello, world.', u'世界こんにちは'] # The 'u' in front of the string values means the string is a Unicode string. Unicode is a way to represent more characters than normal ASCII can manage. The fact that you're seeing the u means you're on Python 2 - strings are Unicode by default on Python 3, but on Python 2, the u in front distinguishes Unicode strings.
# try below
# sentence_texts = ['A A']

First, decode the sentences into character codepoints, and find the script identifeir for each character.

In [None]:
# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_codepoint[i, j] is the codepoint for the j'th character in
# the i'th sentence.
sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')
print(sentence_char_codepoint)

# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_scripts[i, j] is the Unicode script of the j'th character in
# the i'th sentence.
sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)
print(sentence_char_script)

In [None]:
Some relevant examples about code manipulations used below

tf.cancat: Concatenates tensors along one dimension.

In [None]:
t1 = tf.constant([[1, 2, 3], [4, 5, 6]])
t2 = tf.constant([[7, 8, 9], [10, 11, 12]])
tf.concat([t1, t2], axis=0)

tf.fill: Creates a tensor filled with a scalar value.

In [None]:
tf.fill([2, 3], 9)

In [None]:
tf.fill([sentence_char_script.nrows(), 1], True)

tf.not_equal: Returns the truth value of (x != y) element-wise.

In [None]:
sentence_char_script[:, 1:]

In [None]:
sentence_char_script[:, :-1]

In [None]:
tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])

tff.squeeze: Removes dimensions of size 1 from the shape of a tensor.

In [None]:
t = tf.constant([[[2,3]]])

In [None]:
t.shape

In [None]:
tf.shape(t)

In [None]:
tf.shape(tf.squeeze(t)) 

In [None]:
tf.squeeze(t).shape

In [None]:
tf.shape(tf.squeeze(t, axis=1)) 

In [None]:
tf.squeeze(t, axis=[1]).shape

Use the script identifiers to determine where word boundaries should be added.  Add a word boundary at the beginning of each sentence, and for each character whose script differs from the previous character.

In [None]:
# dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_starts_word[i, j] is True if the j'th character in the i'th
# sentence is the start of a word. 
sentence_char_starts_word = tf.concat(
    [tf.fill([sentence_char_script.nrows(), 1], True), # the idea is this line is to add a true which indicate word boundry at the begening of each sentence 
     tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],
    axis=1)

# dtype: int64; shape: [num_words]
#
# word_starts[i] is the index of the character that starts the i'th word (in
# the flattened list of characters from all sentences). squeeze return A Tensor. 
# Has the same type as input. Contains the same data as input, but has one or more 
# dimensions of size 1 removed. tf.where returns the indices of non-zero elements
# axis is an optional list of ints. Defaults to []. If specified, only squeezes the dimensions listed.
word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)

You can then use those start offsets to build a `RaggedTensor` containing the list of words from all batches.

In [None]:
# dtype: int32; shape: [num_words, (num_chars_per_word)]
#
# word_char_codepoint[i, j] is the codepoint for the j'th character in the
# i'th word. the class method from_row_starts creates a RaggedTensor with rows partitioned by row_starts.
word_char_codepoint = tf.RaggedTensor.from_row_starts(
    values=sentence_char_codepoint.values,
    row_starts=word_starts)
print(word_char_codepoint)

To finish, segment the word codepoints `RaggedTensor` back into sentences and encode into UTF-8 strings for readability.

In [None]:
# dtype: int64; shape: [num_sentences]
#
# sentence_num_words[i] is the number of words in the i'th sentence.
sentence_num_words = tf.reduce_sum(
    tf.cast(sentence_char_starts_word, tf.int64),
    axis=1)

# dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]
#
# sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character
# in the j'th word in the i'th sentence.
sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
    values=word_char_codepoint,
    row_lengths=sentence_num_words)
print(sentence_word_char_codepoint)

tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()