<a href="https://colab.research.google.com/github/Satwikram/NLP-Implementations/blob/main/Preprocessing/Tensorflow%20Text%20Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author: Satwik Ram K

### Setup

In [2]:
!pip install -q tensorflow-text

[K     |████████████████████████████████| 4.9 MB 5.2 MB/s 
[?25h

In [3]:
import tensorflow as tf
import numpy as np
import tensorflow_text as tf_text
import re

In [4]:
text = tf.constant(u"Thanks 😊")

In [5]:
text

<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

### RegexSplitter

In [6]:
text_input=[
      "Hi there.\nWhat time is it?\nIt is gametime.",
      "Who let the dogs out?\nWho?\nWho?\nWho?\n\n",
       ]

In [7]:
splitter = tf_text.RegexSplitter()
print(splitter.split(text_input))

<tf.RaggedTensor [[b'Hi there.', b'What time is it?', b'It is gametime.'], [b'Who let the dogs out?', b'Who?', b'Who?', b'Who?']]>


In [8]:
splitter = tf_text.RegexSplitter(split_regex='\t')
print(splitter.split(text_input))

<tf.RaggedTensor [[b'Hi there.\nWhat time is it?\nIt is gametime.'], [b'Who let the dogs out?\nWho?\nWho?\nWho?\n\n']]>


### WhitespaceTokenizer

In [9]:
[i.numpy() for i in tf_text.WhitespaceTokenizer().tokenize("I am Satwik Ram")]

[b'I', b'am', b'Satwik', b'Ram']

### WordShape

In [10]:
tf_text.WordShape.BEGINS_WITH_OPEN_QUOTE

<WordShape.BEGINS_WITH_OPEN_QUOTE: '``.*|["\'`＇＂‘‚‛“«„‟‹「『〝⹂｢﹁﹃][^"\'`＇＂‘‚‛“«„‟‹「『〝⹂｢﹁﹃]*'>

In [11]:
tf_text.WordShape.HAS_MATH_SYMBOL

<WordShape.HAS_MATH_SYMBOL: '.*\\p{Sm}.*'>

In [12]:
tf_text.WordShape.HAS_NO_DIGITS

<WordShape.HAS_NO_DIGITS: '\\P{Nd}*'>

In [13]:
input = "1. This is first index"
tf_text.wordshape(input.split(), tf_text.WordShape.HAS_NO_DIGITS)


<tf.Tensor: shape=(5,), dtype=bool, numpy=array([False,  True,  True,  True,  True])>

In [14]:
tf_text.wordshape(input.split(), tf_text.WordShape.HAS_NO_PUNCT_OR_SYMBOL)

<tf.Tensor: shape=(5,), dtype=bool, numpy=array([False,  True,  True,  True,  True])>

There are so many methods which we can use during text cleaning. Link: https://www.tensorflow.org/text/api_docs/python/text/WordShape

### Case Fold Utf8

In [19]:
tf_text.case_fold_utf8(['The  Quick-Brown',
                'CAT jumped over',
                'the lazy dog  !!  '])

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'the  quick-brown', b'cat jumped over', b'the lazy dog  !!  '],
      dtype=object)>

The input is a Tensor or RaggedTensor of any shape, and the resulting output has the same shape as the input. Note that NFKC normalization is implicitly applied to the strings.

### Ngrams

In [22]:
input_data = tf.ragged.constant([["Satwik", "Ram", "K"], ["Hi", "Bye"]])

tf_text.ngrams(
  input_data,
  width=2,
  axis=-1,
  reduction_type=tf_text.Reduction.STRING_JOIN,
  string_separator="|")


<tf.RaggedTensor [[b'Satwik|Ram', b'Ram|K'], [b'Hi|Bye']]>

In [23]:
tf_text.ngrams(
  input_data,
  width=3,
  axis=-1,
  reduction_type=tf_text.Reduction.STRING_JOIN,
  string_separator="|")

<tf.RaggedTensor [[b'Satwik|Ram|K'], []]>

### Normalize Utf8

In [26]:
tf_text.normalize_utf8(["株式会社", "ＫＡＤＯＫＡＷＡ"], normalization_form='NFKD')

<tf.Tensor: shape=(2,), dtype=string, numpy=
array([b'\xe6\xa0\xaa\xe5\xbc\x8f\xe4\xbc\x9a\xe7\xa4\xbe', b'KADOKAWA'],
      dtype=object)>