# Tokenization with NLU 

Tokenization is the process of splitting input texts into segments which corrospond to words.    

I. e. 'He was hungry' consists of the tokens [He,was,hungry]





# 1. Install Java and NLU

In [1]:

import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu  > /dev/null    

## 2. Load Model and lemmatize sample string

In [2]:
import nlu
pipe = nlu.load('tokenize')
pipe.predict('He was suprised by the diversity of NLU')

Unnamed: 0_level_0,sentence
origin_index,Unnamed: 1_level_1
0,He was suprised by the diversity of NLU


# 3. Get one row per token by setting outputlevel to token.    

In [3]:
pipe.predict('He was suprised by the diversity of NLU', output_level='token')

Unnamed: 0_level_0,token
origin_index,Unnamed: 1_level_1
0,He
0,was
0,suprised
0,by
0,the
0,diversity
0,of
0,NLU


# 4. Checkout possible configurations for the Tokenizer

In [4]:
pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['default_tokenizer'] has settable params:
pipe['default_tokenizer'].setTargetPattern('\S+')    | Info: pattern to grab from text as token candidates. Defaults \S+ | Currently set to : \S+
pipe['default_tokenizer'].setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'"])  | Info: character list used to separate from token boundaries | Currently set to : ['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'"]
pipe['default_tokenizer'].setCaseSensitiveExceptions(True)  | Info: Whether to care for case sensitiveness in exceptions | Currently set to : True
pipe['default_tokenizer'].setMinLength(0)            | Info: Set the minimum allowed legth for each token | Currently set to : 0
pipe['default_tokenizer'].setMaxLength(99999)        | Info: Set the maximum allowed legth for each token | Currently set to : 99999
>>> pipe['sentence_detector'] has settable pa

# 4.1 Configure  Context Chars  
By defining custom context chars, we can get extra tokens from suffixes that match the context chars. 


In [5]:
pipe['default_tokenizer'].setContextChars([',','!','o','d'])
pipe.predict('Hello, world!')

Unnamed: 0_level_0,token
origin_index,Unnamed: 1_level_1
0,Hell
0,"o,"
0,worl
0,d!
