![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/collab/Text_Pre_Processing_and_Cleaning/NLU_Normalizer_example.ipynb)
# Normalziing with NLU 

The Normalizer cleans text data from dirty characters, lowercases it by default and removes punctuation.       

### Removes all dirty characters and from text following a regex pattern.    
- Dirty characters are things like !@#$%^&*()?>< etc..
- Useful for reducing dimension/variance of your data since fewer symbols will occur
- Useful for cleaning tweets 
- Matches slangs
- Language independent 
- You can use a regex pattern to specify which tokens will *not* be removed.  

I.e the pattern [a-z] matches all characters from a,b,c... to x,y,z. It will throw
```
pipe['normalizer'].setCleanupPatterns('[a-z]') 
```


# 1. Install Java and NLU

In [None]:

import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu > /dev/null    

## 2. Load Model and normalize sample string

In [None]:
import nlu 


nlu.load('norm').predict('@CKL_IT says: that #normalizers are pretty useful to clean #structured_strings in #NLU like tweets')

Unnamed: 0_level_0,sentence,normalized
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,@CKL_IT says: that #normalizers are pretty use...,"[CKLIT, says, that, normalizers, are, pretty, ..."


## 2. Configure the normalizer with custom parameters
Use the pipe.print_info() to see all configurable parameters and infos about them for every NLU component in the pipeline pipeline.     
Even tough only 'norm' is loaded, many NLU component dependencies are automatically loaded into the pipeline and also configurable. 


By default the normalizer will set all tokens to lower case.     
Lets change that

In [None]:
pipe = nlu.load('norm')
pipe.predict('LOWERCASE BY DEFAULT')

Unnamed: 0_level_0,sentence,normalized
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,LOWERCASE BY DEFAULT,"[LOWERCASE, BY, DEFAULT]"


### 2.1 Print all parameters for all NLU components in the pipeline 


In [None]:
pipe.print_info()


The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['normalizer'] has settable params:
pipe['normalizer'].setCleanupPatterns(['[^\\pL+]'])  | Info: normalization regex patterns which match will be removed from token | Currently set to : ['[^\\pL+]']
pipe['normalizer'].setLowercase(False)               | Info: whether to convert strings to lowercase | Currently set to : False
pipe['normalizer'].setSlangMatchCase(False)          | Info: whether or not to be case sensitive to match slangs. Defaults to false. | Currently set to : False
>>> pipe['default_tokenizer'] has settable params:
pipe['default_tokenizer'].setTargetPattern('\S+')    | Info: pattern to grab from text as token candidates. Defaults \S+ | Currently set to : \S+
pipe['default_tokenizer'].setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'"])  | Info: character list used to separate from token boundaries | Currently set to : ['.', ',', ';', ':',

### 2.2 Configure the Normalizer not to lowercase text 

In [None]:
pipe['normalizer'].setLowercase(True)      
pipe.predict('LOWERCASE BY DEFAULT')

Unnamed: 0_level_0,sentence,normalized
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,LOWERCASE BY DEFAULT,"[lowercase, by, default]"


### 2.3Configure normalizer to remove strings based on regex pattern.
Lets remove all occurences of the lowercase letters x to z with the pattern [x-z]. 

In [None]:
# Configure the Normalizer 
pipe['normalizer'].setCleanupPatterns(['[x-z]']) 
pipe.predict('From the x to the y to the z')

Unnamed: 0_level_0,sentence,normalized
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,From the x to the y to the z,"[from, the, to, the, to, the]"


#### NOTE: The regex pattern is applied **BEFORE** lowercasing.    
This is why the X,Y,Z tokens are kept i nthe following example


In [None]:
# Configure the Normalizer 
pipe['normalizer'].setCleanupPatterns(['[x-z]']) 
pipe.predict('From the X to the Y to the Z')

Unnamed: 0_level_0,sentence,normalized
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,From the X to the Y to the Z,"[from, the, x, to, the, y, to, the, z]"


# 3. Get one row per normalized token by setting outputlevel to token.    
This lets us compare what the original token was and what it was normalized to. 

In [None]:
pipe.predict('From the X to the Y to the Z', output_level='token')

Unnamed: 0_level_0,token,normalized
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,From,"[from, the, x, to, the, y, to, the, z]"
0,the,"[from, the, x, to, the, y, to, the, z]"
0,X,"[from, the, x, to, the, y, to, the, z]"
0,to,"[from, the, x, to, the, y, to, the, z]"
0,the,"[from, the, x, to, the, y, to, the, z]"
0,Y,"[from, the, x, to, the, y, to, the, z]"
0,to,"[from, the, x, to, the, y, to, the, z]"
0,the,"[from, the, x, to, the, y, to, the, z]"
0,Z,"[from, the, x, to, the, y, to, the, z]"
