![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/text_pre_processing_and_cleaning/NLU_normalizer_example.ipynb)

# Normalziing with NLU 

The Normalizer cleans text data from dirty characters, lowercases it by default and removes punctuation.       

### Removes all dirty characters and from text following a regex pattern.    
- Dirty characters are things like !@#$%^&*()?>< etc..
- Useful for reducing dimension/variance of your data since fewer symbols will occur
- Useful for cleaning tweets 
- Matches slangs
- Language independent 
- You can use a regex pattern to specify which tokens will *not* be removed.  

I.e the pattern [a-z] matches all characters from a,b,c... to x,y,z. It will throw
```
pipe['normalizer'].setCleanupPatterns('[a-z]') 
```


# 1. Install Java and NLU

In [None]:
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
  

import nlu

--2021-05-01 23:18:29--  https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1671 (1.6K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               Installing  NLU 3.0.0 with  PySpark 3.0.2 and Spark NLP 3.0.1 for Google Colab ...

2021-05-01 23:18:29 (1000 KB/s) - written to stdout [1671/1671]

[K     |████████████████████████████████| 204.8MB 69kB/s 
[K     |████████████████████████████████| 153kB 42.2MB/s 
[K     |████████████████████████████████| 204kB 24.1MB/s 
[K     |████████████████████████████████| 204kB 54.2MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


## 2. Load Model and normalize sample string

In [None]:
import nlu 


nlu.load('norm').predict('@CKL_IT says: that #normalizers are pretty useful to clean #structured_strings in #NLU like tweets')

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,document,sentence,token,norm
0,@CKL_IT says: that #normalizers are pretty use...,[@CKL_IT says: that #normalizers are pretty us...,"[@CKL_IT, says, :, that, #normalizers, are, pr...",CKLIT
0,@CKL_IT says: that #normalizers are pretty use...,[@CKL_IT says: that #normalizers are pretty us...,"[@CKL_IT, says, :, that, #normalizers, are, pr...",says
0,@CKL_IT says: that #normalizers are pretty use...,[@CKL_IT says: that #normalizers are pretty us...,"[@CKL_IT, says, :, that, #normalizers, are, pr...",that
0,@CKL_IT says: that #normalizers are pretty use...,[@CKL_IT says: that #normalizers are pretty us...,"[@CKL_IT, says, :, that, #normalizers, are, pr...",normalizers
0,@CKL_IT says: that #normalizers are pretty use...,[@CKL_IT says: that #normalizers are pretty us...,"[@CKL_IT, says, :, that, #normalizers, are, pr...",are
0,@CKL_IT says: that #normalizers are pretty use...,[@CKL_IT says: that #normalizers are pretty us...,"[@CKL_IT, says, :, that, #normalizers, are, pr...",pretty
0,@CKL_IT says: that #normalizers are pretty use...,[@CKL_IT says: that #normalizers are pretty us...,"[@CKL_IT, says, :, that, #normalizers, are, pr...",useful
0,@CKL_IT says: that #normalizers are pretty use...,[@CKL_IT says: that #normalizers are pretty us...,"[@CKL_IT, says, :, that, #normalizers, are, pr...",to
0,@CKL_IT says: that #normalizers are pretty use...,[@CKL_IT says: that #normalizers are pretty us...,"[@CKL_IT, says, :, that, #normalizers, are, pr...",clean
0,@CKL_IT says: that #normalizers are pretty use...,[@CKL_IT says: that #normalizers are pretty us...,"[@CKL_IT, says, :, that, #normalizers, are, pr...",structuredstrings


## 2. Configure the normalizer with custom parameters
Use the pipe.print_info() to see all configurable parameters and infos about them for every NLU component in the pipeline pipeline.     
Even tough only 'norm' is loaded, many NLU component dependencies are automatically loaded into the pipeline and also configurable. 


By default the normalizer will set all tokens to lower case.     
Lets change that

In [None]:
pipe = nlu.load('norm')
pipe.predict('LOWERCASE BY DEFAULT')

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,document,sentence,token,norm
0,LOWERCASE BY DEFAULT,[LOWERCASE BY DEFAULT],"[LOWERCASE, BY, DEFAULT]",LOWERCASE
0,LOWERCASE BY DEFAULT,[LOWERCASE BY DEFAULT],"[LOWERCASE, BY, DEFAULT]",BY
0,LOWERCASE BY DEFAULT,[LOWERCASE BY DEFAULT],"[LOWERCASE, BY, DEFAULT]",DEFAULT


### 2.1 Print all parameters for all NLU components in the pipeline 


In [None]:
pipe.print_info()


The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['normalizer'] has settable params:
pipe['normalizer'].setCleanupPatterns(['[^\\pL+]'])  | Info: normalization regex patterns which match will be removed from token | Currently set to : ['[^\\pL+]']
pipe['normalizer'].setLowercase(False)               | Info: whether to convert strings to lowercase | Currently set to : False
pipe['normalizer'].setSlangMatchCase(False)          | Info: whether or not to be case sensitive to match slangs. Defaults to false. | Currently set to : False
pipe['normalizer'].setMinLength(0)                   | Info: Set the minimum allowed legth for each token | Currently set to : 0
>>> pipe['default_tokenizer'] has settable params:
pipe['default_tokenizer'].setTargetPattern('\S+')    | Info: pattern to grab from text as token candidates. Defaults \S+ | Currently set to : \S+
pipe['default_tokenizer'].setContextChars(['.', ',', ';', ':', '!', '?', '*', '

### 2.2 Configure the Normalizer not to lowercase text 

In [None]:
pipe['normalizer'].setLowercase(True)      
pipe.predict('LOWERCASE BY DEFAULT')

Unnamed: 0,document,sentence,token,norm
0,LOWERCASE BY DEFAULT,[LOWERCASE BY DEFAULT],"[LOWERCASE, BY, DEFAULT]",LOWERCASE
0,LOWERCASE BY DEFAULT,[LOWERCASE BY DEFAULT],"[LOWERCASE, BY, DEFAULT]",BY
0,LOWERCASE BY DEFAULT,[LOWERCASE BY DEFAULT],"[LOWERCASE, BY, DEFAULT]",DEFAULT


### 2.3Configure normalizer to remove strings based on regex pattern.
Lets remove all occurences of the lowercase letters x to z with the pattern [x-z]. 

In [None]:
# Configure the Normalizer 
pipe['normalizer'].setCleanupPatterns(['[x-z]']) 
pipe.predict('From the x to the y to the z')

Unnamed: 0,document,sentence,token,norm
0,From the x to the y to the z,[From the x to the y to the z],"[From, the, x, to, the, y, to, the, z]",From
0,From the x to the y to the z,[From the x to the y to the z],"[From, the, x, to, the, y, to, the, z]",the
0,From the x to the y to the z,[From the x to the y to the z],"[From, the, x, to, the, y, to, the, z]",x
0,From the x to the y to the z,[From the x to the y to the z],"[From, the, x, to, the, y, to, the, z]",to
0,From the x to the y to the z,[From the x to the y to the z],"[From, the, x, to, the, y, to, the, z]",the
0,From the x to the y to the z,[From the x to the y to the z],"[From, the, x, to, the, y, to, the, z]",y
0,From the x to the y to the z,[From the x to the y to the z],"[From, the, x, to, the, y, to, the, z]",to
0,From the x to the y to the z,[From the x to the y to the z],"[From, the, x, to, the, y, to, the, z]",the
0,From the x to the y to the z,[From the x to the y to the z],"[From, the, x, to, the, y, to, the, z]",z


#### NOTE: The regex pattern is applied **BEFORE** lowercasing.    
This is why the X,Y,Z tokens are kept i nthe following example


In [None]:
# Configure the Normalizer 
pipe['normalizer'].setCleanupPatterns(['[x-z]']) 
pipe.predict('From the X to the Y to the Z')

Unnamed: 0,document,sentence,token,norm
0,From the X to the Y to the Z,[From the X to the Y to the Z],"[From, the, X, to, the, Y, to, the, Z]",From
0,From the X to the Y to the Z,[From the X to the Y to the Z],"[From, the, X, to, the, Y, to, the, Z]",the
0,From the X to the Y to the Z,[From the X to the Y to the Z],"[From, the, X, to, the, Y, to, the, Z]",X
0,From the X to the Y to the Z,[From the X to the Y to the Z],"[From, the, X, to, the, Y, to, the, Z]",to
0,From the X to the Y to the Z,[From the X to the Y to the Z],"[From, the, X, to, the, Y, to, the, Z]",the
0,From the X to the Y to the Z,[From the X to the Y to the Z],"[From, the, X, to, the, Y, to, the, Z]",Y
0,From the X to the Y to the Z,[From the X to the Y to the Z],"[From, the, X, to, the, Y, to, the, Z]",to
0,From the X to the Y to the Z,[From the X to the Y to the Z],"[From, the, X, to, the, Y, to, the, Z]",the
0,From the X to the Y to the Z,[From the X to the Y to the Z],"[From, the, X, to, the, Y, to, the, Z]",Z


# 3. Get one row per normalized token by setting outputlevel to token.    
This lets us compare what the original token was and what it was normalized to. 

In [None]:
pipe.predict('From the X to the Y to the Z', output_level='token')

Unnamed: 0,token,norm
0,From,"[From, the, X, to, the, Y, to, the, Z]"
0,the,"[From, the, X, to, the, Y, to, the, Z]"
0,X,"[From, the, X, to, the, Y, to, the, Z]"
0,to,"[From, the, X, to, the, Y, to, the, Z]"
0,the,"[From, the, X, to, the, Y, to, the, Z]"
0,Y,"[From, the, X, to, the, Y, to, the, Z]"
0,to,"[From, the, X, to, the, Y, to, the, Z]"
0,the,"[From, the, X, to, the, Y, to, the, Z]"
0,Z,"[From, the, X, to, the, Y, to, the, Z]"
