![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/component_examples/text_pre_processing_and_cleaning/NLU_lemmatization.ipynb)

# Lemmatization with NLU 

Lemmatizing returns the base form, the so called lemma of every token in the input data.    

I. e. 'He was hungry' becomes 'He be hungry'

The Lemmatizer works by operating on a dictionary and taking context into account. This lets the Lemmatizer dervie a different base word for for a word in two different contexts which depends on the Part of Speech tags. 



This is the main difference  to Stemming, which solves the same problem by applying a heuristic process that removes the end of words.


# 1. Install Java and NLU

In [None]:
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
  

import nlu

--2021-05-01 23:18:27--  https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1671 (1.6K) [text/plain]
Saving to: ‘STDOUT’


2021-05-01 23:18:27 (35.6 MB/s) - written to stdout [1671/1671]

Installing  NLU 3.0.0 with  PySpark 3.0.2 and Spark NLP 3.0.1 for Google Colab ...
[K     |████████████████████████████████| 204.8MB 73kB/s 
[K     |████████████████████████████████| 153kB 50.6MB/s 
[K     |████████████████████████████████| 204kB 20.4MB/s 
[K     |████████████████████████████████| 204kB 52.5MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


## 2. Load Model and lemmatize sample string

In [None]:
import nlu
pipe = nlu.load('en.lemma')
pipe.predict('He was suprised by the diversity of NLU')

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,document,sentence,token,lem
0,He was suprised by the diversity of NLU,[He was suprised by the diversity of NLU],"[He, was, suprised, by, the, diversity, of, NLU]","[He, be, suprise, by, the, diversity, of, NLU]"


# 3. Get one row per lemmatized token by setting outputlevel to token.    
This lets us compare what the original token was and what it was lemmatized to to. 

In [None]:
pipe.predict('He was suprised by the diversity of NLU', output_level='token')

Unnamed: 0,token,lem
0,He,He
0,was,be
0,suprised,suprise
0,by,by
0,the,the
0,diversity,diversity
0,of,of
0,NLU,NLU


# 4. Checkout the Lemma models NLU has to offer for other languages than English!

In [None]:
nlu.print_all_model_kinds_for_action('lemma')

For language <nl> NLU provides the following Models : 
nlu.load('nl.lemma') returns Spark NLP model lemma
For language <en> NLU provides the following Models : 
nlu.load('en.lemma') returns Spark NLP model lemma_antbnc
nlu.load('en.lemma.antbnc') returns Spark NLP model lemma_antbnc
For language <fr> NLU provides the following Models : 
nlu.load('fr.lemma') returns Spark NLP model lemma
For language <de> NLU provides the following Models : 
nlu.load('de.lemma') returns Spark NLP model lemma
For language <it> NLU provides the following Models : 
nlu.load('it.lemma') returns Spark NLP model lemma_dxc
nlu.load('it.lemma.dxc') returns Spark NLP model lemma_dxc
For language <nb> NLU provides the following Models : 
nlu.load('nb.lemma') returns Spark NLP model lemma
For language <pl> NLU provides the following Models : 
nlu.load('pl.lemma') returns Spark NLP model lemma
For language <pt> NLU provides the following Models : 
nlu.load('pt.lemma') returns Spark NLP model lemma
For language <ru>

## 4.1 Let's try German lematization!

In [None]:
nlu.load('de.lemma').predict("Er war von der Vielfältigkeit des NLU Packets begeistert",output_level='token')

lemma download started this may take some time.
Approximate size to download 4 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,token,lem
0,Er,Er
0,war,sein
0,von,von
0,der,der
0,Vielfältigkeit,Vielfältigkeit
0,des,der
0,NLU,NLU
0,Packets,Packets
0,begeistert,begeistern
