![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/collab/Text_Pre_Processing_and_Cleaning/NLU_Lemmatization.ipynb)

# Lemmatization with NLU 

Lemmatizing returns the base form, the so called lemma of every token in the input data.    

I. e. 'He was hungry' becomes 'He be hungry'

The Lemmatizer works by operating on a dictionary and taking context into account. This lets the Lemmatizer dervie a different base word for for a word in two different contexts which depends on the Part of Speech tags. 



This is the main difference  to Stemming, which solves the same problem by applying a heuristic process that removes the end of words.


# 1. Install Java and NLU

In [None]:

import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu > /dev/null    

## 2. Load Model and lemmatize sample string

In [None]:
import nlu
pipe = nlu.load('en.lemma')
pipe.predict('He was suprised by the diversity of NLU')

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]


Unnamed: 0_level_0,en_lemma,document
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[He, be, suprise, by, the, diversity, of, NLU]",He was suprised by the diversity of NLU


# 3. Get one row per lemmatized token by setting outputlevel to token.    
This lets us compare what the original token was and what it was lemmatized to to. 

In [None]:
pipe.predict('He was suprised by the diversity of NLU', output_level='token')

Unnamed: 0_level_0,en_lemma,token
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[He, be, suprise, by, the, diversity, of, NLU]",He
0,"[He, be, suprise, by, the, diversity, of, NLU]",was
0,"[He, be, suprise, by, the, diversity, of, NLU]",suprised
0,"[He, be, suprise, by, the, diversity, of, NLU]",by
0,"[He, be, suprise, by, the, diversity, of, NLU]",the
0,"[He, be, suprise, by, the, diversity, of, NLU]",diversity
0,"[He, be, suprise, by, the, diversity, of, NLU]",of
0,"[He, be, suprise, by, the, diversity, of, NLU]",NLU


# 4. Checkout the Lemma models NLU has to offer for other languages than English!

In [None]:
nlu.print_all_model_kinds_for_action('lemma')

For language <nl> NLU provides the following Models : 
nlu.load('nl.lemma') returns Spark NLP model lemma
For language <en> NLU provides the following Models : 
nlu.load('en.lemma') returns Spark NLP model lemma_antbnc
nlu.load('en.lemma.antbnc') returns Spark NLP model lemma_antbnc
For language <fr> NLU provides the following Models : 
nlu.load('fr.lemma') returns Spark NLP model lemma
For language <de> NLU provides the following Models : 
nlu.load('de.lemma') returns Spark NLP model lemma
For language <it> NLU provides the following Models : 
nlu.load('it.lemma') returns Spark NLP model lemma_dxc
nlu.load('it.lemma.dxc') returns Spark NLP model lemma_dxc
For language <nb> NLU provides the following Models : 
nlu.load('nb.lemma') returns Spark NLP model lemma
For language <pl> NLU provides the following Models : 
nlu.load('pl.lemma') returns Spark NLP model lemma
For language <pt> NLU provides the following Models : 
nlu.load('pt.lemma') returns Spark NLP model lemma
For language <ru>

## 4.1 Let's try German lematization!

In [None]:
nlu.load('de.lemma').predict("Er war von der Vielfältigkeit des NLU Packets begeistert",output_level='token')

lemma download started this may take some time.
Approximate size to download 4 MB
[OK!]


Unnamed: 0_level_0,de_lemma,token
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[Er, sein, von, der, Vielfältigkeit, der, NLU,...",Er
0,"[Er, sein, von, der, Vielfältigkeit, der, NLU,...",war
0,"[Er, sein, von, der, Vielfältigkeit, der, NLU,...",von
0,"[Er, sein, von, der, Vielfältigkeit, der, NLU,...",der
0,"[Er, sein, von, der, Vielfältigkeit, der, NLU,...",Vielfältigkeit
0,"[Er, sein, von, der, Vielfältigkeit, der, NLU,...",des
0,"[Er, sein, von, der, Vielfältigkeit, der, NLU,...",NLU
0,"[Er, sein, von, der, Vielfältigkeit, der, NLU,...",Packets
0,"[Er, sein, von, der, Vielfältigkeit, der, NLU,...",begeistert
