**Lemmatization**

- Lemmatization is the process of converting a word to its base form
- Lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors

'Caring' -> Lemmatization -> 'Care'   
'Caring' -> Stemming -> 'Car'


1. Wordnet Lemmatizer
2. Spacy Lemmatizer
3. TextBlob
4. CLiPS Pattern
5. Stanford CoreNLP
6. Gensim Lemmatizer
7. TreeTagger

In [92]:
# Enabling print for all lines
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Checking the working directory
import os
os.getcwd()

import time

'C:\\Users\\kalya\\Python\\NLP'

**`Wordnet Lemmatizer`**

In [30]:
import nltk
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kalya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\kalya\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [20]:
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize Single Word
print(lemmatizer.lemmatize("bats"), lemmatizer.lemmatize("are"), lemmatizer.lemmatize("feet"))

sentence = "The striped bats are hanging on their feet for best"
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(sentence)])
print(lemmatized_output)

"The output lemma is not great as 'are' & 'hanging' didnot get converted"

# Lemmatization with POS for better output
lemmatized_output_1 = ' '.join([lemmatizer.lemmatize(w, pos = 'v') for w in nltk.word_tokenize(sentence)])
print(lemmatized_output_1)

"But we do not know which POS each word would be attached with"

# Same word can have a multiple lemmas based on the meaning/context
print(lemmatizer.lemmatize("stripes", 'v'), lemmatizer.lemmatize("stripes", 'n')) 

bat are foot
The striped bat are hanging on their foot for best
The strip bat be hang on their feet for best
strip stripe


In [47]:
# Attaching appropiate POS to every word
nltk.pos_tag(nltk.word_tokenize(sentence))

# Map NLTK’s POS tags to the format wordnet lemmatizer would accept
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    # Retuns POS, if found, else noun
    return tag_dict.get(tag, wordnet.NOUN)

# Sample
lemmatizer.lemmatize('feet', get_wordnet_pos('feet'))

# Lemmatize a Sentence with the appropriate POS tag
print(' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)]))

[('The', 'DT'),
 ('striped', 'JJ'),
 ('bats', 'NNS'),
 ('are', 'VBP'),
 ('hanging', 'VBG'),
 ('on', 'IN'),
 ('their', 'PRP$'),
 ('feet', 'NNS'),
 ('for', 'IN'),
 ('best', 'JJS')]

'foot'

The strip bat be hang on their foot for best


**`spaCy Lemmatization`**

- spaCy determines the part-of-speech tag by default and assigns the corresponding lemma
- It comes with a bunch of prebuilt models where the ‘en’ is one of the standard ones for english

In [53]:
import spacy
# !python -m spacy download en

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
# nlp = spacy.load('en', disable=['parser', 'ner'])
nlp = spacy.load('en_core_web_sm')

# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)
" ".join([token.lemma_ for token in doc])

'the stripe bat be hang on -PRON- foot for good'

**`TextBlob Lemmatization`**

TexxtBlob is a powerful, fast and convenient NLP package as well. Using the Word and TextBlob objects, its quite straighforward to parse and lemmatize words and sentences respectively

In [56]:
# !pip install textblob
from textblob import TextBlob, Word

sent = TextBlob(sentence)
" ". join([w.lemmatize() for w in sent.words])

"Like NLTK, TextBlob also uses wordnet internally. We need to supply the POS separately during lemmatization"

# Define function to lemmatize each word with its POS tag
def lemmatize_with_postag(sentence):
    sent = TextBlob(sentence)
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}
    words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]    
    lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
    return " ".join(lemmatized_list)

# Lemmatize
sentence = "The striped bats are hanging on their feet for best"
lemmatize_with_postag(sentence)

Collecting textblob
  Using cached textblob-0.15.3-py2.py3-none-any.whl (636 kB)
Installing collected packages: textblob
Successfully installed textblob-0.15.3


'The striped bat are hanging on their foot for best'

'The striped bat be hang on their foot for best'

**`Pattern Lemmatization`**

In [90]:
help(pattern.en.parse)

Help on function parse in module pattern.en:

parse(s, *args, **kwargs)
    Returns a tagged Unicode string.



In [96]:
# !pip install pattern
import pattern
from pattern.en import lemma, lexeme

# Lemmetizer
" ".join([lemma(wd, parse=True) for wd in sentence.split()])

# Possible lexeme’s for each word
[lexeme(wd) for wd in sentence.split()]

from pattern.en import parse
parse(sentence, lemmata=True, tags=False, chunks=False)

'the stripe bat be hang on their feet for best'

[['the', 'thes', 'thing', 'thed'],
 ['stripe', 'stripes', 'striping', 'striped'],
 ['bat', 'bats', 'batting', 'batted'],
 ['be',
  'am',
  'are',
  'is',
  'being',
  'was',
  'were',
  'been',
  'am not',
  "aren't",
  "isn't",
  "wasn't",
  "weren't"],
 ['hang', 'hangs', 'hanging', 'hung'],
 ['on', 'ons', 'oning', 'oned'],
 ['their', 'theirs', 'theiring', 'theired'],
 ['feet', 'feets', 'feeting', 'feeted'],
 ['for', 'fors', 'forring', 'forred'],
 ['best', 'bests', 'besting', 'bested']]

'The/DT/the striped/JJ/striped bats/NNS/bat are/VBP/be hanging/VBG/hang on/IN/on their/PRP$/their feet/NNS/foot for/IN/for best/JJS/best'

**`Stanford CoreNLP Lemmatization`**

- Popular NLP tool that is originally implemented in Java. There are many python wrappers written around it
- Lemma is embedded in the output of the annotate() method of the StanfordCoreNLP connection object

In [93]:
# !pip install stanfordcorenlp
# !pip install pycorenlp
# from pycorenlp import StanfordCoreNLP
from stanfordcorenlp import StanfordCoreNLP
import json

corenlp_dir = "C:\\Users\\kalya\\Downloads\\stanford-corenlp-latest"
corenlp = StanfordCoreNLP(corenlp_dir)
corenlp.raw_parse("Several women told me I have lying eyes.")

import requests
print(requests.post('http://[::]:9000/?properties={"annotators":"tokenize,ssplit,pos","outputFormat":"json"}'
                    data = {'data':'The quick brown fox jumped over the lazy dog.'}).text)

# nlp_wrapper = StanfordCoreNLP('http://localhost:9000')


# # Connect to the CoreNLP server we just started
# nlp = StanfordCoreNLP('http://localhost', port=9000, timeout=120)

# # Define proporties needed to get lemma
# props = {'annotators': 'pos,lemma', 'pipelineLanguage': 'en', 'outputFormat': 'json'}
# parsed_str = nlp.annotate(sentence, properties=props)

# # Output of nlp.annotate() was converted to a dict using json.loads
# parsed_dict = json.loads(parsed_str)
# parsed_dict

# # lemma values from dictionary
# lemma_list = " ".join([v for d in parsed_dict['sentences'][0]['tokens'] for k,v in d.items() if k == 'lemma'])

FileNotFoundError: [WinError 2] The system cannot find the file specified

In [95]:
import requests
print(requests.post('http://[::]:9000/?properties={"annotators":"tokenize,ssplit,pos","outputFormat":"json"}',
                    data = {'data':'The quick brown fox jumped over the lazy dog.'}).text)


from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

text = ('Pusheen and Smitha walked along the beach.Pusheen wanted to surf, but fell off the surfboard.')
output = nlp.annotate(text, properties={'annotators':'tokenize, ssplit, pos, depparse, parse', 'outputFormat':'json'})
print(output['sentences'][0]['parse'])

ConnectionError: HTTPConnectionPool(host='::', port=9000): Max retries exceeded with url: /?properties=%7B%22annotators%22:%22tokenize,ssplit,pos%22,%22outputFormat%22:%22json%22%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000021DA0A6A588>: Failed to establish a new connection: [WinError 10049] The requested address is not valid in its context'))

In [97]:
from stanfordcorenlp import StanfordCoreNLP
import json, string

def lemmatize_corenlp(conn_nlp, sentence):
    props = {'annotators': 'pos,lemma', 'pipelineLanguage': 'en', 'outputFormat': 'json'}
    
    # Tokenize into words
    sents = conn_nlp.word_tokenize(sentence)
    # Remove punctuations from tokenised list
    sents_no_punct = [s for s in sents if s not in string.punctuation]

    # Form sentence
    sentence2 = " ".join(sents_no_punct)
    # Annotate to get lemma
    parsed_str = conn_nlp.annotate(sentence2, properties=props)
    parsed_dict = json.loads(parsed_str)

    # Extract the lemma for each word
    lemma_list = [v for d in parsed_dict['sentences'][0]['tokens'] for k,v in d.items() if k == 'lemma']
    # Form sentence and return it
    return " ".join(lemma_list)

# Make the connection and call `lemmatize_corenlp`
nlp = StanfordCoreNLP('http://localhost', port=9000, timeout=120)
lemmatize_corenlp(conn_nlp=nlp, sentence=sentence)

KeyboardInterrupt: 

**`Gensim Lemmatization`**

- Gensim provide lemmatization facilities based on the pattern package
- It can be implemented using the lemmatize() method in the utils module
- By default lemmatize() allows only the ‘JJ’, ‘VB’, ‘NN’ and ‘RB’ tags

In [69]:
from gensim.utils import lemmatize
import re
lemmatized_out = ' '.join([wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence, allowed_tags=re.compile('(NN|JJ|RB)'))])
lemmatized_out

'striped bat foot best'

**`TreeTagger Lemmatization`**

- Treetagger is a Part-of-Speech tagger for many languages
- And it provides the lemma of the word as well

In [77]:
# !pip install treetaggerwrapper
import treetaggerwrapper as ttpw

# tagger = ttpw.TreeTagger(TAGLANG='en', TAGDIR='/Users/ecom-selva.p/Documents/MLPlus/11_Lemmatization/treetagger')
tagger = ttpw.TreeTagger(TAGLANG='en', TAGDIR='C:/Users/kalya/Anaconda3/Lib/site-packages/treetaggerwrapper-2.3.dist-info')
tags = tagger.tag_text(sentence)
lemmas = [t.split('\t')[-1] for t in tags]

ERROR:TreeTagger:TreeTagger binary invalid: C:\Users\kalya\Anaconda3\Lib\site-packages\treetaggerwrapper-2.3.dist-info\bin\tree-tagger.exe


TreeTaggerError: TreeTagger binary invalid: C:\Users\kalya\Anaconda3\Lib\site-packages\treetaggerwrapper-2.3.dist-info\bin\tree-tagger.exe

In [101]:
import os
!pip install treetagger
# os.environ['TREETAGGER'] = "/opt/treetagger/cmd" # Or wherever you installed TreeTagger
# os.environ['TREETAGGER'] = "C:/Users/kalya/Downloads/tree-tagger-windows-3.2.3/TreeTagger/cmd"
from treetagger import TreeTagger

tt = TreeTagger(path_to_treetagger='C:/Users/kalya/Downloads/tree-tagger-windows-3.2.3/TreeTagger/')
tt.get_installed_lang()

tt_en = TreeTagger(encoding='utf-8', language='english')
tt_en.tag('Does this thing even work?')

tt_fr = TreeTagger(encoding='utf-8', language='french')
tt_fr.tag(u'Mon Dieu, faites que ça marche!')

ERROR: Could not find a version that satisfies the requirement treetagger (from versions: none)
ERROR: No matching distribution found for treetagger


ModuleNotFoundError: No module named 'treetagger'

In [102]:
from github_com.kennethreitz import requests

ModuleNotFoundError: No module named 'github_com'

In [104]:
# ! pip install git+https://github.com/[repo owner]/[repo]@[branch name]
!pip install git+https://github.com/miotto/treetagger-python.git

Collecting git+https://github.com/miotto/treetagger-python.git
  Cloning https://github.com/miotto/treetagger-python.git to c:\users\kalya\appdata\local\temp\pip-req-build-xsq59o6e


  Running command git clone -q https://github.com/miotto/treetagger-python.git 'C:\Users\kalya\AppData\Local\Temp\pip-req-build-xsq59o6e'
  ERROR: Error [WinError 2] The system cannot find the file specified while executing command git clone -q https://github.com/miotto/treetagger-python.git 'C:\Users\kalya\AppData\Local\Temp\pip-req-build-xsq59o6e'
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?
