# **Lemmatization NLTK and GENSIM**

---


*   [NLTK](https://colab.research.google.com/drive/1xrl29HG6OdyNoXy8p_aEHDl0qKNtX2qI?authuser=1#scrollTo=n1llCOhUHJZE&line=1&uniqifier=1)
*   [SpaCy](https://colab.research.google.com/drive/1xrl29HG6OdyNoXy8p_aEHDl0qKNtX2qI?authuser=1#scrollTo=1hC60Km1CdsS&line=1&uniqifier=1)
*   [GENSIM](https://colab.research.google.com/drive/1xrl29HG6OdyNoXy8p_aEHDl0qKNtX2qI?authuser=1#scrollTo=fYM3BsunEi_3&line=2&uniqifier=1)


---
---


# **NLTK Works best with separate Words not sentence**

In [11]:

import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = " Humans (Homo sapiens) are highly intelligent primates that have become the dominant species on Earth. They are the only extant members of the subtribe Hominina and together with chimpanzees, gorillas, and orangutans,  they are part of the family Hominidae (the great apes, or hominids). Humans are terrestrial animals,  characterized by their erect posture and bipedal locomotion; high manual dexterity and heavy tool use compared to  other animals; open-ended and complex language use compared to other animal communications;  larger, more complex brains than other primates; and highly advanced and organized societies. "


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [12]:

punctuations="?:!.,;()"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words[:20]


['Humans',
 'Homo',
 'sapiens',
 'are',
 'highly',
 'intelligent',
 'primates',
 'that',
 'have',
 'become',
 'the',
 'dominant',
 'species',
 'on',
 'Earth',
 'They',
 'are',
 'the',
 'only',
 'extant']

# **Splitting into separate words are effective for NLTK**

In [None]:
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))

Another way

In [27]:
result = " ".join([wordnet_lemmatizer.lemmatize(wd) for wd in sentence.split()])
print(result)

Humans (Homo sapiens) are highly intelligent primate that have become the dominant specie on Earth. They are the only extant member of the subtribe Hominina and together with chimpanzees, gorillas, and orangutans, they are part of the family Hominidae (the great apes, or hominids). Humans are terrestrial animals, characterized by their erect posture and bipedal locomotion; high manual dexterity and heavy tool use compared to other animals; open-ended and complex language use compared to other animal communications; larger, more complex brain than other primates; and highly advanced and organized societies.


# **Sentence lemma in less efficient in case of NLTK**

In [17]:
lemma = wordnet_lemmatizer.lemmatize(sentence)
print(lemma[0:200])

 Humans (Homo sapiens) are highly intelligent primates that have become the dominant species on Earth. They are the only extant members of the subtribe Hominina and together with chimpanzees, gorillas


See! primates remained primates. Should've become primate.

# **SpaCy lemmatizer is better than NLTK. It can. make "Good" from "Best"**

In [19]:
!pip install spacy



In [20]:
import spacy

In [22]:
nlp = spacy.load('en', disable=['parser', 'ner'])

In [24]:
lemma = nlp(sentence)

In [26]:
result = " ".join([token.lemma_ for token in lemma])
print(sentence)
print(result)

 Humans (Homo sapiens) are highly intelligent primates that have become the dominant species on Earth. They are the only extant members of the subtribe Hominina and together with chimpanzees, gorillas, and orangutans,  they are part of the family Hominidae (the great apes, or hominids). Humans are terrestrial animals,  characterized by their erect posture and bipedal locomotion; high manual dexterity and heavy tool use compared to  other animals; open-ended and complex language use compared to other animal communications;  larger, more complex brains than other primates; and highly advanced and organized societies. 
  human ( homo sapiens ) be highly intelligent primate that have become the dominant specie on Earth . -PRON- be the only extant member of the subtribe Hominina and together with chimpanzee , gorilla , and orangutans ,   -PRON- be part of the family Hominidae ( the great ape , or hominid ) . human be terrestrial animal ,   characterize by -PRON- erect posture and bipedal lo

# **Gensim Lemmatization**
allows only the ‘JJ’, ‘VB’, ‘NN’ and ‘RB’ tags.

In [32]:
!pip install gensim
from gensim.utils import lemmatize




In [33]:
!pip install git+git://github.com/pattern3/pattern.git

Collecting git+git://github.com/pattern3/pattern.git
  Cloning git://github.com/pattern3/pattern.git to /tmp/pip-req-build-gglrehkm
  Running command git clone -q git://github.com/pattern3/pattern.git /tmp/pip-req-build-gglrehkm
Collecting cherrypy
[?25l  Downloading https://files.pythonhosted.org/packages/a8/f9/e11f893dcabe6bc222a1442bf5e14f0322a2d363c92910ed41947078a35a/CherryPy-18.6.0-py2.py3-none-any.whl (419kB)
[K     |████████████████████████████████| 419kB 2.6MB/s 
[?25hCollecting docx
[?25l  Downloading https://files.pythonhosted.org/packages/4a/8e/5a01644697b03016de339ef444cfff28367f92984dc74eddaab1ed60eada/docx-0.2.4.tar.gz (54kB)
[K     |████████████████████████████████| 61kB 6.2MB/s 
[?25hCollecting feedparser
[?25l  Downloading https://files.pythonhosted.org/packages/91/d8/7d37fec71ff7c9dbcdd80d2b48bcdd86d6af502156fc93846fb0102cb2c4/feedparser-5.2.1.tar.bz2 (192kB)
[K     |████████████████████████████████| 194kB 10.3MB/s 
[?25hCollecting pdfminer3k
[?25l  Downloa

In [34]:
result  = lemmatize(sentence)
print(result)

[b'human/NN', b'homo/JJ', b'sapiens/JJ', b'be/VB', b'highly/RB', b'intelligent/JJ', b'primate/NN', b'have/VB', b'become/VB', b'dominant/JJ', b'species/NN', b'earth/NN', b'be/VB', b'only/JJ', b'extant/JJ', b'member/NN', b'subtribe/NN', b'hominina/NN', b'together/RB', b'chimpanzee/NN', b'gorilla/NN', b'orangutan/NN', b'be/VB', b'part/NN', b'family/NN', b'hominidae/VB', b'great/JJ', b'apes/NN', b'hominid/NN', b'human/NN', b'be/VB', b'terrestrial/JJ', b'animal/NN', b'characterize/VB', b'erect/VB', b'posture/NN', b'bipedal/NN', b'locomotion/NN', b'high/JJ', b'manual/JJ', b'dexterity/NN', b'heavy/JJ', b'tool/NN', b'use/NN', b'compare/VB', b'other/JJ', b'animal/NN', b'open/JJ', b'end/VB', b'complex/JJ', b'language/NN', b'use/NN', b'compare/VB', b'other/JJ', b'animal/JJ', b'communication/NN', b'larger/JJ', b'more/RB', b'complex/JJ', b'brain/NN', b'other/JJ', b'primate/NN', b'highly/RB', b'advanced/JJ', b'organized/JJ', b'society/NN']


In [35]:
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
print(lemmatized_out)

['human', 'homo', 'sapiens', 'be', 'highly', 'intelligent', 'primate', 'have', 'become', 'dominant', 'species', 'earth', 'be', 'only', 'extant', 'member', 'subtribe', 'hominina', 'together', 'chimpanzee', 'gorilla', 'orangutan', 'be', 'part', 'family', 'hominidae', 'great', 'apes', 'hominid', 'human', 'be', 'terrestrial', 'animal', 'characterize', 'erect', 'posture', 'bipedal', 'locomotion', 'high', 'manual', 'dexterity', 'heavy', 'tool', 'use', 'compare', 'other', 'animal', 'open', 'end', 'complex', 'language', 'use', 'compare', 'other', 'animal', 'communication', 'larger', 'more', 'complex', 'brain', 'other', 'primate', 'highly', 'advanced', 'organized', 'society']


Not so good

# ***SpaCy works better I guess.***