# Natural Language Toolkit NLTK

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries and an active discussion forum.

## Installation

`pip install nltk`

## NLTK

NLTK comes with many corpora, toy grammars, trained models etc. A complete list is posted at: [http://nltk.org/nltk_data/](http://nltk.org/nltk_data/)



In [1]:
import nltk

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

this will launch the nltk downloader gui, from which you can download any corpus.

![](images/wordnet.png)

## Let's get started



In [2]:
import nltk
nltk.download('punkt')

In [3]:
text = """Monticello wasn't designated as UNESCO World Heritage Site until 1987 """

## Better word tokenizers

In [5]:
import regex
regex.split("[\s\.\,]", text)

['Monticello',
 "wasn't",
 'designated',
 'as',
 'UNESCO',
 'World',
 'Heritage',
 'Site',
 'until',
 '1987',
 '']

In [8]:
nltk.word_tokenize(text)

['Monticello',
 'was',
 "n't",
 'designated',
 'as',
 'UNESCO',
 'World',
 'Heritage',
 'Site',
 'until',
 '1987']

## Stemming

There are multiple stemmers in nltk. 

### Porter stemmer

it applies some rules on the text.

In [9]:
from nltk import PorterStemmer

stemmer = PorterStemmer()

plurals = ['carresses', 'flies', 'dies', 'mules', 'denied', 'died',
          'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 
          'siezing', 'itemization', 'sensational', 'traditional', 'reference',
          'colonizer', 'plotted']

for word in plurals:
    print(f"{word} >>> {stemmer.stem(word)}")

carresses >>> carress
flies >>> fli
dies >>> die
mules >>> mule
denied >>> deni
died >>> die
agreed >>> agre
owned >>> own
humbled >>> humbl
sized >>> size
meeting >>> meet
stating >>> state
siezing >>> siez
itemization >>> item
sensational >>> sensat
traditional >>> tradit
reference >>> refer
colonizer >>> colon
plotted >>> plot


In [10]:
from nltk.stem.snowball import SnowballStemmer

SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

In [11]:
sn_stemmer = SnowballStemmer('english')

The performance is close, but the `english` snowball stemmer is better in some cases than `porter`.

In [12]:
sn_stemmer.stem("generously")

'generous'

In [13]:
stemmer.stem("generously")

'gener'

## Lemmatization

Retrieve the lemma of a word (e.g. "niñas" is the feminine gender, plural number of the lemma "niño".

In [14]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [15]:
for word in plurals:
    print(f"{word} >>> {lemmatizer.lemmatize(word)}")

carresses >>> carresses
flies >>> fly
dies >>> dy
mules >>> mule
denied >>> denied
died >>> died
agreed >>> agreed
owned >>> owned
humbled >>> humbled
sized >>> sized
meeting >>> meeting
stating >>> stating
siezing >>> siezing
itemization >>> itemization
sensational >>> sensational
traditional >>> traditional
reference >>> reference
colonizer >>> colonizer
plotted >>> plotted
