###  What is NLTK ?
NLTK is a powerful Python package that provides a set of diverse natural languages algorithms.
It is free, opensource, easy to use, large community, and well documented.
NLTK consists of the most common algorithms such as 
    * tokenizing, 
    * part-of-speech tagging, 
    * stemming, sentiment analysis, 
    * topic segmentation, and 
    * named entity recognition.


### The Basics of NLP for Text

In this guide we are going to cover some topics that you need to familizrize yourself with while working with NLP

1. Sentence Tokenization
2. Word Tokenization
3. Text  Lemmatization and Stemming
4. Stop Words
5. Parts Of Speech(POS) Tagging

### Getting Started.
import the nltk library

In [42]:
import nltk

# 1.TOKENIZATION

* Tokenization is the first step in text analytics. 
* The process of breaking down a text paragraph into smaller chunks such as **words** or **sentence** is called Tokenization.
* Token is a single entity that is building blocks for sentence or paragraph.

* We can perform two types of tokenization.
    1. Sententence Tokenization
    2. Word Tokenization

#### Sentence Tokenization
Sentence Tokenization 
* Sentence tokenizer breaks text paragraph into sentences.
* It is also called sentence segmentation
* The idea here looks very simple. In English and some other languages, we can split apart the sentences whenever we see a punctuation mark.

In [43]:


from nltk.tokenize import sent_tokenize

text_one = "Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard"

tokenized_text = sent_tokenize(text)

# output
tokenized_text

# check the type of tokenized_text
# type(tokenized_text)

['Hello Mr. Smith, how are you doing today?',
 'The weather is great, and city is awesome.The sky is pinkish-blue.',
 "You shouldn't eat cardboard"]

In [44]:
text_two = """Backgammon is one of the oldest known board games. 
Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. 
It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll
of two dice.
"""

tokenized_text_two = sent_tokenize(text_two)

tokenized_text_two

# loop over the list 
for i in tokenized_text_two:
    print(i)

Backgammon is one of the oldest known board games.
Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.
It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll
of two dice.


 #### Word Tokenization
 * Word tokenizer breaks text paragraph into words.
 * also called word segmentation
 

In [45]:
from nltk.tokenize import word_tokenize

tokenized_word = word_tokenize(text_one)
tokenized_word

['Hello',
 'Mr.',
 'Smith',
 ',',
 'how',
 'are',
 'you',
 'doing',
 'today',
 '?',
 'The',
 'weather',
 'is',
 'great',
 ',',
 'and',
 'city',
 'is',
 'awesome.The',
 'sky',
 'is',
 'pinkish-blue',
 '.',
 'You',
 'should',
 "n't",
 'eat',
 'cardboard']

In [46]:
tokenized_word_two = word_tokenize(text_two)

for i in tokenized_word_two:
    print(i)

Backgammon
is
one
of
the
oldest
known
board
games
.
Its
history
can
be
traced
back
nearly
5,000
years
to
archeological
discoveries
in
the
Middle
East
.
It
is
a
two
player
game
where
each
player
has
fifteen
checkers
which
move
between
twenty-four
points
according
to
the
roll
of
two
dice
.


# 2.TEXT LEMMATIZATION AND STEMMING

For grammatical reasons, documents can contain different forms of a word such as drive, drives, driving. Also, sometimes we have related words with a similar meaning, such as nation, national, nationality.

#### Why do we perform text lemmatization and stemming?
The goal of both stemming and lemmatization is to reduce inflectional(involving a change in the form of a word to express a grammatical function or attribute.) forms and sometimes derivationally related forms of a word to a common base form.

Examples:
    * am, are, is => be
    * dog, dogs, dog’s, dogs’ => dog
    
The result of this mapping applied on a text will be something like that:

    * the boy’s dogs are different sizes => the boy dog be differ size
    
Stemming and lemmatization are special cases of **normalization**. However, they are different from each other

## Stemming
* Stemming is the process of obtaining the root word from the word given.
* Using efficient and well-generalized rules, all tokens can be cut down to obtain the root word, also known as the *stem*.
* Stemming is a purely rule-based process through which we club together variations of the token.

a basic rule-based stemmer, like removing –s/es or -ing or -ed can give you a precision(accuracy) of more than 70 percent .

Example1

The word *sit* can have different variations like 
1. sitting
2. sat

Example 2

The word *connect* can have different variations like
1. connection
2. connected
3. connecting


**NB:
a stemmer operates without knowledge of the context, and therefore cannot understand the difference between words which have different meaning depending on part of speech.**

In [47]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

word1 = "meeting"
word2 = "connecting"
word3 = "goes"
word4 = "registered"
word5 = "better"

stem_word1 = ps.stem(word1)
stem_word2 = ps.stem(word2)
stem_word3 = ps.stem(word3)
stem_word4 = ps.stem(word4)
stem_word5 = ps.stem(word5)

print(stem_word1)
print(stem_word2)
print(stem_word3)
print(stem_word4)
print(stem_word5)

meet
connect
goe
regist
better


## Lemmatization

* Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words. 
* normally the aim is  to remove inflectional endings only and to return the base or dictionary form of a word,
* This is known as the **lemma**.
*  Lemmatization makes use of the context and POS tag to determine the inflected form(shortened version) of the word and various normalization rules are applied for each POS tag to get the root word (lemma).

Examples:

The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

The word “play” is the base form for the word “playing”, and hence this is matched in both stemming and lemmatization.

The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context; e.g., “in our last meeting” or “We are meeting again tomorrow”. Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.

In [48]:
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

word1 = "flying"
word2 = "drove"
word3 = "better"
word3 = "registered"

lem_word1 = lem.lemmatize(word1,"v") #wordnet v-verb
lem_word2 = lem.lemmatize(word2,"v")
lem_word3 = lem.lemmatize(word3,"a")
lem_word4 = lem.lemmatize(word4,"v")

print(lem_word1)
print(lem_word2)
print(lem_word3)
print(lem_word4)


fly
drive
registered
register


# 3.STOPWORDS

* Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords.
* Text may contain stop words such as is, am, are, this, a, an, the, etc.
* **Stopwords** act as bridges and their job is to ensure that sentences are grammatically correct.
* Thus, removing the words that occur commonly in the corpus is the definition of stop-word removal

Stop words are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.

The NLTK tool has a predefined list of stopwords that refers to the most common words. If you use it for your first time, you need to download the stop words using this code: nltk.download(“stopwords”).

Once we complete the downloading, we can load the stopwords package from the nltk.corpus and use it to load the stop words.

In [49]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [52]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = nltk.word_tokenize(sentence)
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)


['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


# 4. Part-of-Speech(POS) TAGGING

* The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word.
* Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context.
* POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.



In [54]:
sent = "Albert Einstein was born in Ulm, Germany in 1879."

tokens=nltk.word_tokenize(sent)
print(tokens)

nltk.pos_tag(tokens)

['Albert', 'Einstein', 'was', 'born', 'in', 'Ulm', ',', 'Germany', 'in', '1879', '.']


[('Albert', 'NNP'),
 ('Einstein', 'NNP'),
 ('was', 'VBD'),
 ('born', 'VBN'),
 ('in', 'IN'),
 ('Ulm', 'NNP'),
 (',', ','),
 ('Germany', 'NNP'),
 ('in', 'IN'),
 ('1879', 'CD'),
 ('.', '.')]