# NLTK: Getting started

* ref: 
    - [https://pythonprogramming.net](https://pythonprogramming.net)
    - [www.geeksforgeeks.org](https://www.geeksforgeeks.org/python-stemming-words-with-nltk/)

* The __NLTK module__ is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. 
NLTK will aid you with everything
    - from splitting sentences from paragraphs, 
    - splitting up words, 
    - recognizing the part of speech of those words, 
    - highlighting the main subjects, 
    - and then even with helping your machine to understand what the text is all about. 

* quick vocabulary:
    
    - __Corpus__ - Body of text, singular. __Corpora__ is the plural of this. Example: *A collection of medical journals*.
    - __Lexicon__ - Words and their meanings. *Example: English dictionary*. Consider, however, that various fields will have different lexicons. For example: *To a financial investor, the first meaning for the word "Bull" is someone who is confident about the market, as compared to the common English lexicon, where the first meaning for the word "Bull" is an animal*. As such, there is a special lexicon for financial investors, doctors, children, mechanics, and so on.
    - __Token__ - Each "entity" that is a part of whatever was split up based on rules. For examples, *each word is a token when a sentence is "tokenized" into words*. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.


In [5]:
# We're going to need NLTK 3. The easiest method to installing the NLTK module is going to be with pip.

!pip install nltk

# we need to install some of the components for NLTK.
import nltk
nltk.download()

#all (for download everything) --> That will download everything for you headlessly.

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## 01. Tokenizing Words and Sentences with NLTK

In [14]:
from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = """Hello Mr. Smith, how are you doing today? 
                The weather is great, and Python is awesome. 
                The sky is pinkish-blue. You shouldn't eat cardboard."""

#  This will output the sentences, split up into a list of sentences
print('Sentences:\n',sent_tokenize(EXAMPLE_TEXT),'\n')

# So there, we have created tokens, which are sentences. Let's tokenize by word instead this time
print('Tokenized by words:\n',word_tokenize(EXAMPLE_TEXT))

Sentences:
 ['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."] 

Tokenized by words:
 ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']


* There are a few **things to note** here. 
    - First, notice that `punctuation is treated as a separate token`. 
    - Also, notice the separation of the word `"shouldn't" into "should" and "n't."` 
    - Finally, notice that `"pinkish-blue" is indeed treated like the "one word" `it was meant to be turned into. 

## 02. Stop words with NLTK

```sh
 We would not want some words like `"umm" or "uhh"` taking up space in our database, or taking up valuable processing time. 
 As such, we call these words `"stop words" because they are useless`, and we wish to do nothing with them. 
 Another version of the term "stop words" can be more literal: `Words we stop on`.
```
![App Screenshot](../images/Stop-word-removal-using-NLTK.png)

In [15]:
from nltk.corpus import stopwords

# Here is the list:
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

####  remove the stop words from your text:

In [18]:
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]

print('Tokenized by words:\n',word_tokens)
print('\nFiltered sentence:\n',filtered_sentence)

Tokenized by words:
 ['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']

Filtered sentence:
 ['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


## 03. Stemming words with NLTK

 A stemming algorithm reduces the words 
 - `chocolates`, `chocolatey`, `choco` to the root word, `chocolate` 
 and 
 - `retrieval`, `retrieved`, `retrieves` reduce to the stem `retrieve`.
 
 One of the most popular stemming algorithms is the `Porter stemmer`, which has been around since 1979.

In [None]:
# import these modules
from nltk.stem import PorterStemmer
  
ps = PorterStemmer()
 
# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]

# stemming words
for w in words:
    print(w, " : ", ps.stem(w))

In [None]:
# Stemming words from sentences
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
  
for w in words:
    print(w, " : ", ps.stem(w))