#Importing NLTK

NLTK requires the user to install the modules they need. All modules can be installed using 'all', or you can install them individually.

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('book')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloa

True

---
#Using NLTK Text objects

Each of the built-in 9 texts is an NLTK Text object. Documentation can be found [here](https://www.nltk.org/_modules/nltk/text.html). Each Text object can be analyzed using various methods.

Here, we are printing the first 20 tokens of 'text1'.

In [4]:
from nltk.book import text1

list = []
list = text1.tokens

for x in range(0,20):
  print(list[x])

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
[
Moby
Dick
by
Herman
Melville
1851
]
ETYMOLOGY
.
(
Supplied
by
a
Late
Consumptive
Usher
to
a
Grammar


##The tokens() method and Text objects
A couple things I've learned:
1. Text objects can have their words' locations revealed to construct lexical dispersion plots that show where and how often the word appears in the text
2. Text objects also support counting the number of times a certain word appears in the text

###Searching for certain words

Using the concordance() method, we can look up where a particular word appears and even select how many lines/instances to display.

In [5]:
text1.concordance("sea", lines=5)

Displaying 5 of 455 matches:
 shall slay the dragon that is in the sea ." -- ISAIAH " And what thing soever 
 S PLUTARCH ' S MORALS . " The Indian Sea breedeth the most and the biggest fis
cely had we proceeded two days on the sea , when about sunrise a great many Wha
many Whales and other monsters of the sea , appeared . Among the former , one w
 waves on all sides , and beating the sea before him into a foam ." -- TOOKE ' 


##NLTK's and Python's count() method
1. Python's count() works by returning the number of times an object appears in a **list**
2. NLTK's count() works by returning the number of times a word appears in a **text object**

##Tokenizing Text

I will be using an excerpt from *What We Know About Acquisition of Adult Literacy* as an example.

Copy this text if you'd like:
"*Imagine for a moment that among humans some people can fly. Government staff come and tell you that you can take a course that will teach you how. This sounds great, and one hears of emotional accounts of what it is like to soar in the sky. But you have no personal experience of what flying feels like. To learn it you must go for six to nine months daily to school. You do exercises like flapping your arms but you never really take off. And you do not often need to fly anywhere. Whenever you do, you can either take the plane or send a relative who can fly to do what is needed. So, is the benefit worth the effort?*"

I use NLTK's word tokenizer word_tokenize() to split the text into tokens. Here, I am printing out the first 10 words of the excerpt.


In [6]:
from nltk.tokenize import word_tokenize

raw_text = "Imagine for a moment that among humans some people can fly. Government staff come and tell you that you can take a course that will teach you how. This sounds great, and one hears of emotional accounts of what it is like to soar in the sky. But you have no personal experience of what flying feels like. To learn it you must go for six to nine months daily to school. You do exercises like flapping your arms but you never really take off. And you do not often need to fly anywhere. Whenever you do, you can either take the plane or send a relative who can fly to do what is needed. So, is the benefit worth the effort?"
tokens = word_tokenize(raw_text)

for x in range(0,10):
  print(tokens[x])


Imagine
for
a
moment
that
among
humans
some
people
can


NLTK's sentence tokenizer sent_tokenize() will perform sentence segmentation similar to word_tokenize() in that it is splitting the text. Here, I am simply displaying all the sentences.

In [7]:
from nltk.tokenize import sent_tokenize
sents = sent_tokenize(raw_text)

for x in sents:
  print(x)

Imagine for a moment that among humans some people can fly.
Government staff come and tell you that you can take a course that will teach you how.
This sounds great, and one hears of emotional accounts of what it is like to soar in the sky.
But you have no personal experience of what flying feels like.
To learn it you must go for six to nine months daily to school.
You do exercises like flapping your arms but you never really take off.
And you do not often need to fly anywhere.
Whenever you do, you can either take the plane or send a relative who can fly to do what is needed.
So, is the benefit worth the effort?


#Stemming and Lemmatization
There are too many words in the English language for a program to efficiently process. That's why we must 'stem' or 'lemmatize' words in order to reduce the number of words a program has to analyze.

Below, we use NLTK's PorterStemmer() to stem the text and display the list.

In [8]:
from nltk.stem.porter import *
stemmer = PorterStemmer()
tokens = word_tokenize(raw_text)
stems = [stemmer.stem(token) for token in tokens]
print(stems)

['imagin', 'for', 'a', 'moment', 'that', 'among', 'human', 'some', 'peopl', 'can', 'fli', '.', 'govern', 'staff', 'come', 'and', 'tell', 'you', 'that', 'you', 'can', 'take', 'a', 'cours', 'that', 'will', 'teach', 'you', 'how', '.', 'thi', 'sound', 'great', ',', 'and', 'one', 'hear', 'of', 'emot', 'account', 'of', 'what', 'it', 'is', 'like', 'to', 'soar', 'in', 'the', 'sky', '.', 'but', 'you', 'have', 'no', 'person', 'experi', 'of', 'what', 'fli', 'feel', 'like', '.', 'to', 'learn', 'it', 'you', 'must', 'go', 'for', 'six', 'to', 'nine', 'month', 'daili', 'to', 'school', '.', 'you', 'do', 'exercis', 'like', 'flap', 'your', 'arm', 'but', 'you', 'never', 'realli', 'take', 'off', '.', 'and', 'you', 'do', 'not', 'often', 'need', 'to', 'fli', 'anywher', '.', 'whenev', 'you', 'do', ',', 'you', 'can', 'either', 'take', 'the', 'plane', 'or', 'send', 'a', 'rel', 'who', 'can', 'fli', 'to', 'do', 'what', 'is', 'need', '.', 'so', ',', 'is', 'the', 'benefit', 'worth', 'the', 'effort', '?']


Using, NLTK's WordNetLemmatizer(), we can lemmatize the excerpt from before.

In [9]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(raw_text)
lemms = [lemmatizer.lemmatize(token) for token in tokens]
print(lemms)

['Imagine', 'for', 'a', 'moment', 'that', 'among', 'human', 'some', 'people', 'can', 'fly', '.', 'Government', 'staff', 'come', 'and', 'tell', 'you', 'that', 'you', 'can', 'take', 'a', 'course', 'that', 'will', 'teach', 'you', 'how', '.', 'This', 'sound', 'great', ',', 'and', 'one', 'hears', 'of', 'emotional', 'account', 'of', 'what', 'it', 'is', 'like', 'to', 'soar', 'in', 'the', 'sky', '.', 'But', 'you', 'have', 'no', 'personal', 'experience', 'of', 'what', 'flying', 'feel', 'like', '.', 'To', 'learn', 'it', 'you', 'must', 'go', 'for', 'six', 'to', 'nine', 'month', 'daily', 'to', 'school', '.', 'You', 'do', 'exercise', 'like', 'flapping', 'your', 'arm', 'but', 'you', 'never', 'really', 'take', 'off', '.', 'And', 'you', 'do', 'not', 'often', 'need', 'to', 'fly', 'anywhere', '.', 'Whenever', 'you', 'do', ',', 'you', 'can', 'either', 'take', 'the', 'plane', 'or', 'send', 'a', 'relative', 'who', 'can', 'fly', 'to', 'do', 'what', 'is', 'needed', '.', 'So', ',', 'is', 'the', 'benefit', 'wo

##Differences between stemming and lemmatization
- Stem - Lemma
- not always a word - always an actual word
- faster - slower
- does not use a corpus - uses a corpus
- truncates words - reduces words using morphology
- crude reduction - proper reduction


---
#My thoughts on NLTK

1. Functionality
  - Skimming the documentation (linked above), I can see a great amount of functionality in the API. NLTK has dozens of modules that can vary in purpose from testing, parsing, classifying, tokenizing, and analyzing text.
2. Code Quality
  - The documentation is very thorough and detailed. Every class and method is explained in their purpose and parameters are intuitive in understanding their usage.
3. Potential
  - There are many interesting ways to use NLTK and its many modules. A few examples are: in the making of a smart-assistant, service automation, and predictive text messaging. 