## Part 1
### Basic Tokenization Process 
1. Extract Text
2. Generate tokens (sentences --> words)
3. Remove stop words, non-characters, punctuations; convert to lowercase 
4. Check Stem words
5. Check frequencies
6. Check word combinations (parts of speech)

### Install NLTK and necessary resources
<font color=blue> **names:**</font> list of common English names <br>
<font color=blue> **stopwords:**</font> list of common words, articles, pronounes, prepositions, and conjunctions <br>
<font color=blue> **vader_lexicon:**</font> list of woreds and jargons referenced during sentiment analysis <br>
<font color=blue> **averaged_perceptron_tagger:**</font> data model used to categorize words into parts of speech <br>
<font color=blue> **punkt:**</font> data model that splits full texts into word lists

In [1]:
!pip install nltk
import nltk
nltk.download([
"names",
"stopwords",
"averaged_perceptron_tagger",
"vader_lexicon",
"punkt",
])



[nltk_data] Downloading package names to /Users/BGBlanco/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/BGBlanco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/BGBlanco/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/BGBlanco/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /Users/BGBlanco/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Martin Luther King's "I have a dream" speech
### Step 1: Load and Extract Text

In [2]:
# load text
filename = 'I have a dream excerpt.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

### Step 2: Generate tokens - sentences and words

In [3]:
# split into sentences
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[0])  # print 1st sentence with [0] 

So even though we face the difficulties of today and tomorrow, I still have a dream.


In [4]:
print(sentences[2])  # print 3rd sentence with [2]

I have a dream that one day this nation will rise up and live out the true meaning of its creed: We hold these truths to be self-evident, that all men are created equal.


In [5]:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:65]) # print the first 65 words 

['So', 'even', 'though', 'we', 'face', 'the', 'difficulties', 'of', 'today', 'and', 'tomorrow', ',', 'I', 'still', 'have', 'a', 'dream', '.', 'It', 'is', 'a', 'dream', 'deeply', 'rooted', 'in', 'the', 'American', 'dream', '.', 'I', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the', 'true', 'meaning', 'of', 'its', 'creed', ':', 'We', 'hold', 'these', 'truths', 'to', 'be', 'self-evident', ',', 'that', 'all', 'men', 'are', 'created', 'equal']


### Step 3: Remove stop words, non-characters, punctuations; convert to lowercase 

In [6]:
# extract alpha only
words = [word for word in tokens if word.isalpha()]
print(words[:11])

['So', 'even', 'though', 'we', 'face', 'the', 'difficulties', 'of', 'today', 'and', 'tomorrow']


In [7]:
# use stopwords resource
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [8]:
# convert to lower case
tokens = [w.lower()for w in tokens]
print(tokens[:65])

['so', 'even', 'though', 'we', 'face', 'the', 'difficulties', 'of', 'today', 'and', 'tomorrow', ',', 'i', 'still', 'have', 'a', 'dream', '.', 'it', 'is', 'a', 'dream', 'deeply', 'rooted', 'in', 'the', 'american', 'dream', '.', 'i', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the', 'true', 'meaning', 'of', 'its', 'creed', ':', 'we', 'hold', 'these', 'truths', 'to', 'be', 'self-evident', ',', 'that', 'all', 'men', 'are', 'created', 'equal']


In [9]:
# check punctuations
import string
print (string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [10]:
# remove punctuations from lower case set
table = str.maketrans('','',string.punctuation) #maketrans maps characters to replace
stripped = [w.translate(table) for w in tokens] #translate usually with maketrans replaces characters per mapping table
print(stripped[:12])

['so', 'even', 'though', 'we', 'face', 'the', 'difficulties', 'of', 'today', 'and', 'tomorrow', '']


In [11]:
# remove non-alpha tokens after removing punctuations
words = [word for word in stripped if word.isalpha()]
print(words[:60]) # from 65 words to only 60 without punctuations

['so', 'even', 'though', 'we', 'face', 'the', 'difficulties', 'of', 'today', 'and', 'tomorrow', 'i', 'still', 'have', 'a', 'dream', 'it', 'is', 'a', 'dream', 'deeply', 'rooted', 'in', 'the', 'american', 'dream', 'i', 'have', 'a', 'dream', 'that', 'one', 'day', 'this', 'nation', 'will', 'rise', 'up', 'and', 'live', 'out', 'the', 'true', 'meaning', 'of', 'its', 'creed', 'we', 'hold', 'these', 'truths', 'to', 'be', 'selfevident', 'that', 'all', 'men', 'are', 'created', 'equal']


In [12]:
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words_filtered = [w for w in words if not w in stop_words] # new text without stop words
print(words_filtered[:28])

['even', 'though', 'face', 'difficulties', 'today', 'tomorrow', 'still', 'dream', 'dream', 'deeply', 'rooted', 'american', 'dream', 'dream', 'one', 'day', 'nation', 'rise', 'live', 'true', 'meaning', 'creed', 'hold', 'truths', 'selfevident', 'men', 'created', 'equal']


### Step 4: Check stem words

In [13]:
# stemming words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in words_filtered]
print(stemmed[:20])

['even', 'though', 'face', 'difficulti', 'today', 'tomorrow', 'still', 'dream', 'dream', 'deepli', 'root', 'american', 'dream', 'dream', 'one', 'day', 'nation', 'rise', 'live', 'true']


### Step 5: Check frequency distribution

In [14]:
# Frequency Distribution WITHOUT stop words, non-alpha, punctuations
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
fdist = FreqDist(words_filtered)
fdist.most_common(10)
fdist.tabulate(10)

 freedom     ring    dream      let      day    every      one     able together   nation 
      13       12       11       11       10        9        8        8        7        4 


In [15]:
# Alternative to fdist=FreqDist(words)
text = nltk.Text(words_filtered)
fd = text.vocab()  # same as most_common()
fd.tabulate(10)

 freedom     ring    dream      let      day    every      one     able together   nation 
      13       12       11       11       10        9        8        8        7        4 


In [17]:
# Frequency Distribution WITH stop words, non-alpha, punctuations
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
fdistfull = FreqDist()
for word in word_tokenize(text):
    fdistfull[word] +=1
fdistfull.most_common(10)
fdistfull.tabulate(10)

TypeError: expected string or bytes-like object

In [18]:
fdist['dream']

11

In [19]:
fdist['freedom']

13

### Step 6: Check word combinations (parts of speech)
<font color=blue> **Concordance:**</font> collection of word locations with their context <br>
<font color=blue> **Collocation:**</font> series of words the frequently go together <br>

In [20]:
# Concordance
file = open('I have a dream excerpt.txt', 'rt')
raw = file.read()
tokens = nltk.word_tokenize(raw)
Text_Con = nltk.Text(tokens)
Text_Con.concordance('dream', lines=5)

Displaying 5 of 11 matches:
 today and tomorrow , I still have a dream . It is a dream deeply rooted in the
row , I still have a dream . It is a dream deeply rooted in the American dream 
 dream deeply rooted in the American dream . I have a dream that one day this n
ted in the American dream . I have a dream that one day this nation will rise u
all men are created equal . I have a dream that one day on the red hills of Geo


In [21]:
Text_Con.concordance('nation')

Displaying 4 of 4 matches:
 . I have a dream that one day this nation will rise up and live out the true 
tle children will one day live in a nation where they will not be judged by th
nsform the jangling discords of our nation into a beautiful symphony of brothe
g . And if America is to be a great nation , this must become true . And so le


In [22]:
concordance_list = Text_Con.concordance_list("dream", lines=5)
for entry in concordance_list:
    print(entry.line)

 today and tomorrow , I still have a dream . It is a dream deeply rooted in the
row , I still have a dream . It is a dream deeply rooted in the American dream 
 dream deeply rooted in the American dream . I have a dream that one day this n
ted in the American dream . I have a dream that one day this nation will rise u
all men are created equal . I have a dream that one day on the red hills of Geo


In [23]:
# Collocation for 2 most frequent words together
finder = nltk.collocations.BigramCollocationFinder.from_words(words_filtered)
finder.ngram_fd.most_common(5)

[(('freedom', 'ring'), 11),
 (('let', 'freedom'), 10),
 (('one', 'day'), 8),
 (('dream', 'one'), 5),
 (('faith', 'able'), 3)]

In [24]:
# Collocation for 3 most frequent words together
finder = nltk.collocations.TrigramCollocationFinder.from_words(words_filtered)
finder.ngram_fd.most_common(5)

[(('let', 'freedom', 'ring'), 10),
 (('dream', 'one', 'day'), 5),
 (('dream', 'today', 'dream'), 2),
 (('today', 'dream', 'one'), 2),
 (('able', 'join', 'hands'), 2)]

In [25]:
# Collocation for 4 most frequent words together
finder = nltk.collocations.QuadgramCollocationFinder.from_words(words_filtered)
finder.ngram_fd.most_common(5)

[(('dream', 'today', 'dream', 'one'), 2),
 (('today', 'dream', 'one', 'day'), 2),
 (('every', 'mountainside', 'let', 'freedom'), 2),
 (('mountainside', 'let', 'freedom', 'ring'), 2),
 (('even', 'though', 'face', 'difficulties'), 1)]

## Part 2 
### Putting it all together (State of the Union, JFK 1961)

In [26]:
import nltk
nltk.download([
"names",
"stopwords",
"state_union",
"averaged_perceptron_tagger",
"vader_lexicon",
"punkt",
])

[nltk_data] Downloading package names to /Users/BGBlanco/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/BGBlanco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     /Users/BGBlanco/nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/BGBlanco/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/BGBlanco/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /Users/BGBlanco/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [28]:
import nltk
from nltk.corpus import state_union
state_union.fileids()

#load text
Kennedy = state_union.raw('1961-Kennedy.txt')

# split into sentences
from nltk import sent_tokenize
Kennedy_sentences = sent_tokenize(Kennedy)
print('--------- Print Sentences ---------')
print(Kennedy_sentences[0])  # print 1st sentence with [0] 

# split into words
from nltk.tokenize import word_tokenize
Kennedy_tokens = word_tokenize(Kennedy)

# convert to lower case
Kennedy_tokens = [w.lower()for w in Kennedy_tokens]

print('--------- Print lowercase word tokens ---------')
print(Kennedy_tokens[:40])

# remove punctuations
import string
table = str.maketrans('','',string.punctuation)
Kennedy_no_punct = [w.translate(table) for w in Kennedy_tokens]
print('--------- Remove Punctuations ---------')
print(Kennedy_no_punct[:40])

# remove non-alpha words
Kennedy_words = [word for word in Kennedy_no_punct if word.isalpha()]
print('--------- Remove Non-alpha characters ---------')
print(Kennedy_words[:40])

# filter out stop words
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
Kennedy_Final = [w for w in Kennedy_words if not w in stopwords]
print('--------- Print without stopwords ---------')
print(Kennedy_Final[:40])

# Frequency Distribution WITHOUT stop words, non-alpha, punctuations
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
fdist = FreqDist(Kennedy_Final)
print('--------- Most Common Words ---------')
fdist.most_common(5)
fdist.tabulate(5)

# Concordance
tokens = nltk.word_tokenize(Kennedy)
Kennedy_Con = nltk.Text(tokens)
print('--------- Concordance ---------')
Kennedy_Con.concordance('freedom', lines=5)

# Collocation for 3 most frequent words together
finder = nltk.collocations.TrigramCollocationFinder.from_words(Kennedy_words)
print('--------- Collocation ---------')
finder.ngram_fd.most_common(5)


--------- Print Sentences ---------
PRESIDENT JOHN F. KENNEDY'S SPECIAL MESSAGE TO THE CONGRESS ON URGENT NATIONAL NEEDS
 
May 25, 1961

Mr. Speaker, Mr. Vice President, my copartners in Government, gentlemen-and ladies:
The Constitution imposes upon me the obligation to "from time to time give to the Congress information of the State of the Union."
--------- Print lowercase word tokens ---------
['president', 'john', 'f.', 'kennedy', "'s", 'special', 'message', 'to', 'the', 'congress', 'on', 'urgent', 'national', 'needs', 'may', '25', ',', '1961', 'mr.', 'speaker', ',', 'mr.', 'vice', 'president', ',', 'my', 'copartners', 'in', 'government', ',', 'gentlemen-and', 'ladies', ':', 'the', 'constitution', 'imposes', 'upon', 'me', 'the', 'obligation']
--------- Remove Punctuations ---------
['president', 'john', 'f', 'kennedy', 's', 'special', 'message', 'to', 'the', 'congress', 'on', 'urgent', 'national', 'needs', 'may', '25', '', '1961', 'mr', 'speaker', '', 'mr', 'vice', 'president', '',

[(('to', 'the', 'congress'), 7),
 (('we', 'can', 'not'), 6),
 (('as', 'well', 'as'), 5),
 (('of', 'the', 'congress'), 5),
 (('an', 'additional', 'million'), 5)]