# Module 18 Office hour


Introduction
====

**Natural Language Processing** (NLP) is the task of making computers understand and produce human languages. 

And it always starts with the **corpus** i.e. *a body of text*. 



What is a Corpus?
====

There are many corpora (*plural of corpus*) available in NLTK, lets start with an English one call the **Brown corpus**.

When using a new corpus in NLTK for the first time, downloads the corpus with the `nltk.download()` function, e.g. 

```python
import nltk
nltk.download('brown')
```

After its downloaded, you can import it as such:

In [1]:
import nltk
#downloading the brown corpus
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Plamen\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [2]:
from nltk.corpus import brown

In [3]:
brown.words() # Returns a list of strings

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [4]:
len(brown.words()) # No. of words in the corpus

1161192

In [5]:
brown.sents() # Returns a list of list of strings 

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [6]:
brown.sents(fileids='ca01') # You can access a specific file with `fileids` argument.

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

The actual `brown` corpus data is **packaged as raw text files**.  And you can find their IDs with: 

In [7]:
len(brown.fileids()) # 500 sources, each file is a source.

500

In [8]:
print(brown.fileids()[:10]) # First 100 sources.

['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10']


You can access the raw files with:

In [9]:
print(brown.raw('cb01').strip()[:1000]) # First 1000 characters.

Assembly/nn-hl session/nn-hl brought/vbd-hl much/ap-hl good/nn-hl 
The/at General/jj-tl Assembly/nn-tl ,/, which/wdt adjourns/vbz today/nr ,/, has/hvz performed/vbn in/in an/at atmosphere/nn of/in crisis/nn and/cc struggle/nn from/in the/at day/nn it/pps convened/vbd ./.
It/pps was/bedz faced/vbn immediately/rb with/in a/at showdown/nn on/in the/at schools/nns ,/, an/at issue/nn which/wdt was/bedz met/vbn squarely/rb in/in conjunction/nn with/in the/at governor/nn with/in a/at decision/nn not/* to/to risk/vb abandoning/vbg public/nn education/nn ./.


	There/ex followed/vbd the/at historic/jj appropriations/nns and/cc budget/nn fight/nn ,/, in/in which/wdt the/at General/jj-tl Assembly/nn-tl decided/vbd to/to tackle/vb executive/nn powers/nns ./.
The/at final/jj decision/nn went/vbd to/in the/at executive/nn but/cc a/at way/nn has/hvz been/ben opened/vbn for/in strengthening/vbg budgeting/vbg procedures/nns and/cc to/to provide/vb legislators/nns information/nn they/ppss need/vb ./.



<br>

You will see that **each word comes with a slash and a label** and unlike normal text, we see that **punctuations are separated from the word that comes before it**, e.g. 

> The/at General/jj-tl Assembly/nn-tl ,/, which/wdt adjourns/vbz today/nr ,/, has/hvz performed/vbn in/in an/at atmosphere/nn of/in crisis/nn and/cc struggle/nn from/in the/at day/nn it/pps convened/vbd ./.

<br>

And we also see that the **each sentence is separated by a newline**:

> There/ex followed/vbd the/at historic/jj appropriations/nns and/cc budget/nn fight/nn ,/, in/in which/wdt the/at General/jj-tl Assembly/nn-tl decided/vbd to/to tackle/vb executive/nn powers/nns ./.
> 
> The/at final/jj decision/nn went/vbd to/in the/at executive/nn but/cc a/at way/nn has/hvz been/ben opened/vbn for/in strengthening/vbg budgeting/vbg procedures/nns and/cc to/to provide/vb legislators/nns information/nn they/ppss need/vb ./.

<br>

That brings us to the next point on **sentence tokenization** and **word tokenization**.

Tokenization
====

**Sentence tokenization** is the process of  *splitting up strings into “sentences”*

**Word tokenization** is the process of  *splitting up “sentences” into “words”*

Lets play around with some interesting texts,  the `singles.txt` from `webtext` corpus. <br>
They were some  **singles ads** from  http://search.classifieds.news.com.au/

First, downoad the data with `nltk.download()`:

```python
nltk.download('webtext')
```

Then you can import with:

In [10]:
nltk.download('webtext')
from nltk.corpus import webtext

[nltk_data] Downloading package webtext to
[nltk_data]     C:\Users\Plamen\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\webtext.zip.


In [11]:
webtext.fileids()

['firefox.txt',
 'grail.txt',
 'overheard.txt',
 'pirates.txt',
 'singles.txt',
 'wine.txt']

In [12]:
# Each line is one advertisement.
for i, line in enumerate(webtext.raw('singles.txt').split('\n')):
    if i > 10: # Lets take a look at the first 10 ads.
        break
    print(str(i) + ':\t' + line)

0:	25 SEXY MALE, seeks attrac older single lady, for discreet encounters.
1:	35YO Security Guard, seeking lady in uniform for fun times.
2:	40 yo SINGLE DAD, sincere friendly DTE seeks r/ship with fem age open S/E
3:	44yo tall seeks working single mum or lady below 45 fship rship. Nat Open
4:	6.2 35 yr old OUTGOING M seeks fem 28-35 for o/door sports - w/e away
5:	A professional business male, late 40s, 6 feet tall, slim build, well groomed, great personality, home owner, interests include the arts travel and all things good, Ringwood area, is seeking a genuine female of similar age or older, in same area or surrounds, for a meaningful long term rship. Looking forward to hearing from you all.
6:	ABLE young man seeks, sexy older women. Phone for fun ready to play
7:	AFFECTIONATE LADY Sought by generous guy, 40s, mutual fulfillment
8:	ARE YOU ALONE or lost in a r/ship too, with no hope in sight? Maybe we could explore new beginnings together? Im 45 Slim/Med build, GSOH, high needs and lo

# Lets zoom in on candidate no. 8

In [13]:
single_no8 = webtext.raw('singles.txt').split('\n')[8]
print(single_no8)

ARE YOU ALONE or lost in a r/ship too, with no hope in sight? Maybe we could explore new beginnings together? Im 45 Slim/Med build, GSOH, high needs and looking for someone similar. You WONT be disappointed.


# Sentence Tokenization
<br>

In NLTK, `sent_tokenize()` the default tokenizer function that you can use to split strings into "*sentences*". 

<br>

It is using the [**Punkt algortihm** from Kiss and Strunk (2006)](http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485).

In [14]:
from nltk import sent_tokenize, word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Plamen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [15]:
sent_tokenize(single_no8)

['ARE YOU ALONE or lost in a r/ship too, with no hope in sight?',
 'Maybe we could explore new beginnings together?',
 'Im 45 Slim/Med build, GSOH, high needs and looking for someone similar.',
 'You WONT be disappointed.']

In [16]:
for sent in sent_tokenize(single_no8):
    print(word_tokenize(sent))

['ARE', 'YOU', 'ALONE', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?']
['Maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?']
['Im', '45', 'Slim/Med', 'build', ',', 'GSOH', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.']
['You', 'WONT', 'be', 'disappointed', '.']


# Lowercasing

The CAPS in the texts are RATHER irritating although we KNOW the guy is trying to EMPHASIZE on something ;P

We can simply **lowercase them after we do `sent_tokenize()` and `word_tokenize()`**. <br>
The tokenizers uses the capitalization as cues to know when to split so removing them before the calling the functions would be sub-optimal.

In [17]:
sent_tokenize(single_no8)

['ARE YOU ALONE or lost in a r/ship too, with no hope in sight?',
 'Maybe we could explore new beginnings together?',
 'Im 45 Slim/Med build, GSOH, high needs and looking for someone similar.',
 'You WONT be disappointed.']

In [18]:
for sent in sent_tokenize(single_no8):
    # It's a little in efficient to loop through each word,
    # after but sometimes it helps to get better tokens.
    print([word.lower() for word in word_tokenize(sent)])
    # Alternatively:
    #print(list(map(str.lower, word_tokenize(sent))))

['are', 'you', 'alone', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?']
['maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?']
['im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.']
['you', 'wont', 'be', 'disappointed', '.']


In [19]:
print(word_tokenize(single_no8))  # Treats the whole line as one document.

['ARE', 'YOU', 'ALONE', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?', 'Maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'Im', '45', 'Slim/Med', 'build', ',', 'GSOH', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.', 'You', 'WONT', 'be', 'disappointed', '.']


Stopwords
====

**Stopwords** are non-content words that primarily has only grammatical function

In NLTK, you can access them as follows:

In [20]:
from nltk.corpus import stopwords

stopwords_en = stopwords.words('english')
print(stopwords_en)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

# Often we want to remove stopwords when we want to keep the "gist" of the document/sentence.

For instance, lets go back to the our `single_no8`

In [21]:
# Treat the multiple sentences as one document (no need to sent_tokenize)
# Tokenize and lowercase
single_no8_tokenized_lowered = list(map(str.lower, word_tokenize(single_no8)))
print(single_no8_tokenized_lowered)

['are', 'you', 'alone', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?', 'maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.', 'you', 'wont', 'be', 'disappointed', '.']


# Let's try to remove the stopwords using the English stopwords list in NLTK

In [22]:
stopwords_en = set(stopwords.words('english')) # Set checking is faster in Python than list.

# List comprehension.
print([word for word in single_no8_tokenized_lowered if word not in stopwords_en])

['alone', 'lost', 'r/ship', ',', 'hope', 'sight', '?', 'maybe', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'looking', 'someone', 'similar', '.', 'wont', 'disappointed', '.']


# Often, we want to remove the punctuations from the documents too.

Since Python comes with "batteries included", we have string.punctuation

In [23]:
from string import punctuation
# It's a string so we have to them into a set type
print('From string.punctuation:', type(punctuation), punctuation)

From string.punctuation: <class 'str'> !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


# Combining the punctuation with the stopwords from NLTK.

In [24]:
stopwords_en_withpunct = stopwords_en.union(set(punctuation))
print(stopwords_en_withpunct)

{'~', 'while', 'yours', 'they', 'themselves', 'yourselves', 'were', "don't", 'such', "she's", 'i', 's', 'both', 'himself', 'did', 'ma', 've', 'through', 'do', 'of', 'there', '`', '@', 'aren', 'by', 'for', 'what', 'can', '*', 'own', 'then', 'you', 'whom', '=', "that'll", 'o', 'my', 'hers', '/', 'those', '#', 'nor', '>', 'or', 'any', 'that', 'same', 'not', 'here', 'who', "mustn't", "you'll", 'it', 'under', 'no', '<', 'and', 'between', 'again', 'to', 'few', 'should', 'our', 'had', 'm', 'ourselves', 'most', "needn't", "shan't", 'being', 'down', 'needn', '^', 'some', 'does', 'against', '-', 'over', 'into', 'above', 'an', 'more', 'ours', 'this', 'until', 'didn', "you've", 'with', 'mustn', 'off', 'all', 'at', 'from', '}', ']', 'because', "hasn't", "you're", 'below', '+', 'out', 'up', "doesn't", 'now', ',', "won't", 'if', 'ain', "it's", 'where', 'other', 'wouldn', 'her', 'shan', 'won', 'be', 'after', 'only', "couldn't", "isn't", 'further', 'haven', 'wasn', '%', 'doesn', 'as', 'about', 'how', '

# Removing stopwords with punctuations from Single no. 8

In [25]:
print([word for word in single_no8_tokenized_lowered if word not in stopwords_en_withpunct])

['alone', 'lost', 'r/ship', 'hope', 'sight', 'maybe', 'could', 'explore', 'new', 'beginnings', 'together', 'im', '45', 'slim/med', 'build', 'gsoh', 'high', 'needs', 'looking', 'someone', 'similar', 'wont', 'disappointed']


# Using a stronger/longer list of stopwords

From the previous output, we have still dangly model verbs (i.e. 'could', 'wont', etc.).

We can combine the stopwords we have in NLTK with other stopwords list we find online.

Personally, I like to use `stopword-json` because it has stopwrds in 50 languages =) <br>
https://github.com/6/stopwords-json

In [26]:
# Stopwords from stopwords-json
stopwords_json = {"en":["a","a's","able","about","above","according","accordingly","across","actually","after","afterwards","again","against","ain't","all","allow","allows","almost","alone","along","already","also","although","always","am","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","aside","ask","asking","associated","at","available","away","awfully","b","be","became","because","become","becomes","becoming","been","before","beforehand","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","c","c'mon","c's","came","can","can't","cannot","cant","cause","causes","certain","certainly","changes","clearly","co","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","currently","d","definitely","described","despite","did","didn't","different","do","does","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","happens","hardly","has","hasn't","have","haven't","having","he","he's","hello","help","hence","her","here","here's","hereafter","hereby","herein","hereupon","hers","herself","hi","him","himself","his","hither","hopefully","how","howbeit","however","i","i'd","i'll","i'm","i've","ie","if","ignored","immediate","in","inasmuch","inc","indeed","indicate","indicated","indicates","inner","insofar","instead","into","inward","is","isn't","it","it'd","it'll","it's","its","itself","j","just","k","keep","keeps","kept","know","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near","nearly","necessary","need","needs","neither","never","nevertheless","new","next","nine","no","nobody","non","none","noone","nor","normally","not","nothing","novel","now","nowhere","o","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","only","onto","or","other","others","otherwise","ought","our","ours","ourselves","out","outside","over","overall","own","p","particular","particularly","per","perhaps","placed","please","plus","possible","presumably","probably","provides","q","que","quite","qv","r","rather","rd","re","really","reasonably","regarding","regardless","regards","relatively","respectively","right","s","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","she","should","shouldn't","since","six","so","some","somebody","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","t","t's","take","taken","tell","tends","th","than","thank","thanks","thanx","that","that's","thats","the","their","theirs","them","themselves","then","thence","there","there's","thereafter","thereby","therefore","therein","theres","thereupon","these","they","they'd","they'll","they're","they've","think","third","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","to","together","too","took","toward","towards","tried","tries","truly","try","trying","twice","two","u","un","under","unfortunately","unless","unlikely","until","unto","up","upon","us","use","used","useful","uses","using","usually","uucp","v","value","various","very","via","viz","vs","w","want","wants","was","wasn't","way","we","we'd","we'll","we're","we've","welcome","well","went","were","weren't","what","what's","whatever","when","whence","whenever","where","where's","whereafter","whereas","whereby","wherein","whereupon","wherever","whether","which","while","whither","who","who's","whoever","whole","whom","whose","why","will","willing","wish","with","within","without","won't","wonder","would","wouldn't","x","y","yes","yet","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves","z","zero"]}
stopwords_json_en = set(stopwords_json['en'])
stopwords_nltk_en = set(stopwords.words('english'))
stopwords_punct = set(punctuation)
# Combine the stopwords. Its a lot longer so I'm not printing it out...
stoplist_combined = set.union(stopwords_json_en, stopwords_nltk_en, stopwords_punct)

# Remove the stopwords from `single_no8`.
print('With combined stopwords:')
print([word for word in single_no8_tokenized_lowered if word not in stoplist_combined])

With combined stopwords:
['lost', 'r/ship', 'hope', 'sight', 'explore', 'beginnings', 'im', '45', 'slim/med', 'build', 'gsoh', 'high', 'similar', 'wont', 'disappointed']


# Stemming and Lemmatization

Often we want to map the different forms of the same word to the same root word, e.g. "walks", "walking", "walked" should all be the same as "walk".

The stemming and lemmatization process are hand-written regex rules written find the root word.

 - **Stemming**: Trying to shorten a word with simple regex rules

 - **Lemmatization**: Trying to find the root word with linguistics rules (with the use of regexes)

(See also: [Stemmers vs Lemmatizers](https://stackoverflow.com/q/17317418/610569) question on StackOverflow)

There are various stemmers and one lemmatizer in NLTK, the most common being:

 - **Porter Stemmer** from [Porter (1980)](https://tartarus.org/martin/PorterStemmer/index.html)
 - **Wordnet Lemmatizer** (port of the Morphy: https://wordnet.princeton.edu/man/morphy.7WN.html)

In [27]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

for word in ['walking', 'walks', 'walked']:
    print(porter.stem(word))

walk
walk
walk


In [28]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
wnl = WordNetLemmatizer()

for word in ['walking', 'walks', 'walked']:
    print(wnl.lemmatize(word))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Plamen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


walking
walk
walked
