# NLTK - Introduction

### Hipergator

On Hipergator NLTK is already installed (just use 'import NLTK')

### Installation Windows

On the command line ('cmd'), type: `pip install nltk`
    
Then, type: `python`. Within python (running on the command line), type: `import nltk` and `nltk.download()`
        
This will open up a window where you can select the different components to install. By default, everything is selected (which is good). 

#### Installation Mac

See https://stackoverflow.com/questions/16598830/nltk-download-hangs-on-os-x

Try running nltk.download_shell() instead as there is most likely an issue displaying the downloader UI. Running the download_shell() function will bypass it.

In [None]:
# this won't work, see above
!pip install nltk

In [None]:
import nltk

In [None]:
# hipergator only
nltk.download('punkt')
nltk.download('stopwords')

## Tokenizing

Tokenizers is used to divide strings into lists of substrings. For example, Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.


In [None]:
from nltk.tokenize import word_tokenize
word_tokenize('Tokenizing this sentence will result in a list with the different elements. Very exciting indeed!')

## Stop words

Text may contain stop words like 'the', 'is', 'are'. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

In [None]:
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
stopWords

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
words

In [None]:
# list comprehension
[ w for w in word_tokenize( data ) if w not in stopWords ]

In [None]:
# equivalent to the following 
words = word_tokenize(data)
wordsFiltered = []
for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)
print(wordsFiltered)

## Punctuation

As seen in the examples above, punctuation is part of the tokenized output and not filtered out by stop words.

In [None]:
import string
print (string.punctuation)

In [None]:
# adding it to the above using list comprehension
print ( [w.lower() for w in words if w not in stopWords and w not in string.punctuation] )

## Part-of-speech

We won't be really needing this in this course, but it is interesting to see how NLTK 'knows' the different parts of senteces (verbs, nouns, etc)

In [None]:
text = nltk.word_tokenize("'Tokenizing these sentence will result in a list with the different elements. Very exciting indeed!'")
nltk.pos_tag(text)

In [None]:
nltk.help.upenn_tagset('VBG')

In [None]:
nltk.help.upenn_tagset('DT')

In [None]:
nltk.help.upenn_tagset('JJ')

## Simple statistics for Sample Business Section

In [None]:
import nltk
from nltk import FreqDist
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# read Apple 2022 annual report business section (item 1 of 10-K)
with open('apple_business_section.txt' , 'r', encoding='utf-8') as myfile:    
    business =  myfile.read() 
    
# list of stop words and punctuation
stopWords = set(stopwords.words('english') ) 

In [None]:
business[0:100]

In [None]:
import re
# remove numbers, see https://stackoverflow.com/questions/57030670/how-to-remove-punctuation-and-numbers-during-tweettokenizer-step-in-nlp
business = re.sub(r'\d+', '', business)

In [None]:
business[0:100]

In [None]:
import string
string.punctuation

In [None]:
# tokens excluding stopwords and punctuation
# adding '’' to things to exclude
business_tokens = [x for x in word_tokenize(business) if x.lower() not in stopWords and x not in string.punctuation + '’']
len(business_tokens)

In [None]:
business_tokens[0:25]

In [None]:
# set will give unique words
V = set(business_tokens)
len(V)

In [None]:
# long words
V = set(business_tokens)
long_words = [w for w in V if len(w) > 15]
sorted(long_words[0:15])

### Convert it to NLKT text

In [None]:
# convert it to nltk text
text = nltk.Text(business_tokens)
# now we can use nltk functions on the text
fdist2 = FreqDist(text)
print(fdist2)

In [None]:
fdist2.most_common(15)   

## Collocations: bigrams that occur more often than we would expect 

In [None]:
# bigrams -- two words used together (both orders)
from nltk import bigrams
list(bigrams(['more', 'is', 'said', 'than', 'done']))

In [None]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))

finder = BigramCollocationFinder.from_words(business_tokens)
scored = finder.score_ngrams(bigram_measures.raw_freq)

#sorted(bigram for bigram, score in scored) 
sorted(finder.nbest(bigram_measures.raw_freq, 20))

## Stemming 

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

In [None]:
ps.stem("shopping")

In [None]:
print ( [ps.stem(w) for w in ["game","gaming","gamed","games"]  ] )