#### NLTK Tutorial, taken from [here](https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/}https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/)

First, load the packages we need. Note that NLTK should be installed and the texts etc. downloaded

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

## Tokenizing

Lets perform some sentiment analysis on the tweets

 - Tokenizing - separate by word and sentence
 - coropora - body of text, anything in the English language
 - lexicon - words and meanings - English dictionary, may be somewhat context specific

Lets use tokenize to split the text up by word and line. 

In [None]:
example_text = 'Hello there Mr. Jenkins, how are you doing today? The weather is warm. Or is it...'

In [None]:
# Tokenize by sentence
print(sent_tokenize(example_text))

In [None]:
print(word_tokenize(example_text))

## Stopwords

 - Words which do not add anything, not useful for sentiment analysis etc. 

In [None]:
from nltk.corpus import stopwords

In [None]:
example_sentence = 'This sentence is an example of showing off strop word filtration jaguar, cat this, end.'

In [None]:
stop_words = set(stopwords.words("english"))

In [None]:
print(stop_words)

In [None]:
words = word_tokenize(example_sentence)

In [None]:
filtered_sentence = [w for w in words if not w in stop_words]

In [None]:
print(filtered_sentence)

## Stemming words

e.g. words have similar meaning riding ride

In [None]:
from nltk.stem import PorterStemmer

In [None]:
ps = PorterStemmer()

In [None]:
example_words=["python", "pythoning", "king", "pythonly"]

In [None]:
stemmed_words = [ps.stem(w) for w in example_words]

In [None]:
stemmed_words

In [None]:
new_text = "It is very important to be pythonly when pythoning with python. All python users are pythonly at least once"

In [None]:
words = [ps.stem(w) for w in word_tokenize(new_text)]

In [None]:
words

Note that the stemmed words are not necessarily real words e.g. once becomes onc. Also words ending with ly become li 

## Part of speech tagging with NLTK

In [None]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [None]:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

In [None]:
custom_sent_tokenizer = PunktSentenceTokenizer(sample_text)

In [None]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [None]:
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)
    except Exception as e:
        print(str(e))

In [None]:
process_content()

Creates tuple of the word and the part of speech
So what do the IN, NN, VBZ, NNP NN VB etc. mean?

You can check using ```nltk.help.upenn_tagset('RB')```



In [None]:
nltk.help.upenn_tagset('RB')

## Chunking with NLTK

 - Trying to find the named entity. Who is a sentence talking about. 
 - Next step, finding out words that modify or affect the named entity. 

In [None]:
def process_content():
    for i in tokenized:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)

        chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>*<NN>?}"""
        chunkParser = nltk.RegexpParser(chunkGram)
        chunked = chunkParser.parse(tagged)
        chunked.draw

In [None]:
nltk.help.upenn_tagset('NN')

RB. = adverb
VB. = verb, based form 
NNP = Noun, Proper singular
NN = Noun, common

In [None]:
process_content()

In [None]:
for i in tokenized:
    words = nltk.word_tokenize(i)
    tagged = nltk.pos_tag(words)

    chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
    chunkParser = nltk.RegexpParser(chunkGram)
    chunked = chunkParser.parse(tagged)
    chunked.draw()

## Chinking

Chinking is basically modifying our chunk, so in the below example, we will remove verbs(VB), prepositions(IN), and descriptors (DT)

In [None]:
for i in tokenized[5:]:
    words = nltk.word_tokenize(i)
    tagged = nltk.pos_tag(words)

    chunkGram = r"""Chunk: {<.*>+}
                            }<VB.?|IN|DT>+{"""
    chunkParser = nltk.RegexpParser(chunkGram)
    chunked = chunkParser.parse(tagged)
    chunked.draw()

## Named Entity recognition

Note this creates a lot of false positives.

In [None]:
for i in tokenized:
    words = nltk.word_tokenize(i)
    tagged = nltk.pos_tag(words)

    namedEnt = nltk.ne_chunk(tagged, binary=False)
    namedEnt.draw()

The above code will find named entities, such as:

 - Organization
 - Person
 - Location
 - Date
 - Money
 - Facility
 - Percent
 
If we say binary = true it will not distinguish between named entity types

## Lemmatization

Similar to stemming, but we get real words, but synonyms
If we have something that is not a noun we have to pass the part of speech (pos) tag e.g. pos='v'

In [14]:
from nltk.stem import WordNetLemmatizer

In [17]:
lemmatizer =WordNetLemmatizer()

In [31]:
print(lemmatizer.lemmatize("better"))
print(lemmatizer.lemmatize("better", pos='v'))

better
better


## Looking at the corpora

In [34]:
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize

In [35]:
sample = gutenberg.raw("bible-kjv.txt")
tok = sent_tokenize(sample)

In [36]:
print(tok[5:15])

['1:5 And God called the light Day, and the darkness he called Night.', 'And the evening and the morning were the first day.', '1:6 And God said, Let there be a firmament in the midst of the waters,\nand let it divide the waters from the waters.', '1:7 And God made the firmament, and divided the waters which were\nunder the firmament from the waters which were above the firmament:\nand it was so.', '1:8 And God called the firmament Heaven.', 'And the evening and the\nmorning were the second day.', '1:9 And God said, Let the waters under the heaven be gathered together\nunto one place, and let the dry land appear: and it was so.', '1:10 And God called the dry land Earth; and the gathering together of\nthe waters called he Seas: and God saw that it was good.', '1:11 And God said, Let the earth bring forth grass, the herb yielding\nseed, and the fruit tree yielding fruit after his kind, whose seed is\nin itself, upon the earth: and it was so.', '1:12 And the earth brought forth grass, and

## Wordnet

In [37]:
from nltk.corpus import wordnet

In [38]:
syns = wordnet.synsets("program")
print(syns)

[Synset('plan.n.01'), Synset('program.n.02'), Synset('broadcast.n.02'), Synset('platform.n.02'), Synset('program.n.05'), Synset('course_of_study.n.01'), Synset('program.n.07'), Synset('program.n.08'), Synset('program.v.01'), Synset('program.v.02')]


In [39]:
syns[0]

Synset('plan.n.01')

In [41]:
print(syns[0].lemmas()[0].name())

plan


In [42]:
syns[0].definition()

'a series of steps to be carried out or goals to be accomplished'

In [44]:
# get the definitions
for i in range(len(syns)):
    print(syns[i].definition())

a series of steps to be carried out or goals to be accomplished
a system of projects or services intended to meet a public need
a radio or television show
a document stating the aims and principles of a political party
an announcement of the events that will occur as part of a theatrical or sporting event
an integrated course of academic studies
(computer science) a sequence of instructions that a computer can interpret and execute
a performance (or series of performances) at a public presentation
arrange a program of or for
write a computer program


In [47]:
print(syns[0].examples())

['they drew up a six-step plan', 'they discussed plans for a new bond issue']


In [48]:
synonyms =[]
antonyms = []

In [60]:
synonyms = [i.lemmas()[0].name() for i in wordnet.synsets("good")]

In [63]:
synonyms

['good',
 'good',
 'good',
 'commodity',
 'good',
 'full',
 'good',
 'estimable',
 'beneficial',
 'good',
 'good',
 'adept',
 'good',
 'dear',
 'dependable',
 'good',
 'good',
 'effective',
 'good',
 'good',
 'good',
 'good',
 'good',
 'good',
 'good',
 'well',
 'thoroughly']

## Semantic similarities

In [64]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("boat.n.01")

In [65]:
# wup stands for Wu and Palmer who wrote a paper on semantic similarity []{}
print(w1.wup_similarity(w2))

0.9090909090909091


So these are 90.91% similar

In [78]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("poison.n.01")

In [79]:
print(w1.wup_similarity(w2))

0.25


What would be use synset for? 
1. to rewrite things
2. sdfsa