# Nltk library

Natural language processing (NLP) is a way of transforming text into key representation of that text, which the computer
can work with more easily.

The natural language toolkit (NLTK) is the most popular library to do so in python.

## Installation
Install it with anaconda or with pip using
```
conda install nltk
# or
pip install nltk
```
then to install all NLTK packages needed, run the code

In [1]:
import matplotlib.pyplot as plt
import nltk
# nltk.download('all')

example_text = 'There is nothing bigger or older than the universe. The questions I would like to talk about are: one, where did we come from? How did the universe come into being? Are we alone in the universe? Is there alien life out there? What is the future of the human race?' \
               'Up until the 1920s, everyone thought the universe was essentially static and unchanging in time. Then it was discovered that the universe was expanding. Distant galaxies were moving away from us. This meant they must have been closer together in the past. If we extrapolate back, we find they must have all been on top of each other about 15 billion years ago. This was the Big Bang, the beginning of the universe.' \
               'But was there anything before the Big Bang? If not, what created the universe? Why did the universe emerge from the Big Bang the way it did? We used to think that the theory of the universe could be divided into two parts. First, there were the laws like Maxwell’s equations and general relativity that determined the evolution of the universe, given its state over all of space at one time. And second, there was no question of the initial state of the universe.' \
               'We have made good progress on the first part, and now have the knowledge of the laws of evolution in all but the most extreme conditions. But until recently, we have had little idea about the initial conditions for the universe. However, this division into laws of evolution and initial conditions depends on time and space being separate and distinct. Under extreme conditions, general relativity and quantum theory allow time to behave like another dimension of space. This removes the distinction between time and space, and means the laws of evolution can also determine the initial state. The universe can spontaneously create itself out of nothing.'

This will open the downloader an will allow you to install all packages you need for sentiment analysis, that aren't installed already.
Select all and install them.

## Preprocess Text
Terminology used in NLP:
* corpora: body of text (nltk.corpus provides sample text to train tokenizer on)
* lexicon: like a dictionary, describes the meaning of words (for example sentiment, dialects, words used differently in other fields)
* Tokens: elements (words, sentences) of and (sentence, text)

### Tokenize Text
First we need to convert the text in to "tokens", this means breaking up the text into an list of:
 * sentences and/or
 * list of words.

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

def preprocess_input(text):
    """
    :param text: text
    :return: Text tokenized
    """
    sentences = sent_tokenize(text)
    tokens = [word_tokenize(sentence) for sentence in sentences]
    return tokens

tokenized = preprocess_input(example_text)
print(tokenized[0])

['There', 'is', 'nothing', 'bigger', 'or', 'older', 'than', 'the', 'universe', '.']


It is possible to train the tokenizer yourself, this is quite handy if the text you want to analyse has a specific 'style'
that the normal tokenizer can't pick up correctly

In [3]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized_custom = custom_sent_tokenizer.tokenize(sample_text)
tokenized_custom_word = [nltk.word_tokenize(i) for i in tokenized_custom]
print(tokenized_custom_word[0])

['PRESIDENT', 'GEORGE', 'W.', 'BUSH', "'S", 'ADDRESS', 'BEFORE', 'A', 'JOINT', 'SESSION', 'OF', 'THE', 'CONGRESS', 'ON', 'THE', 'STATE', 'OF', 'THE', 'UNION', 'January', '31', ',', '2006', 'THE', 'PRESIDENT', ':', 'Thank', 'you', 'all', '.']


### Stop Words
Stop words are words, that mislead or don't help in your data analysis, so you want to filter these ones out before moving on.

In [4]:
from nltk.corpus import stopwords

stopwords = set(stopwords.words("english"))

words = word_tokenize(example_text)
filtered_sentences = [w for w in words if not w in stopwords]
print(filtered_sentences)

['There', 'nothing', 'bigger', 'older', 'universe', '.', 'The', 'questions', 'I', 'would', 'like', 'talk', ':', 'one', ',', 'come', '?', 'How', 'universe', 'come', '?', 'Are', 'alone', 'universe', '?', 'Is', 'alien', 'life', '?', 'What', 'future', 'human', 'race', '?', 'Up', '1920s', ',', 'everyone', 'thought', 'universe', 'essentially', 'static', 'unchanging', 'time', '.', 'Then', 'discovered', 'universe', 'expanding', '.', 'Distant', 'galaxies', 'moving', 'away', 'us', '.', 'This', 'meant', 'must', 'closer', 'together', 'past', '.', 'If', 'extrapolate', 'back', ',', 'find', 'must', 'top', '15', 'billion', 'years', 'ago', '.', 'This', 'Big', 'Bang', ',', 'beginning', 'universe.But', 'anything', 'Big', 'Bang', '?', 'If', ',', 'created', 'universe', '?', 'Why', 'universe', 'emerge', 'Big', 'Bang', 'way', '?', 'We', 'used', 'think', 'theory', 'universe', 'could', 'divided', 'two', 'parts', '.', 'First', ',', 'laws', 'like', 'Maxwell', '’', 'equations', 'general', 'relativity', 'determine

### Stemming
The process of stemming means, to cut every word to the words stem, to remove double values.
For example "walking", "walk", "walked" = "walk"
This makes it easier to cluster words together and get an overall overview of the text.
#### IMPORTANT: nltk does it automatically for you if you are using WordNet

In [5]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = word_tokenize(example_text)
stem_words = [ps.stem(w) for w in words]
print(stem_words[:10])

['there', 'is', 'noth', 'bigger', 'or', 'older', 'than', 'the', 'univers', '.']


### Part of Speech Tagging
This labels every word in a sentence or text it respective part in the sentence (I know the header says it all)
It will label words as nouns, adjective, verb, ... and even covers tenses of sentences

[POS tag list:](https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/)

* CC	coordinating conjunction
* CD	cardinal digit
* DT	determiner
* EX	existential there (like: "there is" ... think of it like "there exists")
* FW	foreign word
* IN	preposition/subordinating conjunction
* JJ	adjective	'big'
* JJR	adjective, comparative	'bigger'
* JJS	adjective, superlative	'biggest'
* LS	list marker	1)
* MD	modal	could, will
* NN	noun, singular 'desk'
* NNS	noun plural	'desks'
* NNP	proper noun, singular	'Harrison'
* NNPS	proper noun, plural	'Americans'
* PDT	predeterminer	'all the kids'
* POS	possessive ending	parent\'s
* PRP	personal pronoun	I, he, she
* PRP$	possessive pronoun	my, his, hers
* RB	adverb	very, silently,
* RBR	adverb, comparative	better
* RBS	adverb, superlative	best
* RP	particle	give up
* TO	to	go 'to' the store.
* UH	interjection	errrrrrrrm
* VB	verb, base form	take
* VBD	verb, past tense	took
* VBG	verb, gerund/present participle	taking
* VBN	verb, past participle	taken
* VBP	verb, sing. present, non-3d	take
* VBZ	verb, 3rd person sing. present	takes
* WDT	wh-determiner	which
* WP	wh-pronoun	who, what
* WP$	possessive wh-pronoun	whose
* WRB	wh-abverb	where, when

In [6]:
def part_of_speech_gen(tokenized_text):
    """
    This function converts tokenized text into a list of word and part of speech pairs.
    :param tokenized_text: Already word tokenized text
    :return: word and part of speech pairs put into an list or -1 if something went wrong
    """
    try:
        return [nltk.pos_tag(i)
                for i in tokenized_text]

    except:
        return -1

part_of_speech = part_of_speech_gen(tokenized_custom_word)
print(part_of_speech[0])

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]


### Chunking
After labeling every word as a part of speech, we can use this to group words together into chunks.
The main idea is, to group these words into "noun phrases", put verbs, adverbs and co that describe a specific noun with
the noun together.

To do this we are using regular expressions and speech tags

In [7]:
def chunking_text(tokenized_text):
    """
    This function chunks sentences of a text and saves the chunk-trees in a list
    :param tokenized_text: Already sentence tokenized text
    :return: list of nlt.Tree class object
    """
    try:
        tags = part_of_speech_gen(tokenized_text)
        # chunking examples
        # chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?} """
        chunkGram = r"""
          NP: {<DT|PP\$>?<JJ>*<NN>}     # chunk determiner/possessive, adjectives and noun
          {<NNP>+}                      # chunk sequences of proper nouns
          """
        chunkParser = nltk.RegexpParser(chunkGram)
        return [chunkParser.parse(tag) for tag in tags]
    except:
        return -1
chunked = chunking_text(tokenized)
print(chunked[0])
chunked[5].draw()

(S
  There/EX
  is/VBZ
  (NP nothing/NN)
  bigger/JJR
  or/CC
  older/JJR
  than/IN
  (NP the/DT universe/NN)
  ./.)


### Chinking
Chinking is the removal of elements from a chunk, so it's part of the chunking process.
A chink i specific is an element we want to remove from the chunk.

#### Chinking and chunking is performed similary, the chunk gramatic defines what should be performed:
* {} = Chunk
* }{ = Chink

In [8]:
def chink_text(tokenized_text):
    """
    This function removes parts of a chunks
    :param tokenized_tex: Already sentence tokenized text
    :return: list of nlt.Tree class object
    """
    try:
        tags = part_of_speech_gen(tokenized_text)
        # Chinking examples
        chunkGram = r"""
          Chunk: {<.*>+}     # chunk all
          }<VB.?|IN|DT>+{    # chink
          """
        chunkParser = nltk.RegexpParser(chunkGram)
        return [chunkParser.parse(tag) for tag in tags]
    except:
        return -1
chink = chink_text(tokenized)
print(chink[0])
chink[5].draw()

(S
  (Chunk There/EX)
  is/VBZ
  (Chunk nothing/NN bigger/JJR or/CC older/JJR)
  than/IN
  the/DT
  (Chunk universe/NN ./.))


### Named Entity Recognition
Named entity will recognize specific types of of things in a text, for [example](https://pythonprogramming.net/named-entity-recognition-nltk-tutorial/):

ORGANIZATION - Georgia-Pacific Corp., WHO
PERSON - Eddy Bonte, President Obama
LOCATION - Murray River, Mount Everest
DATE - June, 2008-06-29
TIME - two fifty a m, 1:30 p.m.
MONEY - 175 million Canadian Dollars, GBP 10.40
PERCENT - twenty pct, 18.75 %
FACILITY - Washington Monument, Stonehenge
GPE - South East Asia, Midlothian

The problem with this solution, is that it is prone to false positives

In [9]:
named_entity = [nltk.ne_chunk(sen, binary=True) for sen in part_of_speech]
named_entity[2].draw()

# short workflow
test = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(example_text)), binary=False)
test.draw()

### Lemmatizing
Lemmatizing is like Stemming, but trims words to the actual stem and not some arbitrary stem stemming provides.

In [10]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(token, pos="a") for token in tokenized[8]]
print(tokenized[8], "\n", lemmatized_words)

['This', 'meant', 'they', 'must', 'have', 'been', 'closer', 'together', 'in', 'the', 'past', '.'] 
 ['This', 'meant', 'they', 'must', 'have', 'been', 'close', 'together', 'in', 'the', 'past', '.']


## NLTK Corpora
Corpora are Databases to train you nlp algorithm with.
Find where the nltk module is installed

In [11]:
print(nltk.__file__)

C:\tools\Anaconda3\envs\ML\lib\site-packages\nltk\__init__.py


Now we can open the data.py file and find links to where our nltk dataset are stored.
Under windows it is under "appdata/nltk"

In this folder there are many different dataset that cover a wide range of topics (form Chat, to movie rating)
And import it into your python project:

In [12]:
from nltk.corpus import gutenberg

sample = gutenberg.raw("shakespeare-caesar.txt")
print(sample[:54])

[The Tragedie of Julius Caesar by William Shakespeare 


### WordNet

Wordnet is a Corpora that we can use in nltk, but is in contrast to classic dataset provided by other corporas, WordNet
is more like a Lexicon and lets us look up **words**, their meaning, synonyms or antonyms.

To work with wordnet we need to create a synset, which looks the word up in WordNet and lets us see the aspects, like meaning and synonyms

In [14]:
from nltk.corpus import wordnet

# Find synsets of "Bank"
syns = wordnet.synsets("Bank")

# Get the synset
print(syns[0].name())

# Just the word
print(syns[0].lemmas()[0].name())

# Definiton
print(syns[0].definition())

# Examples
print(syns[0].examples())

# Synonyms and antonyms
synonyms =

bank.n.01
bank
sloping land (especially the slope beside a body of water)
['they pulled the canoe up on the bank', 'he sat on the bank of the river and watched the currents']


## Getting insight


```
but_and_usage = sum(
    (count_word_usage(tokens, ["but", "and"]) for tokens in sentences)
)
```

## Creating features
With the tokenized text we can filter out certain elements of the text, that are not relevant for us.\
For example remove "but" and "and":
```
but_and_usage = sum(
    (count_word_usage(tokens, ["but", "and"]) for tokens in sentences)
)
```

# UNDER CONSTRUCTION



# References
* https://www.youtube.com/watch?v=FLZvOKSCkxY&list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL