# NLTK: part 3
## 01. Named Entity Recognition with NLTK

* ref: 
    - [https://pythonprogramming.net](https://pythonprogramming.net)
    - [https://nanonets.com](https://nanonets.com/blog/named-entity-recognition-with-nltk-and-spacy/#what-is-named-entity-recognition)

There are two major options with NLTK's named entity recognition: 
- either recognize `all named entities`, 
- recognize named entities as their respective type, like `people`, `places`, `locations`, etc.

In [1]:
# Step One: Import nltk and download necessary packages
 
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import state_union


[nltk_data] Downloading package punkt to /home/pi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/pi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/pi/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/pi/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [None]:
# Step Two: train the model
train_text = state_union.raw("2005-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [6]:
# Step three: Named Entity Recognition
 
sentence = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

tokenized = custom_sent_tokenizer.tokenize(sentence)
# Step Three: Tokenise, find parts of speech and chunk words 
def process_content():
    try:
        for sent in tokenized:
            words = nltk.word_tokenize(sent)
            tagged = nltk.pos_tag(words)
            for chunk in nltk.ne_chunk(tagged):
                if hasattr(chunk, 'label'):
                    print(chunk.label(), ' '.join(c[0] for c in chunk))
    except Exception as e:
        print(str(e))

process_content()

GPE WASHINGTON
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn


## 02. Text Classification with NLTK

In [18]:
import random
from nltk.corpus import movie_reviews

# In each category (we have pos or neg), take all of the file IDs (each review has its own ID), 
# then store the word_tokenized version (a list of words) for the file ID, 
# followed by the positive or negative label in one big list. 
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

#  random to shuffle our documents. This is because we're going to be training and testing. 
random.shuffle(documents)
# sample word_tokenized version of a review
# where the first element is a list the words, and the 2nd element is the "pos" or "neg" label.
print('Sample {} review words: \n{}'.format(documents[1][1],documents[1][0]),'\n')

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print('The most common words and their counts: \n{}'.format(all_words.most_common(15)))


Sample pos review words: 
['you', "'", 've', 'probably', 'heard', 'the', 'one', 'about', 'the', 'priest', 'and', 'the', 'rabbi', ',', 'but', 'never', 'with', 'the', 'same', 'dosage', 'of', 'featherweight', 'charm', 'that', 'is', 'sprinkled', 'over', '`', 'keeping', 'the', 'faith', "'", '.', 'it', "'", 's', 'a', 'fluffy', 'comedy', ',', 'thoroughly', 'glazed', 'with', 'a', 'sense', 'of', 'innocuous', 'innocence', 'and', 'good', 'cheer', ',', 'regarding', 'two', 'moral', 'topics', '--', 'love', 'and', 'religion', '--', 'and', 'how', 'a', 'romantic', 'triangle', 'causes', 'the', 'two', 'to', 'collide', 'head', '-', 'on', '.', 'as', 'youngsters', ',', 'brian', 'finn', ',', 'jacob', 'schramm', 'and', 'anna', 'reilly', 'were', 'an', 'inseparable', 'trio', '.', 'while', 'their', 'friendship', 'progressed', ',', 'anna', 'always', 'had', 'the', 'compassion', 'to', 'shower', 'them', 'both', 'with', 'the', 'same', 'love', 'and', 'support', ',', 'so', 'neither', 'would', 'feel', 'excluded', '.', '

## 03. Converting words to Features with NLTK
we're going to be building off the previous step and `compiling feature lists` of words from `positive reviews` and words from the `negative reviews` to hopefully `see trends in specific types of words` in positive or negative reviews.

In [21]:
# word_features: contains the top 3,000 most common words.
word_features = list(all_words.keys())[:3000]

# find these top 3,000 words in our positive and negative documents,
# marking their presence as either positive or negative
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features




In [24]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]




```sh
Now that we have our `features and labels`, what is next? Typically the next step is to go ahead and train a `Naive Bayes classifier` !
```

## 04. Naive Bayes Classifier with NLTK


In [30]:
# set that we'll train our classifier with
training_set = featuresets[:1900]

# set that we'll test against.
testing_set = featuresets[1900:]

# Next, we can define, and train our classifier like:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [31]:
print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)

Classifier accuracy percent: 85.0


In [32]:
# what the most valuable words are when it comes to positive or negative reviews:

classifier.show_most_informative_features(15)

Most Informative Features
                   sucks = True              neg : pos    =     10.5 : 1.0
                  annual = True              pos : neg    =      9.8 : 1.0
                 frances = True              pos : neg    =      9.1 : 1.0
                  justin = True              neg : pos    =      8.9 : 1.0
                 saddled = True              neg : pos    =      8.2 : 1.0
           unimaginative = True              neg : pos    =      7.6 : 1.0
                 idiotic = True              neg : pos    =      7.1 : 1.0
              schumacher = True              neg : pos    =      6.9 : 1.0
                  shoddy = True              neg : pos    =      6.9 : 1.0
             silverstone = True              neg : pos    =      6.9 : 1.0
                  regard = True              pos : neg    =      6.7 : 1.0
                  alicia = True              neg : pos    =      6.5 : 1.0
               atrocious = True              neg : pos    =      6.5 : 1.0