# Modern Data Science
**(Module 07: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/mds](https://github.com/tulip-lab/mds/issues)

Prepared by and for 
**Student Members** |
2006-2019 [TULIP Lab](http://www.tulip.org.au)

---

# Session I - Text Classification

## Contents

1 [Supervised Classification](#Supervised)
* Gender Identification
* Choosing the Right Features
* Error Aanalysis
* Document Classification
* Part-of-Speech Tagging
* Exploiting Context 
* Sequence Classification
* Other Methods for Sequence Classification



2 [Further Examples of Supervised Classification](#Further)
* Sentence Segmentation
* Identifying Dialogue Act Types
* Recognizing Textual Entailment
* Scaling Up to Large Datasets


3 [Evaluation](#Evaluation)
* The Test Set / Accuracy
* Precision and Recall
* F-Measure
* Confusion Matrices


4 [Decision Trees](#Decision)


5 [Naive Bayes Classifiers](#Naive)


<a id = "Supervised"></a>

## <span style="color:#0b486b">1. Supervised Classification</span>

Classification is the task of choosing the correct class label for a given input.<BR> A classifier is called supervised if it is built based on training corpora containing the correct label for each input.

<img src="http://www.nltk.org/images/supervised-classification.png" width="700"><BR>
<center>(Figure 1) Supervised Classification</center>

### Gender Identification

In [1]:
from __future__ import print_function, unicode_literals
from pprint import pprint
import nltk
from nltk.corpus import names as name2gender
import random
import sys

In [2]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [9]:
print("gender_features('Shrek'):", gender_features('Shrek'))
print("names ended with 'k':")

gender_features('Shrek'): {'last_letter': 'k'}
names ended with 'k':


In [10]:
from nltk.corpus import names

In [11]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])

print('len(labeled_names):', len(labeled_names))
pprint(labeled_names[:10])

len(labeled_names): 7944
[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male')]


In [6]:
import random

In [12]:
random.shuffle(labeled_names)

In [13]:
featuresets = [(gender_features(name), gender) for (name,gender) in labeled_names]
train_set = featuresets[500:]
test_set = featuresets[:500] 
print('len(train_set):', len(train_set))
print('len(test_set):', len(test_set))
pprint(test_set[:10])

len(train_set): 7444
len(test_set): 500
[({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'h'}, 'female'),
 ({'last_letter': 's'}, 'male'),
 ({'last_letter': 's'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'l'}, 'female'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'n'}, 'male')]


In [14]:
classifier=nltk.NaiveBayesClassifier.train(train_set)
print()




In [15]:
print("classifier.classify(gender_features('Neo')):", classifier.classify(gender_features('Neo'))) 
print("classifier.classify(gender_features('Trinity')):", classifier.classify(gender_features('Trinity')))
print("classifier.classify(gender_features('Tony')):", classifier.classify(gender_features('Tony')))

classifier.classify(gender_features('Neo')): male
classifier.classify(gender_features('Trinity')): female
classifier.classify(gender_features('Tony')): female


In [16]:
print('accuracy:', nltk.classify.accuracy(classifier, test_set))

accuracy: 0.768


In [17]:
classifier.show_most_informative_features(10)

Most Informative Features
             last_letter = 'a'            female : male   =     35.6 : 1.0
             last_letter = 'k'              male : female =     32.0 : 1.0
             last_letter = 'f'              male : female =     15.4 : 1.0
             last_letter = 'p'              male : female =     11.3 : 1.0
             last_letter = 'd'              male : female =     10.7 : 1.0
             last_letter = 'v'              male : female =     10.6 : 1.0
             last_letter = 'o'              male : female =      8.9 : 1.0
             last_letter = 'm'              male : female =      8.3 : 1.0
             last_letter = 'r'              male : female =      6.7 : 1.0
             last_letter = 'w'              male : female =      5.1 : 1.0


In [18]:
>>> from nltk.classify import apply_features
>>> train_set = apply_features(gender_features, labeled_names[500:])
>>> test_set = apply_features(gender_features, labeled_names[:500])

In [19]:
from __future__ import print_function, unicode_literals
from pprint import pprint
from nltk.classify import apply_features 
import sys

In [20]:
names = ([(name, 'male') for name in name2gender.words('male.txt')] + \
[(name,'female') for name in name2gender.words('female.txt')])
random.shuffle(names)

In [21]:
print('len(names):', len(names))
pprint(names[:10])

len(names): 7944
[('Norma', 'female'),
 ('Kerri', 'female'),
 ('Elna', 'female'),
 ('Dyna', 'female'),
 ('Jerrylee', 'female'),
 ('Vanni', 'female'),
 ('Shaina', 'female'),
 ('Cherrita', 'female'),
 ('Aamir', 'male'),
 ('Poul', 'male')]


In [22]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [23]:
print("gender_features('Shrek'):", gender_features('Shrek'))
pprint([(name, gender) for (name,gender) in names if gender_features(name)['last_letter']=='k'][:10])

gender_features('Shrek'): {'last_letter': 'k'}
[('Erik', 'male'),
 ('Brook', 'female'),
 ('Aleck', 'male'),
 ('Jack', 'male'),
 ('Izaak', 'male'),
 ('Hendrick', 'male'),
 ('Frank', 'male'),
 ('Isaak', 'male'),
 ('Hank', 'male'),
 ('Chuck', 'male')]


In [24]:
%reload_ext memory_profiler

In [25]:
%memit train_set = [(gender_features(name), gender) for (name,gender) in names][500:]
print("train_set:", type(train_set), sys.getsizeof(train_set), 'bytes')
%memit classifier=nltk.NaiveBayesClassifier.train(train_set)

peak memory: 106.22 MiB, increment: 3.75 MiB
train_set: <class 'list'> 59616 bytes
peak memory: 106.22 MiB, increment: 0.00 MiB


In [None]:
%memit train_set2 = apply_features(gender_features, names[500:])
print("train_set2:", type(train_set2), sys.getsizeof(train_set2), 'bytes')
%memit classifier=nltk.NaiveBayesClassifier.train(train_set2)

In [None]:
def list_from_last_letter(names, letter): 
    li = []
    for name, gender in names:
        if name.endswith(letter):
            li.append((name, gender))
    return li

In [None]:
print("ends with 'k'")
pprint(list_from_last_letter(names, 'k')[:10])

### Choosing the Right Features

In [None]:
from __future__ import print_function, unicode_literals
from pprint import pprint
import nltk
from nltk.corpus import names as name2gender
import random
import sys

In [None]:
names = ([(name, 'male') for name in name2gender.words('male.txt')] + \
[(name,'female') for name in name2gender.words('female.txt')])
random.shuffle(names)

In [None]:
print('len(names):', len(names))
pprint(names[:10])

In [None]:
def gender_features2(name):
    features={}
    features['firstletter']=name[0].lower()
    features['lastletter']=name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features['count(%s)'%letter]=name.lower().count(letter)
        features['has(%s)'%letter]=(letter in name.lower())
    return features

In [None]:
print("gender_features2('Shrek'):")
pprint(gender_features2('Shrek'))

In [None]:
featuresets=[(gender_features2(name),gender) for (name, gender) in names]
train_set=featuresets[500:]
test_set=featuresets[:500]

In [None]:
classifier=nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print('accuracy:', nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(100)

### Error Aanalysis 

<img src="http://www.nltk.org/images/corpus-org.png" width="700">

development of the test set (dev - test) for each of several pieces, each of the test to perform error analysis.

In [None]:
from __future__ import print_function, unicode_literals
from pprint import pprint
import nltk
from nltk.corpus import names as name2gender
import random
import sys

In [None]:
names = ([(name, 'male') for name in name2gender.words('male.txt')] + \
[(name,'female') for name in name2gender.words('female.txt')])
random.shuffle(names)

In [None]:
def gender_features2(name):
    features={}
    features['firstletter']=name[0].lower()
    features['lastletter']=name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features['count(%s)'%letter]=name.lower().count(letter)
        features['has(%s)'%letter]=(letter in name.lower())
    return features

In [None]:
train_names=names[1500:] 
devtest_names=names[500:1500] 
# test_names=names[:500]  

In [None]:
print ("train_names: ", train_names)
print ("devtest_names: ", devtest_names)

In [None]:
train_set = [(gender_features2(n), g) for (n,g) in train_names]
devtest_set = [(gender_features2(n), g) for (n,g) in devtest_names]
# test_set = [(gender_features2(n), g) for (n,g) in test_names]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
errors=[]
for(name, tag) in devtest_names:
    guess=classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag,guess,name))

In [None]:
print("error analysis (names ending with 'n')")
for (tag, guess, name) in sorted(errors):
    if name.endswith('n'):
        print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)) # the answer is that of the input data (name).

### Document Classification

document classification (film review by the sensibility analysis)

In [1]:
from __future__ import print_function, unicode_literals
from pprint import pprint
from nltk.corpus import movie_reviews
import random
import nltk

In [3]:
print("movie_reviews.categories():", movie_reviews.categories())

movie_reviews.categories(): []


In [4]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
# random.shuffle(documents)

In [5]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) # the emergence of the word list and sequence alignment
print("len(all_words):", len(all_words))

len(all_words): 39768


In [6]:
word_features = list(all_words)[:2000]

In [7]:
print("nltk.__version__:", nltk.__version__)
if nltk.__version__.startswith('3.'):
    word_features = [k for (k,v) in all_words.most_common(2000)] # a list of common words (for nltk 3.x)
else:
    word_features = all_words.keys()[:2000] # a list of common words (for nltk 2.x)

nltk.__version__: 3.2.1


In [8]:
print("word_features:", word_features[:10], "...")

word_features: [',', 'the', '.', 'a', 'and', 'of', 'to', "'", 'is', 'in'] ...


In [9]:
# feature extraction of the function definition. (document) - > (including whether or not the word)
def document_features(document): 
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words) 
    return features
print(document_features(movie_reviews.words('pos/cv957_8737.txt')))
print()

{'contains(smith)': False, 'contains(within)': False, 'contains(feels)': False, 'contains(thing)': False, 'contains(little)': True, 'contains(written)': False, 'contains(game)': False, 'contains(need)': False, 'contains(result)': False, 'contains(followed)': False, 'contains(dad)': False, 'contains(soundtrack)': False, 'contains(number)': False, 'contains(acts)': False, 'contains(vampires)': False, 'contains(war)': False, 'contains(impressive)': False, 'contains(indeed)': False, 'contains(fi)': False, 'contains(picture)': False, 'contains(slowly)': False, 'contains(quick)': True, 'contains(grows)': False, 'contains(want)': False, 'contains(thankfully)': False, 'contains(sitting)': False, 'contains(somehow)': False, 'contains(hard)': False, 'contains(production)': False, 'contains(william)': False, 'contains(out)': True, 'contains(music)': False, 'contains(toward)': False, 'contains(style)': False, 'contains(20)': False, 'contains(humorous)': False, 'contains(co)': False, 'contains(atte

In [10]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]

In [11]:
# classification, machine learning
# classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier = nltk.NaiveBayesClassifier.train(train_set)

ValueError: A ELE probability distribution must have at least one bin.

In [None]:
# evaluation 
print('accuracy:', nltk.classify.accuracy(classifier, test_set))
print(classifier.show_most_informative_features(5))

In [None]:
import gc; gc.collect() # release memory.

document classification (film review. through the analysis of # 2 (materials), not all the words), but not the (actor) as feature extraction, how to do?

In [None]:
from __future__ import print_function, unicode_literals
from pprint import pprint
import nltk
from nltk.corpus import movie_reviews
from nltk.corpus import names as name2gender
import random

In [None]:
# the input data (documents, create a positive / negative)
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [None]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
# print("len(all_words):", len(all_words))

In [None]:
_names = set([name.lower() for name in name2gender.words('male.txt')] + \
[name.lower() for name in name2gender.words('female.txt')])  # list

In [None]:
if nltk.__version__.startswith('3.'):    
    actor_names = [name.lower() for (name,v) in all_words.most_common() if name in _names] # film review in the list.
else:
    actor_names = [name.lower() for (name,v) in all_words.keys() if name in _names] # film review in the list.

In [None]:
actor_names = actor_names[:2000] # the analysis and the conditions to be feature (name), a limited number of 2000.
print("len(actor_names):", len(actor_names), actor_names[:100], "...")
print('jolie in actor_names:', 'jolie' in actor_names)
print()

In [None]:
# feature extraction of the function definition. (document) - > (that contains the name of the actor)
def document_features2(document): 
    document_words = set(document)
    features = {}
    for word in actor_names:
        features['contains(%s)' % word] = (word in document_words) 
    return features

In [None]:
featuresets = [(document_features2(doc), category) for (doc, category) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
print("featuresets[0]:", featuresets[0][0].items()[:20], "...", featuresets[0][1])
print()

In [None]:
# classification
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
# evaluation
print('accuracy:', nltk.classify.accuracy(classifier, test_set))
print(classifier.show_most_informative_features(5))

In [None]:
import gc; gc.collect() # release memory.

### Part-of-Speech Tagging


 brown corpus pos tags: http://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used

In [None]:
from __future__ import print_function, unicode_literals
from pprint import pprint
import nltk
from nltk.corpus import brown

In [None]:
# feature extraction function definition (word) - > (suffix)
suffix_fdist = nltk.FreqDist()
print("len(brown.words()):",  len(brown.words()))
for word in brown.words()[:100000]: # a memory is used, too long, but as to some other use.
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1
print("nltk.__version__:", nltk.__version__)
if nltk.__version__.startswith('3.'): 
    common_suffixes = [k for (k,v) in suffix_fdist.most_common(100)] # for nltk 3.x 
else:
    common_suffixes = suffix_fdist.keys()[:100] # for nltk 2.x
suffix_fdist=None
print("common_suffixes:", common_suffixes)    
print()

In [None]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
    return features

In [None]:
# Test
def pos_features_print(word): # as long as the feature True is output. (in the book)
    print("pos_features('"+word+"'):", [(k, v) for (k, v) in pos_features(word).items() if v is True])

In [None]:
pos_features_print('studied')      
print()

In [None]:
tagged_words = brown.tagged_words(categories='news')
print("len(tagged_words):", len(tagged_words))
tagged_words = tagged_words[:10000] # 
print("tagged_words:", tagged_words[:10], "...")
print()

In [None]:
featuresets = [(pos_features(word), tag) for (word, tag) in tagged_words]
size = int(len(featuresets) * 0.1) # test set size
train_set, test_set = featuresets[size:], featuresets[:size]
tagged_words = None
print("featuresets:")
pprint(featuresets[0])
print()

In [None]:
classifier = nltk.DecisionTreeClassifier.train(train_set)

In [None]:
print("accuracy:", nltk.classify.accuracy(classifier, test_set))
print()

In [None]:
print("classifier.classify(pos_features('cats')):", classifier.classify(pos_features('cats'))) # NNS = plural noun
print()

In [None]:
print(classifier.pseudocode(depth=4))
print(classifier.pp(depth=4))

In [None]:
import gc; gc.collect() # release memory.

### Exploiting Context 

In [None]:
from __future__ import print_function, unicode_literals 
from pprint import pprint
import nltk
from nltk.corpus import brown

In [None]:
def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features

In [None]:
print("brown.sents()[0][7]:", brown.sents()[0][7])
print("brown.sents()[0][8]:", brown.sents()[0][8])
print("pos_features(brown.sents()[0], 8):", pos_features(brown.sents()[0], 8))
print()

In [None]:
tagged_sents = brown.tagged_sents(categories='news')
print("tagged_sents[0]:", tagged_sents[0])
print("nltk.tag.untag(tagged_sents[0]):", nltk.tag.untag(tagged_sents[0]))
print()

In [None]:
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
print("train_set[0]:", train_set[0])

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("featuresets[0]:", featuresets[0])
print()

In [None]:
print("accuracy:", nltk.classify.accuracy(classifier, test_set))
print()

In [None]:
?nltk.NaiveBayesClassifier

### Sequence Classification

In [None]:
from __future__ import print_function, unicode_literals 
from pprint import pprint
import nltk
from nltk.corpus import brown

In [None]:
def pos_features(sentence, i, history):
     features = {"suffix(1)": sentence[i][-1:],
                 "suffix(2)": sentence[i][-2:],
                 "suffix(3)": sentence[i][-3:]}
     if i == 0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
     else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1]
     return features

In [None]:
# separator definition 
class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [None]:
tagged_sents = brown.tagged_sents(categories='news')

In [None]:
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
tagger = ConsecutivePosTagger(train_sents)

In [None]:
print("accuracy:", tagger.evaluate(test_sents))

In [None]:
?nltk.TaggerI

### Other Methods for Sequence Classification

Hidden Markov Model (HMM) <BR> 
Maximum Entropy Markov Model (MEMM) <BR> 
Linear-Chain Conditional Random Field Model (CRF) <BR>

<a id = "Further"></a>

## <span style="color:#0b486b">2. Further Examples of Supervised Classification</span>

### Sentence Segmentation

In [None]:
from __future__ import print_function, unicode_literals 
from pprint import pprint
import nltk

In [None]:
# the input data generation (the word list, the boundary position)
sents = nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()  # the broken word position. (start from 0)
offset = 0
for sent in sents:    
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

In [None]:
print("len(sents):", len(sents), sents[0:3], "...")
print()
print("len(tokens):", len(tokens), tokens[0:30], "...")
print()
print("len(boundaries):", len(boundaries), sorted(list(boundaries))[0:10], "...")
print()

In [None]:
# feature extraction function definition (word list) - > (then the capital of the beginning of word, word, or a word or a text)
def punct_features(tokens, i): # by punctuation
    try:
        return {'next-word-capitalized': tokens[i+1][0].isupper(),
                'prevword': tokens[i-1].lower(),
                'punct': tokens[i],
                'prev-word-is-one-char': len(tokens[i-1]) == 1}
    except:
        return {'next-word-capitalized': False,
                'prevword': '',
                'punct': tokens[i],
                'prev-word-is-one-char': False}

In [None]:
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']
print("featuresets:", featuresets[0])
print()

In [None]:
# the study of test set generation (features and parts of speech)
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
print("train_set[0]:", train_set[0])
print()

In [None]:
# classification
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
# evaluation
print("accuracy:", nltk.classify.accuracy(classifier, test_set))
print()

In [None]:
# the article separator
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True: 
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents

In [None]:
# the test separator
sents = nltk.corpus.treebank_raw.sents()[:10]
words=[]
for s in sents:
    words.extend(s)
# print("words:", words)
# print()
print("correct:\n", '\n'.join([' '.join(s) for s in sents ]))
print()
print("guess:\n", '\n'.join([' '.join(s) for s in segment_sentences(words)]))
print()

### Identifying Dialogue Act Types  

 Act types: "Statement," "Emotion," "ynQuestion", and "Continuer." 
 
 Accept, Bye, Clarify, Continuer, Emotion, Emphasis, Greet, No Answer, Other, Reject, Statement, System, Wh-Question, Yes Answer, Yes/No Question.

In [None]:
from __future__ import print_function, unicode_literals 
from pprint import pprint
import nltk

In [None]:
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
print("posts[0]:", posts[0].text)
print()

In [None]:
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains(%s)' % word.lower()] = True
    return features

In [None]:
featuresets = [(dialogue_act_features(post.text), post.get('class'))
               for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("featuresets[0]:", featuresets[0])
print()

In [None]:
print("accuracy:", nltk.classify.accuracy(classifier, test_set))
print(classifier.classify(dialogue_act_features("My name is Hyewoong")))
print(classifier.classify(dialogue_act_features("What a beautiful girl")))
print(classifier.classify(dialogue_act_features("Do you want my love")))


### Recognizing Textual Entailment

Challenge 3, Pair 34 (True) <BR> <BR> T: Parviz Davudi was representing Iran at a meeting of the Shanghai Co-operation Organisation (SCO), the fledgling association that binds Russia, China and four former Soviet republics of central Asia together to fight terrorism.<BR> <BR> H: China is a member of SCO.<BR> <BR> <BR> <BR> Challenge 3, Pair 81 (False)<BR> <BR> T: According to NC Articles of Organization, the members of LLC company are H. Nelson Beavers, III, H. Chester Beavers and Jennie Beavers Stewart.<BR> <BR> H: Jennie Beavers Stewart is a share-holder of Carolina Analytical Laboratory.<BR>

In [None]:
from __future__ import print_function, unicode_literals  
from pprint import pprint
import nltk

In [None]:
def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features

In [None]:
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]
print("rtepair:", rtepair.__dict__)
print()
print("text:", rtepair.text)
print()
print("hypothesis(=keyword) :", rtepair.hyp)
print()

In [None]:
extractor = nltk.RTEFeatureExtractor(rtepair)
print("text_words:", extractor.text_words) 
print("overlap('word'):", extractor.overlap('word'))
print("overlap('ne')", extractor.overlap('ne'))
print("hyp_words:", extractor.hyp_words)
print("hyp_extra('word'):", extractor.hyp_extra('word'))

In [None]:
print(help(extractor.overlap))
print(help(extractor.hyp_extra))

### Scaling Up to Large Datasets

we recommend that you explore NLTK's facilities for interfacing with external machine learning packages <BR> ... to train classifier models significantly faster than the pure-Python classifier implementation

<a id = "Evaluation"></a>

## <span style="color:#0b486b">3. Evaluation</span>

### The Test Set / Accuracy

However, it is very important that the test set be distinct from the training corpus: <BR> it is common to err on the side of safety by using 10% of the overall data for evaluation 

In [None]:
from __future__ import print_function, unicode_literals  
from pprint import pprint
import nltk
import random
from nltk.corpus import brown

In [None]:
# feature extraction function definition (word) - > (suffix words, parts of speech in front of the door)
def pos_features(sentence, i, history):
     features = {"suffix(1)": sentence[i][-1:],
                 "suffix(2)": sentence[i][-2:],
                 "suffix(3)": sentence[i][-3:]}
     if i == 0:
         features["prev-word"] = "<START>"
         features["prev-tag"] = "<START>"
     else:
         features["prev-word"] = sentence[i-1]
         features["prev-tag"] = history[i-1]
     return features

In [None]:
class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)


In [None]:
# Not suitable for the test of 3 cases.
# 1. a study of three types to create, test, and evaluation results, it is difficult to grasp. 
# 2. random. shuffle (), a document in the study of the test of the formation can be not good.
tagged_sents = list(brown.tagged_sents(categories='news'))
print("tagged_sents[0]:", tagged_sents[0])
random.shuffle(tagged_sents)
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size] 
tagger = ConsecutivePosTagger(train_sents)
print('Accuracy: %4.2f' % tagger.evaluate(test_sents))
print()


In [None]:
train_sents = brown.tagged_sents(categories='news')
test_sents = brown.tagged_sents(categories='fiction')
tagger = ConsecutivePosTagger(train_sents)
print('Accuracy: %4.2f' % tagger.evaluate(test_sents))
print()

In [None]:
file_ids = brown.fileids(categories='news')
size = int(len(file_ids) * 0.1)
train_sents = brown.tagged_sents(file_ids[size:])
test_sents = brown.tagged_sents(file_ids[:size])
tagger = ConsecutivePosTagger(train_sents)
print('Accuracy: %4.2f' % tagger.evaluate(test_sents))
print()

### Precision and Recall 

<img src="http://www.nltk.org/images/precision-recall.png" width="700">
<img src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/700px-Precisionrecall.svg.png" width="700">
<img src="https://fbcdn-sphotos-c-a.akamaihd.net/hphotos-ak-xpa1/v/t1.0-9/10991051_844288612293942_8690474408857494396_n.jpg?oh=f4a68cc3875ebea360d2e2fbb1db68f8&oe=554DA29E&__gda__=1434765606_73492ef515b8cf34ddc9a82af0aff2d4" width="700">

### F-Measure (F-Score, F1 score)

http://en.wikipedia.org/wiki/F1_score <BR> <img src="http://upload.wikimedia.org/math/9/9/1/991d55cc29b4867c88c6c22d438265f9.png">

### Confusion Matrices

In [None]:
?nltk.UnigramTagger
?nltk.BigramTagger

In [None]:
from __future__ import print_function, unicode_literals 
from pprint import pprint
import nltk
from nltk.corpus import brown

In [None]:
def tag_list(tagged_sents):
    return [tag for sent in tagged_sents for (word, tag) in sent]
def apply_tagger(tagger, corpus):
    return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus]

In [None]:
gold = tag_list(brown.tagged_sents(categories='editorial')) # 사설

In [None]:
t0 = nltk.DefaultTagger('NN')
test = tag_list(apply_tagger(t0, brown.tagged_sents(categories='editorial')))
cm = nltk.ConfusionMatrix(gold, test)
print("nltk.DefaultTagger('NN'):")
print(cm)
# print(cm.pp(sort_by_count=True, show_percents=True, truncate=9))
print()

In [None]:
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
test = tag_list(apply_tagger(t1, brown.tagged_sents(categories='editorial')))
cm = nltk.ConfusionMatrix(gold, test)
print("nltk.UnigramTagger(train_sents):")
print(cm)
# print(cm.pp(sort_by_count=True, show_percents=True, truncate=9))
print()

In [None]:
t2 = nltk.BigramTagger(train_sents, backoff=t1)
test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial')))
cm = nltk.ConfusionMatrix(gold, test)
print("nltk.BigramTagger(train_sents):")
print(cm)
# print(cm.pp(sort_by_count=True, show_percents=True, truncate=9))
print()

<a id = "Decision"></a>

## <span style="color:#0b486b">4. Decision Trees</span>

<img src="http://www.nltk.org/images/decision-tree.png" width="700">

### Entropy and Information Gain

H = −Σl |in| labelsP(l) × log2P(l). <img src="http://www.nltk.org/images/Binary_entropy_plot.png" width="500"> <BR>

In [None]:
from __future__ import print_function, unicode_literals 
from pprint import pprint
import nltk
import math

In [None]:
import math
def entropy(labels):
    freqdist = nltk.FreqDist(labels)
    probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)]
    return -sum([p * math.log(p,2) for p in probs])

In [None]:
print("entropy(['male', 'male', 'male', 'male']):", entropy(['male', 'male', 'male', 'male']))
print("entropy(['male', 'female', 'male', 'male']):", entropy(['male', 'female', 'male', 'male']))
print("entropy(['female', 'male', 'female', 'male']):", entropy(['female', 'male', 'female', 'male']))
print("entropy(['female', 'female', 'male', 'female']):", entropy(['female', 'female', 'male', 'female']))
print("entropy(['female', 'female', 'female', 'female']):", entropy(['female', 'female', 'female', 'female']))

<a id = "Naive"></a>

## <span style="color:#0b486b">5. Naive Bayes Classifiers</span>

<img src="http://www.nltk.org/images/naive-bayes-triangle.png" width="700">

<img src="http://www.nltk.org/images/naive_bayes_bargraph.png" width="700">

## Underlying Probabilistic Model

<img src="http://www.nltk.org/images/naive_bayes_graph.png" width="700">