# Text classification
- Bag of words feature extraction
- Training a Naive Bayes classifier
- Training a decision tree classifier
- Training a maximum entropy classifier
- Training scikit-learn classifiers
- Measuring precision and recall of a classifier
- Calculating high information words
- Combining classifiers with voting
- Classifying with multiple binary classifiers
- Training a classifier with NLTK-Trainer

## Bag of words feature extraction

- The `bag of words` model is the simplest method
- It constructs a word presence feature set from all the words of an instance
- This method doesn't care about the order of the words
- This method doesn't care about how many times a word occurs

In [2]:
def bag_of_words(words):
    return dict([(word, True) for word in words])

### How to improve the bag of word method? We can...

- Construct a set of words that we want to exclude
 - Filtering stopwords
- Including significant bigrams
 - In addition to single words, it often helps to include significant bigrams

In [5]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def bag_of_words_not_in_set(words, badwords):
    return bag_of_words(set(words) - set(badwords))

def bag_of_non_stopwords(words, stopfile = 'english'):
    badwords = stopwords.words(stopfile)
    return bag_of_words_not_in_set(words, badwords)


bag_of_non_stopwords(word_tokenize("the quick brown fox"))

{'brown': True, 'quick': True, 'fox': True}

In [8]:
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def bag_of_bigrams_words(words, score_fn = BigramAssocMeasures.chi_sq, n = 200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return bag_of_words(words + bigrams)

bag_of_bigrams_words(word_tokenize("the quick brown fox"))

{'the': True,
 'quick': True,
 'brown': True,
 'fox': True,
 ('brown', 'fox'): True,
 ('quick', 'brown'): True,
 ('the', 'quick'): True}

## Training a Naive Bayes classifier

- Use the `Bayes theorem` to predict the probability that a given feature set belongs to a particular label.
 - The formula is:  
   P(label | features) = P(label) * P(features | label) / P(features)  
   > P(label):  
       The prior probability of the label occurring, which is the likelihood that a random feature set will have the label.  
     P(features | label):  
       The prior probability of a given feature set being classified as that label.  
     P(features):  
       The prior probability of a given feature set occurring.  
     P(label | features):  
       This tells us the probability that the given features should have that label.
- `The split_label_feats should modify.`
- During traing, the `NaiveBayesClassifier` class constructs probability distributions for each feature using an `estimator` parameter.  
 - Default estimator: nltk.probability.ELEProbDist  
  - ELE stands for Expected Likelihood Estimate
  - The formula for calculating the label probabilities for a given feature is (`c` + 0.5) / (`N` + `B` / 2)  
   > `c`: count of times a single feature occurs  
     `N`: the total number of feature outcomes observed  
     `B`: the number of bins or unique features in the feature set  
 
 - Other choices  
  - we can use any `estimator` parameter we want, and there are quite a few to choose from.
  - The only constraints are that it must inherit from `nltk.probability.ProbDistI` and its constructor must take a `bins` keyword argument.
  - For example:
   - Using LacplaceProdDist class instead, which uses the formula (`c` + 1) / (`N` + `B`)  
   ```
   from nltk.probability import LaplaceProbDist  
   nb_classifier = NaiveBayesClassifier.train(train_feats, estimator = LaplaceProbDist)  
   accuracy(nb_classifier, test_feats)
   ```

In [16]:
import collections

def label_feats_from_corpus(corp, feature_detector = bag_of_words):
    label_feats = collections.defaultdict(list)
    for label in corp.categories():
        for fileid in corp.fileids(categories = [label]):
            feats = feature_detector(corp.words(fileids = [fileid]))
            label_feats[label].append(feats)
    return label_feats

def split_label_feats(lfeats, split = 0.75):
    train_feats = list()
    test_feats = list()
    for label, feats in lfeats.items():
        cutoff = int(len(feats)* split)
        train_feats.extend([(feat, label) for feat in feats[:cutoff]])
        test_feats.extend([(feat, label) for feat in feats[cutoff:]])
    return train_feats, test_feats

In [20]:
from nltk.corpus import movie_reviews

from nltk.tokenize import word_tokenize

from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

movie_reviews.categories()
lfeats = label_feats_from_corpus(movie_reviews)
lfeats.keys()
train_feats, test_feats = split_label_feats(lfeats, split = 0.75)
nb_classifier = NaiveBayesClassifier.train(train_feats)
print('This Naive Bayes classifier has labels: ', nb_classifier.labels())

comments = ["the plot was ludicrous", "kate winslet is accessible"]
for comment in comments:
    print('comment: ', comment)
    print('predicted category: ', nb_classifier.classify(bag_of_words(word_tokenize(comment))))

# test the accuracy
accuracy(nb_classifier, test_feats)

This Naive Bayes classifier has labels:  ['neg', 'pos']
comment:  the plot was ludicrous
predicted category:  neg
comment:  kate winslet is accessible
predicted category:  pos


0.728

### Classification probability

In [28]:
probs = nb_classifier.prob_classify(test_feats[0][0])
probs.samples()
for key in probs.samples():
    print('prob[{}] = {:.10f}'.format(key, probs.prob(key)))

prob[neg] = 0.0000000354
prob[pos] = 0.9999999646


### Most informative features

- most_informative_features()
- show_most_informative_features()

In [30]:
nb_classifier.most_informative_features(n = 10)
nb_classifier.show_most_informative_features(n = 10)

Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0


### Alternative: Manual training

In [31]:
from nltk.probability import DictionaryProbDist

label_probdist = DictionaryProbDist({'pos': 0.5, 'neg': 0.5})
true_probdist = DictionaryProbDist({True: 1})
feature_probdist = {('pos', 'yes'): true_probdist, ('neg', 'no'): true_probdist}
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)

## Training a decision tree classifier

In [32]:
from nltk.classify import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier.train(train_feats,
                                             binary = True,
                                             entropy_cutoff = 0.8,
                                             depth_cutoff = 5,
                                             support_cutoff = 30)
accuracy(dt_classifier, test_feats)

0.678

### How to use the `DecisionTreeClassifier.train()` class methld?
#### Tell the classifier is a binary classifier or not
- If the classifier only chooses between two labels, it is a binary classifier
- If we have multivalued features, we will want to stick to the default `binary = False`

#### Controlling uncertainty with `entropy_cutoff`
- Entropy is the uncertainty of the outcome.  
 - As entropy approaches 1.0, uncertainty increases  
 - As entropy approaches 0.0, uncertainty decreases

- The entropy_cutoff value is used during the tree refinement process  
- If the entropy of the probability distribution of label choices in the tree is greater than the entropy_cutoff value, then the tree is refined further by creating more branches.  
- If the entropy is lower than the entropy_cutoff value, then tree refinement is halted.  
- Entropy is calculated by giving nltk.probability.entropy() a MLEProbDist value created from a FreqDist of label counts.

#### Controlling tree depth with `depth_cutoff`
- The final decision tree will never be deeper than the deepth_cutoff value  
 - Default value is 100

#### Controlling decisions with `support_cutoff`
- The suppor_cutoff values controls how many labeled feature sets are required to refine the tree
- Labeled feature sets are eliminated once they no longer provide value to the training process
- When the number of labeled feature sets is less than or equal to support_cutoff, refinement stops, at least for that section of the tree

## Training a maximum entropy classifier

## Training a scikit-learn classifier

## Measuring precision and recall of a classifier

## Calculating high information words

## Combining classifiers with voting

## Classifying with multiple binary classifiers

## Training a classifier with NLTK-Trainer