# Learning to Classify Text

### Supervised Classification
Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance.

>Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."

>Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution

The basic classification task has a number of interesting variants. For example, in multi-class classification, each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in advance; and in sequence classification, a list of inputs are jointly classified.

In [1]:
import nltk
from nltk.corpus import names
from nltk.classify import apply_features
import random
from nltk.corpus import brown, treebank_raw, movie_reviews, nps_chat

##### Gender Identification
The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name:

In [2]:
def gender_features(word):
    return {'last_letter': word[-1]}

gender_features('shrek')

{'last_letter': 'k'}

The returned dictionary, known as a feature set, maps from feature names to their values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature, as in the example 'last_letter'. Feature values are values with simple types, such as booleans, numbers, and strings.

>Most classification methods require that features be encoded using simple value types, such as booleans, numbers, and strings. But note that just because a feature has a simple type, this does not necessarily mean that the feature's value is simple to express or compute. Indeed, it is even possible to use very complex and informative values, such as the output of a second supervised classifier, as features.

gender_features is a feature extractor

In [3]:
labeled_names = [(name, 'male') for name in names.words('male.txt')
                ] + [(name, 'female') for name in names.words('female.txt')]

Using the feature extractor to process names data and divide resulting list of feature sets into a training set and a test set.

In [4]:
features_set = [(gender_features(n), gender) for n, gender in labeled_names]
train_set, test_set = features_set[500:], features_set[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [5]:
train_set[:10]

[({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'b'}, 'male'),
 ({'last_letter': 'b'}, 'male'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'y'}, 'male'),
 ({'last_letter': 'y'}, 'male'),
 ({'last_letter': 't'}, 'male'),
 ({'last_letter': 'e'}, 'male')]

In [6]:
classifier.classify(gender_features('Neo'))

'male'

In [7]:
classifier.classify(gender_features('Mohit'))

'male'

In [8]:
classifier.classify(gender_features('Swaraj'))

'female'

In [9]:
classifier.classify(gender_features('Siddharth'))

'female'

In [10]:
nltk.classify.accuracy(classifier, test_set)

0.602

Finally, we can examine the classifier to determine which features it found most effective for distinguishing the names' genders:

In [11]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     35.5 : 1.0
             last_letter = 'k'              male : female =     34.1 : 1.0
             last_letter = 'f'              male : female =     15.9 : 1.0
             last_letter = 'p'              male : female =     13.5 : 1.0
             last_letter = 'v'              male : female =     12.7 : 1.0


This listing shows that the names in the training set that end in "a" are female 33 times more often than they are male, but names that end in "k" are male 32 times more often than they are female. These ratios are known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.

Process to make a simple classifier
1. Make a feature extractor which maps from features names to feature values.
2. If you have labelled data, use a feature extractor and use the store the feature values and labels in a tuple or as a pair.
3. Use this list of pairs(of feature values and labels train a model. 

In [12]:
def g_features(word):
    return {'length':len(word), 'first_letter':word[0], 'last_letter':word[-1]}

In [13]:
label_data = [(g_features(n), gender) for n, gender in labeled_names]

In [14]:
print(training_set[-5:])

NameError: name 'training_set' is not defined

In [None]:
training_set, testing_set = label_data[:int(len(label_data)* 0.9)], label_data[int(len(label_data) - (len(label_data) * 0.9)):]
cf = nltk.NaiveBayesClassifier.train(training_set)

In [None]:
cf.classify(g_features('Mohit'))

In [None]:
cf.classify(g_features('Swaraj'))

In [None]:
cf.classify(g_features('Neo'))

In [None]:
cf.classify(g_features('Bunty'))

In [None]:
cf.classify(g_features('Soumya'))

In [None]:
nltk.classify.accuracy(cf, testing_set)

In [None]:
cf.show_most_informative_features(5)

When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, use the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all the feature sets in memory:

In [None]:
train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])

##### Choosing The Right Features

In [None]:
def gender_features2(name):
    features={}
    features['first_letter'] = name[0].lower()
    features['last_letter'] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features['count({})'.format(letter)] = name.lower().count(letter)
        features['has({})'.format(letter)] = (letter in name.lower())
    return features

In [None]:
print(gender_features2('Mohit'))

However, there are usually limits to the number of features that you should use with a given learning algorithm — if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don't generalize well to new examples. This problem is known as overfitting, and can be especially problematic when working with small training sets. For example, if we train a naive Bayes classifier using the feature extractor shown it will overfit the relatively small training set, resulting in a system whose accuracy is lower than the accuracy of a classifier that only pays attention to the final letter of each name:

In [None]:
features = [(gender_features2(n), gender) for (n, gender) in labeled_names]

train_set, test_set = features[:500], features[500:]
cf2 = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(cf2, test_set)

Once an initial set of features has been chosen, a very productive method for refining the feature set is error analysis. First, we select a development set, containing the corpus data for creating the model. This development set is then subdivided into the training set and the dev-test set.

In [None]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. It is important that we employ a separate dev-test set for error analysis, rather than just using the test set.

Having divided the corpus into appropriate datasets, we train a model using the training set [1], and then run it on the dev-test set 

In [None]:
train_set=[(gender_features(n), gender) for n, gender in train_names]
devtest_set = [(gender_features(n), gender) for n, gender in devtest_names]
test_set = [(gender_features(n), gender) for n, gender in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name genders:

In [None]:
errors = []
for name,  tag in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

We can then examine individual error cases where the model predicted the wrong label, and try to determine what additional pieces of information would allow it to make the right decision (or which existing pieces of information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly.

In [None]:
for (tag, guess, name) in sorted(errors):
    print('correct = {:<8} guess = {:<8s} name = {:<30}'.format(tag, guess, name))

In [None]:
len(errors)

In [None]:
def gender_features(word):
    return {'suffix1': word[-1], 'suffix2': word[-2]}

In [None]:
train_set=[(gender_features(n), gender) for n, gender in train_names]
devtest_set = [(gender_features(n), gender) for n, gender in devtest_names]
test_set = [(gender_features(n), gender) for n, gender in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

The performance has improved. This error analysis procedure can then be repeated, checking got patterns in the errors that are made by the newly improved classifier. Each timethe error analysis procedure is repeated, **we should select a different dev-test/training split**, to ensure that the classifier does not start to reflect idiosyncrasies in the dev-test set.

But once we've used the dev-test set to help us develop the model, we can no longer trust that it will give us an accurate idea of how well the model would perform on new data. It is therefore important to keep the test set separate, and unused, until our model development is complete. At that point, we can use the test set to evaluate how well our model will perform on new input values.

In [None]:
nltk.classify.accuracy(classifier, test_set)

##### Document Classification
 Using these corpora, we can build classifiers that will automatically tag new documents with appropriate category labels. 

In [None]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [None]:
print(documents[0])

In [None]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [None]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt')))

>The reason that we compute the set of all words in a document in [3], rather than just checking if word in document, is that checking whether a word occurs in a set is much faster than checking whether it occurs in a list 

In [None]:
features = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = features[100:], features[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(classifier, test_set)

In [None]:
classifier.show_most_informative_features(5)

##### Part of Speech Tagging

In [None]:
suffix_fdist = nltk.FreqDist()
for word in brown.words():
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1    

In [None]:
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
print(common_suffixes)

Next, we'll define a feature extractor function which checks a given word for these suffixes

In [None]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

In [None]:
tagged_words = brown.tagged_words(categories = 'news')
featuresets = [(pos_features(n), g) for n, g in tagged_words]

In [None]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

In [None]:
%%time
classifier = nltk.DecisionTreeClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

In [None]:
classifier.classify(pos_features('cats'))

In [None]:
#We can even print a pseudocode for decision trees
print(classifier.pseudocode(depth =4))

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

##### Exploiting Context

By augmenting the feature extraction function, we could modify this part-of-speech tagger to leverage a variety of other word-internal features, such as the length of the word, the number of syllables it contains, or its prefix. However, as long as the feature extractor just looks at the target word, we have no way to add features that depend on the context that the word appears in. But contextual features often provide powerful clues about the correct tag — for example, when tagging the word "fly," knowing that the previous word is "a" will allow us to determine that it is functioning as a noun, not a verb.

In [3]:
def pos_features(sentence, i):
    features={'suffix(1)': sentence[i][-1:],
             'suffix(2)': sentence[i][-2:],
             'suffix(3)': sentence[i][-3:]}
    if i == 0:
        features['previous_word']='<START>'
    else:
        features['previous_word'] = sentence[i-1]
    return features

In [4]:
pos_features(brown.sents()[0],8)

{'previous_word': 'an',
 'suffix(1)': 'n',
 'suffix(2)': 'on',
 'suffix(3)': 'ion'}

In [5]:
tagged_sents = brown.tagged_sents(categories = 'news')
feature_sets = []
for sent in tagged_sents:
    untagged_sent = nltk.tag.untag(sent)
    for i, (word, tag) in enumerate(sent):
        feature_sets.append((pos_features(sent, i), tag))

In [6]:
size = int(len(feature_sets)*0.1)
train_set, test_set = feature_sets[size:],  feature_sets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.9977125808055693

In [7]:
classifier.most_informative_features(5)

[('previous_word', ('to', 'TO')),
 ('previous_word', ('the', 'AT')),
 ('previous_word', ('of', 'IN')),
 ('previous_word', '<START>'),
 ('previous_word', ('a', 'AT'))]

##### Sequence Classification

In [None]:
def pos_features(sentence, i ,history):
    features ={'suffix(1)': sentence[i][-1:],
              'suffix(2)': sentence[i][-2:],
              'suffix(3)': sentence[i][-3:],}
    if i == 0:
        features ['previous_word'] = '<START>'
        features ['previous_tag'] = '<START'
    else:
        features['previous_word'] = sentence[i-1]
        features['previous_tag'] = history[i-1]
    return features

class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i , history)
                train_set.append((featureset, tag))
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)
        
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i , history)
            tag =  self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [None]:
tagged_sents = brown.tagged_sents(categories = 'news')
size = int(len(tagged_sents)*0.1)
train_sets, test_sets = tagged_sents[size:], tagged_sents[:size]
tagger = ConsecutivePosTagger(train_sets)
print(tagger.evaluate(test_sets))

###### Other Methods for Sequence Classification

One shortcoming of this approach is that we commit to every decision that we make.
For example, if we decide to label a word as a noun, but later find evidence that it should
have been a verb, there’s no way to go back and fix our mistake. One solution to this
problem is to adopt a transformational strategy instead. Transformational joint classifiers
work by creating an initial assignment of labels for the inputs, and then iteratively
refining that assignment in an attempt to repair inconsistencies between related inputs.
The Brill tagger is a good example of this strategy.

Another solution is to assign scores to all of the possible sequences of part-of-speech
tags, and to choose the sequence whose overall score is highest. This is the approach
taken by Hidden Markov Models. Hidden Markov Models are similar to consecutive
classifiers in that they look at both the inputs and the history of predicted tags. However,
rather than simply finding the single best tag for a given word, they generate a
probability distribution over tags. These probabilities are then combined to calculate
probability scores for tag sequences, and the tag sequence with the highest probability
is chosen. Unfortunately, the number of possible tag sequences is quite large. Given a
tag set with 30 tags, there are about 600 trillion (3010) ways to label a 10-word sentence.
In order to avoid considering all these possible sequences separately, Hidden Markov
Models require that the feature extractor only look at the most recent tag (or the most
recent n tags, where n is fairly small). Given that restriction, it is possible to use dynamic
programming (Section 4.7) to efficiently find the most likely tag sequence. In particular,
for each consecutive word index i, a score is computed for each possible current and
previous tag. This same basic approach is taken by two more advanced models, called
Maximum Entropy Markov Models and Linear-Chain Conditional Random
Field Models; but different algorithms are used to find scores for tag sequences.

### Further Examples of Supervised Classification

##### Sentence Segementation
Sentence segmentation can be viewed as a classification task for punctuation: whenever
we encounter a symbol that could possibly end a sentence, such as a period or a question
mark, we have to decide whether it terminates the preceding sentence.

The first step is to obtain some data that has already been segmented into sentences
and convert it into a form that is suitable for extracting feature

In [None]:
sents = treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent)
    offset +=len(sent)
    boundaries.add(offset-1)

Here, tokens is a merged list of tokens from the individual sentences, and boundaries
is a set containing the indexes of all sentence-boundary tokens. Next, we need to specify
the features of the data that will be used in order to decide whether punctuation indicates
a sentence boundary

In [None]:
tokens

In [None]:
sorted(boundaries)

In [None]:
def punct_features(tokens, i):
    return {'next_word_capitalized': tokens[i+1][0].isupper(),
                'prev_word': tokens[i-1],
               'punct': tokens[i],
               'prev_word_one_char': len(tokens[i-1]) == 1}

In [None]:
feature_set = [(punct_features(tokens,i), (i in boundaries)) for i in range(1, len(tokens)-1) if tokens[i] in '.?!']

In [None]:
size = int(len(feature_set) * 0.9)
train_set, test_set = feature_set[:size], feature_set[size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in words:
        if word in '.?!' and classifier.classify(words, i) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])

What is the general procedure?

1. Extract Feature function which extracts properties of your input(in our - words). Features like the suffix of the word and or the word before the current word
2. If you a particular feature signifies/mean some label, attach it with a label. If suffix is 'ed' most like past tense word.
3. Use this pair of features and labels to train a classfier using any algorithms. Make sure you train it with a subset of this feature set(set of these pairs) and keep the remaining for testing.

##### Identifying Dialogue Act Types

In [None]:
posts = nps_chat.xml_posts()[:10000]

In [None]:
[post.text for post in posts[:10]]

In [None]:
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

In [None]:
feature_set = [(dialogue_act_features(post.text), post.get('class'))
              for post in posts]
size = int(len(feature_set) * 0.9)
train_set, test_set = feature_set[:size], feature_set[size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

##### Recognizing Textual Entailment
Recognizing textual entailment (RTE) is the task of determining whether a given piece of text T entails another text called the "hypothesis" 

Challenge 3, Pair 34 (True)

T: Parviz Davudi was representing Iran at a meeting of the Shanghai Co-operation Organisation (SCO), the fledgling association that binds Russia, China and four former Soviet republics of central Asia together to fight terrorism.

H: China is a member of SCO.

Challenge 3, Pair 81 (False)

T: According to NC Articles of Organization, the members of LLC company are H. Nelson Beavers, III, H. Chester Beavers and Jennie Beavers Stewart.

H: Jennie Beavers Stewart is a share-holder of Carolina Analytical Laboratory.

It should be emphasized that the relationship between text and hypothesis is not intended to be logical entailment, but rather whether a human would conclude that the text provides reasonable evidence for taking the hypothesis to be true.

We can treat RTE as a classification task, in which we try to predict the True/False label for each pair. Although it seems likely that successful approaches to this task will involve a combination of parsing, semantics and real world knowledge, many early attempts at RTE achieved reasonably good results with shallow analysis, based on similarity between the text and hypothesis at the word level. In the ideal case, we would expect that if there is an entailment, then all the information expressed by the hypothesis should also be present in the text. Conversely, if there is information found in the hypothesis that is absent from the text, then there will be no entailment.

In our RTE feature detector, we let words (i.e., word types) serve as proxies for information, and our features count the degree of word overlap, and the degree to which there are words in the hypothesis but not in the text (captured by the method hyp_extra()). Not all words are equally important — Named Entity mentions such as the names of people, organizations and places are likely to be more significant, which motivates us to extract distinct information for words and nes (Named Entities). In addition, some high frequency function words are filtered out as "stopwords"

In [None]:
def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['words_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    featuresp['ne_hyp_extra'] = len(extra.hyp_extra('ne'))
    return features

>The RTEFeatureExtractor class builds a bag of words for both the text and the hypothesis after throwing away some stopwords, then calculates overlap and difference.

In [None]:
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]
extractor = nltk.RTEFeatureExtractor(rtepair)
print(extractor.text_words)

In [None]:
print(extractor.hyp_words)

In [None]:
print(extractor.overlap('word'))

In [None]:
extractor.overlap('ne')

In [None]:
extractor.hyp_extra('word')

### Evaluation
In order to decide whether a classification model is accurately capturing a pattern, we must evaluate that model. The result of this evaluation is important for deciding how trustworthy the model is, and for what purposes we can use it

##### The Test Set
When building the test set, there is often a trade-off between the amount of data available for testing and the amount available for training. For classification tasks that have a small number of well-balanced labels and a diverse test set, a meaningful evaluation can be performed with as few as 100 evaluation instances. But if a classification task has a large number of labels or includes very infrequent labels, then the size of the test set should be chosen to ensure that the least frequent label occurs at least 50 times. Additionally, if the test set contains many closely related instances—such as instances drawn from a single document—then the size of the test set should be increased to ensure that this lack of diversity does not skew the evaluation results. When large amounts of annotated data are available, it is common to err on the side of safety by using 10% of the overall data for evaluation. Another consideration when choosing the test set is the degree of similarity between instances in the test set and those in the development set. The more similar these two datasets are, the less confident we can be that evaluation results will generalize to other datasets.

##### Accuracy

The simplest metric that can be used to evaluate a classifier, accuracy, measures the percentage of inputs in the test set that the classifier correctly labeled. For example, a name gender classifier that predicts the correct name 60 times in a test set containing 80 names would have an accuracy of 60/80 = 75%. The function nltk.classify.accu racy() will calculate the accuracy of a classifier model on a given test set.

When interpreting the accuracy score of a classifier, it is important to consider the frequencies of the individual class labels in the test set. For example, consider a classifier that determines the correct word sense for each occurrence of the word bank. If we evaluate this classifier on financial newswire text, then we may find that the financialinstitution sense appears 19 times out of 20. In that case, an accuracy of 95% would hardly be impressive, since we could achieve that accuracy with a model that always returns the financial-institution sense. However, if we instead evaluate the classifier on a more balanced corpus, where the most frequent word sense has a frequency of 40%, then a 95% accuracy score would be a much more positive result. 

##### Precision and Recall
Another instance where accuracy scores can be misleading is in “search” tasks, such as information retrieval, where we are attempting to find documents that are relevant to a particular task. Since the number of irrelevant documents far outweighs the number of relevant documents, the accuracy score for a model that labels every document as irrelevant would be very close to 100%.

* **True positives** are relevant items that we correctly identified as relevant.
* **True negatives** are irrelevant items that we correctly identified as irrelevant. 
* **False positives** (or Type I errors) are irrelevant items that we incorrectly identified as relevant.
* **False negatives** (or Type II errors) are relevant items that we incorrectly identified as irrelevant.



* **Precision**, which indicates how many of the items that we identified were relevant, is TP/(TP+FP).
* **Recall**, which indicates how many of the relevant items that we identified, is TP/(TP+FN). 
* **The F-Measure** (or F-Score), which combines the precision and recall to give a single score, is defined to be the harmonic mean of the precision and recall (2 × Precision × Recall)/(Precision+Recall). 