# Learning to Classify Text

### Supervised Classification
Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance.

>Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."

>Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution

The basic classification task has a number of interesting variants. For example, in multi-class classification, each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in advance; and in sequence classification, a list of inputs are jointly classified.

In [137]:
import nltk
from nltk.corpus import names
from nltk.classify import apply_features
import random
from nltk.corpus import movie_reviews

##### Gender Identification
The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name:

In [127]:
def gender_features(word):
    return {'last_letter': word[-1]}

gender_features('shrek')

{'last_letter': 'k'}

The returned dictionary, known as a feature set, maps from feature names to their values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature, as in the example 'last_letter'. Feature values are values with simple types, such as booleans, numbers, and strings.

>Most classification methods require that features be encoded using simple value types, such as booleans, numbers, and strings. But note that just because a feature has a simple type, this does not necessarily mean that the feature's value is simple to express or compute. Indeed, it is even possible to use very complex and informative values, such as the output of a second supervised classifier, as features.

gender_features is a feature extractor

In [120]:
labeled_names = [(name, 'male') for name in names.words('male.txt')
                ] + [(name, 'female') for name in names.words('female.txt')]

Using the feature extractor to process names data and divide resulting list of feature sets into a training set and a test set.

In [41]:
features_set = [(gender_features(n), gender) for n, gender in labeled_names]
train_set, test_set = features_set[500:], features_set[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [42]:
train_set

[({'laster_letter': 'n'}, 'male'),
 ({'laster_letter': 'e'}, 'male'),
 ({'laster_letter': 'e'}, 'male'),
 ({'laster_letter': 'b'}, 'male'),
 ({'laster_letter': 'b'}, 'male'),
 ({'laster_letter': 'e'}, 'male'),
 ({'laster_letter': 'y'}, 'male'),
 ({'laster_letter': 'y'}, 'male'),
 ({'laster_letter': 't'}, 'male'),
 ({'laster_letter': 'e'}, 'male'),
 ({'laster_letter': 'n'}, 'male'),
 ({'laster_letter': 'n'}, 'male'),
 ({'laster_letter': 'n'}, 'male'),
 ({'laster_letter': 's'}, 'male'),
 ({'laster_letter': 'n'}, 'male'),
 ({'laster_letter': 'e'}, 'male'),
 ({'laster_letter': 'y'}, 'male'),
 ({'laster_letter': 'r'}, 'male'),
 ({'laster_letter': 'd'}, 'male'),
 ({'laster_letter': 'y'}, 'male'),
 ({'laster_letter': 'n'}, 'male'),
 ({'laster_letter': 'e'}, 'male'),
 ({'laster_letter': 's'}, 'male'),
 ({'laster_letter': 'y'}, 'male'),
 ({'laster_letter': 'r'}, 'male'),
 ({'laster_letter': 'n'}, 'male'),
 ({'laster_letter': 'y'}, 'male'),
 ({'laster_letter': 'y'}, 'male'),
 ({'laster_letter': 

In [11]:
classifier.classify(gender_features('Neo'))

'male'

In [12]:
classifier.classify(gender_features('Mohit'))

'male'

In [19]:
classifier.classify(gender_features('Swaraj'))

'female'

In [20]:
classifier.classify(gender_features('Siddharth'))

'female'

In [21]:
nltk.classify.accuracy(classifier, test_set)

0.602

Finally, we can examine the classifier to determine which features it found most effective for distinguishing the names' genders:

In [23]:
classifier.show_most_informative_features(5)

Most Informative Features
           laster_letter = 'a'            female : male   =     35.5 : 1.0
           laster_letter = 'k'              male : female =     34.1 : 1.0
           laster_letter = 'f'              male : female =     15.9 : 1.0
           laster_letter = 'p'              male : female =     13.5 : 1.0
           laster_letter = 'v'              male : female =     12.7 : 1.0


This listing shows that the names in the training set that end in "a" are female 33 times more often than they are male, but names that end in "k" are male 32 times more often than they are female. These ratios are known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.

Process to make a simple classifier
1. Make a feature extractor which maps from features names to feature values.
2. If you have labelled data, use a feature extractor and use the store the feature values and labels in a tuple or as a pair.
3. Use this list of pairs(of feature values and labels train a model. 

In [101]:
def g_features(word):
    return {'length':len(word), 'first_letter':word[0], 'last_letter':word[-1]}

In [102]:
label_data = [(g_features(n), gender) for n, gender in labeled_names]

In [103]:
print(training_set[-5:])

[({'length': 7, 'first_letter': 'R', 'last_letter': 'a'}, 'female'), ({'length': 7, 'first_letter': 'R', 'last_letter': 'e'}, 'female'), ({'length': 7, 'first_letter': 'R', 'last_letter': 'o'}, 'female'), ({'length': 4, 'first_letter': 'R', 'last_letter': 'e'}, 'female'), ({'length': 7, 'first_letter': 'R', 'last_letter': 'n'}, 'female')]


In [104]:
training_set, testing_set = label_data[:int(len(label_data)* 0.9)], label_data[int(len(label_data) - (len(label_data) * 0.9)):]
cf = nltk.NaiveBayesClassifier.train(training_set)

In [105]:
cf.classify(g_features('Mohit'))

'male'

In [106]:
cf.classify(g_features('Swaraj'))

'male'

In [107]:
cf.classify(g_features('Neo'))

'male'

In [108]:
cf.classify(g_features('Bunty'))

'female'

In [109]:
cf.classify(g_features('Soumya'))

'male'

In [110]:
nltk.classify.accuracy(cf, testing_set)

0.7495104895104895

In [111]:
cf.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     34.6 : 1.0
             last_letter = 'k'              male : female =     28.3 : 1.0
             last_letter = 'f'              male : female =     24.3 : 1.0
             last_letter = 'v'              male : female =     15.7 : 1.0
             last_letter = 'm'              male : female =     10.6 : 1.0


When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, use the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all the feature sets in memory:

In [112]:
train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])

##### Choosing The Right Features

In [98]:
def gender_features2(name):
    features={}
    features['first_letter'] = name[0].lower()
    features['last_letter'] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features['count({})'.format(letter)] = name.lower().count(letter)
        features['has({})'.format(letter)] = (letter in name.lower())
    return features

In [118]:
print(gender_features2('Mohit'))

{'first_letter': 'm', 'last_letter': 't', 'count(a)': 0, 'has(a)': False, 'count(b)': 0, 'has(b)': False, 'count(c)': 0, 'has(c)': False, 'count(d)': 0, 'has(d)': False, 'count(e)': 0, 'has(e)': False, 'count(f)': 0, 'has(f)': False, 'count(g)': 0, 'has(g)': False, 'count(h)': 1, 'has(h)': True, 'count(i)': 1, 'has(i)': True, 'count(j)': 0, 'has(j)': False, 'count(k)': 0, 'has(k)': False, 'count(l)': 0, 'has(l)': False, 'count(m)': 1, 'has(m)': True, 'count(n)': 0, 'has(n)': False, 'count(o)': 1, 'has(o)': True, 'count(p)': 0, 'has(p)': False, 'count(q)': 0, 'has(q)': False, 'count(r)': 0, 'has(r)': False, 'count(s)': 0, 'has(s)': False, 'count(t)': 1, 'has(t)': True, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 0, 'has(x)': False, 'count(y)': 0, 'has(y)': False, 'count(z)': 0, 'has(z)': False}


However, there are usually limits to the number of features that you should use with a given learning algorithm — if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncrasies of your training data that don't generalize well to new examples. This problem is known as overfitting, and can be especially problematic when working with small training sets. For example, if we train a naive Bayes classifier using the feature extractor shown it will overfit the relatively small training set, resulting in a system whose accuracy is lower than the accuracy of a classifier that only pays attention to the final letter of each name:

In [121]:
features = [(gender_features2(n), gender) for (n, gender) in labeled_names]

train_set, test_set = features[:500], features[500:]
cf2 = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(cf2, test_set)

0.32818377216550243

Once an initial set of features has been chosen, a very productive method for refining the feature set is error analysis. First, we select a development set, containing the corpus data for creating the model. This development set is then subdivided into the training set and the dev-test set.

In [128]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. It is important that we employ a separate dev-test set for error analysis, rather than just using the test set.

Having divided the corpus into appropriate datasets, we train a model using the training set [1], and then run it on the dev-test set 

In [129]:
train_set=[(gender_features(n), gender) for n, gender in train_names]
devtest_set = [(gender_features(n), gender) for n, gender in devtest_names]
test_set = [(gender_features(n), gender) for n, gender in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.347


Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name genders:

In [130]:
errors = []
for name,  tag in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

We can then examine individual error cases where the model predicted the wrong label, and try to determine what additional pieces of information would allow it to make the right decision (or which existing pieces of information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly.

In [131]:
for (tag, guess, name) in sorted(errors):
    print('correct = {:<8} guess = {:<8s} name = {:<30}'.format(tag, guess, name))

correct = male     guess = female   name = Clinton                       
correct = male     guess = female   name = Clive                         
correct = male     guess = female   name = Clyde                         
correct = male     guess = female   name = Cob                           
correct = male     guess = female   name = Cobb                          
correct = male     guess = female   name = Cobbie                        
correct = male     guess = female   name = Cobby                         
correct = male     guess = female   name = Cody                          
correct = male     guess = female   name = Colbert                       
correct = male     guess = female   name = Cole                          
correct = male     guess = female   name = Coleman                       
correct = male     guess = female   name = Colin                         
correct = male     guess = female   name = Collin                        
correct = male     guess = female   na

In [132]:
len(errors)

653

In [133]:
def gender_features(word):
    return {'suffix1': word[-1], 'suffix2': word[-2]}

In [134]:
train_set=[(gender_features(n), gender) for n, gender in train_names]
devtest_set = [(gender_features(n), gender) for n, gender in devtest_names]
test_set = [(gender_features(n), gender) for n, gender in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.474


The performance has improved. This error analysis procedure can then be repeated, checking got patterns in the errors that are made by the newly improved classifier. Each timethe error analysis procedure is repeated, **we should select a different dev-test/training split**, to ensure that the classifier does not start to reflect idiosyncrasies in the dev-test set.

But once we've used the dev-test set to help us develop the model, we can no longer trust that it will give us an accurate idea of how well the model would perform on new data. It is therefore important to keep the test set separate, and unused, until our model development is complete. At that point, we can use the test set to evaluate how well our model will perform on new input values.

In [135]:
nltk.classify.accuracy(classifier, test_set)

0.488

##### Document Classification
 Using these corpora, we can build classifiers that will automatically tag new documents with appropriate category labels. 

In [141]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [144]:
print(documents[0])

(['mars', 'attacks', '!', '(', '1996', ')', '-', 'c', ':', 'jack', 'nicholson', ',', 'glenn', 'close', ',', 'annette', 'bening', ',', 'martin', 'short', ',', 'danny', 'devito', ',', 'rod', 'steiger', ',', 'pierce', 'brosnan', ',', 'sarah', 'jessica', 'parker', ',', 'michael', 'j', '.', 'fox', ',', 'jim', 'brown', ',', 'pam', 'grier', ',', 'joe', 'don', 'baker', ',', 'natalie', 'portman', ',', 'christina', 'applegate', ',', 'lisa', 'marie', ',', 'tom', 'jones', '.', 'this', 'is', 'director', 'tim', 'burton', "'", 's', 'finest', 'film', 'to', 'date', '.', 'many', 'will', 'compare', 'this', 'tale', 'of', 'martians', 'who', 'invade', 'earth', 'to', 'independence', 'day', ',', 'but', 'even', 'though', 'the', 'stories', 'are', 'similar', ',', 'they', 'really', 'are', 'two', 'distinctly', 'different', 'films', '.', 'however', 'as', 'a', 'whole', ',', 'mars', 'attacks', 'is', 'much', 'more', 'entertaining', 'than', 'id4', ',', 'and', 'i', 'loved', 'id4', '.', 'you', 'really', 'have', 'to', 'be

In [148]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [149]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt')))

{'contains(plot)': True, 'contains(:)': True, 'contains(two)': True, 'contains(teen)': False, 'contains(couples)': False, 'contains(go)': False, 'contains(to)': True, 'contains(a)': True, 'contains(church)': False, 'contains(party)': False, 'contains(,)': True, 'contains(drink)': False, 'contains(and)': True, 'contains(then)': True, 'contains(drive)': False, 'contains(.)': True, 'contains(they)': True, 'contains(get)': True, 'contains(into)': True, 'contains(an)': True, 'contains(accident)': False, 'contains(one)': True, 'contains(of)': True, 'contains(the)': True, 'contains(guys)': False, 'contains(dies)': False, 'contains(but)': True, 'contains(his)': True, 'contains(girlfriend)': True, 'contains(continues)': False, 'contains(see)': False, 'contains(him)': True, 'contains(in)': True, 'contains(her)': False, 'contains(life)': False, 'contains(has)': True, 'contains(nightmares)': False, 'contains(what)': True, "contains(')": True, 'contains(s)': True, 'contains(deal)': False, 'contains

>The reason that we compute the set of all words in a document in [3], rather than just checking if word in document, is that checking whether a word occurs in a set is much faster than checking whether it occurs in a list 

In [151]:
features = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = features[100:], features[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [152]:
nltk.classify.accuracy(classifier, test_set)

0.82

In [154]:
classifier.show_most_informative_features(5)

Most Informative Features
 contains(unimaginative) = True              neg : pos    =      8.4 : 1.0
    contains(schumacher) = True              neg : pos    =      7.5 : 1.0
          contains(mena) = True              neg : pos    =      7.1 : 1.0
        contains(suvari) = True              neg : pos    =      7.1 : 1.0
     contains(atrocious) = True              neg : pos    =      7.1 : 1.0
