# DATA 620 - Project 3

Jeremy OBrien, Mael Illien, Vanita Thompson

* Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 
* Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. 
* Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. 
* Once you are satisfied with your classifier, check its final performance on the test set. 
* How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?
* Source: Natural Language Processing with Python, exercise 6.10.2.

The three classifiers from Chapter 6: NaiveBayes, DecisionTree, MaxEntropy

## Setup

In [1]:
import random
import nltk, re, pprint
from nltk.corpus import names
from nltk.classify import apply_features

(Description of approach)

## Data Import & Transformation

(Explanation)

In [2]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + 
         [(name, 'female') for name in names.words('female.txt')])

random.shuffle(labeled_names)
labeled_names[:10]

[('Howard', 'male'),
 ('Kore', 'female'),
 ('Townie', 'male'),
 ('Jimbo', 'male'),
 ('Marney', 'female'),
 ('Alisun', 'female'),
 ('Lemuel', 'male'),
 ('Lilah', 'female'),
 ('Cris', 'female'),
 ('Sherrie', 'female')]

### Train Test Split

(Explanation)

In [34]:
# Incorporated in function (JO: uncommented for use in testing, can recomment out for submission)
train_names = labeled_names[:500]
devtest_names = labeled_names[500:1000]
test_names = labeled_names[1000:]

In [35]:
# Incorporated in function (JO: uncommented for use in testing, can recomment out for submission)
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

### Test Classifier

(Explanation)

In [5]:
def test_classifier(names_corpus, gender_features_function, classifier_type):
#     train_set = apply_features(gender_features, names[:500])
#     devtest_set = apply_features(gender_features, names[500:1000])
#     test_set = apply_features(gender_features, names[1000:])

    # Train test split
    train_names = names_corpus[:500]
    devtest_names = names_corpus[500:1000]
    test_names = names_corpus[1000:]
    
    # Appy features
    train_set = [(gender_features_function(n), gender) for (n, gender) in train_names]
    devtest_set = [(gender_features_function(n), gender) for (n, gender) in devtest_names]
    test_set = [(gender_features_function(n), gender) for (n, gender) in test_names]
    
    # Classify and print score
    classifier = classifier_type.train(train_set)
    print(nltk.classify.accuracy(classifier, devtest_set))
    print(nltk.classify.accuracy(classifier, test_set))
    
    #classifier.show_most_informative_features(5)
    
    return classifier  

### Errors

(Explanation)

In [6]:
def errors(classifier):
    errors = []
    for (name, tag) in devtest_names:
        guess = classifier.classify(gender_features(name))
        if guess != tag:
            errors.append( (tag, guess, name) )
            
    for (tag, guess, name) in sorted(errors): 
        print('correct=%-8s guess=%-8s name=%-30s'%(tag, guess, name))

## Feature Engineering

(Explanation)

### Example 1

(Explanation)

In [7]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [8]:
gender_features('John')

{'last_letter': 'n'}

### Example 2

(Explanation)

In [11]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [12]:
#gender_features2('John')

### Example 3

(Explanation)

In [13]:
def gender_features3(word):
    return {'suffix1': word[-1:],'suffix2': word[-2:]}

In [14]:
gender_features3('Cristina')

{'suffix1': 'a', 'suffix2': 'na'}

### Example 4

(Explanation)

In [15]:
def gender_features4(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features['suffix1'] =  name[-1:]
    features['suffix2'] = name[-2:]
    features['suffix3'] = name[-3:]
    # features['length'] = len(name) # doesn't add much
    #suf = []
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

## Naive Bayes

(Explanation)

In [16]:
mod1 = test_classifier(labeled_names, gender_features, nltk.NaiveBayesClassifier)

0.774
0.7442396313364056


(Explanation)

In [17]:
mod1.show_most_informative_features(5)
print(mod1.classify(gender_features('Neo')))
print(mod1.classify(gender_features('Trinity')))

Most Informative Features
             last_letter = 'a'            female : male   =     27.2 : 1.0
             last_letter = 'r'              male : female =     13.5 : 1.0
             last_letter = 'd'              male : female =      9.4 : 1.0
             last_letter = 't'              male : female =      8.1 : 1.0
             last_letter = 'o'              male : female =      7.5 : 1.0
male
female


(Explanation)

In [18]:
mod2 = test_classifier(labeled_names, gender_features2, nltk.NaiveBayesClassifier)
mod2.show_most_informative_features(5)

0.772
0.7482718894009217
Most Informative Features
              lastletter = 'a'            female : male   =     27.2 : 1.0
              lastletter = 'r'              male : female =     13.5 : 1.0
              lastletter = 'd'              male : female =      9.4 : 1.0
                count(o) = 2                male : female =      8.3 : 1.0
              lastletter = 't'              male : female =      8.1 : 1.0


(Explanation)

In [19]:
mod3 = test_classifier(labeled_names, gender_features3, nltk.NaiveBayesClassifier)
mod3.show_most_informative_features(5)

0.806
0.7783698156682027
Most Informative Features
                 suffix1 = 'a'            female : male   =     27.2 : 1.0
                 suffix1 = 'r'              male : female =     13.5 : 1.0
                 suffix2 = 'er'             male : female =      9.4 : 1.0
                 suffix1 = 'd'              male : female =      9.4 : 1.0
                 suffix1 = 't'              male : female =      8.1 : 1.0


(Explanation)

In [20]:
mod4 = test_classifier(labeled_names, gender_features4, nltk.NaiveBayesClassifier)
mod4.show_most_informative_features(10)

0.796
0.7874423963133641
Most Informative Features
              lastletter = 'a'            female : male   =     27.2 : 1.0
                 suffix1 = 'a'            female : male   =     27.2 : 1.0
                 suffix1 = 'r'              male : female =     13.5 : 1.0
              lastletter = 'r'              male : female =     13.5 : 1.0
                 suffix2 = 'er'             male : female =      9.4 : 1.0
              lastletter = 'd'              male : female =      9.4 : 1.0
                 suffix1 = 'd'              male : female =      9.4 : 1.0
                count(o) = 2                male : female =      8.3 : 1.0
                 suffix1 = 't'              male : female =      8.1 : 1.0
              lastletter = 't'              male : female =      8.1 : 1.0


## Decision Trees

(Explanation)

In [21]:
dt_mod1 = test_classifier(labeled_names, gender_features4, nltk.DecisionTreeClassifier)

0.702
0.6768433179723502


(Explanation)

In [22]:
import math
def entropy(labels):
    freqdist = nltk.FreqDist(labels)
    probs = [freqdist.freq(l) for l in freqdist]
    return -sum(p * math.log(p,2) for p in probs)

## Max Entropy

Instead of using probabilites to set model parameters as the Naive Bayes classifier does, the Maximum Entropy Model (or MaxEnt) searches for the set of parameters that maximize model performance. The property of entropy entails uniformity of the distribution where there isn't empirical evidence that would constrain that uniformity.  

Intuition is that classifiers with lower entropy introduce biases that are not justified. 

Importantly, MaxEnt does not assume independence of features (as Naive Bayes does) and so is not negatively impacted when there is dependence between features (can often be the case). As MaxEnt captures the structure of the training data, the more features it uses the stronger the constraint of empirical consistenycy becomes (reference)[https://lost-contact.mit.edu/afs/cs.pitt.edu/projects/nltk/docs/tutorial/classifying/nochunks.html#maxent].

For each joint feature (define!), MaxEnt algorithms calculate the empirical frequency and...(complete!).

NLTK offers two algorithms, Generalized Iterative Scaling (GIS) and Improved Iterative Scaling (IIS), which offers faster convergence. Accoring to wikipedia (link!) and the literature ('A comparison of algorithms for maximum entropy parameter estimation')[http://luthuli.cs.uiuc.edu/~daf/courses/Opt-2017/Papers/p18-malouf.pdf], the performance of these algorithms has been substantially improved uppon by gradient-based methods, such as coordinate descent and limited memory L-BFGS and LMBVM. Most notably, iterative optimizations can be time consuming

Additionally, MaxEnt is a conditional classifier, meaning it can be used to determine the most likely label for a given input or conversely how likely a label is for that input.  A generative classifier like Naive Bayers can estimate the most likely input value, how likely an input value is, the same given an input label. 

https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf (cite!) 


Rapid comuptation, peaks after single iteration

(Add chart)

In [23]:
me_mod1 = test_classifier(labeled_names, gender_features, nltk.ConditionalExponentialClassifier)  # consider changing max_iter param to 20, how to add kwarg?

# https://stackoverflow.com/questions/39391280/how-to-change-number-of-iterations-in-maxent-classifier-for-pos-tagging-in-nltk

me_mod1.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.376
             2          -0.36621        0.776
             3          -0.36189        0.776
             4          -0.35931        0.776
             5          -0.35759        0.776
             6          -0.35637        0.776
             7          -0.35545        0.776
             8          -0.35474        0.776
             9          -0.35417        0.776
            10          -0.35370        0.776
            11          -0.35331        0.776
            12          -0.35298        0.776
            13          -0.35270        0.776
            14          -0.35246        0.776
            15          -0.35225        0.776
            16          -0.35206        0.776
            17          -0.35189        0.776
            18          -0.35174        0.776
            19          -0.35161        0.776
 

In [36]:
me_mod1.explain(test_set)

AttributeError: 'list' object has no attribute 'items'

In [25]:
# type(me_mod1.show_most_informative_features(10))

   6.644 last_letter=='k' and label is 'male'
   6.644 last_letter=='p' and label is 'male'
   6.644 last_letter=='c' and label is 'male'
   6.644 last_letter=='w' and label is 'male'
   6.644 last_letter=='j' and label is 'male'
   6.644 last_letter=='v' and label is 'male'
   6.644 last_letter=='f' and label is 'male'
  -4.807 last_letter=='a' and label is 'male'
  -2.700 last_letter=='r' and label is 'female'
  -2.000 last_letter=='d' and label is 'female'


NoneType

(Explanation)

(Add chart)

In [28]:
me_mod2 = test_classifier(labeled_names, gender_features2, nltk.ConditionalExponentialClassifier)
me_mod2.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.376
             2          -0.61376        0.624
             3          -0.59683        0.624
             4          -0.58097        0.628
             5          -0.56614        0.648
             6          -0.55229        0.672
             7          -0.53937        0.696
             8          -0.52732        0.708
             9          -0.51606        0.720
            10          -0.50555        0.732
            11          -0.49572        0.744
            12          -0.48652        0.760
            13          -0.47789        0.772
            14          -0.46979        0.788
            15          -0.46218        0.802
            16          -0.45501        0.800
            17          -0.44826        0.800
            18          -0.44188        0.800
            19          -0.43585        0.806
 

Slow computation, peaks after six iterations

(Add chart)

In [27]:
me_mod3 = test_classifier(labeled_names, gender_features3, nltk.ConditionalExponentialClassifier)
me_mod3.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.376
             2          -0.33924        0.824
             3          -0.31200        0.840
             4          -0.29466        0.844
             5          -0.28280        0.850
             6          -0.27427        0.850
             7          -0.26785        0.850
             8          -0.26283        0.850
             9          -0.25879        0.850
            10          -0.25545        0.850
            11          -0.25265        0.850
            12          -0.25025        0.850
            13          -0.24818        0.850
            14          -0.24637        0.850
            15          -0.24478        0.850
            16          -0.24336        0.850
            17          -0.24210        0.850
            18          -0.24096        0.850
            19          -0.23993        0.850
 

Rapid computation, unclear if reached optimum as continues to improve

(Add chart)

In [26]:
me_mod4 = test_classifier(labeled_names, gender_features4, nltk.ConditionalExponentialClassifier)
me_mod4.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.396
             2          -0.61468        0.604
             3          -0.58329        0.622
             4          -0.55521        0.700
             5          -0.53014        0.752
             6          -0.50773        0.780
             7          -0.48767        0.800
             8          -0.46966        0.818
             9          -0.45343        0.834
            10          -0.43876        0.850
            11          -0.42544        0.858
            12          -0.41331        0.866
            13          -0.40221        0.866
            14          -0.39202        0.870
            15          -0.38263        0.870
            16          -0.37394        0.874
            17          -0.36589        0.878
            18          -0.35839        0.878
            19          -0.35139        0.878
 

## Conclusion

## Youtube