Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?


In [4]:
import random, math, string
from collections import Counter, defaultdict
import nltk
import numpy as np, pandas as pd
random.seed(6502415)

lets tale a look at our data


In [5]:
from nltk.corpus import names

# nltk.download('names') # If needed
labeled_names = [(name, "male") for name in names.words("male.txt")] + [
    (name, "female") for name in names.words("female.txt")
]
random.shuffle(labeled_names)
print(labeled_names[:10])
print(len(labeled_names))
df = pd.DataFrame(labeled_names, columns=["name", "label"])
print(df["label"].value_counts())

[('Gordon', 'male'), ('Charlott', 'female'), ('Adela', 'female'), ('Daveta', 'female'), ('Torr', 'male'), ('Aube', 'male'), ('Brandise', 'female'), ('Margery', 'female'), ('Darryl', 'male'), ('Neall', 'male')]
7944
label
female    5001
male      2943
Name: count, dtype: int64


Define features extraction function


In [None]:
def gender_features2(name):
    name = name.lower()

    features = {
        # General structure
        "first_letter": name[0],
        "last_letter": name[-1],
        "last_two": name[-2:],
        "last_three": name[-3:],
        "length": len(name),
        "ends_with_vowel": name[-1] in "aeiouy",# added y to vowel
        
    }

    # --- Domain-informed patterns ---
    male_endings = ["or", "son", "ric", "us", "an", "ton", "bert", "ich","on","er"]
    female_endings = ["a", "ia", "ie", "ine", "elle", "ette", "ina", "na","ly"]

    # Flags for specific suffix groups
    features["male_like"] = any(name.endswith(suf) for suf in male_endings)
    features["female_like"] = any(name.endswith(suf) for suf in female_endings)

    

    return features

Now that we have our labled dataset, we can proceed making test sets, dev-sets and training sets.


In [21]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]
train_set = [(gender_features2(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features2(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features2(n), gender) for (n, gender) in test_names]

Now we can train our models we are foing a Naive Bayes, a Decision Tree, and a Max Entropy Classifier.

Maximum Entropy Implementation: The Maximum Entropy (MaxEnt) model is a probabilistic framework that selects the model with the highest entropy—that ie, the greatest uncertainty or uniformity—subject to the constraints imposed by the training data, ensuring no additional assumptions are made beyond the evidence. The Generalized Iterative Scaling (GIS) algorithm achieves this by updating all feature weights collectively to satisfy global constraints, while the Improved Iterative Scaling (IIS) algorithm refines the process by adjusting each weight individually, resulting in faster convergence and improved numerical stability.

Naive Bayes Classifier: The Naive Bayes model is a probabilistic classifier based on Bayes’ Theorem, which predicts the most likely class for a given input by combining prior probabilities with the likelihood of observed features. It assumes that all features are independent of one another (the “naive” assumption), which simplifies computation and often works surprisingly well even when this assumption isn’t perfectly true.

Decision Tree Classifier: A Decision Tree classifier predicts outcomes by recursively splitting the data into branches based on feature values that best separate the target classes. Each internal node represents a decision rule, and each leaf node represents a class label; this structure makes the model highly interpretable, though it can sometimes overfit if the tree grows too deep or captures noise in the data.

In [22]:
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
dt_classifier = nltk.DecisionTreeClassifier.train(train_set)
maxent_classifier = nltk.MaxentClassifier.train(
    train_set,
    algorithm="IIS",  # GIS or 'IIS'
    trace=0,  # set to 0 to hide iteration output
    max_iter=20,  # number of training iterations
)

In [23]:
print("Acccuracy on dev test-set")
print("Decision Tree Accuracy:", nltk.classify.accuracy(dt_classifier, devtest_set))

print("Naive Bayes Accuracy:", nltk.classify.accuracy(nb_classifier, devtest_set))

print("MaxEnt Accuracy:", nltk.classify.accuracy(maxent_classifier, devtest_set))

Acccuracy on dev test-set
Decision Tree Accuracy: 0.749
Naive Bayes Accuracy: 0.811
MaxEnt Accuracy: 0.816


In [24]:
print("accuracy on test set")
print("Decision Tree Accuracy:", nltk.classify.accuracy(dt_classifier, test_set))

print("Naive Bayes Accuracy:", nltk.classify.accuracy(nb_classifier, test_set))

print("MaxEnt Accuracy:", nltk.classify.accuracy(maxent_classifier, test_set))

accuracy on test set
Decision Tree Accuracy: 0.718
Naive Bayes Accuracy: 0.788
MaxEnt Accuracy: 0.79


As we can see, the Maximum Entropy model iteratively adjusts feature weights to find the optimal balance for classification. It achieved an accuracy of 0.816 on the training set and 0.788 on the test set, improving upon the baseline accuracy of approximately 0.76 reported in the book. The Naive Bayes achieved an accuracy of .811 on the dev-test set when the letter "y" was added to ends_with_vowel feature. The decision tree model was outpreformed by both, all models saw a dip in test_set evaluation. Using the IIS algorithim saw a slight improvement in the test_set accuracy.

In [11]:
test_truth = [label for (features, label) in test_set]
test_pred = [maxent_classifier.classify(features) for (features, label) in test_set]

print(nltk.ConfusionMatrix(test_truth, test_pred))

       |   f     |
       |   e     |
       |   m   m |
       |   a   a |
       |   l   l |
       |   e   e |
-------+---------+
female |<270> 37 |
  male |  68<125>|
-------+---------+
(row = reference; col = test)



Taking a look at the weights.


In [12]:
type(maxent_classifier.weights())


numpy.ndarray

In [13]:
maxent_classifier.show_most_informative_features(20)


   6.462 last_three=='zra' and label is 'male'
   4.131 last_three=='eza' and label is 'male'
   4.033 last_three=='ido' and label is 'female'
   3.780 last_three=='tya' and label is 'male'
   3.725 last_three=='tim' and label is 'female'
  -3.650 last_letter=='a' and label is 'male'
   3.633 last_three=='ild' and label is 'female'
   3.469 last_three=='ark' and label is 'female'
   3.384 last_three=='bev' and label is 'female'
  -3.376 last_two=='ko' and label is 'male'
   3.354 last_three=='em' and label is 'female'
   3.240 last_three=='pam' and label is 'female'
  -3.234 last_letter=='k' and label is 'female'
   3.219 last_two=='aa' and label is 'male'
   3.219 last_three=='laa' and label is 'male'
   3.158 last_two=='ua' and label is 'male'
   3.158 last_three=='hua' and label is 'male'
  -3.135 last_three=='nne' and label is 'male'
   2.999 last_three=='nch' and label is 'female'
   2.972 last_three=='kye' and label is 'male'


In [14]:
nb_classifier.show_most_informative_features(20)

Most Informative Features
                last_two = 'la'           female : male   =     67.0 : 1.0
                last_two = 'ia'           female : male   =     48.2 : 1.0
             last_letter = 'a'            female : male   =     43.6 : 1.0
                last_two = 'ta'           female : male   =     40.9 : 1.0
                last_two = 'ra'           female : male   =     33.2 : 1.0
                last_two = 'us'             male : female =     32.8 : 1.0
                last_two = 'sa'           female : male   =     30.4 : 1.0
                last_two = 'rt'             male : female =     27.5 : 1.0
             last_letter = 'k'              male : female =     25.3 : 1.0
                last_two = 'io'             male : female =     23.7 : 1.0
                last_two = 'do'             male : female =     22.6 : 1.0
                last_two = 'ld'             male : female =     22.2 : 1.0
                last_two = 'rd'             male : female =     21.2 : 1.0