# Project 3:

Goal:  Build a classifier that predicts gender from a first name using features from NLTK Ch. 6.  
Plan:
- Textbook baseline (last letter)
- suffix2
- first letter
- length
- vowel count

# Data Description

We will be using the NLTK Names Corpus, which has two lists: `male.txt` and `female.txt`.  
We are going to combine them into labeled examples of `(name, gender)`.  

1- Loading Data

In [2]:
import nltk
from nltk.corpus import names
import random

In [3]:
nltk.download('names')

#loading names, labeling and shuffle
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\rahar\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


In [4]:
#splitting into groups
train_names = labeled_names[1000:]
devtest_names = labeled_names[500:1000]
test_names = labeled_names[:500]

In [5]:

print(f"Training set: {len(train_names)} names")
print(f"Dev-test set: {len(devtest_names)} names")
print(f"Test set: {len(test_names)} names")

print(train_names[:5])

Training set: 6944 names
Dev-test set: 500 names
Test set: 500 names
[('Jaine', 'female'), ('Sonia', 'female'), ('Gwyneth', 'female'), ('Louisette', 'female'), ('Dionis', 'male')]


2 - Textbook example: last name as feature

The first model uses only the last letter of each name as the feature.
It got a dev-test accuracy of 0.746 which is similar with the performance in the textbook.

The most informative features confirm known linguistic patterns were “a” for female, and “k,” “p,” “f,” and “d” for male.

In [None]:
#feature extractor
def gender_features(name):
    return {'last_letter': name[-1].lower()}

#test
print(gender_features("Neo"))
print(gender_features("Trinity"))


{'last_letter': 'o'}
{'last_letter': 'y'}


In [10]:
#feature sets for each group
train_set = [(gender_features(n), g) for (n, g) in train_names]
devtest_set = [(gender_features(n), g) for (n, g) in devtest_names]
test_set = [(gender_features(n), g) for (n, g) in test_names]

#train Naive Bayes classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

acc_dev_v1 = nltk.classify.accuracy(classifier, devtest_set)
print(f"accuracy: {acc_dev_v1:.3f}")

#what are the most predictive features ?
classifier.show_most_informative_features(5)

accuracy: 0.746
Most Informative Features
             last_letter = 'k'              male : female =     42.1 : 1.0
             last_letter = 'a'            female : male   =     40.8 : 1.0
             last_letter = 'p'              male : female =     19.6 : 1.0
             last_letter = 'f'              male : female =     14.5 : 1.0
             last_letter = 'd'              male : female =     10.2 : 1.0


3 - Adding last two letters as feature

Next we are expanding the previous feature to look at the last two letters to capture more detailed endings like -na, -ia, -la.

The dev-test accuracy increased slightly from 0.746 to 0.750.

This improvement is small but postitive, we see that some multiletter endings contain gender information that the single letter model couldn’t capture, such as “-na,” “-la,” “-ra,” and “-ia” for female names, and “-rk” or “-ld” for male names.

We will keep this feature

In [12]:
#extractor
def gender_features_v2(name):
    name = name.lower()
    return {
        'suffix1': name[-1:],
        'suffix2': name[-2:]
    }

#feature for groups
train_set_v2 = [(gender_features_v2(n), g) for (n, g) in train_names]
devtest_set_v2 = [(gender_features_v2(n), g) for (n, g) in devtest_names]

#new model
clf_v2 = nltk.NaiveBayesClassifier.train(train_set_v2)

acc_dev_v2 = nltk.classify.accuracy(clf_v2, devtest_set_v2)

print(f"suffix2 accuracy:      {acc_dev_v2:.3f}")
print(f"improvement:                        {acc_dev_v2 - acc_dev_v1:+.3f}")

clf_v2.show_most_informative_features(5)

suffix2 accuracy:      0.750
improvement:                        +0.004
Most Informative Features
                 suffix2 = 'na'           female : male   =     93.3 : 1.0
                 suffix2 = 'la'           female : male   =     70.0 : 1.0
                 suffix2 = 'ra'           female : male   =     58.7 : 1.0
                 suffix2 = 'ia'           female : male   =     52.6 : 1.0
                 suffix1 = 'k'              male : female =     42.1 : 1.0


4- Adding first letter and name length as features

Next we added the first lettere and the length of the name as features

The model’s dev-test accuracy improved from 0.750 to 0.764 (+0.014), which is a bigger increase.

However, the top predictors are still the last two letters, but the new features might have helped in cases where those suffixes alone were not enough, such as in Ben and Jen, or Kim and Tim.

In [13]:
def gender_features_v3(name):
    name = name.lower()
    return {
        'suffix1': name[-1:],
        'suffix2': name[-2:],
        'first_letter': name[0],
        'name_length': len(name)
    }

train_set_v3 = [(gender_features_v3(n), g) for (n, g) in train_names]
devtest_set_v3 = [(gender_features_v3(n), g) for (n, g) in devtest_names]

clf_v3 = nltk.NaiveBayesClassifier.train(train_set_v3)
acc_dev_v3 = nltk.classify.accuracy(clf_v3, devtest_set_v3)

print(f"dev-test accuracy: {acc_dev_v3:.3f}")
print(f"improvement: {acc_dev_v3 - acc_dev_v2:+.3f}")
clf_v3.show_most_informative_features(5)


dev-test accuracy: 0.764
improvement: +0.014
Most Informative Features
                 suffix2 = 'na'           female : male   =     93.3 : 1.0
                 suffix2 = 'la'           female : male   =     70.0 : 1.0
                 suffix2 = 'ra'           female : male   =     58.7 : 1.0
                 suffix2 = 'ia'           female : male   =     52.6 : 1.0
                 suffix1 = 'k'              male : female =     42.1 : 1.0


5- Adding vowels as features

Female names usually contain more vowels (such as Anna, Olivia) while male ones often end with consonants (Mark, Scott). To capture this, we measured the total number of vowels in the names and added it as a feature.

The results produced a small improvement (+0.008).

By now, we are at 77%–78% accuracy, which matches the plateau shown in the textbook. We will stop at these features to avoid chasing tiny, unstable gains that would most likely overfit.

In [17]:
def gender_features_v4b(name):
    name = name.lower()
    return {
        'suffix1': name[-1:],
        'suffix2': name[-2:],
        'first_letter': name[0],
        'name_length': len(name),
        'vowel_count': sum(ch in VOWELS for ch in name)
    }

train_set_v4b = [(gender_features_v4b(n), g) for (n,g) in train_names]
devtest_set_v4b = [(gender_features_v4b(n), g) for (n,g) in devtest_names]
clf_v4b = nltk.NaiveBayesClassifier.train(train_set_v4b)
acc_dev_v4b = nltk.classify.accuracy(clf_v4b, devtest_set_v4b)
print(f"v4b (vowel_count) dev-test:   {acc_dev_v4b:.3f}  Δvs v3: {acc_dev_v4b - acc_dev_v3:+.3f}")


v4b (vowel_count) dev-test:   0.772  Δvs v3: +0.008


6 - Using final model and evaluating it

The model achieved a dev-test accuracy of 0.772 and a test accuracy of 0.766, showing a small −0.006 generalization gap.

Performance on the unseen test set is usually slightly lower than on the dev-test set because the model has been tuned using the dev-test data. But because the scores are close to each other, we can assume the model generalizes well and isn’t overfitting.

The most predictive features are still the two-letter suffixes “-na”, “-la”, and “-ia”.

In [18]:
#building test set features
test_set_v4b = [(gender_features_v4b(n), g) for (n, g) in test_names]

# Evaluate on dev-test and test
print(f"dev-test accuracy: {acc_dev_v4b:.3f}")

test_acc_v4b = nltk.classify.accuracy(clf_v4b, test_set_v4b)
print(f"test) accuracy:     {test_acc_v4b:.3f}")
print(f"test - dev: {test_acc_v4b - acc_dev_v4b:+.3f}")

clf_v4b.show_most_informative_features(5)


dev-test accuracy: 0.772
test) accuracy:     0.766
test - dev: -0.006
Most Informative Features
                 suffix2 = 'na'           female : male   =     93.3 : 1.0
                 suffix2 = 'la'           female : male   =     70.0 : 1.0
                 suffix2 = 'ra'           female : male   =     58.7 : 1.0
                 suffix2 = 'ia'           female : male   =     52.6 : 1.0
                 suffix1 = 'k'              male : female =     42.1 : 1.0
