### Instructions 

### Using any of the three classifiers (Decision Tree, Naive Bayes Classifier, Maximum Entropy Classifiers) described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 

#### Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set (used to perform error analysis), and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. 

### How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect? 

### Source: Natural Language Processing with Python, exercise 6.10.2.

In [1]:
# Import library
import nltk
nltk.download('names')

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Ron\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


True

In [3]:
import random
from nltk.corpus import names

# Load and label names
male_names = [(name, 'male') for name in names.words('male.txt')]
female_names = [(name, 'female') for name in names.words('female.txt')]

# Combine and shuffle
all_names = male_names + female_names
random.shuffle(all_names)

# Split into subsets
test_names = all_names[:500]
devtest_names = all_names[500:1000]
train_names = all_names[1000:]   

# Check counts
print(f"Training set: {len(train_names)}")
print(f"Dev-test set: {len(devtest_names)}")
print(f"Test set: {len(test_names)}")

# Glimpse training set
print("Example training samples:", train_names[:25])

Training set: 6944
Dev-test set: 500
Test set: 500
Example training samples: [('Carina', 'female'), ('Reena', 'female'), ('Connie', 'male'), ('Ev', 'male'), ('Madelin', 'female'), ('Alvina', 'female'), ('Jacynth', 'female'), ('Tobin', 'male'), ('Tessy', 'female'), ('Xymenes', 'male'), ('Onida', 'female'), ('Marti', 'female'), ('Adriena', 'female'), ('Julio', 'male'), ('Joseph', 'male'), ('Berk', 'male'), ('Deeanne', 'female'), ('Schuyler', 'male'), ('Kaylee', 'female'), ('Ariela', 'female'), ('Kali', 'female'), ('Corina', 'female'), ('Ajay', 'female'), ('Casandra', 'female'), ('Shena', 'female')]


### With the NTLK library and names corpus made, along with the subset splitting done, I will move to simple classification and eventual gradual improvement.

In [18]:
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

# Define simple feature extractor 
def gender_features(word):
    return {'last_letter': word[-1].lower()}

# Create subsets
train_set = [(gender_features(name), gender) for (name, gender) in train_names]
devtest_set = [(gender_features(name), gender) for (name, gender) in devtest_names]
test_set = [(gender_features(name), gender) for (name, gender) in test_names]

# Train Naive Bayes classifier
classifier = NaiveBayesClassifier.train(train_set)

# Evaluate 
devtest_accuracy = accuracy(classifier, devtest_set)
test_accuracy = accuracy(classifier, test_set)

# See metrics
print(f"Development test accuracy: {devtest_accuracy *100:.1f}%")
print(f"Final test accuracy: {test_accuracy * 100:.1f}%")

# Most important letter features 
classifier.show_most_informative_features(10)

Development test accuracy: 75.2%
Final test accuracy: 78.6%
Most Informative Features
             last_letter = 'a'            female : male   =     40.4 : 1.0
             last_letter = 'k'              male : female =     25.9 : 1.0
             last_letter = 'f'              male : female =     15.9 : 1.0
             last_letter = 'p'              male : female =     11.2 : 1.0
             last_letter = 'd'              male : female =      9.7 : 1.0
             last_letter = 'm'              male : female =      8.7 : 1.0
             last_letter = 'o'              male : female =      8.2 : 1.0
             last_letter = 'v'              male : female =      7.8 : 1.0
             last_letter = 'r'              male : female =      7.1 : 1.0
             last_letter = 'g'              male : female =      5.5 : 1.0


### With this initial classifier, which has a decent accuracy rating of 75.2% and 78.6% for the Dev-Test and Final accuracy, this proves a solid baseline result. The Dev-Test accuracy is how well the model performs on the tuning data, and the final accuracy is against the test data. We can identify the ratio of certain letters to determine which names are more likely to be male or female. Let's see where the classificer makes mistakes.

In [8]:
# See name by name on Dev-Test set, and which it classified wrong
errors = []

for (name, true_label) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != true_label:
        errors.append((true_label, guess, name))

# Sort errors alphabetically
errors = sorted(errors, key=lambda x: x[2])

print(f"Number of errors: {len(errors)} out of {len(devtest_names)}")
print("\nSample Name Misclassifications:\n")

for (true_label, guess, name) in errors[:25]:
    print(f"{name:15} - predicted: {guess:6} | actual: {true_label}")

Number of errors: 124 out of 500

Sample Name Misclassifications:

Abagail         - predicted: male   | actual: female
Adel            - predicted: male   | actual: female
Allah           - predicted: female | actual: male
Amabel          - predicted: male   | actual: female
Aurel           - predicted: male   | actual: female
Bab             - predicted: male   | actual: female
Barnabe         - predicted: female | actual: male
Bell            - predicted: male   | actual: female
Benjy           - predicted: female | actual: male
Benny           - predicted: female | actual: male
Berkley         - predicted: female | actual: male
Bren            - predicted: male   | actual: female
Brice           - predicted: female | actual: male
Brinkley        - predicted: female | actual: male
Candis          - predicted: male   | actual: female
Carmel          - predicted: male   | actual: female
Charleen        - predicted: male   | actual: female
Chauncey        - predicted: female | actual: 

### The classifier struggled with female names ending in consonants, such as Abigail, Bell, Charleen, and Christal. Conversely, it struggled with male names ending with vowels, such as Benjy, Benny, Brice, Daryle, and Davide. Some names are too ambiguous or rare, such as Sam, Alex, Dallas, Darby, and Allah, so the model cannot learn the actual gender well from them, as the name either rarely appears or could be for either gender. The last letter rule defined earlier doesn't always hold up, so I need to enhance the feature focus. Chapter 6, Section 1.2 enhances the classifier to capture the first and last letters to determine prefix/suffix patterns. However, as the book notes, it introduces a lot more features and can lead to overfitting. This leads to a higher training accuracy, but lower accuracy for the dev-test and test sets. So I will avoid making too much of a change on the feature training. 

In [10]:
# Modified version of gender_features2 in Chapter 6.1.2
def gender_features3(name):
    features = {
        'first_letter': name[0].lower(),
        'last_letter': name[-1].lower(),
        'last_two': name[-2:].lower(),   
        'name_length': len(name)
    }
    return features


# Create sets using the new features
train_set3 = [(gender_features3(name), gender) for (name, gender) in train_names]
devtest_set3 = [(gender_features3(name), gender) for (name, gender) in devtest_names]
test_set3 = [(gender_features3(name), gender) for (name, gender) in test_names]

# Train Naive Bayes classifier
classifier3 = NaiveBayesClassifier.train(train_set3)

# Evaluate 
devtest_accuracy3 = accuracy(classifier3, devtest_set3)
test_accuracy3 = accuracy(classifier3, test_set3)

# See metrics
print(f"Development test accuracy: {devtest_accuracy3 *100:.1f}%$")
print(f"Final test accuracy: {test_accuracy3 * 100:.1f}%")


# Most important letter features 
classifier3.show_most_informative_features(10)

Development test accuracy: 78.2%$
Final test accuracy: 82.2%
Most Informative Features
                last_two = 'na'           female : male   =     96.1 : 1.0
                last_two = 'ia'           female : male   =     86.4 : 1.0
             last_letter = 'a'            female : male   =     40.4 : 1.0
                last_two = 'us'             male : female =     37.6 : 1.0
                last_two = 'ra'           female : male   =     36.7 : 1.0
                last_two = 'sa'           female : male   =     31.1 : 1.0
                last_two = 'ta'           female : male   =     30.0 : 1.0
                last_two = 'do'             male : female =     26.1 : 1.0
             last_letter = 'k'              male : female =     25.9 : 1.0
                last_two = 'ld'             male : female =     23.0 : 1.0


### With this enhancement, the classifier now considers the last 2 letters of a name as its own feature, leading to an increase in accuracy from 75.2% to 78.2%, and from 78.6% to 82.2% respectively. We can see that we went from 124/500 errors in the prior method, to 109/500 errors below.

In [11]:
errors3 = []

for (name, true_label) in devtest_names:
    guess = classifier3.classify(gender_features3(name))
    if guess != true_label:
        errors3.append((true_label, guess, name))

# Sort errors alphabetically
errors3 = sorted(errors3, key=lambda x: x[2])

print(f"Number of errors: {len(errors3)} out of {len(devtest_names)}")
print("\nSample Name Misclassifications:\n")

for (true_label, guess, name) in errors3[:25]:
    print(f"{name:15} - predicted: {guess:6} | actual: {true_label}")

Number of errors: 109 out of 500

Sample Name Misclassifications:

Abagail         - predicted: male   | actual: female
Adel            - predicted: male   | actual: female
Allah           - predicted: female | actual: male
Amabel          - predicted: male   | actual: female
Aurel           - predicted: male   | actual: female
Bab             - predicted: male   | actual: female
Barbey          - predicted: male   | actual: female
Barnabe         - predicted: female | actual: male
Bell            - predicted: male   | actual: female
Benny           - predicted: female | actual: male
Bren            - predicted: male   | actual: female
Brice           - predicted: female | actual: male
Bryn            - predicted: female | actual: male
Candis          - predicted: male   | actual: female
Carmel          - predicted: male   | actual: female
Chauncey        - predicted: female | actual: male
Cortese         - predicted: female | actual: male
Dallas          - predicted: male   | actual: 

### The remaining tricky names are the following edge cases: 

    - Female names ending with consonants: Abagail, Amabel, Bell, Carmel, Candis
    - Male names ending with vowels: Benny, Davide, Eddy, Dwayne
    - Rare or ambiguous names: Allah, Barnabe, Cortese, Darby

### So adding the last 2 letters as a suffix feature helped the classifier, but it still has issues with the above categories. I will add the first 2 letters as a feature as well.

In [12]:
# Modified version of gender_features3, adding first two letters
def gender_features4(name):
    features = {
        'first_letter': name[0].lower(),
        'first_two': name[:2].lower(),      
        'last_letter': name[-1].lower(),
        'last_two': name[-2:].lower(),      
        'name_length': len(name)
    }
    return features


# Create sets using the new features
train_set4 = [(gender_features4(name), gender) for (name, gender) in train_names]
devtest_set4 = [(gender_features4(name), gender) for (name, gender) in devtest_names]
test_set4 = [(gender_features4(name), gender) for (name, gender) in test_names]

# Train Naive Bayes classifier
classifier4 = NaiveBayesClassifier.train(train_set4)

# Evaluate 
devtest_accuracy4 = accuracy(classifier4, devtest_set4)
test_accuracy4 = accuracy(classifier4, test_set4)

# See metrics
print(f"Development test accuracy: {devtest_accuracy4 * 100:.1f}%")
print(f"Final test accuracy: {test_accuracy4 * 100:.1f}%")

# Most important letter features 
classifier4.show_most_informative_features(10)

Development test accuracy: 79.2%
Final test accuracy: 82.2%
Most Informative Features
                last_two = 'na'           female : male   =     96.1 : 1.0
                last_two = 'ia'           female : male   =     86.4 : 1.0
             last_letter = 'a'            female : male   =     40.4 : 1.0
                last_two = 'us'             male : female =     37.6 : 1.0
                last_two = 'ra'           female : male   =     36.7 : 1.0
                last_two = 'sa'           female : male   =     31.1 : 1.0
                last_two = 'ta'           female : male   =     30.0 : 1.0
                last_two = 'do'             male : female =     26.1 : 1.0
             last_letter = 'k'              male : female =     25.9 : 1.0
                last_two = 'ld'             male : female =     23.0 : 1.0


In [13]:
# Check Errors
errors4 = []

for (name, true_label) in devtest_names:
    guess = classifier4.classify(gender_features4(name))
    if guess != true_label:
        errors4.append((true_label, guess, name))

# Sort errors alphabetically
errors4 = sorted(errors4, key=lambda x: x[2])

print(f"Number of errors: {len(errors4)} out of {len(devtest_names)}")
print("\nSample Name Misclassifications:\n")

for (true_label, guess, name) in errors4[:25]:
    print(f"{name:15} - predicted: {guess:6} | actual: {true_label}")

Number of errors: 104 out of 500

Sample Name Misclassifications:

Abagail         - predicted: male   | actual: female
Adel            - predicted: male   | actual: female
Allah           - predicted: female | actual: male
Bab             - predicted: male   | actual: female
Barbe           - predicted: male   | actual: female
Barbey          - predicted: male   | actual: female
Bell            - predicted: male   | actual: female
Benjy           - predicted: female | actual: male
Benny           - predicted: female | actual: male
Berkley         - predicted: female | actual: male
Bren            - predicted: male   | actual: female
Brice           - predicted: female | actual: male
Bryn            - predicted: female | actual: male
Cal             - predicted: female | actual: male
Chauncey        - predicted: female | actual: male
Cortese         - predicted: female | actual: male
Dallas          - predicted: male   | actual: female
Darby           - predicted: female | actual: male

### Adding the first 2 letters improved the Dev-Test accuracy by 1%, and the test accuracy stayed the same. The number of errors went from 109 to 104, and the remaining ones are rare or ambiguous (Barbe, Bab, Brice, Cal, Cortese, Daryle, Gael). For additional improvement, I will check if there are any vowel/consonant patterns to add to the current features (length, prefix, and suffix). 

In [14]:
# Adding vowel/consonant features
def gender_features5(name):
    vowels = 'aeiou'
    name_lower = name.lower()
    num_vowels = sum(1 for letter in name_lower if letter in vowels)
    num_consonants = len(name_lower) - num_vowels
    
    features = {
        'first_letter': name_lower[0],
        'first_two': name_lower[:2],
        'last_letter': name_lower[-1],
        'last_two': name_lower[-2:],
        'name_length': len(name_lower),
        'num_vowels': num_vowels,
        'num_consonants': num_consonants,
        'ends_with_vowel': (name_lower[-1] in vowels),
        'starts_with_vowel': (name_lower[0] in vowels)
    }
    return features


# Create sets using the new features
train_set5 = [(gender_features5(name), gender) for (name, gender) in train_names]
devtest_set5 = [(gender_features5(name), gender) for (name, gender) in devtest_names]
test_set5 = [(gender_features5(name), gender) for (name, gender) in test_names]

# Train Naive Bayes classifier
classifier5 = NaiveBayesClassifier.train(train_set5)

# Evaluate 
devtest_accuracy5 = accuracy(classifier5, devtest_set5)
test_accuracy5 = accuracy(classifier5, test_set5)

# See metrics
print(f"Development test accuracy: {devtest_accuracy5 * 100:.1f}%")
print(f"Final test accuracy: {test_accuracy5 * 100:.1f}%")

# Most important letter features 
classifier5.show_most_informative_features(10)

Development test accuracy: 77.2%
Final test accuracy: 80.4%
Most Informative Features
                last_two = 'na'           female : male   =     96.1 : 1.0
                last_two = 'ia'           female : male   =     86.4 : 1.0
             last_letter = 'a'            female : male   =     40.4 : 1.0
                last_two = 'us'             male : female =     37.6 : 1.0
                last_two = 'ra'           female : male   =     36.7 : 1.0
                last_two = 'sa'           female : male   =     31.1 : 1.0
                last_two = 'ta'           female : male   =     30.0 : 1.0
                last_two = 'do'             male : female =     26.1 : 1.0
             last_letter = 'k'              male : female =     25.9 : 1.0
                last_two = 'ld'             male : female =     23.0 : 1.0


### Adding the Consonant/Vowel features lowered the Dev-Test accuracy from 79.2% to 77.2%, and the Final Test Accuracy from 82.2% to 80.4%, asd the Naive Bayes Classifier likely overfitted to these new features. This is likely because multiple features are highly correlated, such as num_vowels, ends_with_vowel, and last_letter. So simple, strong features, such as the suffix, prefix, and first/last letters, can outperform others in Naive Bayes.

In [15]:
errors5 = []

for (name, true_label) in devtest_names:
    guess = classifier5.classify(gender_features5(name))
    if guess != true_label:
        errors5.append((true_label, guess, name))

# Sort errors alphabetically 
errors5 = sorted(errors5, key=lambda x: x[2])

print(f"Number of errors: {len(errors5)} out of {len(devtest_names)}")
print("\nSample Name Misclassifications:\n")

for (true_label, guess, name) in errors5[:25]:
    print(f"{name:15} - predicted: {guess:6} | actual: {true_label}")


Number of errors: 114 out of 500

Sample Name Misclassifications:

Abagail         - predicted: male   | actual: female
Adel            - predicted: male   | actual: female
Amabel          - predicted: male   | actual: female
Bab             - predicted: male   | actual: female
Barbey          - predicted: male   | actual: female
Barnabe         - predicted: female | actual: male
Beau            - predicted: female | actual: male
Bell            - predicted: male   | actual: female
Bren            - predicted: male   | actual: female
Brice           - predicted: female | actual: male
Candis          - predicted: male   | actual: female
Carmel          - predicted: male   | actual: female
Charleen        - predicted: male   | actual: female
Chelsey         - predicted: male   | actual: female
Christal        - predicted: male   | actual: female
Corey           - predicted: male   | actual: female
Cortese         - predicted: female | actual: male
Dallas          - predicted: male   | ac

### Since those features are highly correlated, I will remove the num_vowels and num_consonants features, and check if keeping the starts_with and ends_with features helps accuracy.

In [16]:
def gender_features6(name):
    vowels = 'aeiou'
    name_lower = name.lower()
    
    features = {
        'first_letter': name_lower[0],
        'first_two': name_lower[:2],
        'last_letter': name_lower[-1],
        'last_two': name_lower[-2:],
        'name_length': len(name_lower),
        'starts_with_vowel': (name_lower[0] in vowels),
        'ends_with_vowel': (name_lower[-1] in vowels)
    }
    return features


# Create sets using the new features
train_set6 = [(gender_features6(name), gender) for (name, gender) in train_names]
devtest_set6 = [(gender_features6(name), gender) for (name, gender) in devtest_names]
test_set6 = [(gender_features6(name), gender) for (name, gender) in test_names]

# Train Naive Bayes classifier
classifier6 = NaiveBayesClassifier.train(train_set6)

# Evaluate 
devtest_accuracy6 = accuracy(classifier6, devtest_set6)
test_accuracy6 = accuracy(classifier6, test_set6)

# See metrics
print(f"Development test accuracy: {devtest_accuracy6 * 100:.1f}%")
print(f"Final test accuracy: {test_accuracy6 * 100:.1f}%")

# Most important letter features 
classifier6.show_most_informative_features(10)

Development test accuracy: 77.2%
Final test accuracy: 82.0%
Most Informative Features
                last_two = 'na'           female : male   =     96.1 : 1.0
                last_two = 'ia'           female : male   =     86.4 : 1.0
             last_letter = 'a'            female : male   =     40.4 : 1.0
                last_two = 'us'             male : female =     37.6 : 1.0
                last_two = 'ra'           female : male   =     36.7 : 1.0
                last_two = 'sa'           female : male   =     31.1 : 1.0
                last_two = 'ta'           female : male   =     30.0 : 1.0
                last_two = 'do'             male : female =     26.1 : 1.0
             last_letter = 'k'              male : female =     25.9 : 1.0
                last_two = 'ld'             male : female =     23.0 : 1.0


### The Dev-Test accuracy stayed the same, and the Final Test accuracy rose from 80.4% to 82%. We can see that the most informative features are dominated by the suffixes and the last letter features. This still has the same errors as my gender_features5 classifier, so the edge cases persist.

In [17]:
errors6 = []

for (name, true_label) in devtest_names:
    guess = classifier6.classify(gender_features6(name))
    if guess != true_label:
        errors6.append((true_label, guess, name))

# Sort errors alphabetically
errors6 = sorted(errors6, key=lambda x: x[2])

print(f"Number of errors: {len(errors6)} out of {len(devtest_names)}")
print("\nSample Name Misclassifications:\n")

for (true_label, guess, name) in errors6[:25]:
    print(f"{name:15} - predicted: {guess:6} | actual: {true_label}")

Number of errors: 114 out of 500

Sample Name Misclassifications:

Abagail         - predicted: male   | actual: female
Adel            - predicted: male   | actual: female
Allah           - predicted: female | actual: male
Amabel          - predicted: male   | actual: female
Aurel           - predicted: male   | actual: female
Bab             - predicted: male   | actual: female
Barbey          - predicted: male   | actual: female
Barnabe         - predicted: female | actual: male
Beau            - predicted: female | actual: male
Bell            - predicted: male   | actual: female
Beulah          - predicted: male   | actual: female
Bren            - predicted: male   | actual: female
Brice           - predicted: female | actual: male
Bryn            - predicted: female | actual: male
Candis          - predicted: male   | actual: female
Charleen        - predicted: male   | actual: female
Chelsey         - predicted: male   | actual: female
Christal        - predicted: male   | actu

In [27]:
# See metrics for each Naive Bayes Classifier

import pandas as pd

# Print all columns together
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.precision', 1)

# Original Classifier info
original_classifiers = [
    'gender_features (Last letter only)',
    'gender_features3 (First letter, last 2 letters, name length)',
    'gender_features4 (First 2 letters, last 2 letters, name length)',
    'gender_features5 (Same as #4, with Vowel & Consonant counts + ending & starting vowels)',
    'gender_features6 (Same as #5, without Vowel & Consonant counts)'
]

# Split into Classifier and Description
split_classifiers = [c.split('(', 1) for c in original_classifiers]
classifier_names = [c[0].strip() for c in split_classifiers]
descriptions = [c[1].strip(") ").strip() if len(c) > 1 else '' for c in split_classifiers]

# Create summary table
classifier_summary = pd.DataFrame({
    'Classifier': classifier_names,
    'Description': descriptions,
    'Dev-Test Accuracy (%)': [
        devtest_accuracy * 100,
        devtest_accuracy3 * 100,
        devtest_accuracy4 * 100,
        devtest_accuracy5 * 100,
        devtest_accuracy6 * 100
    ],
    'Final Test Accuracy (%)': [
        test_accuracy * 100,
        test_accuracy3 * 100,
        test_accuracy4 * 100,
        test_accuracy5 * 100,
        test_accuracy6 * 100
    ],
    '# Errors': [
        len(errors),
        len(errors3),
        len(errors4),
        len(errors5),
        len(errors6)
    ]
})

# Display table
print(classifier_summary.to_string(index=False))

      Classifier                                                          Description  Dev-Test Accuracy (%)  Final Test Accuracy (%)  # Errors
 gender_features                                                     Last letter only                   75.2                     78.6       124
gender_features3                            First letter, last 2 letters, name length                   78.2                     82.2       109
gender_features4                         First 2 letters, last 2 letters, name length                   79.2                     82.2       104
gender_features5 Same as #4, with Vowel & Consonant counts + ending & starting vowels                   77.2                     80.4       114
gender_features6                         Same as #5, without Vowel & Consonant counts                   77.2                     82.0       114


### Overall, classifier 4 has the best performance using the Naive Bayes method. It assumes features are conditionally independent given the class (Male or Female name), and it counts frequencies of features (like letters in names) for each class. It then uses those counts to compute the probability for the most likely class for each name, then multiplies all those probabilities for each feature together. I want to see how it compares using the Decision Tree and Maximum Entropy Classifiers.

In [35]:
from nltk import classify, NaiveBayesClassifier, DecisionTreeClassifier, MaxentClassifier

# Best performing NB feature extractor
def gender_features4(name):
    features = {
        'first_letter': name[0].lower(),
        'first_two': name[:2].lower(),      
        'last_letter': name[-1].lower(),
        'last_two': name[-2:].lower(),      
        'name_length': len(name)
    }
    return features

# Prepare feature sets 
train_set = [(gender_features4(n), g) for (n, g) in train_names]
devtest_set = [(gender_features4(n), g) for (n, g) in devtest_names]
test_set = [(gender_features4(n), g) for (n, g) in test_names]

# Train Classifiers 
nb_classifier = NaiveBayesClassifier.train(train_set)
dt_classifier = DecisionTreeClassifier.train(train_set)
me_classifier = MaxentClassifier.train(train_set, max_iter = 25)

# Compute Accuracies 
nb_devtest_acc = classify.accuracy(nb_classifier, devtest_set)
nb_test_acc = classify.accuracy(nb_classifier, test_set)

dt_devtest_acc = classify.accuracy(dt_classifier, devtest_set)
dt_test_acc = classify.accuracy(dt_classifier, test_set)

me_devtest_acc = classify.accuracy(me_classifier, devtest_set)
me_test_acc = classify.accuracy(me_classifier, test_set)

# Count Misclassifications 
def count_errors(classifier, dataset):
    errors = [(name, true) for (features, true), (name, _) in zip(dataset, devtest_names)
              if classifier.classify(features) != true]
    return len(errors)

nb_errors = count_errors(nb_classifier, devtest_set)
dt_errors = count_errors(dt_classifier, devtest_set)
me_errors = count_errors(me_classifier, devtest_set)

print("_______________________________________________________________________________")
# Display results
results_df = pd.DataFrame({
    "Classifier": ["Naive Bayes", "Decision Tree", "Max Entropy"],
    "Dev-Test Accuracy": [f"{nb_devtest_acc*100:.2f}%", f"{dt_devtest_acc*100:.2f}%", f"{me_devtest_acc*100:.2f}%"],
    "Final Test Accuracy": [f"{nb_test_acc*100:.2f}%", f"{dt_test_acc*100:.2f}%", f"{me_test_acc*100:.2f}%"],
    "# Misclassified (Dev-Test)": [nb_errors, dt_errors, me_errors]
})

print(results_df.to_string(index=False))

  ==> Training (25 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.43768        0.782
             3          -0.37358        0.817
             4          -0.34217        0.821
             5          -0.32391        0.825
             6          -0.31203        0.828
             7          -0.30369        0.829
             8          -0.29752        0.832
             9          -0.29277        0.833
            10          -0.28897        0.834
            11          -0.28588        0.836
            12          -0.28330        0.836
            13          -0.28111        0.836
            14          -0.27922        0.836
            15          -0.27758        0.836
            16          -0.27614        0.837
            17          -0.27486        0.837
            18          -0.27372        0.836
            19          -0.27270        0.837
  

### The Decision Tree approach creates rules that split data based on features, where the stronger interactions happen earlier at the splits of each tree node. The Maximum Entropy approach assigns weights to each feature based on how strongly it predicts an observation's label, and can capture overlapping or correlated features better than Naive Bayes. Looking at the results above, we can see that Naive Bayes and Max Entropy models performed about the same. Max Entropy models don't assume independence between the features, unlike Naive Bayes. Overall, these results are quite good and what I'd expect.

### Out of curiosity, I want to see if n-grams can capture morphological patterns, like -ail, -bel, -die, -ud, -ine that the prior features miss, along with other common gendered substrings such as -ann, -elle, -ette, -ine, and -son. Names like Annette or Madeline will be marked as female names, while names like Jackson and Nathaniel will be marked as male names.

In [36]:
import re

# Feature extractor 
def gender_features_ngrams_substrings(name):
    name = name.lower()
    features = {}
    
    # Basic structure features
    features['first_letter'] = name[0]
    features['last_letter'] = name[-1]
    features['name_length'] = len(name)

    # Vowel/consonant features
    vowels = set('aeiou')
    vowel_count = sum(1 for c in name if c in vowels)
    consonant_count = len(name) - vowel_count
    features['vowel_ratio'] = round(vowel_count / max(1, len(name)), 2)
    features['ends_with_vowel'] = name[-1] in vowels
    features['vowel_pattern'] = re.sub(r'[^aeiou]', 'C', re.sub(r'[aeiou]', 'V', name))

    # Character n-grams up to length 4
    for n in range(2, 5):
        for i in range(len(name) - n + 1):
            gram = name[i:i+n]
            features[f'ngram_{n}_{gram}'] = True
    
    # Gendered substrings
    female_substrings = ['ann', 'elle', 'ette', 'ine']
    male_substrings = ['son', 'ton', 'ian', 'el']
    
    for sub in female_substrings:
        features[f'has_female_sub_{sub}'] = sub in name
    for sub in male_substrings:
        features[f'has_male_sub_{sub}'] = sub in name

    return features
    
########################
# Load and split the Names corpus
male_names = [(name, 'male') for name in names.words('male.txt')]
female_names = [(name, 'female') for name in names.words('female.txt')]

all_names = male_names + female_names
random.shuffle(all_names)

test_names = all_names[:500]
devtest_names = all_names[500:1000]
train_names = all_names[1000:]

# Prepare feature sets
train_set = [(gender_features_ngrams_substrings(name), gender) for (name, gender) in train_names]
devtest_set = [(gender_features_ngrams_substrings(name), gender) for (name, gender) in devtest_names]
test_set = [(gender_features_ngrams_substrings(name), gender) for (name, gender) in test_names]

# Train Max Entropy Classifier
classifier_me = MaxentClassifier.train(train_set, max_iter=25)

# Evaluate
devtest_accuracy = accuracy(classifier_me, devtest_set)
test_accuracy = accuracy(classifier_me, test_set)

# Count errors on dev-test
errors = [(true, guess, name) for (name, true) in devtest_names
          if (guess := classifier_me.classify(gender_features_ngrams_substrings(name))) != true]


# Display results
print(f"Development test accuracy: {devtest_accuracy*100:.2f}%")
print(f"Final test accuracy: {test_accuracy*100:.2f}%")
print(f"Number of misclassified names (Dev-Test): {len(errors)}\n")

print("Sample misclassified names (up to 25):")
for (true_label, guess, name) in errors[:25]:
    print(f"{name:15} - predicted: {guess:6} | actual: {true_label}")


  ==> Training (25 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.51219        0.690
             3          -0.44231        0.819
             4          -0.39384        0.865
             5          -0.35847        0.880
             6          -0.33144        0.889
             7          -0.31002        0.895
             8          -0.29254        0.901
             9          -0.27795        0.906
            10          -0.26555        0.908
            11          -0.25483        0.912
            12          -0.24547        0.914
            13          -0.23719        0.917
            14          -0.22981        0.920
            15          -0.22317        0.922
            16          -0.21716        0.923
            17          -0.21168        0.924
            18          -0.20667        0.925
            19          -0.20205        0.927
  

### This feature extractor is an improvement from the prior one, with rises in Dev-Test accuracy (79.2% to 82.6%), and Final Test accuracy (82.8% to 85.6%), making fewer mistakes on name classifying (87 vs 104).