<a href="https://colab.research.google.com/github/Lfirenzeg/msds620/blob/main/Project%203/620_Name_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data 620
## Project 3
## By Luis Munoz Grass

Using any of the three classifiers (decision trees, naive Bayes' classifiers, and maximum entropy classifiers)  described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.  

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set?
Is this what you'd expect?

## Solution
We will begin by installing and importing NLTK, then download just the Names corpus.

In [1]:
!pip install nltk
import nltk
nltk.download('names')




[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


True

We can now take a look at the data set, particularly we want to see the frequency of the first and last letters for either male or female gender.

In [8]:
import collections
import pandas as pd

from nltk.corpus import names
import random

# load and label
labeled_names = ([(name, 'male')   for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)


The assignment calls for:

500 names for the test set

500 names for the dev-test set

the remaining 6900 for training

In [9]:
#  7900 names
test_set    = labeled_names[:500]
devtest_set = labeled_names[500:1000]
train_set   = labeled_names[1000:]


In [10]:
# pull raw lists
male_names   = names.words('male.txt')
female_names = names.words('female.txt')

# count first and last letters
male_first  = collections.Counter(n[0].lower() for n in male_names)
male_last   = collections.Counter(n[-1].lower() for n in male_names)
fem_first   = collections.Counter(n[0].lower() for n in female_names)
fem_last    = collections.Counter(n[-1].lower() for n in female_names)

# build a table of the top 10 in each category
def top_df(counter, gender, position, n=10):
    rows = [(letter, count, gender, position)
            for letter, count in counter.most_common(n)]
    return pd.DataFrame(rows, columns=['letter','count','gender','position'])

df = pd.concat([
    top_df(male_first,   'male',   'first',  10),
    top_df(fem_first,    'female', 'first',  10),
    top_df(male_last,    'male',   'last',   10),
    top_df(fem_last,     'female', 'last',   10),
]).reset_index(drop=True)

# show df
df

Unnamed: 0,letter,count,gender,position
0,s,238,male,first
1,a,213,male,first
2,m,200,male,first
3,r,200,male,first
4,t,188,male,first
5,b,173,male,first
6,c,166,male,first
7,h,163,male,first
8,g,156,male,first
9,w,151,male,first


We see a few clear patterns:

### Last letter differences

Female names mostly end in "a" or "e"

a: 1773 female vs only (virtually) 0 male

e: 1432 female vs 468 male

We can anticipate features such as "ends_with_a" and "ends_with_e" are extremely strong female signals.

Male names tend to end in consonants like "n", "y", "s", or "d".

n: 478 male vs 386 female

y: 332 male vs 461 female (but proportionally stronger for male)

s: 230 male vs 93 female


Features like "ends_with_n", "ends_with_s", or even "ends_with_consonant" may help flag males.

---

### First letter differences

Female names most often start with m, c, or a (484, 469, 443)

Male names most often start with s, a, or m (238, 213, 200)

While there's overlap on 'a' and 'm', seeing a name start with c or j is a stronger female hint; s or t is a stronger male hint.

---

### Vowel vs consonant patterns

Female names end in a vowel little over half of the time (1773 + 1432 out of around 3900)

Male names end in a vowel only about a quarter of the time (468 + 332 out of 2300)

ends_with_vowel flag sees like a good feature.


Next we'll load the list of male and female names, we will label each name, and start shuffling them.

## Defining Gender Features

We'll start simple by taking the last letter of the name, and then iterate. A  basic extractor may look like:

In [11]:
def gender_features(name):
    name = name.lower()
    return {
        'last_letter' : name[-1],
        'first_letter': name[0],
        'suffix_2'    : name[-2:],
        'name_length' : len(name),
        'vowel_count' : sum(ch in 'aeiou' for ch in name)
    }


### Baseline: Naïve Bayes

NLTK's NaiveBayesClassifier is dead simple and hopefully a strong baseline:

In [12]:
from nltk.classify import NaiveBayesClassifier
from nltk.classify import accuracy

# Vectorize
train_feats    = [(gender_features(n), g) for (n, g) in train_set]
devtest_feats  = [(gender_features(n), g) for (n, g) in devtest_set]
test_feats     = [(gender_features(n), g) for (n, g) in test_set]

# Train
nb_classifier = NaiveBayesClassifier.train(train_feats)

# Evaluate on dev‑test
print("Dev‑test accuracy:", accuracy(nb_classifier, devtest_feats))


Dev‑test accuracy: 0.794


Around 0.786 out of the Bayes classifier seems like a good start. Let's see what else we can get with other more robust classifiers

### Maximum Entropy (Logistic)
NLTK's MaxentClassifier often edges NB out if there are richer features, but is slower:

In [13]:
from nltk.classify import MaxentClassifier

me_classifier = MaxentClassifier.train(train_feats,
                                       algorithm='gis',
                                       max_iter=10)

print("Dev‑test accuracy (MaxEnt):", accuracy(me_classifier, devtest_feats))


  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.47725        0.777
             3          -0.40970        0.799
             4          -0.37604        0.805
             5          -0.35610        0.807
             6          -0.34305        0.808
             7          -0.33394        0.810
             8          -0.32724        0.811
             9          -0.32214        0.814
         Final          -0.31814        0.814
Dev‑test accuracy (MaxEnt): 0.822


Around 0.814 as a baseline for MaximumEntropy is pretty good. If the next classifier does not top that performane then we can iterate on this model.

### Decision Tree
We can also try DecisionTreeClassifier, it gives interpretable rules but can overfit:

In [14]:
from nltk.classify import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier.train(train_feats, entropy_cutoff=0.1, depth_cutoff=20)
print("Dev‑test accuracy (DT):", accuracy(dt_classifier, devtest_feats))


Dev‑test accuracy (DT): 0.78


Around 0.794 accuracy from the DecisionTree classifier is better thant the initial Bayes, but not as good as MaximumEntropy, so we can start taking a closer look at where our best classifier is missing.

## Iterating on features

### Examining Results

We can compare the predictions of ME on the test set against the gold labels and collect the mismatches

In [15]:
# collecting errors on the dev‑test set
errors = []
for name, actual in devtest_set:
    feats = gender_features(name)
    pred  = me_classifier.classify(feats)
    if pred != actual:
        errors.append((name, actual, pred))

print(f"Total errors on dev‑test: {len(errors)} / {len(devtest_set)}")

# first 20 mistakes
for name, actual, pred in errors[:20]:
    print(f"{name:15s}  →  actual: {actual:6s}   pred: {pred:6s}")


Total errors on dev‑test: 89 / 500
Lust             →  actual: female   pred: male  
Micheil          →  actual: male     pred: female
Meryl            →  actual: male     pred: female
Bartie           →  actual: male     pred: female
Brice            →  actual: male     pred: female
Luis             →  actual: male     pred: female
Sukey            →  actual: female   pred: male  
Sandy            →  actual: male     pred: female
Jayme            →  actual: female   pred: male  
Ted              →  actual: female   pred: male  
Lucian           →  actual: male     pred: female
Fallon           →  actual: female   pred: male  
Lawrence         →  actual: male     pred: female
Rosalind         →  actual: female   pred: male  
Arne             →  actual: male     pred: female
Lennie           →  actual: male     pred: female
Larry            →  actual: male     pred: female
Bess             →  actual: female   pred: male  
Millicent        →  actual: female   pred: male  
Oliy           

In [16]:
df_errors = pd.DataFrame(errors, columns=['name','actual','predicted'])
df_errors.head(20)


Unnamed: 0,name,actual,predicted
0,Lust,female,male
1,Micheil,male,female
2,Meryl,male,female
3,Bartie,male,female
4,Brice,male,female
5,Luis,male,female
6,Sukey,female,male
7,Sandy,male,female
8,Jayme,female,male
9,Ted,female,male


In [17]:
# list the ones we got right, at least the first 20
correct = [(n,a,me_classifier.classify(gender_features(n)))
           for n,a in devtest_set
           if me_classifier.classify(gender_features(n)) == a]
print("Some correct classifications:")
for name, actual, pred in correct[:20]:
    print(f"{name:15s}  →  {pred}")


Some correct classifications:
Waneta           →  female
Flossie          →  female
Astra            →  female
Shoshanna        →  female
Auguste          →  female
Frans            →  male
Marlin           →  male
Cyrillus         →  male
Terrye           →  female
Juliet           →  female
Wilma            →  female
Wylma            →  female
Whitaker         →  male
Hailey           →  male
Rubina           →  female
Lauren           →  female
Jillayne         →  female
Jocelyne         →  female
Erin             →  male
Goldy            →  female


### Improving Gender Features



In [18]:
def gender_features(name):
    name = name.lower()
    feats = {}

    # basic features
    feats['first_letter']   = name[0]
    feats['last_letter']    = name[-1]
    feats['name_length']    = len(name)

    # take a look at slightly longer prefixes and suffixes
    feats['prefix_2']       = name[:2]
    feats['prefix_3']       = name[:3]
    feats['suffix_2']       = name[-2:]
    feats['suffix_3']       = name[-3:]
    feats['suffix_4']       = name[-4:]

    # also compare count of vowels vs consonants
    vowels = set('aeiou')
    vowel_count = sum(ch in vowels for ch in name)
    feats['vowel_count']    = vowel_count
    feats['distinct_vowels']= len(set(name) & vowels)
    feats['vowel_ratio']    = vowel_count / len(name)
    feats['ends_with_vowel']= name[-1] in vowels
    feats['starts_with_vowel']= name[0] in vowels

    # look for double letters, consonant count
    feats['double_letter']  = any(name[i]==name[i+1] for i in range(len(name)-1))
    feats['consonant_count']= sum(ch.isalpha() and ch not in vowels for ch in name)

    # a few character bigrams (you can expand to *all* bigrams if you like)
    feats['bigram_start']   = name[:2]
    feats['bigram_end']     = name[-2:]

    return feats

In [19]:
# Re‐vectorize
train_feats   = [(gender_features(n), g) for n,g in train_set]
devtest_feats = [(gender_features(n), g) for n,g in devtest_set]

### Re-Running Maximum Entropy


In [20]:
# re-train and evaluate
me = MaxentClassifier.train(train_feats,
                            algorithm='gis',
                            max_iter=15,
                            gaussian_prior_sigma=1.0)
print("Dev‑test acc:", accuracy(me, devtest_feats))

  ==> Training (15 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.46179        0.816
             3          -0.37634        0.868
             4          -0.32744        0.885
             5          -0.29538        0.894
             6          -0.27245        0.901
             7          -0.25503        0.907
             8          -0.24119        0.911
             9          -0.22982        0.916
            10          -0.22024        0.919
            11          -0.21200        0.921
            12          -0.20480        0.924
            13          -0.19844        0.925
            14          -0.19275        0.926
         Final          -0.18761        0.928
Dev‑test acc: 0.83


We can observe that in the stronger feature set there's much higher training accuracy.

In the original ME (10 iters) training accuracy peaked at around 81% and around 82.2% on dev-test.

Then, then Re-vamped ME (15 iters) training accuracy peaks at around 92.8%  and around 83.0% on dev-test.

That jump in training accuracy tells us the extra prefixes,suffixes,vowel counts, etc. gave the model a lot more capacity to memorize the 6,900 names.

So we went from 82.2% to 83.0% on dev-test.

Although modest, that's improvement. But since it's so small that shows diminishing returns from each new feature once we've captured the big signals (last letter, vowel count, basic suffixes).

that means the model now is  memorizing idiosyncrasies of the training names that don't generalize.

### Further Refining

At this point we can track dev accuracy at each iteration and stop when it maxes out rather than always doing the full 15. We may find dev-test peaking around iter10 or something, and training beyond that only deepens overfitting.

In [28]:
best_acc    = 0.0
best_iter   = 0
best_clf    = None
sigma       = 1.0  # our current regularization
max_iters   = 15  # select how many passes to try

for i in range(1, max_iters + 1):
    clf = MaxentClassifier.train(
        train_feats,
        algorithm='gis',
        max_iter=i,
        gaussian_prior_sigma=sigma
    )
    acc = accuracy(clf, devtest_feats)
    print(f"Iter {i:2d} → dev‑test acc {acc:.3f}")
    if acc > best_acc:
        best_acc, best_iter, best_clf = acc, i, clf

print(f"\n Best dev‑test accuracy {best_acc:.3f} at iteration {best_iter}\n")
# lock in the best‑so‑far classifier
me = best_clf

# finally, see how it does on the held‑out test set
print("Test accuracy:", accuracy(me, test_feats))

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
         Final          -0.46179        0.816
Iter  1 → dev‑test acc 0.764
  ==> Training (2 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
         Final          -0.46179        0.816
Iter  2 → dev‑test acc 0.764
  ==> Training (3 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.46179        0.816
         Final          -0.37634        0.868
Iter  3 → dev‑test acc 0.804
  ==> Training (4 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.46179        0.816
       

By iteration 8 we've hit dev-test of about 0.834, and beyond that, any extra passes only raise training accuracy, meaning our model is already as good as it gets on unseen dev names by eight passes. Any further fitting just memorizes the training set.

Finally, we can test increasing gaussian_prior_sigma to try and dampen those features with high variance.

In [29]:
best_sigma = None
best_sigma_acc = 0.0
best_sigma_clf = None
iters = best_iter  # from above early stop

for sigma in [0.1, 0.5, 1.0, 2.0, 5.0]:
    clf = MaxentClassifier.train(
        train_feats,
        algorithm='gis',
        max_iter=iters,
        gaussian_prior_sigma=sigma
    )
    acc = accuracy(clf, devtest_feats)
    print(f"sigma={sigma:4.1f} → dev‑test acc {acc:.3f}")
    if acc > best_sigma_acc:
        best_sigma_acc, best_sigma, best_sigma_clf = acc, sigma, clf

print(f"\n Best Sigma={best_sigma} with dev‑test acc {best_sigma_acc:.3f}\n")
# lock in our final, regularized classifier
me = best_sigma_clf

# final check on test set
print("Final test accuracy:", accuracy(me, test_feats))


  ==> Training (8 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.46179        0.816
             3          -0.37634        0.868
             4          -0.32744        0.885
             5          -0.29538        0.894
             6          -0.27245        0.901
             7          -0.25503        0.907
         Final          -0.24119        0.911
sigma= 0.1 → dev‑test acc 0.834
  ==> Training (8 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.46179        0.816
             3          -0.37634        0.868
             4          -0.32744        0.885
             5          -0.29538        0.894
             6          -0.27245        0.901
             7          -0.25503        0.907
         Final          -0.24119

So, no change accross sigma 0.1 to 5, yielding 0.834 on dev every time.

This might be due to the range of sigma being in a sweet spot plateau where the penalty is too weak or too slightly stronger  to alter the learned decision boundary in just eight iterations.

## Creating our own name identifier feature

We can create a function that will return a ProbabilityDist over the two labels 'male' or 'female'.

We will use .max() to pick the label with highest probability and .prob(label) gives that probability.

We can use a loop from Python's built-in input() so we can type any name at the => prompt and immediately see the confidence percentage on each name.

In [26]:
def predict_name_with_confidence(name, classifier):
    feats = gender_features(name)
    prob_dist = classifier.prob_classify(feats)
    guess = prob_dist.max()  # the label with highest prob
    confidence = prob_dist.prob(guess)  # Probabilty (guess, name)
    return guess, confidence

def interactive_gender_prompt(classifier):
    print("Enter a name to classify (or just hit Enter to quit):")
    while True:
        user_input = input("=> ").strip()
        if not user_input:
            print("Goodbye!")
            break
        gender, conf = predict_name_with_confidence(user_input, classifier)
        print(f"{user_input} ⇒ {gender} ({conf*100:.1f}% confidence)\n")


In [27]:
# To launch:
interactive_gender_prompt(me)

Enter a name to classify (or just hit Enter to quit):
=> Gabrielle
Gabrielle ⇒ female (84.3% confidence)

=> Maribell
Maribell ⇒ female (72.2% confidence)

=> Steven
Steven ⇒ male (75.1% confidence)

=> Erick
Erick ⇒ male (84.6% confidence)

=> Erica
Erica ⇒ female (92.1% confidence)

=> 
Goodbye!


## To Conclude

When defining the features to identify gender we started with last letter, first letter, suffixes (2 to 4), vowel consonant counts, double letter flags, and bigrams at word ends.

We also added early stopping to avoid overfitting, which peaked at 8 iterations.

We tried regularization sweeps that showed a broad plateau, indicating our features capture the main signal.

### Limitations and next steps

Names from other cultures or very unusual spellings may fall outside patterns learned from the NLTK corpus.

We could incorporate external name databases for even higher accuracy.

A simple vote among Naive Bayes, Decision Trees, and MaxEnt might push us past the mid-80s.