#Data620:Project03

Mahmud Hasan Al Raji

#Project Details

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

#Import Libraries and Get Required Datasets

In this part, I imported the NLTK libraries and downloaded the names corpus. The dataset has two files: one with male names and one with female names. I labeled each name as "male" or "female" and then mixed them together. After that, I shuffled the data to make it random. Finally, I divided the dataset into three parts: 6900 names for training, 500 names for dev-test, and 500 names for the test set. This helps to train the model, check its progress, and test how well it works on new data.

In [16]:
import nltk
import random
from nltk.corpus import names
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

# Download dataset
nltk.download("names")

# Prepare and shuffle the data
labeled_names = (
    [(name, 'male') for name in names.words('male.txt')] +
    [(name, 'female') for name in names.words('female.txt')]
)
random.shuffle(labeled_names)

# Split into train, dev-test, and test
test_names = labeled_names[:500]
devtest_names = labeled_names[500:1000]
train_names = labeled_names[1000:]

print(f"Train: {len(train_names)}, Dev-test: {len(devtest_names)}, Test: {len(test_names)}")


Train: 6944, Dev-test: 500, Test: 500


[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


#Step 1: Define Incremental Feature Extractors

In this step, I created three simple functions that define different ways to extract features from a name for gender prediction. The first function, features_last_letter, uses only the last letter of a name since it often indicates gender (for example, names ending in "a" are usually female). The second one, features_first_last, uses both the first and last letters to capture more information from the name. The third function, features_suffix, focuses on the last one, two, and three letters (suffixes), which often show strong gender patterns like "-ie" or "-son." I improved the classifier gradually using these different feature sets and tested each version only on the dev-test set to track progress and find which combination predicts gender most accurately.



In [17]:
#Baseline: last letter only
def features_last_letter(name):
    return {'last_letter': name[-1].lower()}

#Add first letter
def features_first_last(name):
    name = name.lower()
    return {'first_letter': name[0], 'last_letter': name[-1]}

#Add suffix features
def features_suffix(name):
    name = name.lower()
    return {
        'suffix1': name[-1:],
        'suffix2': name[-2:],
        'suffix3': name[-3:]
    }


#Step 2: Train and Evaluate Incrementally on Dev-Test

In this step, I defined a function called evaluate_devtest to train and test the classifier using different feature sets. The function first creates a training set and a dev-test set by applying the selected feature extractor to the names. It then trains a Naive Bayes classifier on the training set and checks its accuracy on the dev-test set. The function prints the feature set name and its accuracy score so that I can compare different models easily. I used this function to test each version of the classifier on the dev-test set and observe which feature extractor gives better performance before moving to the final test evaluation.

In [18]:
def evaluate_devtest(feature_func, train, devtest):
    train_set = [(feature_func(n), gender) for (n, gender) in train]
    devtest_set = [(feature_func(n), gender) for (n, gender) in devtest]

    classifier = NaiveBayesClassifier.train(train_set)
    dev_acc = accuracy(classifier, devtest_set)

    print(f"Feature set: {feature_func.__name__}")
    print(f"Dev-test Accuracy: {dev_acc:.3f}")
    return classifier, dev_acc


In [19]:
#Test each version on dev-test only:
classifiers = {}
classifiers['last_letter'], acc1 = evaluate_devtest(features_last_letter, train_names, devtest_names)
classifiers['first_last'], acc2 = evaluate_devtest(features_first_last, train_names, devtest_names)
classifiers['suffix'], acc3 = evaluate_devtest(features_suffix, train_names, devtest_names)


Feature set: features_last_letter
Dev-test Accuracy: 0.770
Feature set: features_first_last
Dev-test Accuracy: 0.774
Feature set: features_suffix
Dev-test Accuracy: 0.776


#Step 3: Pick the Best One

In this step, I compared the accuracy of all three feature sets and selected the one that performed the best on the dev-test set. The code checks the accuracy values and automatically picks the feature model with the highest score. Then it prints the name of the best model and its dev-test accuracy. This step helps me decide which classifier to move forward with for the final evaluation on the unseen test set.

In [25]:
best_feature = max([('last_letter', acc1), ('first_last', acc2), ('suffix', acc3)], key=lambda x: x[1])
print(f"\n Best performing model on dev-test: {best_feature[0]} (Accuracy: {best_feature[1]:.3f})")



 Best performing model on dev-test: suffix (Accuracy: 0.776)


#Step 4: Final Evaluation on the Test Set

In this step, I trained the final classifier using the suffix features because it performed the best earlier. Then I tested it on the unseen test set to get the final accuracy score. Finally, I displayed the most informative features that the classifier used to tell male and female names apart.

In [21]:
# Prepare test set
test_set = [(features_suffix(n), gender) for (n, gender) in test_names]

# Train final classifier using suffix features and ALL training data
final_classifier = NaiveBayesClassifier.train(
    [(features_suffix(n), gender) for (n, gender) in train_names]
)

test_acc = accuracy(final_classifier, test_set)
print(f"\n Final Test Accuracy (unseen data): {test_acc:.3f}")

final_classifier.show_most_informative_features(5)



 Final Test Accuracy (unseen data): 0.784
Most Informative Features
                 suffix2 = 'na'           female : male   =     93.9 : 1.0
                 suffix2 = 'la'           female : male   =     72.5 : 1.0
                 suffix2 = 'ia'           female : male   =     52.7 : 1.0
                 suffix2 = 'ld'             male : female =     36.9 : 1.0
                 suffix1 = 'a'            female : male   =     33.4 : 1.0


The final test accuracy of my model on unseen data was 0.784, which shows that the classifier performed well. The most informative features reveal strong patterns such as names ending in "na," "la," and "ia" being mostly female, while names ending in "k" or "us" are more likely male. This confirms that suffix patterns are very helpful in predicting the gender of names.

So, I was satisfied with the suffix-based classifier since it achieved the highest accuracy among all versions on the dev-test set. When I evaluated its final performance on the unseen test set, the accuracy slightly increased from 0.776 on the dev-test set to 0.784 on the test set. The results are very close, showing that the model generalizes well and does not overfit the data. This small difference is expected because both datasets come from the same source, and it confirms that the classifier performs consistently on new, unseen examples.

#Performance Comparison (Dev-test vs. Test)

In [24]:
from sklearn.metrics import confusion_matrix, classification_report
import pandas as pd

# Dev-test Performance

actual_dev = [gender for (name, gender) in devtest_names]
pred_dev = [final_classifier.classify(features_suffix(name)) for (name, gender) in devtest_names]

print("\n Dev-test set performance (Classification Report):\n")
print(classification_report(actual_dev, pred_dev, target_names=['male', 'female']))

cm_dev = confusion_matrix(actual_dev, pred_dev, labels=['male', 'female'])
cm_dev_df = pd.DataFrame(cm_dev, index=['Actual_Male', 'Actual_Female'],
                                   columns=['Pred_Male', 'Pred_Female'])

print("\nDev-test Confusion Matrix:")
display(cm_dev_df)


#Test Performance

actual_test = [gender for (name, gender) in test_names]
pred_test = [final_classifier.classify(features_suffix(name)) for (name, gender) in test_names]

print("\n Test set Performance (Classification Report):\n")
print(classification_report(actual_test, pred_test, target_names=['male', 'female']))

cm_test = confusion_matrix(actual_test, pred_test, labels=['male', 'female'])
cm_test_df = pd.DataFrame(cm_test, index=['Actual_Male', 'Actual_Female'],
                                     columns=['Pred_Male', 'Pred_Female'])

print("\nTest Confusion Matrix:")
display(cm_test_df)



 Dev-test set performance (Classification Report):

              precision    recall  f1-score   support

        male       0.87      0.77      0.81       320
      female       0.66      0.79      0.72       180

    accuracy                           0.78       500
   macro avg       0.76      0.78      0.77       500
weighted avg       0.79      0.78      0.78       500


Dev-test Confusion Matrix:


Unnamed: 0,Pred_Male,Pred_Female
Actual_Male,142,38
Actual_Female,74,246



 Test set Performance (Classification Report):

              precision    recall  f1-score   support

        male       0.85      0.81      0.83       324
      female       0.68      0.73      0.70       176

    accuracy                           0.78       500
   macro avg       0.76      0.77      0.77       500
weighted avg       0.79      0.78      0.79       500


Test Confusion Matrix:


Unnamed: 0,Pred_Male,Pred_Female
Actual_Male,129,47
Actual_Female,61,263


The performance on the test set is slightly better than on the dev-test set. The test accuracy is 0.784, while the dev-test accuracy is 0.776. Precision and recall also improved a little. For example, for predicting male names, the F1-score increased from 0.81 (dev-test) to 0.83 (test). The confusion matrices also show fewer mistakes in the final test set. For instance, 61 female names were misclassified as male in the test set compared to 74 in the dev-test set. This small improvement is expected because the model was tuned using the dev-test set, and the similar results on the test set show the classifier generalizes well and is not overfitting.