In [10]:
import nltk

In [11]:
nltk.download('names')

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


True

Defines gender_features Function
Here, a function gender_features is defined. For this basic example, it extracts the last letter of a word as its primary feature. This is a common starting point for text classification to illustrate feature extraction.

In [12]:
def gender_features(word):
    """
    Extracts the last letter of a word as a feature.
    """
    return {"last_letter": word[-1]}

print(f"Features for 'obama': {gender_features('obama')}")

Features for 'obama': {'last_letter': 'a'}


Loads All Names from Corpus
This cell imports the names corpus specifically and then loads all available names into the all_names variable. It then prints the total count of names found in the corpus.

In [13]:
from nltk.corpus import names

# Get all names from the NLTK corpus.
all_names = names.words()

# Print the total count of names.
print(f"Total number of names in the corpus: {len(all_names)}")

Total number of names in the corpus: 7944


Labels Male and Female Names
In this cell, names are retrieved from 'male.txt' and 'female.txt' within the NLTK corpus. Each name is then paired with its corresponding gender label ('male' or 'female') to create a dataset of labeled_names for supervised learning.

In [15]:
from nltk.corpus import names

# Label male and female names.
male_names = [(name, 'male') for name in names.words('male.txt')]
female_names = [(name, 'female') for name in names.words('female.txt')]

# Combine and print samples.
labeled_names = male_names + female_names
print(f"Total labeled names: {len(labeled_names)}")

Total labeled names: 7944


Shuffles, Trains, and Evaluates Classifier
This is the core cell for the classification task. It first shuffles the labeled_names for randomness, creates feature sets using gender_features, splits the data into training and testing sets, and then trains a Naive Bayes Classifier. Finally, it demonstrates a prediction and evaluates the overall accuracy of the classifier on the test set.

In [16]:
import random
import nltk

# Shuffle names for balanced datasets.
random.shuffle(labeled_names)

# Create feature sets and print an example.
featuresets = [(gender_features(name), gender) for (name, gender) in labeled_names]
print(f"Example feature set: {featuresets[0]}")

# Split into training and testing sets.
test_set_size = 2000
train_set = featuresets[test_set_size:]
test_set = featuresets[:test_set_size]

# Train the Naive Bayes Classifier.
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Make predictions and evaluate accuracy.
name1 = "David"
predicted_gender1 = classifier.classify(gender_features(name1))
print(f"Predicted gender for '{name1}': {predicted_gender1}")
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Classifier Accuracy: {accuracy:.4f}")

Example feature set: ({'last_letter': 'e'}, 'female')
Predicted gender for 'David': male
Classifier Accuracy: 0.7650
