<a href="https://colab.research.google.com/github/GitableGabe/DATA_620_Collab/blob/main/Project3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3/Data 620 : **Team members:** Heleine, Gabriel, Kossi, Victor.


# Instructions:
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set?
Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.

# 1. Split the corpus/data
We start by splitting the corpus into training, dev-test, and test sets.


In [18]:
import nltk
from nltk.corpus import names
import random

# Load the names corpus
nltk.download('names')
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])

# Shuffle the names
random.shuffle(names)

# Split the corpus
train_names = names[1000:]
devtest_names = names[500:1000]
test_names = names[:500]


[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


# 2. Features Extraction
As a second step, we'll extract features starting with simple features and then incrementally improve them.

In [28]:
def gender_features(name):
    return {
        'last_letter': name[-1].lower(),
        'last_two_letters': name[-2:].lower(),
        'first_letter': name[0].lower(),
        'first_two_letters': name[:2].lower(),
        'name_length': len(name),
        'vowel_count': sum(1 for char in name if char in 'aeiou')
    }

# Example usage
print(gender_features('Shrek'))


{'last_letter': 'k', 'last_two_letters': 'ek', 'first_letter': 's', 'first_two_letters': 'sh', 'name_length': 5, 'vowel_count': 1}


# 3.Train the Classifier.
As a third step we'll train the classifiers. We'll subsequently use the Naive Bayes classifier, the Decision Tree classifier and the Maxent classifier.

In [49]:
from nltk.classify import NaiveBayesClassifier, DecisionTreeClassifier, MaxentClassifier
from nltk.corpus import names
import random
import nltk
from nltk.classify import apply_features

# Prepare the training and dev-test sets
train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, devtest_names)
test_set = apply_features(gender_features, test_names)

# Train Naive Bayes classifier
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
# Train Decision Tree classifier
dt_classifier = nltk.DecisionTreeClassifier.train(train_set)
# Train Maxent classifier
me_classifier = nltk.MaxentClassifier.train(train_set, max_iter=10)


  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.368
             2          -0.45241        0.785
             3          -0.37517        0.849
             4          -0.33137        0.865
             5          -0.30320        0.874
             6          -0.28335        0.879
             7          -0.26843        0.884
             8          -0.25668        0.889
             9          -0.24708        0.892
         Final          -0.23904        0.895


# 4. Initial Evaluation

In [38]:


# Train and evaluate Naive Bayes classifier
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Naive Bayes Classifier accuracy on dev-test set:", nltk.classify.accuracy(nb_classifier, devtest_set))
nb_classifier.show_most_informative_features(10)
print("Naive Bayes Classifier accuracy on test set:", nltk.classify.accuracy(nb_classifier, test_set))

# Train and evaluate Decision Tree classifier
dt_classifier = nltk.DecisionTreeClassifier.train(train_set)
print("Decision Tree Classifier accuracy on dev-test set:", nltk.classify.accuracy(dt_classifier, devtest_set))
print("Decision Tree Classifier accuracy on test set:", nltk.classify.accuracy(dt_classifier, test_set))

# Train and evaluate Maxent classifier
me_classifier = nltk.MaxentClassifier.train(train_set, max_iter=10)
print("Maxent Classifier accuracy on dev-test set:", nltk.classify.accuracy(me_classifier, devtest_set))
me_classifier.show_most_informative_features(10)
print("Maxent Classifier accuracy on test set:", nltk.classify.accuracy(me_classifier, test_set))


Naive Bayes Classifier accuracy on dev-test set: 0.776
Most Informative Features
        last_two_letters = 'na'           female : male   =     95.1 : 1.0
        last_two_letters = 'la'           female : male   =     71.6 : 1.0
        last_two_letters = 'us'             male : female =     63.3 : 1.0
             last_letter = 'k'              male : female =     40.7 : 1.0
        last_two_letters = 'ia'           female : male   =     36.0 : 1.0
        last_two_letters = 'sa'           female : male   =     35.1 : 1.0
             last_letter = 'a'            female : male   =     33.4 : 1.0
        last_two_letters = 'ta'           female : male   =     31.0 : 1.0
        last_two_letters = 'do'             male : female =     25.2 : 1.0
        last_two_letters = 'io'             male : female =     25.2 : 1.0
Naive Bayes Classifier accuracy on test set: 0.8
Decision Tree Classifier accuracy on dev-test set: 0.736
Decision Tree Classifier accuracy on test set: 0.726
  ==> Trai

# 5. Incremental Improvements:
Based on the initial evaluation, we can make several incremental improvements to the feature extraction function:

Add features for the first three letters;

Add features for the last three letters;

Include the number of consonants;

Include the ratio of vowels to consonants;

# 6. Final evaluation on test set (after improvements)

In [53]:
from nltk.classify.util import accuracy # Import the accuracy function
def gender_features(name):
    features = {
        'last_letter': name[-1].lower(),
        'last_two_letters': name[-2:].lower(),
        'first_letter': name[0].lower(),
        'first_two_letters': name[:2].lower(),
        'name_length': len(name),
        'vowel_count': sum(1 for char in name if char in 'aeiou'),
        'first_three_letters': name[:3].lower(),
        'last_three_letters': name[-3:].lower(),
        'consonant_count': sum(1 for char in name if char not in 'aeiou '),
        'vowel_to_consonant_ratio': sum(1 for char in name if char in 'aeiou') / (sum(1 for char in name if char not in 'aeiou ') + 1)
    }
    return features

# Re-prepare the datasets
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

# Train and evaluate Naive Bayes classifier
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Improved Naive Bayes Classifier accuracy on dev-test set:", nltk.classify.accuracy(nb_classifier, devtest_set))
nb_classifier.show_most_informative_features(10)
print("Improved Naive Bayes Classifier accuracy on test set:", nltk.classify.accuracy(nb_classifier, test_set))

# Train and evaluate Decision Tree classifier
dt_classifier = nltk.DecisionTreeClassifier.train(train_set)
print("Improved Decision Tree Classifier accuracy on dev-test set:", nltk.classify.accuracy(dt_classifier, devtest_set))
print("Improved Decision Tree Classifier accuracy on test set:", nltk.classify.accuracy(dt_classifier, test_set))

# Train and evaluate Maxent classifier
me_classifier = nltk.MaxentClassifier.train(train_set, max_iter=10)
print("Improved Maxent Classifier accuracy on dev-test set:", nltk.classify.accuracy(me_classifier, devtest_set))
me_classifier.show_most_informative_features(10)
print("Improved Maxent Classifier accuracy on test set:", nltk.classify.accuracy(me_classifier, test_set))


Improved Naive Bayes Classifier accuracy on dev-test set: 0.808
Most Informative Features
        last_two_letters = 'na'           female : male   =     94.7 : 1.0
        last_two_letters = 'la'           female : male   =     71.0 : 1.0
             last_letter = 'k'              male : female =     42.6 : 1.0
        last_two_letters = 'ia'           female : male   =     38.4 : 1.0
        last_two_letters = 'sa'           female : male   =     34.6 : 1.0
             last_letter = 'a'            female : male   =     33.9 : 1.0
        last_two_letters = 'us'             male : female =     28.0 : 1.0
        last_two_letters = 'ra'           female : male   =     25.6 : 1.0
        last_two_letters = 'ta'           female : male   =     24.9 : 1.0
        last_two_letters = 'rd'             male : female =     24.0 : 1.0
Improved Naive Bayes Classifier accuracy on test set: 0.834
Improved Decision Tree Classifier accuracy on dev-test set: 0.724
Improved Decision Tree Classifier 

A few noticeable changes:

**Improved Naive Bayes Classifier:**

Dev-test set accuracy: 0.810
Test set accuracy: 0.828
Added features such as last three letters and more detailed vowel/consonant analysis improved performance.

**Improved Decision Tree Classifier:**

Dev-test set accuracy: 0.738
Test set accuracy: 0.754

 **Slight improvement in accuracy with enhanced feature extraction.**

**Improved Maxent Classifier: **

Dev-test set accuracy: 0.808
Test set accuracy: 0.836

**Significant improvement in accuracy with additional features.**


# 7. Analysis and comparison:

**Naive Bayes Classifier:**

Initial dev-test accuracy: 0.776 → Improved dev-test accuracy: 0.810
Initial test accuracy: 0.800 → Improved test accuracy: 0.828

The Naive Bayes classifier showed a notable improvement in both the dev-test and test set accuracies after adding more features. This indicates that the additional features provided more discriminative power for the classifier.

**Decision Tree Classifier**:

Initial dev-test accuracy: 0.736 → Improved dev-test accuracy: 0.738
Initial test accuracy: 0.726 → Improved test accuracy: 0.754

The Decision Tree classifier showed a slight improvement in both dev-test and test set accuracies. Decision Trees are sensitive to overfitting, and the additional features might have provided some benefit without significantly increasing complexity.

**Maxent Classifier:**

Initial dev-test accuracy: 0.770 → Improved dev-test accuracy: 0.808
Initial test accuracy: 0.796 → Improved test accuracy: 0.836

The Maxent classifier showed significant improvement, indicating that it could leverage the additional features effectively to enhance prediction performance.

 **Expected vs. Actual Performance**

Performance on Dev-test vs. Test Set:

The improvements were consistent across both the dev-test and test sets, suggesting that the feature enhancements generalized well to unseen data.
It is expected that performance on the dev-test set would be slightly better than on the test set due to the iterative tuning process based on the dev-test set. However, the results indicate that the improvements were robust enough to perform well on the test set too.

**The most informative features for each classifier after the improvements have been made.**
The most informative features for all three classifiers indicate that the last two letters and the last letter of names are strong predictors for determining gender. The additional feature of the last three letters also proved to be valuable, especially in distinguishing between male and female names. This demonstrates the effectiveness of feature engineering in enhancing the performance of natural language classifiers.

# 7. Conclusion and Discussion
 This exercise demonstrates the importance of having separate dev-test and test sets for tuning and final evaluation.

 The enhancements made to the feature extraction process significantly improved the performance of the Naive Bayes and Maxent classifiers. The Decision Tree classifier also saw improvements, albeit smaller, indicating a need for perhaps different tuning or regularization techniques to fully utilize the new features. The results demonstrate that careful feature engineering can have a substantial impact on the effectiveness of classifiers in natural language processing tasks.