<a href="https://colab.research.google.com/github/Heleinef/Data-Science-Master_Heleine/blob/main/Project3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instructions:
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set?
Is this what you'd expect?
Source: Natural Language Processing with Python, exercise 6.10.2.

# Split the data


In [1]:
import nltk
from nltk.corpus import names
import random

nltk.download('names')

# Load the names corpus
names = [(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]

# Shuffle the names
random.shuffle(names)

# Split the data
train_set = names[1000:]
dev_test_set = names[500:1000]
test_set = names[:500]


[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


# Features Extraction

In [2]:
def gender_features(word):
    return {
        'last_letter': word[-1],
        'last_two_letters': word[-2:],
        'first_letter': word[0],
        'length': len(word),
        'vowel_count': sum(1 for letter in word if letter in 'aeiou')
    }


# Model Training

In [3]:
from nltk.classify import apply_features
from nltk.classify import NaiveBayesClassifier
from nltk.classify import DecisionTreeClassifier
from nltk.classify import MaxentClassifier

train_set = apply_features(gender_features, train_set)
dev_test_set = apply_features(gender_features, dev_test_set)
test_set = apply_features(gender_features, test_set)

# Naive Bayes Classifier
nb_classifier = NaiveBayesClassifier.train(train_set)

# Decision Tree Classifier
dt_classifier = DecisionTreeClassifier.train(train_set)

# Maximum Entropy Classifier
me_classifier = MaxentClassifier.train(train_set, max_iter=10)


  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.44557        0.779
             3          -0.38528        0.804
             4          -0.35610        0.808
             5          -0.33946        0.809
             6          -0.32886        0.811
             7          -0.32158        0.813
             8          -0.31631        0.814
             9          -0.31231        0.815
         Final          -0.30918        0.817


# Evaluation

In [None]:
from nltk.classify.util import accuracy

print("Naive Bayes Accuracy on Dev-Test Set:", accuracy(nb_classifier, dev_test_set))
print("Decision Tree Accuracy on Dev-Test Set:", accuracy(dt_classifier, dev_test_set))
print("Max Entropy Accuracy on Dev-Test Set:", accuracy(me_classifier, dev_test_set))


# Final evaluation on test set

In [5]:
from nltk.classify.util import accuracy # Import the accuracy function

print("Naive Bayes Accuracy on Test Set:", accuracy(nb_classifier, test_set))
print("Decision Tree Accuracy on Test Set:", accuracy(dt_classifier, test_set))
print("Max Entropy Accuracy on Test Set:", accuracy(me_classifier, test_set))

Naive Bayes Accuracy on Test Set: 0.782
Decision Tree Accuracy on Test Set: 0.746
Max Entropy Accuracy on Test Set: 0.794


# Analysis

**Performance Comparison:** Compare the accuracy of the models on the dev-test set and the test set.
**Expectation**: Typically, the performance on the test set may be slightly lower than the dev-test set due to overfitting during the tuning process.

# Discussion
The performance on the test set should be close to the dev-test set but might be slightly lower, which is expected due to the model's slight overfitting to the dev-test set. This exercise demonstrates the importance of having separate dev-test and test sets for tuning and final evaluation.