<a href="https://colab.research.google.com/github/IgnatiusEzeani/NLP-Lecture/blob/main/Week_18_Lecture_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Gender Identification

**Credit**: The example code below were taken from [Chapters 6 of the NLTK book](https://www.nltk.org/book/ch06.html).

NLTK has a wordlist corpus, `Names`, containing 8,000 first names categorized by gender. The male and female names are stored in separate files. Let's find names which appear in both files, i.e. names that are ambiguous for gender:

###**Import `nltk` and download the `name` corpus**

In [None]:
import nltk
import random
nltk.download('names')
names = nltk.corpus.names 

###**Names in both male and female list**

In [None]:
print(names.fileids())
male_names = names.words('male.txt')
female_names = names.words('female.txt')
male_female = [w for w in male_names if w in female_names]
print(len(male_female))
for name in male_female[:20]:
  print(name)


###**Distribution of last letters**
According to [NLTK](https://www.nltk.org/book/ch02.html#sec-lexical-resources) suggests that male and female names have some distinctive characteristics. Names ending in `a`, `e` and `i` are likely to be female, while names ending in `k`, `o`, `r`, `s` and `t` are likely to be male. Let's see...

In [None]:
cfd = nltk.ConditionalFreqDist(
    (fileid, name[-1])
    for fileid in names.fileids()
    for name in names.words(fileid))
cfd.plot()

###**Feature extractor functions**
Let's build a classifier to model these differences more precisely. The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name.

The following feature extractors function builds a dictionary containing relevant information about a given name

In [None]:
# feature extractor 1
def gender_features(word):
  return {'last_letter': word[-1]}

# feature extractor 2
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

# feature extractor 3
def gender_features3(word):
  return {'suffix1': word[-1:], 'suffix2': word[-2:]}

###**Compiling the training instances**

In [None]:
# Building the training instances
labeled_names = ([(name, 'male') for name in names.words('male.txt')] 
                 + [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)
# len(labeled_names)

###**Train-DevTest-Test Split**

In [None]:
# train-devtest-test split
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]
print(len(train_names), len(devtest_names), len(test_names))

###**Extracting the features**

In [None]:
# Extracting the features
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

###**Training and Testing the Classifier**

In [None]:
# Training the classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

# apply the classifier to the development test
print(nltk.classify.accuracy(classifier, devtest_set))

###**Building the Error List**

In [None]:
# error analysis
errors = []
for (name, tag) in devtest_names:
  guess = classifier.classify(gender_features(name))
  if guess != tag:
    errors.append((tag, guess, name))

###**Show errors**

In [None]:
# Error list
print("Errors:", len(errors))
for (tag, guess, name) in sorted(errors[:20]):
  print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

###**Most informative features**

In [None]:
# Most informative features
classifier.show_most_informative_features(10)

###**Classifying other names**

In [None]:
print(classifier.classify(gender_features('Neo')))
# Output: 'male'
print(classifier.classify(gender_features('Trinity')))
# Output: 'female'

###**Classifying your name**

In [None]:
## Uncomment and modify below to classify your name with your best classifier
# print(classifier.classify(gender_features(<your name>))) #remember to change your 

###**Using other extractors**

You can use the other two feature extractor functions `gender_features2()` and `gender_features3()`.

1. Which performed better and why?

2. Can you think of any other way to modify the feature extractor function? Apply it and test your result. 

In [None]:
## Your code here