# Supervised Machine Learning: Showcase \#1

## Name Gender Classifier

▪ To create a classifier that would automatically classify a given name into either male or female.

In [None]:
from nltk.corpus import names

male_names = names.words("male.txt")
female_names = names.words("female.txt")

<span style = "color:red">
    
**Exercise \#1: Write python code that reads all names from \'female.txt\' into a list named \'female_names\' using readlines().** 

</span>

In [3]:
f1 = open("female.txt", "r")
female_names = f1.readlines()

<span style = "color:red">
    
**Exercise \#2: Write python code that prints the first 10 females names in \'female_names\'.** 

</span>

In [4]:
print(female_names[:10])

['Abagael\n', 'Abagail\n', 'Abbe\n', 'Abbey\n', 'Abbi\n', 'Abbie\n', 'Abby\n', 'Abigael\n', 'Abigail\n', 'Abigale\n']


In [5]:
f1.tell()

35576

<span style = "color:red">
    
**Exercise \#3: Rewrite your answer for Exercise \#1 to read all names from \'female.txt\' using read() and splitlines().** 

</span>

<span style = "color:red">
    
**Exercise \#4: Write python code that reads all names from \'male.txt\' into a list named \'male_names\'.** 

</span>

<span style = "color:red">
    
**Exercise \#5: Write python code that creates a labelled dataset named \'names_list\' from the combination of \'female_names\' and \'male_names\' with the use of binary tuple and list comprehension.** 

</span>

<span style = "color:red">
    
**Exercise \#6: Write python code that randomly shuffles all names in \'names_list\'.** 

</span>

### Features Extraction

▪ **extract_gender_features()** is a function that extracts features from names for supervised machine learning purposes.

In [None]:
def extract_gender_features(name):
    
    # Convert all names to lowercase
    name = name.lower()
    
    # Create an empty dictionary
    features = {}
    
    # Extract different lengths of suffixes from names as features
    features["suffix1"] = name[-1:]
    features["suffix2"] = name[-2:] if len(name) > 1 else name[0]
    features["suffix3"] = name[-3:] if len(name) > 2 else name[0]
    #features["suffix4"] = name[-4:] if len(name) > 3 else name[0]
    #features["suffix5"] = name[-5:] if len(name) > 4 else name[0]
    #features["suffix6"] = name[-6:] if len(name) > 5 else name[0]
    
    # Extract different lengths of prefixes from names as features
    features["prefix1"] = name[:1]
    features["prefix2"] = name[:2] if len(name) > 1 else name[0]
    features["prefix3"] = name[:3] if len(name) > 2 else name[0]
    #features["prefix4"] = name[:4] if len(name) > 3 else name[0]
    #features["prefix5"] = name[:5] if len(name) > 4 else name[0]
    #features["wordLen"] = len(name)
   
    return features

data = [(extract_gender_features(name), gender) for (name, gender) in names_list]

In [None]:
print(names_list[0])
print(data[0])

### Training and Testing Data

▪ Split names into 80% training data and 20% testing data.

In [None]:
# Set a limit for splitting training and testing data
train_count = int(.8 * len(data))
train_count

In [None]:
# Make the first 80% (the value of trainCount) dataset as the training data
train_data = data[:train_count]

# Make the remaining dataset as the test data
test_data = data[train_count:]

### Model Fitting with Naive Bayes

In [None]:
import nltk

# Train Naive Bayes classifier
bayes = nltk.NaiveBayesClassifier.train(train_data)

### Model Evaluation with Testing Data

In [None]:
# Use classify() to do gender prediction
prediction = [(bayes.classify(features), bayes.classify(features) == label) for features, label in test_data]

In [None]:
# Evaluate the performance in terms of accuracy
print("Test data accuracy =", nltk.classify.accuracy(bayes, test_data))

### Prediction vs. Actual Gender

In [None]:
names_test = names_list[train_count:]

# Create an empty list to store name, gender, prediction, true/false
result = []

# Use sum() to combine two tuples into a new tuple
for index in range(len(prediction)):
    result.append(sum((names_test[index], prediction[index]), ()))

In [None]:
import pandas as pd

df = pd.DataFrame(result, columns = ['Name', 'Gender', 'Prediction', 'T/F'])
df[:20]

<span style = "color:red">
    
**Exercise \#7: Write python code that prints the first 20 wrong prediction.** 

</span>

### Most Informative Features

In [None]:
# Show the 25 most informative features that our model used
bayes.show_most_informative_features(25)

<span style = "color:red">
    
**Exercise \#8: Write python code that demonstrates the use of the predictive model.** 

</span>

<span style = "color:red">
    
**Exercise \#9: Use \'extract_gender_features()\', to conduct experiments on the effect of the use of different types of features.** 

</span>