<a href="https://colab.research.google.com/github/Jlok17/Data620/blob/main/Project_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project 3:
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

In [157]:
import nltk
import pandas as pd
import random
import numpy as np
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

In [158]:
nltk.download('names')
from nltk.corpus import names

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


In [159]:
# Filtering Name Set to Remove Unisex Names
males = [(name.lower().strip(), 'male') for name in names.words('male.txt')]
females = [(name.lower().strip(), 'female') for name in names.words('female.txt')]

unisex = set(male_name for male_name, _ in males) & set(female_name for female_name, _ in females)
males = [(name, gender) for name, gender in males if name not in unisex]
females = [(name, gender) for name, gender in females if name not in unisex]

all_names = males + females

In [160]:
all_names_df = pd.DataFrame(all_names, columns=['Name', 'Gender'])
all_names_df.head

<bound method NDFrame.head of          Name  Gender
0       aamir    male
1       aaron    male
2       abbot    male
3      abbott    male
4       abdel    male
...       ...     ...
7208   zorine  female
7209  zsa zsa  female
7210   zsazsa  female
7211   zulema  female
7212   zuzana  female

[7213 rows x 2 columns]>

#### Feature Engineering:

In [161]:
def gender_features(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features["last_is_vowel"] = name[-1] in 'aeiouy'

    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features[f"count({letter})"] = name.lower().count(letter)
        features[f"has({letter})"] = letter in name.lower()
        features[f"first({letter})"] = name.lower().find(letter)

    features["suffix2"] = name[-2:].lower()
    features["last2"] = name[-2:].lower()

    if len(name) >= 3:
        features["last3"] = name[-3:].lower()
    else:
        features["last3"] = " " + name[-2:].lower()

    features["length"] = len(name)

    return features


In [162]:
all_names_df = all_names_df.join(
    all_names_df['Name'].apply(lambda x: pd.Series(gender_features(x)))
)
all_names_df.head

<bound method NDFrame.head of          Name  Gender firstletter lastletter  last_is_vowel  count(a)  has(a)  \
0       aamir    male           a          r          False         2    True   
1       aaron    male           a          n          False         2    True   
2       abbot    male           a          t          False         1    True   
3      abbott    male           a          t          False         1    True   
4       abdel    male           a          l          False         1    True   
...       ...     ...         ...        ...            ...       ...     ...   
7208   zorine  female           z          e           True         0   False   
7209  zsa zsa  female           z          a           True         2    True   
7210   zsazsa  female           z          a           True         2    True   
7211   zulema  female           z          a           True         1    True   
7212   zuzana  female           z          a           True         2    True  

In [169]:
# Randomly select indices for test, dev-test, and training sets
test_indices = all_names_df.sample(n=500, random_state=42).index
dev_test_indices = all_names_df.drop(test_indices).sample(n=500, random_state=123).index
train_indices = all_names_df.drop(test_indices).drop(dev_test_indices).index

test_set = all_names_df.loc[test_indices]
dev_test_set = all_names_df.loc[dev_test_indices]
train_set = all_names_df.loc[train_indices]
test_set.head(5)

Unnamed: 0,Name,Gender,firstletter,lastletter,last_is_vowel,count(a),has(a),first(a),count(b),has(b),...,count(y),has(y),first(y),count(z),has(z),first(z),suffix2,last2,last3,length
308,brock,male,b,k,False,0,False,-1,1,True,...,0,False,-1,0,False,-1,ck,ck,ock,5
381,chelton,male,c,n,False,0,False,-1,0,False,...,0,False,-1,0,False,-1,on,on,ton,7
5716,marleen,female,m,n,False,1,True,1,0,False,...,0,False,-1,0,False,-1,en,en,een,7
2312,tyler,male,t,r,False,0,False,-1,0,False,...,1,True,1,0,False,-1,er,er,ler,5
251,benjamin,male,b,n,False,1,True,4,1,True,...,0,False,-1,0,False,-1,in,in,min,8


#### Method:
As seen below that with the classification of Using Naive Bayes and Decision Tree, that the Naive Bayes has a higher test accuracy of 87.6% compared to 78.2% accuracy respectively. Briefly going through the whole thing, I imported the names file from nltk.corpus and using these names. I removed the Unisex name in order to not have a double count and to strictly decided if the gender of the name was male or female. With this subgroup of names, the feature engineering was to then take the name as the input and return different features such as: First letter, Last letter, the length of the words, the suffix and the counts and if it contains each letter.

In [166]:
test_features = [(gender_features(n), g) for n, g in zip(test_set['Name'], test_set['Gender'])]
dev_test_features = [(gender_features(n), g) for n, g in zip(dev_test_set['Name'], dev_test_set['Gender'])]
train_features = [(gender_features(n), g) for n, g in zip(train_set['Name'], train_set['Gender'])]

# Classify using Naive Bayes and Decision Tree
classifier_NB = nltk.NaiveBayesClassifier.train(train_features)
classifier_DT = nltk.DecisionTreeClassifier.train(train_features)

# Training accuracy
print("Training Accuracy For Naive Bayes:", nltk.classify.accuracy(classifier_NB, train_features))
print("Training Accuracy For Decision Tree:", nltk.classify.accuracy(classifier_DT, train_features))

# Dev-Test Accuracy
dev_test_actual = dev_test_set['Gender'].tolist()
dev_test_NB_predicted = [classifier_NB.classify(gender_features(n)) for n in dev_test_set['Name']]
dev_test_DT_predicted = [classifier_DT.classify(gender_features(n)) for n in dev_test_set['Name']]

print("Dev-Test Accuracy For Naive Bayes:", accuracy_score(dev_test_actual, dev_test_NB_predicted))
print("Dev-Test Accuracy For Decision Tree:", accuracy_score(dev_test_actual, dev_test_DT_predicted))

# Test Accuracy
test_actual = test_set['Gender'].tolist()
test_NB_predicted = [classifier_NB.classify(gender_features(n)) for n in test_set['Name']]
test_DT_predicted = [classifier_DT.classify(gender_features(n)) for n in test_set['Name']]

print("Test Accuracy For Naive Bayes:", accuracy_score(test_actual, test_NB_predicted))
print("Test Accuracy For Decision Tree:", accuracy_score(test_actual, test_DT_predicted))

Training Accuracy For Naive Bayes: 0.859005311443747
Training Accuracy For Decision Tree: 0.9946885562530179
Dev-Test Accuracy For Naive Bayes: 0.842
Dev-Test Accuracy For Decision Tree: 0.792
Test Accuracy For Naive Bayes: 0.876
Test Accuracy For Decision Tree: 0.782


### Alternative Method:

Below I want to try and see if I could stack the classifiers by doing a Naive Bayes Model then a Decision Tree with a final Naive Bayes Model. The hope is to get a better test accuracy than 87.6 % and 78.2 %. As shown from the results with Test Accuracy for the first Naive Bayes being 81.4 %, Decision Tree at 54.2% and Naive Bayes 2 being at 75.8% accuracy. This results in that my stacking classifiers isn't more accurate than using just one classification. I will note that I did try and use the function stacking_classifier() as seen below but I ran into some issues where it ultimately had to be scrapped. So I am a little curious to see if using that function from sklearn.ensmble would make the stacking classifers work better.

```
stacking_classifier = StackingClassifier(
    estimators=[('nb', classifier_NB), ('dt', classifier_DT)],
    final_estimator=DecisionTreeClassifier()
)
stacking_classifier.fit(train_features, train_labels)
```



In [167]:
# Creating featuresets
train_names = train_set['Name'].tolist()
dev_test_names = dev_test_set['Name'].tolist()
test_names = test_set['Name'].tolist()

train_labels = train_set['Gender'].tolist()
dev_test_labels = dev_test_set['Gender'].tolist()
test_labels = test_set['Gender'].tolist()

# Label Enconding
label_encoder = LabelEncoder()
train_labels_encoded = label_encoder.fit_transform(train_labels)
dev_test_labels_encoded = label_encoder.transform(dev_test_labels)
test_labels_encoded = label_encoder.transform(test_labels)

# Convert to matrix representation
vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 3))
train_features_matrix = vectorizer.fit_transform(train_names)
dev_test_features_matrix = vectorizer.transform(dev_test_names)
test_features_matrix = vectorizer.transform(test_names)


In [168]:
# Initial Naive Bayes model
classifier_NB_1 = MultinomialNB()
classifier_NB_1.fit(train_features_matrix, train_labels_encoded)

# Using predictions(NB1) as features
nb1_predictions = classifier_NB_1.predict(train_features_matrix)
train_features_stacked = np.hstack((train_features_matrix.toarray(), np.array(nb1_predictions).reshape(-1, 1)))

# Decision Tree model
classifier_DT = DecisionTreeClassifier()
classifier_DT.fit(train_features_stacked, train_labels_encoded)

# Combining predictions from both models
nb1_predictions_dev = classifier_NB_1.predict(dev_test_features_matrix)
dev_features_stacked = np.hstack((dev_test_features_matrix.toarray(), np.array(nb1_predictions_dev).reshape(-1, 1)))

# Second Naive Bayes model
classifier_NB_2 = MultinomialNB()
classifier_NB_2.fit(dev_features_stacked, dev_test_labels_encoded)
nb1_predictions_test = classifier_NB_1.predict(test_features_matrix)
test_features_stacked = np.hstack((test_features_matrix.toarray(), np.array(nb1_predictions_test).reshape(-1, 1)))
nb2_predictions = classifier_NB_2.predict(test_features_stacked)

# Accuracy Test
test_labels_decoded = label_encoder.inverse_transform(test_labels_encoded)
nb1_predictions_decoded = label_encoder.inverse_transform(nb1_predictions_test)
print("Test Accuracy For Naive Bayes 1:", accuracy_score(test_labels_decoded, nb1_predictions_decoded))

dt_predictions = classifier_DT.predict(dev_features_stacked)
dt_predictions_decoded = label_encoder.inverse_transform(dt_predictions)
print("Test Accuracy For Decision Tree:", accuracy_score(test_labels_decoded, dt_predictions_decoded))

nb2_predictions_decoded = label_encoder.inverse_transform(nb2_predictions)
print("Test Accuracy For Naive Bayes 2:", accuracy_score(test_labels_decoded, nb2_predictions_decoded))


Test Accuracy For Naive Bayes 1: 0.814
Test Accuracy For Decision Tree: 0.542
Test Accuracy For Naive Bayes 2: 0.758
