## Project 3

Haig Bedros, Nori Selina, Julia Ferris, Matthew Roland

Project 3 - Your project should be submitted (as a Jupyter Notebook via GitHub) by end of the due date. The group should present their code and findings in our meetup. The ability to be an effective member of a virtual team is highly valued in the data science job market.  

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set?

Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.

In [63]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import names
from nltk.classify import apply_features
import random

## Defining a function to extract gender features

This function simply returns both the last character of a word and the last 2 characters of a word, respectively. For our model, we will be using these suffices to classify whether a name is masculine or feminine.

In [64]:
def gender_features(word):
    return {'suffix1': word[-1],
            'suffix2': word[-2]}

## Loading the Names Corpus

In [65]:
#nltk.download()
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])
random.seed(12345)
random.shuffle(names)

## Creating Train, Test, and Development Sets

In [66]:
featuresets = [(gender_features(n), g) for (n,g) in names]

train_set, test_set, dev_test = featuresets[0:501], featuresets[1002:], featuresets[501:1002]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print("Accuracy of test set:", nltk.classify.accuracy(classifier, test_set))
print('Accuracy of dev test set:', nltk.classify.accuracy(classifier, dev_test))

Accuracy of test set: 0.7605877268798618
Accuracy of dev test set: 0.7305389221556886


As we can see, this classifier performs with an accuracy of ~.75, which is sufficient, but can certainly be improved upon.

In [67]:
dev_test_names = names[501:1002]

errors = []
for (name, tag) in dev_test_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8} name={:<30}'.format(tag, guess, name))
    

correct=female   guess=male     name=Aeriel                        
correct=female   guess=male     name=Aileen                        
correct=female   guess=male     name=Alisun                        
correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Ariel                         
correct=female   guess=male     name=Brandais                      
correct=female   guess=male     name=Brittan                       
correct=female   guess=male     name=Caitrin                       
correct=female   guess=male     name=Carmel                        
correct=female   guess=male     name=Caroljean                     
correct=female   guess=male     name=Cary                          
correct=female   guess=male     name=Charmain                      
correct=female   guess=male     name=Christel                      
correct=female   guess=male     name=Christian  

The list of errors produced can give us hints regarding ways to modify our model. For instance, we can see that the model appears to guess male names too often based on the last 2 characters of the string. Perhaps incorporating the first character of a string may ameliorate this issue.

## Refining the Classification Model

Several new features are added below with the goal of improving the model. Suffixes, prefixes, word length, vowels, consonants, and vowels in the suffix were all features added to the classification model. These changes were based on the errors of the previous attempt.

In [68]:
def gender_features_2(word):
    return {'suffix1': word[-1],
            'suffix2': word[-2],
            'prefix1': word[0],
            'prefix2': word[1],
            'length': len(word),
            'vowels': sum(1 for v in word.lower() if v in 'aeiou'),
            'consonants': sum(1 for c in word.lower() if c not in 'aeiou'),
            'suffix_vowel': word[0] in 'aeiou'}   

In [69]:
featuresets = [(gender_features_2(n), g) for (n,g) in names]

train_set_2, test_set_2, dev_test_2 = featuresets[0:501], featuresets[1002:], featuresets[501:1002]

classifier_2 = nltk.NaiveBayesClassifier.train(train_set_2)

print("Accuracy of test set:", nltk.classify.accuracy(classifier_2, test_set_2))
print('Accuracy of dev test set:', nltk.classify.accuracy(classifier_2, dev_test_2))

Accuracy of test set: 0.7623163353500432
Accuracy of dev test set: 0.7345309381237525


## Conclusions

How does the performance on the test set compare to the performance on the dev-test set?

- When the classifier was originally checked for accuracies, the accuracy on the test set was nearly identical to the performance on the dev-test set. This shows that the classifier was consistent in its abilities to determine gender. The high accuracy of over 70% showed that the model was relatively good at determining if the name was male or female, but it could be improved upon.
- When the classifier was trained a second time using more features, the accuracy on the test set was again nearly identical to the performance on the dev-test set. The difference in accuracies was slightly more than the first classifier, but they were still very close. This shows the classifier was also consistent in its abilities to determine gender. It did not improve much compared to the original classifier, but it still showed a strong accuracy.

Is this what you'd expect?

- This was expected based on the results from the textbook. The accuracies were very similar in the example shown.
- Without consideration for the textbook, we assumed the dev-test set would have a similar accuracy to the test set in the first classification model because neither set was used to train the model. The model should be consistent across many different data sets. Also, we assumed the dev-test set would be slightly more accurate than the test set in the second classification model because the errors made on the dev-test set were used to help decide on the features added in the second model. Therefore, the results were also as expected even without considering the accuracies shown in the textbook.

# Using Scikit-Learn for Classification

Now, we will compare the performance of classifiers from scikit-learn. Specifically, we will build a logistic regression model and a random forest model to see how these compare to nltk's naivebayes classification model.

To start, we will create a new feature selection system that will act similarly to the one we created previously, using the length of each name, as well as the first and last two letters of each name for the purposes of classification.

In [70]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
#from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

names_df = pd.DataFrame(names)

names_df.rename(columns = {0: 'name', 1: 'gender'}, inplace =  True)

def gender_features_sci(word):
    return pd.DataFrame({
        'length': [len(l) for l in word],
        'suffix1': [l[0] for l in word],
        'suffix2': [l[1] for l in word],
        'prefix1': [l[-1] for l in word],
        'prefix2': [l[-2] for l in word]
    })

Next, we will apply this feature selection function to the dataframe and partition our data into predictor and outcome dataframes. Because of sklearn's logic, we will have to code dummy variables for the model to properly read our data. Then, we will properly encode our binary male/female outcomes.

In [71]:
X = gender_features_sci(names_df['name'])
y = names_df['gender']

X = pd.get_dummies(X, columns = ['suffix1', 'suffix2', 'prefix1', 'prefix2'])

label = LabelEncoder()
label.fit_transform(y)


array([0, 0, 0, ..., 1, 0, 0])

Finally, we will build the actual logistic regression model by first splitting our data, followed by fitting the training data.

In [72]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)
reg_model = LogisticRegression(max_iter = 300)
reg_model.fit(X_train, y_train)

y_pred = reg_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print('Accuracy:', accuracy)
print(classification_report(y_test, y_pred, target_names=label.classes_))

Accuracy: 0.8003355704697986
              precision    recall  f1-score   support

      female       0.83      0.86      0.84      1499
        male       0.74      0.71      0.72       885

    accuracy                           0.80      2384
   macro avg       0.79      0.78      0.78      2384
weighted avg       0.80      0.80      0.80      2384



As we can see, our model performed rather well, obtaining an accuracy of around 80%, which is slightly better compared to nltk's naive bayesian model. Furthermore, after diagnosing the classification report, it seems that our model performs better when classifying female outcomes compared to males.

Now, we will compare the performance of our regression to a random forest model

In [73]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_forest = rf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_forest)

print('Accuracy', accuracy)
print(classification_report(y_test, y_pred_forest, target_names=label.classes_))

Accuracy 0.7831375838926175
              precision    recall  f1-score   support

      female       0.82      0.83      0.83      1499
        male       0.71      0.70      0.71       885

    accuracy                           0.78      2384
   macro avg       0.77      0.77      0.77      2384
weighted avg       0.78      0.78      0.78      2384



It would appear that our logistic regression model performs marginally better than our random forest model, with a 2% difference in accuracy. Notably, both models possess a bias toward more often correctly predicting names that belong to female observations compared to males; however, this difference appears to be somewhat more pronounced in the random forest model. Perhaps the disparity in predicted outcomes is a result of features that were not explored in our models, such as vowel and consonant composition. Clearly, more complex models should be constructed for the construction of an enhanced, generalizable classification model.