# Mystery Friend Writing Classifier

You’ve received an anonymous postcard from a friend who you haven’t seen in years. Your friend did not leave a name, but the card is definitely addressed to you. So far, you’ve narrowed your search down to three friends, based on handwriting:

Emma Goldman
Matthew Henson
TingFang Wu

But which one sent you the card?

Today, we'll be building a writing classifier to distinguish one friend's writing from anothers. We will be using scikit-learn's bag-of-words and a Naive Bayes Classifier to get the job done.

In [2]:
from goldman_emma_raw import goldman_docs
from henson_matthew_raw import henson_docs
from wu_tingfang_raw import wu_docs

# import sklearn modules:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Setting up the combined list of friends' writing samples
friends_docs = goldman_docs + henson_docs + wu_docs
# Setting up labels for the three friends
friends_labels = [1] * 154 + [2] * 141 + [3] * 166

# Print out a document from each friend for a check:
print("This is Goldman:")
print(goldman_docs[120])
print("\n")

print("This is Henson:")
print(henson_docs[120])
print("\n")

print("This is Wu:")
print(wu_docs[120])
print("\n")

This is Goldman:
 Nor will the stereotyped
Philistine argument that the laxity of divorce laws and the growing
looseness of woman account for the fact that: first, every twelfth
marriage ends in divorce; second, that since 1870 divorces have
increased from 28 to 73 for every hundred thousand population; third,
that adultery, since 1867, as ground for divorce, has increased 270.8
per cent.; fourth, that desertion increased 369.8 per cent.

Added to these startling figures is a vast amount of material,
dramatic and literary, further elucidating this subject


This is Henson:
M.

I was ashore on Duck Island in 1891, on my first voyage north, and I
remember distinctly the cairn the party built and the money they
deposited in it


This is Wu:
 America is known to have a large number of such men and
women, men and women who devote their time and money to preaching peace
among the nations




In [8]:
# This will be the test text:
mystery_postcard = """
My friend,
From the 10th of July to the 13th, a fierce storm raged, clouds of
freeing spray broke over the ship, incasing her in a coat of icy mail,
and the tempest forced all of the ice out of the lower end of the
channel and beyond as far as the eye could see, but the _Roosevelt_
still remained surrounded by ice.
Hope to see you soon.
"""

# Create bow_vectorizer, this will help us construct the bag-of-words (BoW) automatically:
bow_vectorizer = CountVectorizer()

# Define friends_vectors, fit a BoW dict on the training data and return the vectorized results of the sentences.
# The language model will learn from the vectorized form, whose word indices are based on the BoWs dict
# Note: A bag-of-words vector will be the same length as the features dictionary / BoW dict, which is a map of each unique word token in the training data to a vector index:
friends_vectors = bow_vectorizer.fit_transform(friends_docs)

# Define mystery_vector, vectorize the test text based on the word indices of the trained PoW dict: 
mystery_vector = bow_vectorizer.transform([mystery_postcard])

# Define friends_classifier (Naive Bayes):
friends_classifier = MultinomialNB()

# Train the classifier on the training data and labels:
friends_classifier.fit(friends_vectors, friends_labels)

# Make the prediction using the trained model on the test text:
predictions = friends_classifier.predict(mystery_vector)

# See how confident the model is in it's classification by printing the estimated probabilities:
predictions_proba = friends_classifier.predict_proba(mystery_vector)
print("The estimated probabilities are: [Goldman Hensen Wu]")
print(predictions_proba)
print("\n")

mystery_friend = predictions[0] if predictions[0] else "someone else"

if mystery_friend == 1:
    writer = "Goldman"
elif mystery_friend == 2:
    writer = "Henson"
else:
    writer = "Wu"

# Reveal who the Mystery Writer was:
print("The postcard was from {}!".format(writer))

The estimated probabilities are: [Goldman Hensen Wu]
[[1.10199321e-02 9.88977727e-01 2.34054697e-06]]


The postcard was from Henson!
