# Mystery Friend

You've received an anonymous postcard from a friend who you haven't seen in years. Your friend did not leave a name, but the card is definitely addressed to you. So far, you've narrowed your search down to three friends, based on handwriting:
- Emma Goldman
- Matthew Henson
- TingFang Wu

But which one sent you the card?

Just like you can classify a message as spam or not spam with a spam filter, you can classify writing as related to one friend or another by building a kind of friend writing classifier. You have past writing from all three friends stored up in the variable `friends_docs`, which means you can use scikit-learn's bag-of-words and Naive Bayes classifier to determine who the mystery friend is!

Ready?

## Feature Vectors Are in the Bag with Scikit-Learn

## Import Libraries

In [34]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Import messages to process
import import_ipynb
from messages.goldman_emma_raw import goldman_docs
from messages.henson_matthew_raw import henson_docs
from messages.wu_tingfang_raw import wu_docs

## Text Exploration

In [80]:
# Inspect lines from the messages of each friend
print("Here's an excerpt from Emma Goldman's message:\n", goldman_docs[49], "\n")
print("Here's an excerpt from Matthew Henson's message:\n", henson_docs[49], "\n")
print("Here's an excerpt from TingFang Wu's message:\n", wu_docs[49], "\n")


Here's an excerpt from Emma Goldman's message:
  What he gives to the world is only gray and hideous
things, reflecting a dull and hideous existence,--too weak to live,
too cowardly to die 

Here's an excerpt from Matthew Henson's message:
 Miss Marie Ahnighito Peary, aged about ten months, who
first saw the light of day at Anniversary Lodge on the 12th of the
previous September, was taken by her mother to her kinfolks in the
South 

Here's an excerpt from TingFang Wu's message:
  Let us, for instance, compare England with the United
States 



## Feature Extraction (Vectorization) using Bag-of-Words

In [2]:
# Create bow_vectorizer:
bow_vectorizer = CountVectorizer()

# Combine messages to one document
friends_docs = goldman_docs + henson_docs + wu_docs

# Define friends_vectors:
friends_vectors = bow_vectorizer.fit_transform(friends_docs)

In [78]:
# Create a new mystey message we want to classify
mystery_message = """
My friend,
From the 10th of July to the 13th, a fierce storm raged, clouds of
freeing spray broke over the ship, incasing her in a coat of icy mail,
and the tempest forced all of the ice out of the lower end of the
channel and beyond as far as the eye could see, but the _Roosevelt_
still remained surrounded by ice.
Hope to see you soon.
"""

# Define mystery_vector:
mystery_vector = bow_vectorizer.transform([mystery_message])


## This Mystery Friend Gets Classified

5. You've vectorized and prepared all the documents. Let's take a look at your friends' writing samples to get a sense of how they write.

   Print out one document of each friend's writing - try any one between `0` and `140`. (Your friends' documents are stored in `goldman_docs`, `henson_docs`, and `wu_docs`.)

## Naive Bayes Classifier

In [66]:
# Define friends_classifier
friends_classifier = MultinomialNB()

# Define friends_labels, length of documents indicates number of words
friends_labels = ["Emma"] * len(goldman_docs) + ["Matthew"] * len(henson_docs) + ["Tingfang"] * len(wu_docs)

# Train the classifier
friends_classifier.fit(friends_vectors, friends_labels)

# Predict mystery_vector
predictions = friends_classifier.predict(mystery_vector)
mystery_friend = predictions[0] if predictions[0] else "someone else"
print("The postcard was from {}!".format(mystery_friend))

The postcard was from Emma!


Taking a random excerpt from Emma Goldman's "The place of the individual in society" (source: [https://www.gutenberg.org/cache/epub/71418/pg71418.txt]) allows us to test how the classifier holds up!

In [60]:
mystery_message = """
The interests of the State and those of the individual differ
fundamentally and are antagonistic. The State and the political and
economic institutions it supports can exist only by fashioning the
individual to their particular purpose; training him to respect “law and
order;” teaching him obedience, submission and unquestioning faith in
the wisdom and justice of government; above all, loyal service and
complete self-sacrifice when the State commands it, as in war. The State
puts itself and its interests even above the claims of religion and of
God. It punishes religious or conscientious scruples against
individuality because there is no individuality without liberty, and
liberty is the greatest menace to authority.
"""

# Define mystery_vector:
mystery_vector = bow_vectorizer.transform([mystery_message])

In [79]:
# Predict mystery_vector
probabilities = friends_classifier.predict_proba(mystery_vector)

# Create list of authors
author_names = ["Emma Goldman", "Matthew Henson", "TingFang Wu"]

# Extract probabilities for each author
author_probabilities = probabilities[0]

# Initialize a dictionary to store the likelihood for each author
likelihoods = {}

for i, author in enumerate(author_names):
    likelihoods[author] = author_probabilities[i]

# Output the likelihood for each author
for author, likelihood in likelihoods.items():
    print(f"The likelihood of the mystery message being from {author} is {likelihood:.2%}")


# Find the index with the highest probability
index_max_probability = np.argmax(author_probabilities)

# Output the result
print("")
print("Conclusion:")
print(f"Based on these probabilities, the mystery message is most likely from {author_names[index_max_probability]}.")


The likelihood of the mystery message being from Emma Goldman is 1.10%
The likelihood of the mystery message being from Matthew Henson is 98.90%
The likelihood of the mystery message being from TingFang Wu is 0.00%

Conclusion:
Based on these probabilities, the mystery message is most likely from Matthew Henson.
