# Mystery Friend

You've received an anonymous postcard from a friend who you haven't seen in years. Your friend did not leave a name, but the card is definitely addressed to you. So far, you've narrowed your search down to three friends, based on handwriting:
- Emma Goldman
- Matthew Henson
- TingFang Wu

But which one sent you the card?

Just like you can classify a message as spam or not spam with a spam filter, you can classify writing as related to one friend or another by building a kind of friend writing classifier. You have past writing from all three friends stored up in the variable `friends_docs`, which means you can use scikit-learn's bag-of-words and Naive Bayes classifier to determine who the mystery friend is!

Ready?

## Feature Vectors Are in the Bag with Scikit-Learn

1. In the code block below, import `CountVectorizer` from `sklearn.feature_extraction.text`. Below it, import `MultinomialNB` from `sklearn.naive_bayes`.

In [7]:
# import sklearn modules here:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

2. Define `bow_vectorizer` as an implementation of `CountVectorizer`.

In [8]:
# Create bow_vectorizer:
bow_vectorizer = CountVectorizer()

In [6]:
#the following code is a lightweight way to install new packages. You will need the `import_ipynb` package for this to 
%pip install import_ipynb

Note: you may need to restart the kernel to use updated packages.


3. Use your newly minted `bow_vectorizer` to both `fit` (train) and `transform` (vectorize) all your friends' writing (stored in the variable `friends_docs`). Save the resulting vector object as `friends_vectors`.

In [9]:
import import_ipynb

from goldman_emma_raw import goldman_docs
from henson_matthew_raw import henson_docs
from wu_tingfang_raw import wu_docs

friends_docs = goldman_docs + henson_docs + wu_docs

# Define friends_vectors:
friends_vectors = bow_vectorizer.fit_transform(friends_docs)

importing Jupyter notebook from goldman_emma_raw.ipynb
importing Jupyter notebook from henson_matthew_raw.ipynb
importing Jupyter notebook from wu_tingfang_raw.ipynb


4. Create a new variable `mystery_vector`. Assign to it the vectorized form of `[mystery_postcard]` using the vectorizer's `.transform()` method.

   (`mystery_postcard` is a string, while the vectorizer expects a list as an argument.)

In [10]:
mystery_postcard = """
My friend,
From the 10th of July to the 13th, a fierce storm raged, clouds of
freeing spray broke over the ship, incasing her in a coat of icy mail,
and the tempest forced all of the ice out of the lower end of the
channel and beyond as far as the eye could see, but the _Roosevelt_
still remained surrounded by ice.
Hope to see you soon.
"""

# Define mystery_vector:
mystery_vector = bow_vectorizer.transform([mystery_postcard])


## This Mystery Friend Gets Classified

5. You've vectorized and prepared all the documents. Let's take a look at your friends' writing samples to get a sense of how they write.

   Print out one document of each friend's writing - try any one between `0` and `140`. (Your friends' documents are stored in `goldman_docs`, `henson_docs`, and `wu_docs`.)

In [11]:
# Print out a document from each friend:
print(goldman_docs[49])
print(henson_docs[49])
print(wu_docs[49])

 What he gives to the world is only gray and hideous
things, reflecting a dull and hideous existence,--too weak to live,
too cowardly to die
Miss Marie Ahnighito Peary, aged about ten months, who
first saw the light of day at Anniversary Lodge on the 12th of the
previous September, was taken by her mother to her kinfolks in the
South
 Let us, for instance, compare England with the United
States


6. Have an inkling about which friend wrote the mystery card? We can use a classifier to confirm those suspicions...

   Implement a Naive Bayes classifier using `MultinomialNB`. Save the result to `friends_classifier`.

In [7]:
# Define friends_classifier:
friends_classifier = MultinomialNB()

7. Train `friends_classifier` on `friends_vectors` and `friends_labels` using the classifier's `.fit()` method.

In [8]:
friends_labels = ["Emma"] * 154 + ["Matthew"] * 141 + ["Tingfang"] * 166

# Train the classifier:
friends_classifier.fit(friends_vectors, friends_labels)

MultinomialNB()

8. Change `predictions` value from `["None Yet"]` to the classifier's prediction about which friend wrote the postcard. You can do this by calling the classifier's `predict()` method on the `mystery_vector`.

In [9]:
predictions = friends_classifier.predict(mystery_vector)


## Mystery Revealed!

9. Uncomment the final print statement and run the code block below to see who your mystery friend was all along!

In [10]:
mystery_friend = predictions[0] if predictions[0] else "someone else"

# Uncomment the print statement:
print("The postcard was from {}!".format(mystery_friend))

The postcard was from Matthew!


10. But does it really work? Find some lines by Emma Goldman, Matthew Henson, and TingFang Wu on <a href="http://www.gutenberg.org" target="_blank">gutenberg.org</a> and save them to `mystery_postcard` to see how the classifier holds up!

    Try using the `.predict_proba()` method instead of `.predict()` and print out `predictions` to see the estimated probabilities that the `mystery_postcard` was written by each person.
   
    What happens when you add in a recent email or text instead?

In [3]:
# Import necessary modules from scikit-learn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Also ensure to import any other required modules or data
import import_ipynb
from goldman_emma_raw import goldman_docs
from henson_matthew_raw import henson_docs
from wu_tingfang_raw import wu_docs

importing Jupyter notebook from goldman_emma_raw.ipynb
importing Jupyter notebook from henson_matthew_raw.ipynb
importing Jupyter notebook from wu_tingfang_raw.ipynb


 Define and Initialize Vectorizer and Classifier
After importing the necessary modules, define and initialize the CountVectorizer and MultinomialNB:

In [4]:
# Define vectorizer and classifier
bow_vectorizer = CountVectorizer()
friends_classifier = MultinomialNB()


Prepare and Vectorize Data
Combine and vectorize the documents from each friend:

In [5]:
# Combine documents from each friend
friends_docs = goldman_docs + henson_docs + wu_docs

# Vectorize friend documents
friends_vectors = bow_vectorizer.fit_transform(friends_docs)

# Define labels
friends_labels = ["Emma"] * len(goldman_docs) + ["Matthew"] * len(henson_docs) + ["Tingfang"] * len(wu_docs)

# Train the classifier
friends_classifier.fit(friends_vectors, friends_labels)


Test with Sample Lines
Define sample lines and vectorize them:

In [6]:
# Sample lines from each friend (replace these with actual sample lines)
emma_sample = """
The usual winter fever of America is in this month of January, and...
"""
matthew_sample = """
My expedition was a long and arduous one, but we made significant progress...
"""
tingfang_sample = """
The landscape was beautiful and serene, the perfect backdrop for...
"""

# Vectorize these samples
emma_vector = bow_vectorizer.transform([emma_sample])
matthew_vector = bow_vectorizer.transform([matthew_sample])
tingfang_vector = bow_vectorizer.transform([tingfang_sample])

# Predict
print("Emma Sample Prediction:", friends_classifier.predict(emma_vector))
print("Matthew Sample Prediction:", friends_classifier.predict(matthew_vector))
print("TingFang Sample Prediction:", friends_classifier.predict(tingfang_vector))

# Print probabilities
print("Emma Sample Probabilities:", friends_classifier.predict_proba(emma_vector))
print("Matthew Sample Probabilities:", friends_classifier.predict_proba(matthew_vector))
print("TingFang Sample Probabilities:", friends_classifier.predict_proba(tingfang_vector))


Emma Sample Prediction: ['Tingfang']
Matthew Sample Prediction: ['Matthew']
TingFang Sample Prediction: ['Emma']
Emma Sample Probabilities: [[0.06435879 0.15218932 0.7834519 ]]
Matthew Sample Probabilities: [[1.84761455e-05 9.99945592e-01 3.59317853e-05]]
TingFang Sample Probabilities: [[0.6297412  0.29042627 0.07983254]]


Test with Recent Email or Text
Vectorize and classify a recent text sample:

In [7]:
# Recent email or text
recent_text = """
Hi there, I hope you're doing well. Just wanted to catch up and see...
"""

# Vectorize the recent text
recent_vector = bow_vectorizer.transform([recent_text])

# Predict
recent_prediction = friends_classifier.predict(recent_vector)
print("Recent Text Prediction:", recent_prediction[0])


Recent Text Prediction: Matthew
