<a href="https://colab.research.google.com/github/Rhin0Runner/File-Metadata-Microservice/blob/main/Naibe_Bayes%2C_scikit_Message_Classifier_Mystery_Friend.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mystery Friend

You've received an anonymous postcard from a friend who you haven't seen in years. Your friend did not leave a name, but the card is definitely addressed to you. So far, you've narrowed your search down to three friends, based on handwriting:
- Emma Goldman
- Matthew Henson
- TingFang Wu

But which one sent you the card?

Just like you can classify a message as spam or not spam with a spam filter, you can classify writing as related to one friend or another by building a kind of friend writing classifier. You have past writing from all three friends stored up in the variable `friends_docs`, which means you can use scikit-learn's bag-of-words and Naive Bayes classifier to determine who the mystery friend is!

Ready?

## Feature Vectors Are in the Bag with Scikit-Learn

1. In the code block below, import `CountVectorizer` from `sklearn.feature_extraction.text`. Below it, import `MultinomialNB` from `sklearn.naive_bayes`.

In [1]:
# import sklearn modules here:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

2. Define `bow_vectorizer` as an implementation of `CountVectorizer`.

In [2]:
# Create bow_vectorizer:
bow_vectorizer = CountVectorizer()

3. Use your newly minted `bow_vectorizer` to both `fit` (train) and `transform` (vectorize) all your friends' writing (stored in the variable `friends_docs`). Save the resulting vector object as `friends_vectors`.

In [5]:
 !pip install import-ipynb

Collecting import-ipynb
  Downloading import_ipynb-0.1.4-py3-none-any.whl (4.1 kB)
Collecting jedi>=0.16 (from IPython->import-ipynb)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, import-ipynb
Successfully installed import-ipynb-0.1.4 jedi-0.19.1


In [7]:
import import_ipynb

from goldman_emma_raw import goldman_docs
from henson_matthew_raw import henson_docs
from wu_tingfang_raw import wu_docs

friends_docs = goldman_docs + henson_docs + wu_docs

# Define friends_vectors:
friends_vectors = bow_vectorizer.fit_transform(friends_docs)

4. Create a new variable `mystery_vector`. Assign to it the vectorized form of `[mystery_postcard]` using the vectorizer's `.transform()` method.

   (`mystery_postcard` is a string, while the vectorizer expects a list as an argument.)

In [28]:
mystery_postcard = """
My friend,
From the 10th of July to the 13th, a fierce storm raged, clouds of
freeing spray broke over the ship, incasing her in a coat of icy mail,
and the tempest forced all of the ice out of the lower end of the
channel and beyond as far as the eye could see, but the _Roosevelt_
still remained surrounded by ice.
Hope to see you soon.
"""

# Define mystery_vector:
mistery_vector = bow_vectorizer.transform([mystery_postcard])


## This Mystery Friend Gets Classified

5. You've vectorized and prepared all the documents. Let's take a look at your friends' writing samples to get a sense of how they write.

   Print out one document of each friend's writing - try any one between `0` and `140`. (Your friends' documents are stored in `goldman_docs`, `henson_docs`, and `wu_docs`.)

In [29]:
# Print out a document from each friend:
print(goldman_docs[41])

 Poor America, of what
avail is all her wealth, if the individuals comprising the nation are
wretchedly poor?  If they live in squalor, in filth, in crime, with
hope and joy gone, a homeless, soilless army of human prey.

It is generally conceded that unless the returns of any business
venture exceed the cost, bankruptcy is inevitable


In [14]:
print(henson_docs[125])

Captain Bartlett was forward,
astraddle of the bow with the boat-hook in his hands to fend off the
blocks of ice, and knew perfectly well where he wanted to land, but the
group of excited Esquimos were in his way and though he ordered them
back, they continued running about and getting in his way


In [24]:
print(wu_docs[9])

 Was I to be blamed for
wondering if the elevator would be my coffin?  On another occasion I
met a man whose name was "Death", and as soon as I heard his name I
felt inclined to run away, for I did not wish to die


6. Have an inkling about which friend wrote the mystery card? We can use a classifier to confirm those suspicions...

   Implement a Naive Bayes classifier using `MultinomialNB`. Save the result to `friends_classifier`.

In [30]:
# Define friends_classifier:
friends_classifier = MultinomialNB()

7. Train `friends_classifier` on `friends_vectors` and `friends_labels` using the classifier's `.fit()` method.

In [31]:
friends_labels = ["Emma"] * 154 + ["Matthew"] * 141 + ["Tingfang"] * 166

# Train the classifier:
friends_classifier.fit(friends_vectors, friends_labels)

8. Change `predictions` value from `["None Yet"]` to the classifier's prediction about which friend wrote the postcard. You can do this by calling the classifier's `predict()` method on the `mystery_vector`.

In [33]:
# Change predictions:
predictions = friends_classifier.predict(mistery_vector)


## Mystery Revealed!

9. Uncomment the final print statement and run the code block below to see who your mystery friend was all along!

In [35]:
mystery_friend = predictions[0] if predictions[0] else "someone else"

# Uncomment the print statement:
print("The postcard was from {}!".format(mystery_friend))

The postcard was from Matthew!


10. But does it really work? Find some lines by Emma Goldman, Matthew Henson, and TingFang Wu on <a href="http://www.gutenberg.org" target="_blank">gutenberg.org</a> and save them to `mystery_postcard` to see how the classifier holds up!

    Try using the `.predict_proba()` method instead of `.predict()` and print out `predictions` to see the estimated probabilities that the `mystery_postcard` was written by each person.
   
    What happens when you add in a recent email or text instead?

In [36]:
predictions_proba = friends_classifier.predict_proba(mistery_vector)

In [37]:
print(predictions_proba)

[[1.10199321e-02 9.88977727e-01 2.34054697e-06]]
