In [1]:
import pandas as pd
data = pd.read_parquet("data/training.parquet")

In [2]:
data

Unnamed: 0,label,text
0,legitimate,You must write to me. Catherine sighed. And th...
1,legitimate,Who would have thought Mr. Crawford sure of he...
2,legitimate,He had only himself to please in his choice: h...
3,legitimate,Oh! One accompaniment to her song took her agr...
4,legitimate,"As soon as breakfast was over, she went to her..."
5,legitimate,Mrs Clay's selfishness was not so great as to ...
6,legitimate,"But self, though it would intrude, could not e..."
7,legitimate,"Elizabeth, though she did not wish to slight. ..."
8,legitimate,Edmund had descended from that moral elevation...
9,legitimate,I read up on the morrow the Crawfords were eng...


In [3]:
import spacy
english = spacy.load("en")

In [4]:
data["text"].get_values()[0]

"You must write to me. Catherine sighed. And there are other circumstances which I am now satisfied that I never brewed it. They will read together. Her praise had been given her at different times, but _this_ is the true one. So surrounded, so caressed, she was even positively civil; but it was not directed to me--it was to Mrs. Weston. And besides the operation of a sensible, intelligent man like Mr. Allen. I see that more than a little proud-looking woman of uncordial address, who met her husband's sisters without any affection, and almost without beauty. I walked over the the vending machine so I am very sorry--extremely sorry--But, Miss Smith, indeed!--Oh! Could she but have given Harriet her feelings about it all!"

In [5]:
doc = english(data["text"].get_values()[0])

In [6]:
type(doc)

spacy.tokens.doc.Doc

We can use spaCy to identify parts of speech.

In [7]:
for token in doc:
    print("%s is a %s" % (token.text, token.pos_))

You is a PRON
must is a VERB
write is a VERB
to is a ADP
me is a PRON
. is a PUNCT
Catherine is a PROPN
sighed is a VERB
. is a PUNCT
And is a CCONJ
there is a ADV
are is a VERB
other is a ADJ
circumstances is a NOUN
which is a ADJ
I is a PRON
am is a VERB
now is a ADV
satisfied is a ADJ
that is a ADP
I is a PRON
never is a ADV
brewed is a VERB
it is a PRON
. is a PUNCT
They is a PRON
will is a VERB
read is a VERB
together is a ADV
. is a PUNCT
Her is a ADJ
praise is a NOUN
had is a VERB
been is a VERB
given is a VERB
her is a PRON
at is a ADP
different is a ADJ
times is a NOUN
, is a PUNCT
but is a CCONJ
_ is a VERB
this is a DET
_ is a NOUN
is is a VERB
the is a DET
true is a ADJ
one is a NOUN
. is a PUNCT
So is a ADV
surrounded is a VERB
, is a PUNCT
so is a ADV
caressed is a ADJ
, is a PUNCT
she is a PRON
was is a VERB
even is a ADV
positively is a ADV
civil is a ADJ
; is a PUNCT
but is a CCONJ
it is a PRON
was is a VERB
not is a ADV
directed is a VERB
to is a ADP
me is a PRON
-- i

We can also use spaCy to identify the base forms of words -- it does this with a combination of part-of-speech-specific rules and a dictionary of exceptions.  The spaCy component that does this is called a [_lemmatizer_](https://en.wikipedia.org/wiki/Lemma_%28morphology%29).

In [8]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)

In [9]:
for token in doc:
    print("%s has a base form of %s" % (token.text, lemmatizer(token.text, token.pos_)))

You has a base form of ['you']
must has a base form of ['must']
write has a base form of ['write']
to has a base form of ['to']
me has a base form of ['me']
. has a base form of ['.']
Catherine has a base form of ['catherine']
sighed has a base form of ['sigh']
. has a base form of ['.']
And has a base form of ['and']
there has a base form of ['there']
are has a base form of ['be']
other has a base form of ['othe', 'oth']
circumstances has a base form of ['circumstance']
which has a base form of ['which']
I has a base form of ['i']
am has a base form of ['be']
now has a base form of ['now']
satisfied has a base form of ['satisfied']
that has a base form of ['that']
I has a base form of ['i']
never has a base form of ['never']
brewed has a base form of ['brew']
it has a base form of ['it']
. has a base form of ['.']
They has a base form of ['they']
will has a base form of ['will']
read has a base form of ['read']
together has a base form of ['together']
. has a base form of ['.']
Her ha

We can apply this process to the entire data frame if we'd like, but it might take a while.

In [10]:
def lemmas(s):
    return " ".join([lemmatizer(token.text, token.pos_)[0] for token in english(s) if str(token.text) not in [".", "!", "?", "," ";"]])

data["lemmas"] = data["text"].apply(lemmas,1)

Having words in more-or-less canonical forms is useful, but we'll want a different representation to identify structure in our data.  Remember, our goal is to be able to learn a function that can separate between documents that are likely to represent legitimate messages (i.e., prose in the style of Jane Austen) or spam messages (i.e., prose in the style of food-product reviews).  

_Feature engineering_ is the name for the process of turning real-world data into a form that a machine learning algorithm can take advantage of.  You'll learn more about this process in the next notebook; here, we'll just take a very basic approach that will let us visualize our data.  Logically, here's what we'll do:

1.  We'll collect word counts for each example, showing us how frequent each word is in a given document;
2.  We'll then turn those raw counts into frequencies (i.e., for a given word what percentage of words in given document are that word?), giving us a mapping from words to frequencies for each document;
3.  Finally, we'll encode these mappings as fixed-size vectors in a space-efficient way, by using a hash function to determine which vector element should get a given frequency.  Hashing has a few advantages, but for our purposes the most important advantage is that we don't need to know all of the words we might see in advance. 

(That's what we'll _logically_ do -- we'll _actually_ do these steps a bit out of order because it will make our code simpler and more efficient without changing the results.)

In [11]:
import numpy as np

def hashing_frequency(vecsize, h):
    """ 
    returns a function that will collect term frequencies 
    into a vector with _vecsize_ elements and will use 
    the hash function _h_ to choose which vector element 
    to update for a given term
    """
    
    def hf(words):
        if type(words) is type(""):
            # handle both lists of words and space-delimited strings
            words = words.split(" ")
            
        result = np.zeros(vecsize)
        for term in words:
            result[h(term) % vecsize] += 1.0
        result = result / sum(result)
        return result
    
    return hf

In [12]:
data["vectors"] = data["lemmas"].apply(hashing_frequency(2048, hash), 1)

So now instead of having documents (which we had from the raw data) or lists of word lemmas, we have vectors representing word lemma frequencies.  Because we've hashed lemmas into these vectors, we can't in general reconstruct the list of words from a vector, but we _do_ know that if the same lemma appears in two documents, their vectors will reflect it in corresponding buckets.

However, we've generated a 2,048-element vector.  Recall that our ultimate goal is to place documents in space so that we can identify a way to separate legitimate documents from spam documents.  Our 2,048-element vector is a point in a space, but it's a point in a space that most of our geometric intuitions don't apply to (some of us have enough trouble navigating the three dimensions of the physical world).  Let's use a very basic technique to project these vectors to a much smaller space that we can visualize.

In [13]:
from sklearn import random_projection
DIMENSIONS = 2

rp = random_projection.SparseRandomProjection(DIMENSIONS)
