In the beginning of this class, you identified emails by their authors using a number of supervised classification algorithms. In those projects, we handled the preprocessing for you, transforming the input emails into a TfIdf so they could be fed into the algorithms. Now you will construct your own version of that preprocessing step, so that you are going directly from raw data to processed features.

You will be given two text files: one contains the locations of all the emails from Sara, the other has emails from Chris. You will also have access to the parseOutText() function, which accepts an opened email as an argument and returns a string containing all the (stemmed) words in the email.

You’ll start with a warmup exercise to get acquainted with parseOutText(). Go to the tools directory and run parse_out_email_text.py, which contains parseOutText() and a test email to run this function over.

https://github.com/mudspringhiker/ud120-projects/blob/master/tools/parse_out_email_text.py

parseOutText() takes the opened email and returns only the text part, stripping away any metadata that may occur at the beginning of the email, so what's left is the text of the message. We currently have this script set up so that it will print the text of the email to the screen, what is the text that you get when you run parseOutText()?


In [1]:
from nltk.stem.snowball import SnowballStemmer
import string

In [2]:
def parseOutText(f):
    """ 
    given an opened email file f, parse out all text below the
    metadata block at the top
    (in Part 2, you will also add stemming capabilities)
    and return a string that contains all the words in the 
    email (space-separated)
    
    example use case:
    f = open("email_file_name.txt", "r")
    text = parseOutText(f)
    
    """
    f.seek(0)  # go back to beginning of file
    all_text = f.read()
    
    # split off metadata
    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        # remove punctuation
        text_string = content[1].translate(string.maketrans("", ""), string.punctuation)
        
        # project part 2: comment out the line below
        words = text_string
        
        # split the text string into individual words, stem each word,
        # and append the stemmed word to words (make sure there's a single
        # space between each stemmed word)
        
    return words

In [3]:
def main():
    ff = open("../ud120-projects/text_learning/test_email.txt", "r")
    text = parseOutText(ff)
    print text

In [4]:
if __name__ == '__main__':
    main()



Hi Everyone  If you can read this message youre properly using parseOutText  Please proceed to the next part of the project



### Deploying Stemming

In parseOutText(), comment out the following line: 
```
words = text_string 
```
Augment parseOutText() so that the string it returns has all the words stemmed using a SnowballStemmer (use the nltk package, some examples that I found helpful can be found here: http://www.nltk.org/howto/stem.html ). Rerun parse_out_email_text.py, which will use your updated parseOutText() function--what’s your output now?

Hint: you'll need to break the string down into individual words, stem each word, then recombine all the words into one string.

http://www.nltk.org/howto/stem.html

In [5]:
def parseOutText(f):
    """ 
    given an opened email file f, parse out all text below the
    metadata block at the top
    (in Part 2, you will also add stemming capabilities)
    and return a string that contains all the words in the 
    email (space-separated)
    
    example use case:
    f = open("email_file_name.txt", "r")
    text = parseOutText(f)
    
    """
    f.seek(0)  # go back to beginning of file
    all_text = f.read()
    
    # split off metadata
    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        # remove punctuation
        text_string = content[1].translate(string.maketrans("", ""), string.punctuation)
        
        # project part 2: comment out the line below
        # words = text_string
        
        # split the text string into individual words, stem each word,
        # and append the stemmed word to words (make sure there's a single
        # space between each stemmed word)
        
        
        stemmer = SnowballStemmer("english")
        for stuff in text_string.split():
            stuffing = stemmer.stem(stuff)
            words = words + (stuffing + " ")
        
        
    return words

In [6]:
if __name__ == '__main__':
    main()

hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project 


### Clean Away "Signature Words" 

https://github.com/mudspringhiker/ud120-projects/blob/master/text_learning/vectorize_text.py

In vectorize_text.py, you will iterate through all the emails from Chris and from Sara. For each email, feed the opened email to parseOutText() and return the stemmed text string. Then do two things:

- remove signature words (“sara”, “shackleton”, “chris”, “germani”--bonus points if you can figure out why it's "germani" and not "germany")
- append the updated text string to word_data -- if the email is from Sara, append 0 (zero) to from_data, or append a 1 if Chris wrote the email.

Once this step is complete, you should have two lists: one contains the stemmed text of each email, and the second should contain the labels that encode (via a 0 or 1) who the author of that email is.

Running over all the emails can take a little while (5 minutes or more), so we've added a temp_counter to cut things off after the first 200 emails. Of course, once everything is working, you'd want to run over the full dataset.

In the box below, put the string that you get for word_data[152].

In [7]:
import os
import pickle
import re

In [8]:
"""
    Starter code to process the emails from Sara and Chris to extract
    the features and get the documents ready for classification.
    The list of all the emails from Sara are in the from_sara list
    likewise for emails from Chris (from_chris)
    The actual documents are in the Enron email dataset, which
    you downloaded/unpacked in Part 0 of the first mini-project. If you have
    not obtained the Enron email corpus, run startup.py in the tools folder.
    The data is stored in lists and packed away in pickle files at the end.
"""
from_sara = open("../ud120-projects/text_learning/from_sara.txt", "r")
from_chris = open("../ud120-projects/text_learning/from_chris.txt", "r")

from_data = []
word_data = []

temp_counter is a way to speed up the development--there are thousands of emails from Sara and Chris, so running over all of them can take a long time temp_counter helps you only look at the first 200 emails in the list so you can iterate your modifications quicker

In [9]:
temp_counter = 0

In [10]:
for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        # only look at first 200 emails when developing
        # once everything is working, remove this line to run over full dataset
        temp_counter += 1
        if temp_counter < 200:
            path = os.path.join('../ud120-projects', path[:-1])
            print path
            email = open(path, "r")
            
            # use parseOutText() to extract the text from the opened email
            email_text = parseOutText(email)
            
            # use str.replace() to remove any instances of the words 
            # ["sara", "shackleton", "chris", "germani]
            word_list = ["sara", "shackleton", "chris", "germani"]
            for word in word_list:
                email_text = email_text.replace(word, "")
                
            # append the text to word_data
            word_data.append(email_text)
            
            # append a 0 to from_data if email is from Sara, and 1 if email is from Chris
            if from_person == from_chris:
                from_data.append(1)
            else:
                from_data.append(0)
            email.close()

../ud120-projects/maildir/bailey-s/deleted_items/101.
../ud120-projects/maildir/bailey-s/deleted_items/106.
../ud120-projects/maildir/bailey-s/deleted_items/132.
../ud120-projects/maildir/bailey-s/deleted_items/185.
../ud120-projects/maildir/bailey-s/deleted_items/186.
../ud120-projects/maildir/bailey-s/deleted_items/187.
../ud120-projects/maildir/bailey-s/deleted_items/193.
../ud120-projects/maildir/bailey-s/deleted_items/195.
../ud120-projects/maildir/bailey-s/deleted_items/214.
../ud120-projects/maildir/bailey-s/deleted_items/215.
../ud120-projects/maildir/bailey-s/deleted_items/233.
../ud120-projects/maildir/bailey-s/deleted_items/242.
../ud120-projects/maildir/bailey-s/deleted_items/243.
../ud120-projects/maildir/bailey-s/deleted_items/244.
../ud120-projects/maildir/bailey-s/deleted_items/246.
../ud120-projects/maildir/bailey-s/deleted_items/247.
../ud120-projects/maildir/bailey-s/deleted_items/254.
../ud120-projects/maildir/bailey-s/deleted_items/259.
../ud120-projects/maildir/ba

In [11]:
print from_data

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [12]:
len(from_data)

199

In [13]:
from_sara.close()

In [14]:
from_chris.close()

In [15]:
word_data

[u'sbaile2 nonprivilegedpst susan pleas send the forego list to richard thank   enron wholesal servic 1400 smith street eb3801a houston tx 77002 ph 713 8535620 fax 713 6463490 ',
 u'sbaile2 nonprivilegedpst 1 txu energi trade compani 2 bp capit energi fund lp may be subject to mutual termin 2 nobl gas market inc 3 puget sound energi inc 4 virginia power energi market inc 5 t boon picken may be subject to mutual termin 5 neumin product co 6 sodra skogsagarna ek for probabl an ectric counterparti 6 texaco natur gas inc may be book incorrect for texaco inc financi trade 7 ace capit re oversea ltd 8 nevada power compani 9 prior energi corpor 10 select energi inc origin messag from tweed sheila sent thursday januari 31 2002 310 pm to   subject pleas send me the name of the 10 counterparti that we are evalu thank ',
 u'sbaile2 nonprivilegedpst all here the second tier of counterparti to add to the data retriev list 11 medianew group inc 12 macromedia incorpor 13 british airway plc 14 merc ir

In [16]:
pickle.dump(word_data, open("your_word_data.pkl", "w"))

In [17]:
pickle.dump(from_data, open("your_email_authors.pkl", "w"))

#### Quiz: String you get for word_data[152]?

In [18]:
word_data[152]

u'tjonesnsf stephani and sam need nymex calendar '

### TfIdf It

Transform the word_data into a tf-idf matrix using the sklearn TfIdf transformation. Remove english stopwords.

You can access the mapping between words and feature numbers using get_feature_names(), which returns a list of all the words in the vocabulary. How many different words are there?

Be sure to use the tf-idf Vectorizer class to transform the word data.

Don't forget to remove english stop words when you set up the vectorizer, using sklearn's stop word list (not NLTK).

http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
vectorizer = TfidfVectorizer(stop_words='english')

In [21]:
vectorizer

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [22]:
vectorizer.fit_transform(word_data)

<199x3078 sparse matrix of type '<type 'numpy.float64'>'
	with 13199 stored elements in Compressed Sparse Row format>

In [23]:
len(vectorizer.get_feature_names())

3078

Removing the counter:

In [None]:
from_sara = open("../ud120-projects/text_learning/from_sara.txt", "r")
from_chris = open("../ud120-projects/text_learning/from_chris.txt", "r")

from_data = []
word_data = []

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        # only look at first 200 emails when developing
        # once everything is working, remove this line to run over full dataset
        #temp_counter += 1
        #if temp_counter < 200:
            path = os.path.join('../ud120-projects', path[:-1])
            print path
            email = open(path, "r")
            
            # use parseOutText() to extract the text from the opened email
            email_text = parseOutText(email)
            
            # use str.replace() to remove any instances of the words 
            # ["sara", "shackleton", "chris", "germani]
            word_list = ["sara", "shackleton", "chris", "germani"]
            for word in word_list:
                email_text = email_text.replace(word, "")
                
            # append the text to word_data
            word_data.append(email_text)
            
            # append a 0 to from_data if email is from Sara, and 1 if email is from Chris
            if from_person == from_chris:
                from_data.append(1)
            else:
                from_data.append(0)
            email.close()

../ud120-projects/maildir/bailey-s/deleted_items/101.
../ud120-projects/maildir/bailey-s/deleted_items/106.
../ud120-projects/maildir/bailey-s/deleted_items/132.
../ud120-projects/maildir/bailey-s/deleted_items/185.
../ud120-projects/maildir/bailey-s/deleted_items/186.
../ud120-projects/maildir/bailey-s/deleted_items/187.
../ud120-projects/maildir/bailey-s/deleted_items/193.
../ud120-projects/maildir/bailey-s/deleted_items/195.
../ud120-projects/maildir/bailey-s/deleted_items/214.
../ud120-projects/maildir/bailey-s/deleted_items/215.
../ud120-projects/maildir/bailey-s/deleted_items/233.
../ud120-projects/maildir/bailey-s/deleted_items/242.
../ud120-projects/maildir/bailey-s/deleted_items/243.
../ud120-projects/maildir/bailey-s/deleted_items/244.
../ud120-projects/maildir/bailey-s/deleted_items/246.
../ud120-projects/maildir/bailey-s/deleted_items/247.
../ud120-projects/maildir/bailey-s/deleted_items/254.
../ud120-projects/maildir/bailey-s/deleted_items/259.
../ud120-projects/maildir/ba

In [25]:
vectorizer = TfidfVectorizer(stop_words='english')

In [26]:
vectorizer.fit_transform(word_data)

<17578x38757 sparse matrix of type '<type 'numpy.float64'>'
	with 1078821 stored elements in Compressed Sparse Row format>

How many unique words are in your TfIdf?

In [27]:
len(vectorizer.get_feature_names())

38757

### Accessing TfIdf Features

What is word number 34597 in your TfIdf?

(Just to be clear--if the question were "what is word number 100," we would be looking for the word corresponding to vocab_list[100]. Zero-indexed arrays are so confusing to talk about sometimes.)

In [32]:
list_of_words = vectorizer.get_feature_names()

In [33]:
list_of_words[34597]

u'stephaniethank'