# Mini-Project - Feature Selection

##### Student Tags

Author: Anderson Hitoshi Uyekita    
Mini-Project: Feature Selection  
Course: Data Science - Foundations II  
COD: ND111  
Date: 23/01/2019    

***

## Table of Contents
- [Introduction](#intro)
- [Given code 1](#code1)
- [Exercise 1](#part_i_1)
- [Exercise 2](#part_i_2)
- [Exercise 3](#part_i_3)
- [Exercise 4](#part_i_4)
- [Exercise 5](#part_i_5)
- [Exercise 6](#part_i_6)

***

In [1]:
# Importing Libraries.
import numpy as np
import pandas as pd

## General Information

This Jupyter Notebook (in Python 2) aims to create a reproducible archive.

## Introduction <a id='intro'></a>

Katie explained in a video a problem that arose in preparing Chris and Sara’s email for the author identification project; it had to do with a feature that was a little too powerful (effectively acting like a signature, which gives an arguably unfair advantage to an algorithm). You’ll work through that discovery process here.

## Exercise 1 - Overfitting a Decision Tree 1 <a id='part_i_1'></a>

This bug was found when Katie was trying to make an overfit decision tree to use as an example in the decision tree mini-project. A decision tree is classically an algorithm that can be easy to overfit; one of the easiest ways to get an overfit decision tree is to use a small training set and lots of features.

>**If a decision tree is overfit, would you expect the accuracy on a test set to be very high or pretty low?**

Low

## Exercise 2 - Overfitting a Decision Tree 2 <a id='part_i_2'></a>

>**If a decision tree is overfit, would you expect high or low accuracy on the training set?**

High.

Exactly. The accuracy would be very high on the training set, but would plummet once it was actually tested.

## Exercise 3 - Number of Features and Overfitting <a id='part_i_3'></a>

A classic way to overfit an algorithm is by using lots of features and not a lot of training data. You can find the starter code in feature_selection/find_signature.py. Get a decision tree up and training on the training data, and print out the accuracy. 

In [2]:
#!/usr/bin/python

import pickle
import numpy
numpy.random.seed(42)


### The words (features) and authors (labels), already largely processed.
### These files should have been created from the previous (Lesson 10)
### mini-project.
words_file = "../text_learning/your_word_data.pkl" 
authors_file = "../text_learning/your_email_authors.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )



### test_size is the percentage of events assigned to the test set (the
### remainder go into training)
### feature matrices changed to dense representations for compatibility with
### classifier functions in versions 0.15.2 and earlier
from sklearn import model_selection
features_train, features_test, labels_train, labels_test = model_selection.train_test_split(word_data, authors, test_size=0.1, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()


### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

### your code goes here

# Importing the Scikit Learn package of Support Vector Machine
from sklearn import tree

# Importing the Scikit Learn to calcutate the accuracy.
from sklearn.metrics import accuracy_score

# Importing time to compute the time elapsed.
from time import time

# Creating the classifier using the linear kernel.
clf = tree.DecisionTreeClassifier(min_samples_split = 50)

# Saving time to compute the elapse time of fitting process.
t0 = time()

# Fitting/Training clf based on training dataframes.
clf.fit(features_train, labels_train)

# Calculating the elapse time of fit calculation.
print "training time:", round(time()-t0, 3), "s"

# Saving time to compute the elapse time of predicting process. 
t1 = time()

# Storing the predict from features_test in pred.
pred = clf.predict(features_test)

# Calculating the elapse time of predicting calculation.
print "predict time:", round(time()-t1, 3), "s"

# Calculating the accuracy and storing in acc.
acc = accuracy_score(pred, labels_test)

training time: 0.173 s
predict time: 0.443 s


In [3]:
# Number of training points.
features_train.shape[0]

150L

>**How many training points are there, according to the starter code?**

150

## Exercise 4 - Accuracy of Your Overfit Decision Tree <a id='part_i_4'></a>

>**What’s the accuracy of the decision tree you just made?** (Remember, we're setting up our decision tree to overfit -- ideally, we want to see the test accuracy as relatively low.)



In [4]:
# Printing the acc.
print "Accuracy:", round(acc,4)

Accuracy: 0.9477


## Exercise 5 - Identify the Most Powerful Features <a id='part_i_5'></a>

Take your (overfit) decision tree and use the feature_importances_ attribute to get a list of the relative importance of all the features being used. We suggest iterating through this list (it’s long, since this is text data) and only printing out the feature importance if it’s above some threshold (say, 0.2--remember, if all words were equally important, each one would give an importance of far less than 0.01).

>**What’s the importance of the most important feature? What is the number of this feature?**

In [5]:
# Importance.
max(clf.feature_importances_)

0.7647058823529412

In [6]:
# Importing pandas.
import numpy as np

# Finding the "index" of the features above (0.7647058823529412)
np.where(max(clf.feature_importances_) == clf.feature_importances_)[0][0]

33614

## Exercise 6 - Identify the Most Powerful Features <a id='part_i_6'></a>

In order to figure out what words are causing the problem, you need to go back to the TfIdf and use the feature numbers that you obtained in the previous part of the mini-project to get the associated words. You can return a list of all the words in the TfIdf by calling get_feature_names() on it; pull out the word that’s causing most of the discrimination of the decision tree. What is it? Does it make sense as a word that’s uniquely tied to either Chris Germany or Sara Shackleton, a signature of sorts?

In [7]:
# What is the word of 33614 index?
vectorizer.get_feature_names()[33614].encode('utf-8')

'sshacklensf'

>**What is the most powerful word when your decision tree is making its classification decisions?**

sshacklensf

## Exercise 7 - Remove, Repeat <a id='part_i_7'></a>

This word seems like an outlier in a certain sense, so let’s remove it and refit. Go back to text_learning/vectorize_text.py, and remove this word from the emails using the same method you used to remove “sara”, “chris”, etc. Rerun vectorize_text.py, and once that finishes, rerun find_signature.py. Any other outliers pop up? What word is it? Seem like a signature-type word? (Define an outlier as a feature with importance >0.2, as before).

In [8]:
#!/usr/bin/python

from nltk.stem.snowball import SnowballStemmer
import string

def parseOutText_v2(f):
    """ given an opened email file f, parse out all text below the
        metadata block at the top
        (in Part 2, you will also add stemming capabilities)
        and return a string that contains all the words
        in the email (space-separated) 
        
        example use case:
        f = open("email_file_name.txt", "r")
        text = parseOutText(f)
        
    """

    f.seek(0)  ### go back to beginning of file (annoying)
    all_text = f.read()

    ### split off metadata
    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        ### remove punctuation
        text_string = content[1].translate(string.maketrans("", ""), string.punctuation)

        ### project part 2: comment out the line below
        words = text_string

        ### split the text string into individual words, stem each word,
        ### and append the stemmed word to words (make sure there's a single
        ### space between each stemmed word)
        
        # Converting \n in spaces
        words = words.replace('\n',' ')

        # Removing punctuation
        # https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
        words = words.translate(string.maketrans("",""), string.punctuation)
        
        # Importing re package
        import re
        
        # Removing any instance of double or more spaces.
        words = re.sub(' +', ' ',words)
        
        # Removing spaces in the begining and ending of the email.
        words = words.lstrip().rstrip()
        
        # Splitting by space. Creating a vector of words.
        words = words.split()
        
        # Creating the Classifier using english as language.
        stemmer = SnowballStemmer("english")
        
        # Stemming each word of the vector.
        words = map(lambda x : stemmer.stem(x), words);
        
        # Binding all words together to became a string again.
        words = ' '.join(words)
            
    # Converting to UTF-8
    return words.encode("utf-8")

def main():
    ff = open("../text_learning/test_email.txt", "r")
    text = parseOutText_v2(ff)
    print text

if __name__ == '__main__':
    main()

hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project


In [9]:
#!/usr/bin/python

import os
import pickle
import re
import sys
import time


"""
    Starter code to process the emails from Sara and Chris to extract
    the features and get the documents ready for classification.

    The list of all the emails from Sara are in the from_sara list
    likewise for emails from Chris (from_chris)

    The actual documents are in the Enron email dataset, which
    you downloaded/unpacked in Part 0 of the first mini-project. If you have
    not obtained the Enron email corpus, run startup.py in the tools folder.

    The data is stored in lists and packed away in pickle files at the end.
"""

from_sara  = open("../text_learning/from_sara.txt", "r")
from_chris = open("../text_learning/from_chris.txt", "r")

from_data = []
word_data = []

### temp_counter is a way to speed up the development--there are
### thousands of emails from Sara and Chris, so running over all of them
### can take a long time
### temp_counter helps you only look at the first 200 emails in the list so you
### can iterate your modifications quicker
temp_counter = 0

# Start
start = time.time()

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
        temp_counter += 1
        if temp_counter > -1:
            path = os.path.join('..', path[:-1])
            #print path
            email = open(path, "r")

            ### use parseOutText to extract the text from the opened email
            parsed_email = parseOutText_v2(email)
                
            ### use str.replace() to remove any instances of the words
            ### ["sara", "shackleton", "chris", "germani"]
            parsed_email_clean = parsed_email.replace('sara','').replace('chris','').replace('shackleton','').replace('germani','')
            
            parsed_email_clean = parsed_email_clean.replace('sshacklensf','')
            
            ### append the text to word_data
            word_data.append(parsed_email_clean)
            
            ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
            if name == 'sara':
                from_data.append(0)
            elif name == 'chris':
                from_data.append(1)
            else:
                print "ERROR!!"
            
            email.close()

print "emails processed"
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("your_word_data_fs.pkl", "w") )
pickle.dump( from_data, open("your_email_authors_fs.pkl", "w") )

# Stop
stop = time.time()

print "Processing time: ", round((stop - start)/60,2), "minutes"

emails processed
Processing time:  1.85 minutes


In [10]:
### in Part 4, do TfIdf vectorization here

# Importing the TfIdf package.
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating the vectorizer using english stopwords.
vectorizer = TfidfVectorizer(stop_words = 'english', lowercase=True)

# Fitting and Transforming.
vectorizer.fit_transform(word_data)

# Creating the Vocabulary List.
vocab_list = vectorizer.get_feature_names()

# Length of the Vocabulary.
len(vectorizer.get_feature_names())

38756

In [11]:
#!/usr/bin/python

import pickle
import numpy
numpy.random.seed(42)


### The words (features) and authors (labels), already largely processed.
### These files should have been created from the previous (Lesson 10)
### mini-project.
words_file = "your_word_data_fs.pkl" 
authors_file = "your_email_authors_fs.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )



### test_size is the percentage of events assigned to the test set (the
### remainder go into training)
### feature matrices changed to dense representations for compatibility with
### classifier functions in versions 0.15.2 and earlier
from sklearn import model_selection
features_train, features_test, labels_train, labels_test = model_selection.train_test_split(word_data, authors, test_size=0.1, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()


### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

### your code goes here

# Importing the Scikit Learn package of Support Vector Machine
from sklearn import tree

# Importing the Scikit Learn to calcutate the accuracy.
from sklearn.metrics import accuracy_score

# Importing time to compute the time elapsed.
from time import time

# Creating the classifier using the linear kernel.
clf = tree.DecisionTreeClassifier(min_samples_split = 50)

# Saving time to compute the elapse time of fitting process.
t0 = time()

# Fitting/Training clf based on training dataframes.
clf.fit(features_train, labels_train)

# Calculating the elapse time of fit calculation.
print "training time:", round(time()-t0, 3), "s"

# Saving time to compute the elapse time of predicting process. 
t1 = time()

# Storing the predict from features_test in pred.
pred = clf.predict(features_test)

# Calculating the elapse time of predicting calculation.
print "predict time:", round(time()-t1, 3), "s"

# Calculating the accuracy and storing in acc.
acc = accuracy_score(pred, labels_test)

training time: 0.139 s
predict time: 0.48 s


In [12]:
# Converting array in DataFrame.
importance = pd.DataFrame(clf.feature_importances_, columns = ['imp'])

# Filtering the features greater than 0.2
importance[importance.imp > 0.2]

Unnamed: 0,imp
14343,0.666667


In [13]:
# Finding the word which is very dominant.
vectorizer.get_feature_names()[14343].encode('utf-8')

'cgermannsf'

>**Does another highly powerful word arise after you get rid of the first "signature word"?** (Hint: the answer is yes)
>
>**What is this word?**

cgermannsf

## Exercise 8 - Checking Important Features Again <a id='part_i_8'></a>

Update vectorize_test.py one more time, and rerun. Then run find_signature.py again. Any other important features (importance>0.2) arise? How many? Do any of them look like “signature words”, or are they more “email content” words, that look like they legitimately come from the text of the messages?



In [None]:
#!/usr/bin/python

import os
import pickle
import re
import sys
import time


"""
    Starter code to process the emails from Sara and Chris to extract
    the features and get the documents ready for classification.

    The list of all the emails from Sara are in the from_sara list
    likewise for emails from Chris (from_chris)

    The actual documents are in the Enron email dataset, which
    you downloaded/unpacked in Part 0 of the first mini-project. If you have
    not obtained the Enron email corpus, run startup.py in the tools folder.

    The data is stored in lists and packed away in pickle files at the end.
"""

from_sara  = open("../text_learning/from_sara.txt", "r")
from_chris = open("../text_learning/from_chris.txt", "r")

from_data = []
word_data = []

### temp_counter is a way to speed up the development--there are
### thousands of emails from Sara and Chris, so running over all of them
### can take a long time
### temp_counter helps you only look at the first 200 emails in the list so you
### can iterate your modifications quicker
temp_counter = 0

# Start
start = time.time()

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
        temp_counter += 1
        if temp_counter > -1:
            path = os.path.join('..', path[:-1])
            #print path
            email = open(path, "r")

            ### use parseOutText to extract the text from the opened email
            parsed_email = parseOutText_v2(email)
                
            ### use str.replace() to remove any instances of the words
            ### ["sara", "shackleton", "chris", "germani"]
            parsed_email_clean = parsed_email.replace('sara','').replace('chris','').replace('shackleton','').replace('germani','')
            
            parsed_email_clean = parsed_email_clean.replace('sshacklensf','').replace('cgermannsf','')
            
            ### append the text to word_data
            word_data.append(parsed_email_clean)
            
            ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
            if name == 'sara':
                from_data.append(0)
            elif name == 'chris':
                from_data.append(1)
            else:
                print "ERROR!!"
            
            email.close()

print "emails processed"
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("your_word_data_fs_2.pkl", "w") )
pickle.dump( from_data, open("your_email_authors_fs_2.pkl", "w") )

# Stop
stop = time.time()

print "Processing time: ", round((stop - start)/60,2), "minutes"

In [None]:
### in Part 4, do TfIdf vectorization here

# Importing the TfIdf package.
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating the vectorizer using english stopwords.
vectorizer = TfidfVectorizer(stop_words = 'english', lowercase=True)

# Fitting and Transforming.
vectorizer.fit_transform(word_data)

# Creating the Vocabulary List.
vocab_list = vectorizer.get_feature_names()

# Length of the Vocabulary.
len(vectorizer.get_feature_names())

In [None]:
#!/usr/bin/python

import pickle
import numpy
numpy.random.seed(42)


### The words (features) and authors (labels), already largely processed.
### These files should have been created from the previous (Lesson 10)
### mini-project.
words_file = "your_word_data_fs_2.pkl" 
authors_file = "your_email_authors_fs_2.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )



### test_size is the percentage of events assigned to the test set (the
### remainder go into training)
### feature matrices changed to dense representations for compatibility with
### classifier functions in versions 0.15.2 and earlier
from sklearn import model_selection
features_train, features_test, labels_train, labels_test = model_selection.train_test_split(word_data, authors, test_size=0.1, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()


### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

### your code goes here

# Importing the Scikit Learn package of Support Vector Machine
from sklearn import tree

# Importing the Scikit Learn to calcutate the accuracy.
from sklearn.metrics import accuracy_score

# Importing time to compute the time elapsed.
from time import time

# Creating the classifier using the linear kernel.
clf = tree.DecisionTreeClassifier(min_samples_split = 50)

# Saving time to compute the elapse time of fitting process.
t0 = time()

# Fitting/Training clf based on training dataframes.
clf.fit(features_train, labels_train)

# Calculating the elapse time of fit calculation.
print "training time:", round(time()-t0, 3), "s"

# Saving time to compute the elapse time of predicting process. 
t1 = time()

# Storing the predict from features_test in pred.
pred = clf.predict(features_test)

# Calculating the elapse time of predicting calculation.
print "predict time:", round(time()-t1, 3), "s"

# Calculating the accuracy and storing in acc.
acc = accuracy_score(pred, labels_test)

In [None]:
# Converting array in DataFrame.
importance = pd.DataFrame(clf.feature_importances_, columns = ['imp'])

# Filtering the features greater than 0.2
importance[importance.imp > 0.2]

In [None]:
# Finding the word which is very dominant.
vectorizer.get_feature_names()[21323].encode('utf-8'), vectorizer.get_feature_names()[18849].encode('utf-8')

>**Once you've removed the signature words and reprocess emails, do any "new important features" (importance > 0.2) arise? How Much?**

houectect

## Exercise 9 - Accuracy of the Overfit Tree <a id='part_i_9'></a>

What’s the accuracy of the decision tree now? We've removed two "signature words", so it will be more difficult for the algorithm to fit to our limited training set without overfitting. Remember, the whole point was to see if we could get the algorithm to overfit--a sensible result is one where the accuracy isn't that great!



In [None]:
# Printing the accuracy.
acc

>**What’s the accuracy of the decision tree now?**

81.63%

#### Copying Files

In [None]:
# Importing shutil to deal with copy
from shutil import copyfile

# File name
filename = 'find_signature.ipynb'

# Lesson
lesson = '12-Lesson_12'

# Directory to make a copy
dir_copy = '../../' + lesson + '/00-Mini Project/' + filename

# Copying file.
copyfile(filename, dir_copy)