# Feature Selection Mini-project

Katie explained in a video a problem that arose in preparing Chris and Sara’s email for the author identification project; it had to do with a feature that was a little too powerful (effectively acting like a signature, which gives an arguably unfair advantage to an algorithm). You’ll work through that discovery process here.

## Overfitting a Decision Tree

This bug was found when Katie was trying to make an overfit decision tree to use as an example in the decision tree mini-project. A decision tree is classically an algorithm that can be easy to overfit; one of the easiest ways to get an overfit decision tree is to use a small training set and lots of features.
If a decision tree is overfit, would you expect the accuracy on a test set to be very high or pretty low?

- low

If a decision tree is overfit, would you expect high or low accuracy on the training set?

- high accuracy on training set if overfit

## Number of Features and Overfitting

A classic way to overfit an algorithm is by using lots of features and not a lot of training data. You can find the starter code in feature_selection/find_signature.py. 

https://github.com/mudspringhiker/ud120-projects/blob/master/feature_selection/find_signature.py

Get a decision tree up and training on the training data, and print out the accuracy. How many training points are there, according to the starter code?

In [1]:
import pickle
import numpy as np
np.random.seed(42)

In [2]:
### The words (features) and authors (labels), already largely processed.
### These files should have been created from the previous (Lesson 10)
### mini-project.
words_file = "../10_Text_Learning/your_word_data.pkl" 
authors_file = "../10_Text_Learning/your_email_authors.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )

In [3]:
### test_size is the percentage of events assigned to the test set (the
### remainder go into training)
### feature matrices changed to dense representations for compatibility with
### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)

In [4]:
features_train

[u'sshacklensf kay ill give you my comment to the templat and i ask that you pleas add these to the clean document thank   enron north america corp 1400 smith street eb 3801a houston texa 77002 7138535620 phone 7136463490 fax enroncom forward by  houect on 04262001 0924 am carol st clair 04252001 0145 pm to  houectect frank sayreenrondevelopmentenrondevelop paul radousenronenronxg brent hendrynaenronenron cc subject hedg fund templat enclos is a form of hedg fund isda templat i have blacklin it against the union spring schedul carol st clair eb 3889 7138533989 phone 7136463393 fax carolstclairenroncom ',
 u'cgermannsf forward by  germanyhouect on 05042000 0713 am from enron north america general announc 05042000 0648 am to enron virus alert cc subject virus alert attent virusworm alert i love you virus may42000 if you receiv an email messag which specifi the subject as i love you or iloveyou or contain a file attach titl lovelettervb or loveletterforyoutxt or ani similar file name imme

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()

In [6]:
features_train

<15820x37863 sparse matrix of type '<type 'numpy.float64'>'
	with 950025 stored elements in Compressed Sparse Row format>

In [7]:
features_test

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [8]:
### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150]#.toarray()
labels_train   = labels_train[:150]

In [17]:
### your code goes here
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

In [18]:
from sklearn.metrics import accuracy_score

In [19]:
accuracy_score(pred, labels_test)

0.95904436860068254

In [12]:
len(labels_test)

1758

In [13]:
len(labels_train)

150

In [14]:
len(features_test)

1758

### Identify the Most Powerful Feature

Take your (overfit) decision tree and use the featureimportances attribute to get a list of the relative importance of all the features being used. We suggest iterating through this list (it’s long, since this is text data) and only printing out the feature importance if it’s above some threshold (say, 0.2--remember, if all words were equally important, each one would give an importance of far less than 0.01). What’s the importance of the most important feature? What is the number of this feature?

In [21]:
clf.feature_importances_

array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

In [24]:
for i in clf.feature_importances_:
    if i > 0:
        print i

0.0263157894737
0.0749500333111
0.764705882353
0.134028294862


In [25]:
for i in clf.feature_importances_:
    if i > 0.2:
        print i

0.764705882353


In [26]:
for i, value in enumerate(clf.feature_importances_):
    if value > 0.2:
        print i, value

33614 0.764705882353


### Use TfIdf to get the most important word

In order to figure out what words are causing the problem, you need to go back to the TfIdf and use the feature numbers that you obtained in the previous part of the mini-project to get the associated words. You can return a list of all the words in the TfIdf by calling get_feature_names() on it; pull out the word that’s causing most of the discrimination of the decision tree. What is it? Does it make sense as a word that’s uniquely tied to either Chris Germany or Sara Shackleton, a signature of sorts?

In [31]:
vectorizer.get_feature_names()[33614]

u'sshacklensf'

### Remove, Repeat

This word seems like an outlier in a certain sense, so let’s remove it and refit. Go back to text_learning/vectorize_text.py, and remove this word from the emails using the same method you used to remove “sara”, “chris”, etc. Rerun vectorize_text.py, and once that finishes, rerun find_signature.py. Any other outliers pop up? What word is it? Seem like a signature-type word? (Define an outlier as a feature with importance >0.2, as before).

See:

http://localhost:8888/notebooks/udacity/dand/intro_to_machine_learning/10_Text_Learning/text_learning_vectorize_text.ipynb

In [32]:
words_file = "../10_Text_Learning/your_word_data_rnd2.pkl" 
authors_file = "../10_Text_Learning/your_email_authors_rnd2.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r"))

In [33]:
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)

In [34]:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()

In [35]:
features_train = features_train[:150]#.toarray()
labels_train   = labels_train[:150]

In [36]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

In [37]:
accuracy_score(pred, labels_test)

0.96928327645051193

In [38]:
for i, value in enumerate(clf.feature_importances_):
    if value > 0.2:
        print i, value

14343 0.666666666667


In [39]:
vectorizer.get_feature_names()[14343]

u'cgermannsf'

JUst curious what the others are:

In [40]:
for i, value in enumerate(clf.feature_importances_):
    if value > 0:
        print i, value

8674 0.162601626016
14337 0.0506072874494
14343 0.666666666667
16268 0.093808630394
18249 0.0263157894737


In [41]:
vectorizer.get_feature_names()[8674]

u'62502pst'

In [42]:
vectorizer.get_feature_names()[16268]

u'deal'

In [43]:
vectorizer.get_feature_names()[14337]

u'cgerman'

In [44]:
vectorizer.get_feature_names()[18249]

u'eol'

### Checking important features again

Update vectorize_text.py one more time, and rerun. Then run find_signature.py again. Any other important features (importance>0.2) arise? How many? Do any of them look like “signature words”, or are they more “email content” words, that look like they legitimately come from the text of the messages?

In [46]:
# vectorize_text done
words_file = "../10_Text_Learning/your_word_data_rnd3.pkl" 
authors_file = "../10_Text_Learning/your_email_authors_rnd3.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r"))

features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()

features_train = features_train[:150]#.toarray()
labels_train   = labels_train[:150]

clf = tree.DecisionTreeClassifier()
clf = clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

accuracy_score(pred, labels_test)


0.81342434584755408

In [49]:
for i, value in enumerate(clf.feature_importances_):
    if value >= 0.2:
        print i, value

21323 0.363636363636


In [50]:
vectorizer.get_feature_names()[21323]

u'houectect'

In [52]:
for i, value in enumerate(clf.feature_importances_):
    if value > 0:
        print i, value, vectorizer.get_feature_names()[i]
        

11924 0.0248101945003 assoc
11975 0.105378579003 attach
12575 0.0262801932367 befor
14328 0.0177777777778 cgas
15212 0.0255293305728 cone
16267 0.0474074074074 deal
17248 0.0426666666667 drive
18849 0.186927243449 fax
21323 0.363636363636 houectect
22546 0.0840692099229 isda
29690 0.0755170338269 pleas


--> 'houectect' will not be removed, not a signature word (not very common)

### Accuracy of the overfit tree

What’s the accuracy of the decision tree now? We've removed two "signature words", so it will be more difficult for the algorithm to fit to our limited training set without overfitting. Remember, the whole point was to see if we could get the algorithm to overfit--a sensible result is one where the accuracy isn't that great!

0.81342434584755408

"Excellent work! Now that we've removed the outlier 'signature words', the training data is starting to overfit to the words that remain."