In [2]:
# CIS 545

# Homework 5: Spam Classification in SciKit-Learn and TensorFlow

This assignment uses data from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Data processing was inspired by https://www.kaggle.com/overflow012/d/uciml/sms-spam-collection-dataset/text-preprocessing-classification

The code below gives you a helper function and reads in the data.

In [3]:
import pandas as pd

# This function returns the k most frequently appearing keywords in the dataframe
def top_k(data_df, vec, k):
    X = vec.fit_transform(data_df['sms'].values)
    labels = vec.get_feature_names()    
    return pd.DataFrame(columns = labels, data = X.toarray()).sum().sort_values(ascending = False)[:k]

sms_df = pd.read_csv('spam.csv', encoding='latin-1')
sms_df.columns = ['class', 'sms', 'a', 'b', 'c']

In [4]:
sms_df

Unnamed: 0,class,sms,a,b,c
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
5,spam,FreeMsg Hey there darling it's been 3 week's n...,,,
6,ham,Even my brother is not like to speak with me. ...,,,
7,ham,As per your request 'Melle Melle (Oru Minnamin...,,,
8,spam,WINNER!! As a valued network customer you have...,,,
9,spam,Had your mobile 11 months or more? U R entitle...,,,


## Step 1.1 Data Wrangling and Inspection

Clean up `sms_df`.  Delete 'a', 'b', 'c', lowercase the sms text, add a column 'length'.

In [5]:
# TODO: Data wrangling / cleaning
sms_df = sms_df[['class','sms']]

sms_df['length'] = sms_df['sms'].apply(lambda x: len(str(x)))
sms_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,class,sms,length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61
5,spam,FreeMsg Hey there darling it's been 3 week's n...,148
6,ham,Even my brother is not like to speak with me. ...,77
7,ham,As per your request 'Melle Melle (Oru Minnamin...,160
8,spam,WINNER!! As a valued network customer you have...,158
9,spam,Had your mobile 11 months or more? U R entitle...,154


After you’ve done this, step through these cells to see how the input data divides between ‘spam’ texts and ‘ham’ (non-spam) texts.  You should note that the spam has certain terms, e.g., ‘winner’, that don’t appear as frequently in “ham.”

In [6]:
display(sms_df)

Unnamed: 0,class,sms,length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61
5,spam,FreeMsg Hey there darling it's been 3 week's n...,148
6,ham,Even my brother is not like to speak with me. ...,77
7,ham,As per your request 'Melle Melle (Oru Minnamin...,160
8,spam,WINNER!! As a valued network customer you have...,158
9,spam,Had your mobile 11 months or more? U R entitle...,154


In [7]:
display(sms_df.groupby('class').describe())

Unnamed: 0_level_0,length,length,length,length,length,length,length,length
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ham,4825.0,71.023627,58.016023,2.0,33.0,52.0,92.0,910.0
spam,747.0,138.866131,29.183082,13.0,132.5,149.0,157.0,224.0


## Step 1.2. Vectorizing the Text

`scikit-learn` has a handy `CountVectorizer` that builds a (sparse) matrix of word counts, drops stop words and even does stemming.  See:

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

The code below builds document vectors automatically.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(decode_error = 'ignore', stop_words = 'english')
X = vec.fit_transform(sms_df['sms'].values)

### Step 1.2.1 What are the frequent words? What are frequent patterns?

Run the next couple of cells to see the top-30 words in spam and in ham.  If you look, you should see numbers and parts of URLs (“www”, “com”) in the spam.  Perhaps we need to get rid of these altogether.  However, the URLs and numbers are highly varied, so we would like to map all URLs to one feature, and all numbers to one feature. 

In the cell below store the index of "www" as `www` and the index of "com" as `com`.

In [9]:
www = 0
com = 0
for i in range(len(vec.get_feature_names())):
    if vec.get_feature_names()[i] == 'www':
        www = i
    elif vec.get_feature_names()[i] == 'com':
        com = i
print(www)
print(com)

8274
2089


In [10]:
top_spam = top_k(sms_df[sms_df['class'] == 'spam'], vec, 30)
display(top_spam)

free          224
txt           163
ur            144
mobile        127
text          125
stop          121
claim         113
reply         104
www            98
prize          93
just           78
cash           76
won            76
uk             74
150p           71
send           70
new            69
nokia          67
win            64
urgent         63
tone           60
week           60
50             57
contact        56
service        56
msg            54
com            54
18             51
16             51
guaranteed     50
dtype: int64

In [11]:
top_ham = top_k(sms_df[sms_df['class'] == 'ham'], vec, 30)
display(top_ham)

gt       318
lt       316
just     293
ok       287
ll       265
ur       241
know     236
good     233
got      232
like     232
come     227
day      209
time     201
love     199
going    169
home     165
want     164
lor      162
need     158
sorry    157
don      151
da       150
today    139
later    135
dont     132
did      129
send     129
think    128
pls      123
hi       122
dtype: int64

### Step 1.2.2 Regularize URLs and Numbers

Let’s replace all URL patterns, and all numbers, with a single text token (“_url_” and “_number_”).  To do this, simply pass the appropriate column of your DataFrame to `regularize_urls` and `regularize_numbers`. Replace the SMS text with the results of regularizing.

In [12]:
# TODO: Regularize/tokenize URLs and numbers
from regularize import regularize_urls, regularize_numbers
sms_df['sms']  = regularize_urls(sms_df['sms'])
sms_df['sms']  = regularize_numbers(sms_df['sms'])
sms_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,class,sms,length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in _num_ a wkly comp to win FA Cu...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61
5,spam,FreeMsg Hey there darling it's been _num_ we...,148
6,ham,Even my brother is not like to speak with me. ...,77
7,ham,As per your request 'Melle Melle (Oru Minnamin...,160
8,spam,WINNER!! As a valued network customer you have...,158
9,spam,Had your mobile _num_ months or more? U R en...,154


## Step 1.2.2 Results

Re-run the CountVectorizer, re-create vector `X`, and re-compute the top 30 spam terms.  Store the top 30 spam terms in `top_spam`.

In [13]:
# TODO: Top-30 spam terms
top_spam = top_k(sms_df[sms_df['class'] == 'spam'], vec, 30)
display(top_spam)

_num_         3289
free           228
txt            165
ur             144
_url_          141
mobile         129
stop           126
text           125
claim          113
reply          104
prize           92
just            78
won             76
cash            76
nokia           71
send            70
win             70
new             69
urgent          63
week            60
tone            59
box             57
msg             56
service         56
contact         56
guaranteed      50
ppm             49
customer        49
mins            47
phone           46
dtype: int64

In [14]:
top_ham = top_k(sms_df[sms_df['class'] == 'ham'], vec, 30)
display(top_ham)

_num_    1227
gt        318
lt        316
just      293
ok        287
ll        265
ur        241
know      236
good      233
got       233
like      232
come      228
day       214
time      201
love      199
going     169
home      165
want      165
lor       162
need      158
sorry     157
don       151
da        150
today     138
later     135
dont      132
send      129
did       129
think     128
tell      123
dtype: int64

In [15]:
display(top_spam)

_num_         3289
free           228
txt            165
ur             144
_url_          141
mobile         129
stop           126
text           125
claim          113
reply          104
prize           92
just            78
won             76
cash            76
nokia           71
send            70
win             70
new             69
urgent          63
week            60
tone            59
box             57
msg             56
service         56
contact         56
guaranteed      50
ppm             49
customer        49
mins            47
phone           46
dtype: int64

In [16]:
if "_num_" not in top_spam:
    raise ValueError

In [17]:
if "_url_" not in top_spam:
    raise ValueError

## Step 1.3 Creating Features

Currently we have a very large number of features, namely all of the words that aren’t stop words.  Let’s do dimensionality reduction, by only looking for the words that frequently occur in either spam or ham.  Recall that we just recomputed the top-30 spam words.

Compute the top-30 ham words, then create a list of “vocabulary” words from the combination of the top spam + ham words.  Create a new feature set called `relevant_vec` using `CountVectorizer` with just the top spam + ham words. 

Hint: See http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html.

In [18]:
vec.vocabulary_

{'reaction': 4474,
 'dizzamn': 1510,
 'easy': 1649,
 'delay': 1370,
 'seventeen': 4888,
 'greatly': 2307,
 'narcotics': 3652,
 'eva': 1782,
 'lotz': 3238,
 'payed': 4041,
 'atural': 362,
 'ruining': 4707,
 'mth': 3590,
 'edge': 1658,
 'meal': 3394,
 'solihull': 5125,
 'uncles': 5872,
 'freek': 2086,
 'placed': 4146,
 'arent': 276,
 'oh': 3844,
 'wearing': 6113,
 'south': 5173,
 'switch': 5466,
 '_url_': 2,
 'mid': 3453,
 'onlyfound': 3875,
 'whenevr': 6170,
 'miiiiiiissssssssss': 3457,
 'sachin': 4727,
 'hopefully': 2554,
 'networking': 3702,
 'jumpers': 2912,
 'eve': 1785,
 'karnan': 2941,
 'albi': 138,
 'activities': 54,
 'men': 3429,
 'impatient': 2677,
 'face': 1857,
 'classmates': 992,
 'journey': 2896,
 'foot': 2034,
 'nattil': 3656,
 'definitely': 1362,
 'payment': 4044,
 'genius': 2191,
 'cthen': 1247,
 'jaklin': 2828,
 'strongly': 5344,
 'differences': 1454,
 'que': 4410,
 'perfume': 4071,
 'disclose': 1488,
 'darlings': 1313,
 'ran': 4449,
 'hugs': 2597,
 'missionary': 3494,


In [19]:
top_ham1 = top_ham.to_frame()
top_spam1 = top_spam.to_frame()
merge = pd.concat([top_ham1, top_spam1])

merge = merge.reset_index().rename(columns = {'index':'name'})[['name']]
merge = merge.drop_duplicates()
merge = merge.reset_index()[['name']]
merge = merge.reset_index()



In [20]:
mapping = dict(zip(merge.name, merge.index))
mapping

{'_num_': 0,
 '_url_': 32,
 'box': 47,
 'cash': 40,
 'claim': 36,
 'come': 11,
 'contact': 50,
 'customer': 53,
 'da': 22,
 'day': 12,
 'did': 27,
 'don': 21,
 'dont': 25,
 'free': 30,
 'going': 15,
 'good': 8,
 'got': 9,
 'gt': 1,
 'guaranteed': 51,
 'home': 16,
 'just': 3,
 'know': 7,
 'later': 24,
 'like': 10,
 'll': 5,
 'lor': 18,
 'love': 14,
 'lt': 2,
 'mins': 54,
 'mobile': 33,
 'msg': 48,
 'need': 19,
 'new': 43,
 'nokia': 41,
 'ok': 4,
 'phone': 55,
 'ppm': 52,
 'prize': 38,
 'reply': 37,
 'send': 26,
 'service': 49,
 'sorry': 20,
 'stop': 34,
 'tell': 29,
 'text': 35,
 'think': 28,
 'time': 13,
 'today': 23,
 'tone': 46,
 'txt': 31,
 'ur': 6,
 'urgent': 44,
 'want': 17,
 'week': 45,
 'win': 42,
 'won': 39}

In [None]:
# TODO: Vector of 'important' words

relevant_vec = CountVectorizer(decode_error = 'ignore', vocabulary = mapping, stop_words = 'english')
relevant_vec.get_feature_names()

This cell adds an sms length feature (normalized by the maximum length) and creates training and test sets for you.

In [22]:
import sklearn.model_selection as ms
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

# X is the feature array, based off relevant words
X = relevant_vec.fit_transform(sms_df['sms'].values).toarray()

# Compute the length of each sms message, normalized by max length
Xlen = np.zeros((X.shape[0],1))
inx = 0
for v in sms_df['sms'].values:
        Xlen[inx,0] = len(v)
        inx += 1
Xlen = Xlen / max(Xlen)
# Add the length as another feature
X = np.hstack((X, Xlen))

#y = sms_df['class'].values
y = np.array((sms_df['class'] == 'spam').astype(int))

# Now we split...
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.2, random_state = 42)


## Step 1.4 Classifier Evaluation

Write a function `spam_classify(model)` that takes a constructed model as input, `fit`s the model on the training set, `predict`s on the test set, and returns the `score` on the test outputs. The monospaced words are hints for which functions you should use. Note that you may need to run the cell below again as your write the constructor calls so that you do not get duplicates in your results table.

In [23]:
def get_accuracy(y_predict, y_actual):
    cnt = 0
    for i in range(len(y_predict)):
        y_p = y_predict[i]
        y_a = y_actual[i]
        if y_p == y_a:
            cnt += 1
    return cnt/len(y_predict)

In [24]:
# TODO: Write your spam_classify function here.

def spam_classify(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return get_accuracy(y_pred, y_test) 

# Results, as a list of dictionaries
classifier_results = []

### Step 1.4.1 Decision Trees

Construct 6 Decision Trees **with random state 42**. Use maximum depths of 1, 2, 3, 4, 5, and the default. Store them in variables `dt1` to `dt6`.

In [25]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import sklearn.model_selection as ms

# TODO: Construct your 5 decision trees here.
dt1 = tree.DecisionTreeClassifier(max_depth=1,criterion='entropy', random_state=42)
dt2 = tree.DecisionTreeClassifier(max_depth=2,criterion='entropy', random_state=42)
dt3 = tree.DecisionTreeClassifier(max_depth=3,criterion='entropy', random_state=42)
dt4 = tree.DecisionTreeClassifier(max_depth=4,criterion='entropy', random_state=42)
dt5 = tree.DecisionTreeClassifier(max_depth=5,criterion='entropy', random_state=42)
dt6 = tree.DecisionTreeClassifier(criterion='entropy', random_state=42)

classifier_results.append({'Classifier': 'DecTree', 'Param': 'Depth=1', 'Score': spam_classify(dt1)})
classifier_results.append({'Classifier': 'DecTree', 'Param': 'Depth=2', 'Score': spam_classify(dt2)})
classifier_results.append({'Classifier': 'DecTree', 'Param': 'Depth=3', 'Score': spam_classify(dt3)})
classifier_results.append({'Classifier': 'DecTree', 'Param': 'Depth=4', 'Score': spam_classify(dt4)})
classifier_results.append({'Classifier': 'DecTree', 'Param': 'Depth=5', 'Score': spam_classify(dt5)})
classifier_results.append({'Classifier': 'DecTree', 'Param': '', 'Score': spam_classify(dt6)})

In [26]:
classifier_results

[{'Classifier': 'DecTree', 'Param': 'Depth=1', 'Score': 0.9399103139013453},
 {'Classifier': 'DecTree', 'Param': 'Depth=2', 'Score': 0.9399103139013453},
 {'Classifier': 'DecTree', 'Param': 'Depth=3', 'Score': 0.9497757847533632},
 {'Classifier': 'DecTree', 'Param': 'Depth=4', 'Score': 0.9560538116591928},
 {'Classifier': 'DecTree', 'Param': 'Depth=5', 'Score': 0.9587443946188341},
 {'Classifier': 'DecTree', 'Param': '', 'Score': 0.9713004484304932}]

### Step 1.4.2 Logistic Regression

Construct 2 `liblinear` Logistic Regression classifiers **with random state 42**. Use `l1` and `l2` regularization penalties. Store them in variables `lr1` and `lr2`.

In [27]:
from sklearn.linear_model import LogisticRegression

# TODO: Construct your 2 logistic regression classifiers here.
lr1 = LogisticRegression(penalty='l1',random_state=42)
lr2 = LogisticRegression(penalty='l2',random_state=42)
classifier_results.append({'Classifier': 'LogReg',  'Param': 'Reg=l1',  'Score': spam_classify(lr1)})
classifier_results.append({'Classifier': 'LogReg',  'Param': 'Reg=l2',  'Score': spam_classify(lr2)})



In [28]:
classifier_results

[{'Classifier': 'DecTree', 'Param': 'Depth=1', 'Score': 0.9399103139013453},
 {'Classifier': 'DecTree', 'Param': 'Depth=2', 'Score': 0.9399103139013453},
 {'Classifier': 'DecTree', 'Param': 'Depth=3', 'Score': 0.9497757847533632},
 {'Classifier': 'DecTree', 'Param': 'Depth=4', 'Score': 0.9560538116591928},
 {'Classifier': 'DecTree', 'Param': 'Depth=5', 'Score': 0.9587443946188341},
 {'Classifier': 'DecTree', 'Param': '', 'Score': 0.9713004484304932},
 {'Classifier': 'LogReg', 'Param': 'Reg=l1', 'Score': 0.9713004484304932},
 {'Classifier': 'LogReg', 'Param': 'Reg=l2', 'Score': 0.9704035874439462}]

### Step 1.4.3 Support Vector Machines

Construct 1 Support Vector Machines classifier with the kernel coefficient set by `gamma="auto"` and **with random state 42**. Ensure the model computes probabilities for the 2 classes by setting the probability flag to True. Store the model in the variable `svm`.

In [29]:
from sklearn.svm import SVC

# TODO: Construct your 1 support vector machines classifier here.
svm = SVC(random_state = 42,probability=True, gamma = 'auto')

classifier_results.append({'Classifier': 'SVM', 'Param': '', 'Score': spam_classify(svm)})

In [30]:
display(pd.DataFrame(classifier_results))

Unnamed: 0,Classifier,Param,Score
0,DecTree,Depth=1,0.93991
1,DecTree,Depth=2,0.93991
2,DecTree,Depth=3,0.949776
3,DecTree,Depth=4,0.956054
4,DecTree,Depth=5,0.958744
5,DecTree,,0.9713
6,LogReg,Reg=l1,0.9713
7,LogReg,Reg=l2,0.970404
8,SVM,,0.9713


In [31]:
# Don't delete this cell.

In [32]:
# Don't delete this cell.

In [33]:
# Don't delete this cell.

## Step 2.0 Ensembles

We are going to use your `spam_classify` function again. No changes needed. Note that you may need to run the cell below again as your write the constructor calls so that you do not get duplicates in your results table.

In [34]:
# Results, as a list of dictionaries
classifier_results = []

## Step 2.1 Random Forest

Construct 1 random forest classifier with 31 estimators and **random state 314**. Store it in the variable `rfor`.

In [35]:
from sklearn.ensemble import RandomForestClassifier

# TODO: Construct your random forest classifier here.
rfor = RandomForestClassifier(random_state = 314, n_estimators = 31)

classifier_results.append({'Classifier': 'RFo', 'Param': 'Count=31', 'Score': spam_classify(rfor)})

In [36]:
classifier_results

[{'Classifier': 'RFo', 'Param': 'Count=31', 'Score': 0.9829596412556054}]

## Step 2.2 Bagging

Construct 4 bagging classifiers with 31 estimators and **random state 314**. The base classifiers should be `dt6`, `lr1`, `lr2`, and `svm`. Store them in variables `bag1` to `bag4`.

In [37]:
from sklearn.ensemble import BaggingClassifier

# TODO: Construct your bagging classifier here.
bag1 = BaggingClassifier(base_estimator=dt6,n_estimators=31, random_state=314)
bag2 = BaggingClassifier(base_estimator=lr1,n_estimators=31, random_state=314)
bag3 = BaggingClassifier(base_estimator=lr2,n_estimators=31, random_state=314)
bag4 = BaggingClassifier(base_estimator=svm,n_estimators=31, random_state=314)
classifier_results.append({'Classifier': 'Bag', 'Param': 'Base=dt6', 'Score': spam_classify(bag1)})
classifier_results.append({'Classifier': 'Bag', 'Param': 'Base=lr1', 'Score': spam_classify(bag2)})
classifier_results.append({'Classifier': 'Bag', 'Param': 'Base=lr2', 'Score': spam_classify(bag3)})
classifier_results.append({'Classifier': 'Bag', 'Param': 'Base=svm', 'Score': spam_classify(bag4)})





In [38]:
classifier_results

[{'Classifier': 'RFo', 'Param': 'Count=31', 'Score': 0.9829596412556054},
 {'Classifier': 'Bag', 'Param': 'Base=dt6', 'Score': 0.9820627802690582},
 {'Classifier': 'Bag', 'Param': 'Base=lr1', 'Score': 0.9739910313901345},
 {'Classifier': 'Bag', 'Param': 'Base=lr2', 'Score': 0.9713004484304932},
 {'Classifier': 'Bag', 'Param': 'Base=svm', 'Score': 0.9730941704035875}]

## Step 2.3 AdaBoost

Construct 4 AdaBoost classifiers with 31 estimators and **random state 314**. The base classifiers should be `dt6`, `lr1`, `lr2`, and `svm`. Store them in variables `ada1` to `ada4`. Training these models could take a long time.

In [39]:
from sklearn.ensemble import AdaBoostClassifier

# TODO: Construct your AdaBoost Classifier here.
ada1 = AdaBoostClassifier(base_estimator=dt6,n_estimators=31, random_state=314)
ada2 = AdaBoostClassifier(base_estimator=lr1,n_estimators=31, random_state=314)
ada3 = AdaBoostClassifier(base_estimator=lr2,n_estimators=31, random_state=314)
ada4 = AdaBoostClassifier(base_estimator=svm,n_estimators=31, random_state=314)

classifier_results.append({'Classifier': 'Ada', 'Param': 'Base=dt6', 'Score': spam_classify(ada1)})
classifier_results.append({'Classifier': 'Ada', 'Param': 'Base=lr1', 'Score': spam_classify(ada2)})
classifier_results.append({'Classifier': 'Ada', 'Param': 'Base=lr2', 'Score': spam_classify(ada3)})
classifier_results.append({'Classifier': 'Ada', 'Param': 'Base=svm', 'Score': spam_classify(ada4)})





KeyboardInterrupt: 

In [40]:
display(pd.DataFrame(classifier_results))

Unnamed: 0,Classifier,Param,Score
0,RFo,Count=31,0.98296
1,Bag,Base=dt6,0.982063
2,Bag,Base=lr1,0.973991
3,Bag,Base=lr2,0.9713
4,Bag,Base=svm,0.973094
5,Ada,Base=dt6,0.970404
6,Ada,Base=lr1,0.865471
7,Ada,Base=lr2,0.950673


In [41]:
if min(pd.DataFrame(classifier_results)["Score"][0:5]) < 0.95:
    raise ValueError("Something went wrong.")

In [42]:
# Don't delete this cell.

## Step 3.0 Neural Networks

Let’s continue building upon our spam classifier, this time using neural networks -- both Perceptrons and feed-forward networks. We are going to use your `spam_classify` function again. No changes needed. Note that you may need to run the cell below again as your write the constructor calls so that you do not get duplicates in your results table.

In [43]:
# Results, as a list of dictionaries
classifier_results = []

## Step 3.1 Perceptron

Construct 1 perception classifier with a maximum number of iterations of 1000, a tolerance of 1e-3, and **random state 42**. Store it in the variable `perc`.

In [44]:
from sklearn.linear_model import Perceptron

# TODO: Construct your perception classifier here.
perc = Perceptron(max_iter=1000, tol=1e-3, random_state=42)

classifier_results.append({'Classifier': 'Perceptron', 'Param': '', 'Score': spam_classify(perc)})

## Step 3.2 Multi-layer Perceptrons

### Step 3.2.1 Small Network

Construct 1 MLP classifier with 3 hidden nodes in 1 layer and **random state 42**. Store it in the variable `mlp1`.

In [45]:
from sklearn.neural_network import MLPClassifier

# TODO: Construct your small MLP classifier here.
mlp1 = MLPClassifier(hidden_layer_sizes=(3,),random_state=42)

classifier_results.append({'Classifier': 'NN', 'Param': 'Hidden=(3)', 'Score': spam_classify(mlp1)})

### Step 3.2.2 Medium Network

Construct 1 MLP classifier with 10 hidden nodes in 1 layer and **random state 42**. Store it in the variable `mlp2`.

In [46]:
# TODO: Construct your medium MLP classifier here.
# YOUR CODE HERE
mlp2 = MLPClassifier(hidden_layer_sizes=(10,),random_state=42)

classifier_results.append({'Classifier': 'NN', 'Param': 'Hidden=(10)', 'Score': spam_classify(mlp2)})

### Step 3.2.3 Large Network

Construct 1 MLP classifier with 10 hidden nodes in each of 3 layers and **random state 1**. Store it in the variable `mlp3`.

In [47]:
# TODO: Construct your large MLP classifier here.
# YOUR CODE HERE
mlp3 = MLPClassifier(hidden_layer_sizes=(10,10,10),random_state=1)


classifier_results.append({'Classifier': 'NN', 'Param': 'Hidden=(10,10,10)', 'Score': spam_classify(mlp3)})



In [48]:
display(pd.DataFrame(classifier_results))

Unnamed: 0,Classifier,Param,Score
0,Perceptron,,0.973094
1,NN,Hidden=(3),0.976682
2,NN,Hidden=(10),0.974888
3,NN,"Hidden=(10,10,10)",0.973094


In [49]:
# Don't delete this cell.

In [50]:
# Don't delete this cell.

In [51]:
# Don't delete this cell.

# Tensorflow

## Step 4.1. Loading Data

The first cell imports TensorFlow. For this part of the homework, you will use `X_train` and `y_train` created in Step 1.3.

Define TensorFlow **columns** (features) for each of the words in the vocabulary from Step 1.3.  Also add an additional column for the length. Store these columns in a list.

In [52]:
import tensorflow as tf

# TODO: Define TensorFlow columns
columns = relevant_vec.get_feature_names()
columns.append('length')
columns

['_num_',
 'gt',
 'lt',
 'just',
 'ok',
 'll',
 'ur',
 'know',
 'good',
 'got',
 'like',
 'come',
 'day',
 'time',
 'love',
 'going',
 'home',
 'want',
 'lor',
 'need',
 'sorry',
 'don',
 'da',
 'today',
 'later',
 'dont',
 'send',
 'did',
 'think',
 'tell',
 'free',
 'txt',
 '_url_',
 'mobile',
 'stop',
 'text',
 'claim',
 'reply',
 'prize',
 'won',
 'cash',
 'nokia',
 'win',
 'new',
 'urgent',
 'week',
 'tone',
 'box',
 'msg',
 'service',
 'contact',
 'guaranteed',
 'ppm',
 'customer',
 'mins',
 'phone',
 'length']

In [53]:
my_feature_columns = []
for key in columns:
    my_feature_columns.append(tf.feature_column.numeric_column(key=key))
my_feature_columns

[_NumericColumn(key='_num_', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='gt', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='lt', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='just', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='ok', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='ll', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='ur', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='know', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='good', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='got', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key=

## Step 4.2. Setting up Features

Create a function `input_fn` that takes parameters `x` (2D numpy array of features) and `y` (1D numpy array of labels).  This should create a tensor for each **column** of the 2D array `x`. You can think of this as creating a tensor for each feature. This function should return a tuple of the dictionary of the tensors created from the columns and a tensor created from the second input `y`.  

Create a function `test_input_fn` that takes no arguments, but returns the output of passing in the test set and labels to `input_fn`. Create a similar function `train_input_fn` that does the same thing except passes in the training set and labels.

In [62]:
def input_fn(features, labels):
#     features_df = pd.DataFrame(features,columns=columns)
#     labels_df = pd.DataFrame(labels)
#     dataset = tf.data.Dataset.from_tensor_slices((dict(features_df), labels_df))
#     dataset = dataset.batch(64)
    features_dict = {}
    i = 0
    for key in columns:
        features_dict[key] = tf.constant(features[:,i])
        i += 1
    
    return features_dict, tf.constant(labels)


In [55]:
def train_input_fn():
    return input_fn(X_train, y_train)
def test_input_fn():
    return input_fn(X_test, y_test)

## Step 4.3 Construct, Train, Evaluate, Results in TensorFlow

### Step 4.3.1

Construct a DNNClassifier with two hidden layers of 5 units each and store it as `dnn_55`. For reference, [here](https://www.tensorflow.org/get_started/premade_estimators) is an example of using a DNNClassifier.

In [56]:
# TODO: Create DNNClassifier
tf.set_random_seed(42)
dnn_55 = tf.estimator.DNNClassifier(feature_columns=my_feature_columns,hidden_units=[5, 5],n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': 'worker', '_keep_checkpoint_max': 5, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc8b497f550>, '_model_dir': '/tmp/tmp09p6tgqp', '_log_step_count_steps': 100, '_protocol': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_train_distribute': None, '_save_checkpoints_secs': 600, '_master': '', '_save_checkpoints_steps': None, '_num_worker_replicas': 1, '_save_summary_steps': 100, '_tf_random_seed': None, '_experimental_distribute': None, '_task_id': 0, '_device_fn': None, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_num_ps_replicas': 0, '_eval_distribute': None, '_service': None, '_global_id_in_cluster': 0}


### Step 4.3.2

Construct a LinearClassifier here and store it as `tf_lin`.

In [57]:
# TODO: Construct your LinearClassifier here.
tf.set_random_seed(42)
tf_lin = tf.estimator.LinearClassifier(feature_columns=my_feature_columns)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': 'worker', '_keep_checkpoint_max': 5, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc8b497f4a8>, '_model_dir': '/tmp/tmpla66n0ej', '_log_step_count_steps': 100, '_protocol': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_train_distribute': None, '_save_checkpoints_secs': 600, '_master': '', '_save_checkpoints_steps': None, '_num_worker_replicas': 1, '_save_summary_steps': 100, '_tf_random_seed': None, '_experimental_distribute': None, '_task_id': 0, '_device_fn': None, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_num_ps_replicas': 0, '_eval_distribute': None, '_service': None, '_global_id_in_cluster': 0}


### Step 4.3.3

Write a function `train_evaluate(m, num_steps)` that takes the model and a number of steps as arguments, trains the model over the training data, evaluates on the test data, sorts the results of the evaluate operation by key, prints the keys and their values, and returns the accuracy. 

In [60]:
# TODO: Write your train_evaluate(m, num_steps) function here.
def train_evaluate(m, num_steps):
    m.train(input_fn=train_input_fn,steps=num_steps)
    eval_result = m.evaluate(input_fn=test_input_fn, steps=num_steps)
    print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))
    return eval_result['accuracy']
    

In [63]:
# Results, as a list of dictionaries
classifier_results = []

classifier_results.append({'Classifier': 'DNN', 'Param': 'Hidden=(5,5)', 'Score': train_evaluate(dnn_55, 1000)})
classifier_results.append({'Classifier': 'Lin', 'Param': '',             'Score': train_evaluate(tf_lin, 1000)})

display(pd.DataFrame(classifier_results))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp09p6tgqp/model.ckpt.
INFO:tensorflow:loss = 3074.1306, step = 1
INFO:tensorflow:global_step/sec: 277.24
INFO:tensorflow:loss = 254.33302, step = 101 (0.362 sec)
INFO:tensorflow:global_step/sec: 359.802
INFO:tensorflow:loss = 219.14915, step = 201 (0.278 sec)
INFO:tensorflow:global_step/sec: 333.959
INFO:tensorflow:loss = 203.02982, step = 301 (0.301 sec)
INFO:tensorflow:global_step/sec: 373.967
INFO:tensorflow:loss = 193.5601, step = 401 (0.266 sec)
INFO:tensorflow:global_step/sec: 326.762
INFO:tensorflow:loss = 186.94994, step = 501 (0.310 sec)
INFO:tensorflow:global_step/sec: 312.074
INFO:tensorflow:loss = 182.31215, step = 601 (0.318 sec)
INFO:tensorflow:global_step/sec: 343.787
INFO:tensorflow:lo

Unnamed: 0,Classifier,Param,Score
0,DNN,"Hidden=(5,5)",0.973094
1,Lin,,0.974888


In [59]:
# Don't delete this cell.