 ## Machine Learning Online Class

 Exercise 6 | Spam Classification with SVMs

 Instructions

 ------------
 
  This file contains code that helps you get started on the
  exercise. You will need to complete the following functions:

     gaussianKernel.m
     dataset3Params.m
     processEmail.m
     emailFeatures.m

  For this exercise, you will not need to change any code in this file,
  or any other files other than those mentioned above.

In [1]:
import sys
from scipy.io import loadmat
from sklearn.svm import SVC, LinearSVC
import numpy as np
sys.path.append('../')
from ex6.processEmail import *
from ex6.emailFeatures import emailFeatures

In [2]:
# ==================== Part 1: Email Preprocessing ====================
#  To use an SVM to classify emails into Spam v.s. Non-Spam, you first need
#  to convert each email into a vector of features. In this part, you will
#  implement the preprocessing steps for each email. You should
#  complete the code in processEmail.m to produce a word indices vector
#  for a given email.

print('Preprocessing sample email (emailSample1.txt)')

# Extract Features
with open('data/emailSample1.txt', 'r') as f:
    file_contents = f.read()

word_indices = processEmail(file_contents)

print("""
=========================
Word Indices: 
""")
print(word_indices)

Preprocessing sample email (emailSample1.txt)
==== Processed Email ====

anyon know how much it cost to host a web portal well it depend on how mani visitor you re expect thi can be anywher from less than number buck a month to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if your run someth big to unsubscrib yourself from thi mail list send an email to emailaddr

Word Indices: 

[86, 916, 794, 1077, 883, 370, 1699, 790, 1822, 1831, 883, 431, 1171, 794, 1002, 1893, 1364, 592, 1676, 238, 162, 89, 688, 945, 1663, 1120, 1062, 1699, 375, 1162, 479, 1893, 1510, 799, 1182, 1237, 810, 1895, 1440, 1547, 181, 1699, 1758, 1896, 688, 1676, 992, 961, 1477, 71, 530, 1699, 531]


In [3]:
## ==================== Part 2: Feature Extraction ====================
#  Now, you will convert each email into a vector of features in R^n.
#  You should complete the code in emailFeatures.m to produce a feature
#  vector for a given email.

print('Extracting features from sample email (emailSample1.txt)')

# Extract Features
with open('data/emailSample1.txt', 'r') as f:
    file_contents = f.read()
word_indices = processEmail(file_contents)
features = emailFeatures(word_indices)

# Print Stats
print('Length of feature vector: %d' % len(features))
print('Number of non-zero entries: %d' % sum(features > 0))

print("""
期望输出：
Length of feature vector: 1899
Number of non-zero entries: 45
""")

Extracting features from sample email (emailSample1.txt)
==== Processed Email ====

anyon know how much it cost to host a web portal well it depend on how mani visitor you re expect thi can be anywher from less than number buck a month to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if your run someth big to unsubscrib yourself from thi mail list send an email to emailaddr
Length of feature vector: 1899
Number of non-zero entries: 45

期望输出：
Length of feature vector: 1899
Number of non-zero entries: 45



In [4]:
# =========== Part 3: Train Linear SVM for Spam Classification ========
#  In this section, you will train a linear classifier to determine if an
#  email is Spam or Not-Spam.

# Load the Spam Email dataset
# You will have X, y in your environment
data = loadmat('./data/spamTrain.mat')

X, y = data['X'], data['y']

print('\nTraining Linear SVM (Spam Classification)')
print('(this may take 1 to 2 minutes) ...')

C = 0.1
model = SVC(C=C, kernel='linear')  # 线性
model.fit(X, y.ravel())  # 训练数据
p = model.score(X, y)
print('Training Accuracy: %f' % p)


Training Linear SVM (Spam Classification)
(this may take 1 to 2 minutes) ...
Training Accuracy: 0.998250


In [5]:
# =================== Part 4: Test Spam Classification ================
#  After training the classifier, we can evaluate it on a test set. We have
#  included a test set in spamTest.mat

# Load the test dataset
# You will have Xtest, ytest in your environment
data = loadmat('./data/spamTest.mat')
Xtest, ytest = data['Xtest'], data['ytest']

print('Evaluating the trained Linear SVM on a test set ...')

p = model.score(Xtest, ytest)

print('Test Accuracy: %f' % p)

Evaluating the trained Linear SVM on a test set ...
Test Accuracy: 0.989000


## ================= Part 5: Top Predictors of Spam ====================

Since the model we are training is a linear SVM, we can inspect the
weights learned by the model to understand better how it is determining
whether an email is spam or not. The following code finds the words with
the highest weights in the classifier. Informally, the classifier 'thinks' that these words are the most likely indicators of spam.

这一部分是按结果的权重显示单词的，但是用的这个库返回的model里没有这个属性

In [6]:
# =================== Part 6: Try Your Own Emails =====================
#  Now that you've trained the spam classifier, you can use it on your own
#  emails! In the starter code, we have included spamSample1.txt,
#  spamSample2.txt, emailSample1.txt and emailSample2.txt as examples.
#  The following code reads in one of these emails and then uses your
#  learned SVM classifier to determine whether the email is Spam or
#  Not Spam

# Set the file to be read in (change this to spamSample2.txt,
# emailSample1.txt or emailSample2.txt to see different predictions on
# different emails types). Try your own emails as well!
"""
这一部分有问题
"""
filename = './data/emailSample2.txt'  

# Read and predict
with open(filename, 'r') as f:
    file_contents = f.read()
word_indices = processEmail(file_contents)
x = emailFeatures(word_indices)
p = model.predict(x.reshape(1, -1))

print('\nProcessed {}\n\nSpam Classification: {}'.format(filename, p))
print('(1 indicates spam, 0 indicates not spam)')

==== Processed Email ====

folk my first time post have a bit of unix experi but am new to linux just got a new pc at home dell box with window xp ad a second hard diskfor linux partit the disk and have instal suse number number from cd which wentfin except it didn t pick up my monitor i have a dell brand enumberfpp number lcd flat panel monitor and a nvidia geforcenumbertinumb video card both of which are probabl too new to featur in suse s defaultset i download a driver from the nvidia websit and instal it use rpm then i ran saxnumb as wa recommend in some post i found on the net butit still doesn t featur my video card in the avail list what next anoth problem i have a dell brand keyboard and if i hit cap lock twice the whole machin crash in linux not window even the on off switch isinact leav me to reach for the power cabl instead if anyon can help me in ani way with these prob i d be realli grate i ve search the net but have run out of idea or should i be go for a differ version o