# Build a classifier that will automatically categorizes each SMS as Spam or Ham

Using SMSSpamCollection.tsv, containing 5,570 SMS categorized by SPAM or HAM (not spam). The labels are stored in the first column and the messages are in the second column. The columns are seperated by tab.  


### Step 0:  Click File -> save a copy in Drive

### Step 1 : Download dataset to local

Download Dataset : [SMSSpamCollection.tsv](https://drive.google.com/file/d/18EFPF0G_Jgjnc3FO7EvGD85eM3vwIO5W/view?usp=sharing)

### Step 2: Read in text


In [1]:
import io
import nltk
import pandas as pd
import re
import string

pd.set_option('display.max_colwidth', 100)

**Option A:  Run in Google COLAB **

In [0]:
from google.colab import files
uploaded = files.upload()
data = pd.read_csv(io.StringIO(uploaded["SMSSpamCollection.tsv"].decode('utf-8')), sep='\t')     #Google Drive Colab

Saving SMSSpamCollection.tsv to SMSSpamCollection (1).tsv


* click 'Choose Files' button
* select 'SMSSpamCollection.tsv' 

**Option B:  Run at LOCAL  (Jupyter Notebook) **


*   Upload SMSSpamCollection.tsv to your Jupyter Notebook




In [0]:
data = pd.read_csv("SMSSpamCollection.tsv", sep='\t') #Read Local File

### Step 3: Explore Dataset

In [0]:


data.columns = ['label', 'body_text']
data.head()

Unnamed: 0,label,body_text
0,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
1,ham,"Nah I don't think he goes to usf, he lives around here though"
2,ham,Even my brother is not like to speak with me. They treat me like aids patent.
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
4,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...


### Step 4: Construct a baseLine model using message length as a feature

In [0]:
def base_features(text):
   return {'len': len(text)}

In [0]:
featuresets = [(base_features(row[2]),row[1]) for row in data.itertuples()]
train_set, test_set = featuresets[500:], featuresets[:500]
print(featuresets[:5])



[({'len': 155}, 'spam'), ({'len': 61}, 'ham'), ({'len': 77}, 'ham'), ({'len': 35}, 'ham'), ({'len': 160}, 'ham')]


In [0]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

### Step 5: Evaluate the baseline model 



In [0]:
print(nltk.classify.accuracy(classifier, test_set))

0.888


### Step 6: Show most informative features

In [0]:
classifier.show_most_informative_features(10)

Most Informative Features
                     len = 161              spam : ham    =     35.9 : 1.0
                     len = 160              spam : ham    =     12.1 : 1.0
                     len = 143              spam : ham    =     11.5 : 1.0
                     len = 156              spam : ham    =      9.5 : 1.0
                     len = 181              spam : ham    =      9.3 : 1.0
                     len = 170              spam : ham    =      9.3 : 1.0
                     len = 173              spam : ham    =      9.3 : 1.0
                     len = 163              spam : ham    =      9.0 : 1.0
                     len = 25                ham : spam   =      8.8 : 1.0
                     len = 157              spam : ham    =      8.6 : 1.0


# Next is your turn 

baseline is 88% using only message length. 

## **How to Submit**

*   Send an e-mail  and confirm receivinging.

> 
*   ** TO** :  **kantinee@gmail.com**  and **ochaowalit@gmail.com**
*   **Subject** :  (517 432 NLP Exam) - *YourStudentID* - *YourName*
*   **Attach 2 files** :

>    
1.   Your sourcecode  in *YourStudentID-code.ipynb*
2.   Image of Accuracy Evaluation save as  *YourStudentID-eva.jpg*

*   **Detail** : Write down Your Name , StudentID

## ** Scoring**

*   2 scores if student can change Classifier Techniques.
*   4 scores if student apply stop word and lemma.
*   5 scores if student can design new features.
*   6 scores if student can design new features without errors . 
*   7 scores if student can design new features without errors and show error analysis. 
*   8 scores if student can design new features without errors and get better performance than the baseline.
*  10 scores if student can design new features without errors and get better performance using cross validation.








In [0]:
%%html
<marquee style='width: 30%; color: blue;'><b>Good Luck !!</b></marquee>