<a href="https://colab.research.google.com/github/Anson3208/Sentiment-Textmining-Analysis-Learning/blob/main/03_Spam_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 3 Spam Classification


# SMS SPAM Collection

The SMS SPAM Collection is a corpus of real text messages (SMS messages) that have been classified as either SPAM or HAM (i.e. not SPAM). The corpus contains 5,574 documents, 747 of which are SPAM and 4,827 of which are HAM. You can find the readme for the corpus [here](https://storage.googleapis.com/wd13/SMSSpamCollectionReadme). 

The following code downloads a copy of the SMS SPAM Corpus and saves it in a variable `sms_corpus`. 

In [None]:
import urllib.request, json 
sms_corpus = []
with urllib.request.urlopen("https://storage.googleapis.com/wd13/SMSSpamCollection.txt") as url:
  for line in url.readlines():
    sms_corpus.append(line.decode().split('\t'))

`sms_corpus` is a list. Each element of the list is another list which stores a document and its label.

In [None]:
# print the text and label of document 16
docid = 16
print(sms_corpus[docid])

['ham', "Oh k...i'm watching here:)\n"]


In [None]:
# print the label of document 16
docid = 16
print(sms_corpus[docid][0])

ham


In [None]:
# print the text of document 16
docid = 16
print(sms_corpus[docid][1])

Oh k...i'm watching here:)



# Create a tokenizer

Write a function `tokenize` that takes a string and returns a list of tokens. 

In [None]:
import urllib.request, json 
with urllib.request.urlopen("https://storage.googleapis.com/wd13/stopwords%20and%20lemmas.json") as url: #import the a dictionary of words as url-->to include lemma and stopwords
  data = json.load(url) #the data in url has no datatype-->use json.load to load the data as dictionary
  stopwords = data['stopwords'] #the link is a dictionary format, stopwords is the key which store the list of words that should be excluded


import re
def tokenize(doc):
  emoti_list = [':)','(:',':(','):',':D','D:',':P','P:',':V','V:',':/','/:',':\\','\\:',':|','|:',
                ';)','(;',';(',');',';D','D;',';P','P;',';V','V;',';/','/;',';\\','\\;',';|','|;',
                ':-)','(-:',':-(',')-:',':-D','D-:',':-P','P-:',':-V','V-:',':-/','/-:',':-\\',
                '\\-:',':-|','|-:',';-)','(-;',';-(',')-;',';-D','D-;',';-P','P-;',';-V','V-;',
                ';-/','/-;',';-\\','\\-;',';-|','|-;']
  tokenizer_pattern = re.compile('|'.join([
      '|'.join([re.escape(e) for e in emoti_list]),
      "[A-Za-z]+(?:['-_\.][A-Za-z]+)?",
      '\.\.+'
      ]))
  tokens = tokenizer_pattern.findall(doc)
  for i in range(0,len(tokens)):
    if re.match('\.\.+',tokens[i]):
      tokens[i] = '..+'
    else:
      tokens[i] = tokens[i].lower()
  return(tokens)

# Import log function

In [None]:
from math import log
log(1)

0.0

# Calculate token scores

Calculate scores for every token in the corpus, using the method discussed in class. Store these scores in a dictionary called `token_scores`. 

In [None]:
#Create Blank set

corpus_ham_count = 0
corpus_spam_count = 0
token_ham_count = {}
token_spam_count = {}
unique_token_count = set()
token_score={}



for doc in sms_corpus:        #for each doc, the output is ['ham', 'abc']
  label = doc[0]              #Create variable label represets 'ham' or 'spam'
  tokens = tokenize(doc[1])   #tokenize document

#Counting number of ham and spam in corpus
  if label == 'ham':
    corpus_ham_count += 1
  else:
    corpus_spam_count += 1    
  
#counting token
  for token in set(tokens):               #turn token list to be set to have unique tokens
    unique_token_count.add(token)         #adding token to unique token set
    if label == 'ham':
      if token not in token_ham_count:
        token_ham_count[token] =1
      else:
        token_ham_count[token] +=1
    else:
      if token not in token_spam_count:
        token_spam_count[token] =1
      else:
        token_spam_count[token] +=1

#calculating score
for token in unique_token_count:
  if token not in token_spam_count or token not in token_ham_count:
    continue
  token_score[token] = log((token_spam_count[token]/corpus_spam_count)/(token_ham_count[token]/corpus_ham_count))


#score of document(p|spam / p|ham)
total_count = corpus_ham_count + corpus_spam_count
prob_ham = corpus_ham_count/total_count
prob_spam = corpus_spam_count/total_count

score = log(prob_spam/prob_ham)




In [None]:
score

-1.8659152505276757

In [None]:
token_score['go']

-0.09289830336354553

# Create a score message function

Write a funciton `score_message` that takes an SMS message `doc` and returns a SPAM score, using the method discussed in class. 

In [None]:
def score_message(doc):
  score = log(corpus_spam_count/corpus_ham_count)
  tokens = tokenize(doc)
  for token in set(tokens):
    if token in token_score:
      score = score + token_score[token]
  return(score)

# Score some messages

In [None]:
score_message('go hello')

-2.3954833963575908

# Discussion

What tokens are most predictive of a message being SPAM?

In [None]:
#Check the max value in token_score
#print(f'The token "{max(token_score, key=token_score.get)}" has the highest likelihood being SPAM message')

for token,score in sorted(token_score.items(),key=lambda item: -item[1])[0:20]:
  print(token,score)

code 5.267112632189831
p 5.267112632189831
uk 5.0439690808756215
urgent 4.894437346904658
award 4.861647524081667
delivery 4.861647524081667
await 4.810354229694116
private 4.756287008423841
nokia 4.737594875411688
services 4.699128594583891
club 4.638503972767457
landline 4.638503972767457
statement 4.638503972767457
voucher 4.573965451629886
apply 4.573965451629886
games 4.573965451629886
mths 4.573965451629886
rate 4.504972580142934
congratulations 4.504972580142934
service 4.468604935972059


What tokens are most predictive of a message being HAM?

In [None]:
for token,score in sorted(token_score.items(),key=lambda item: item[1])[0:20]:
  print(token,score)

da -3.0244338776940785
oh -2.825432631701468
come -2.811575597040042
much -2.7975238435843917
too -2.76881373770196
way -2.6984329409401604
wat -2.677379531742328
already -2.6114215639505307
say -2.599992868126908
happy -2.564901548315638
yeah -2.564901548315638
really -2.5408039967365776
home -2.509841771132611
..+ -2.3246303601545515
his -2.261219134517416
thing -2.2449586136456356
i've -2.2449586136456356
but -2.201889756459813
did -2.1414179347047955
my -2.1347844112091616


How many documents are mis-classified by the model?

In [None]:
ham_correct_count = 0
spam_correct_count = 0
ham_mis_count = 0
spam_mis_count = 0

for doc in sms_corpus:
  label = doc[0]
  score = score_message(doc[1])
  if label == 'ham':
    if score >0:
      ham_mis_count += 1
    else:
      ham_correct_count +=1
  else:
    if score <0:
      spam_mis_count +=1
    else:
      spam_correct_count +=1
print(f'Number of Ham mis-classifed: {ham_mis_count}')
print(f'Number of Ham correct-classifed: {ham_correct_count}')
print(f'Number of Spam mis-classifed: {spam_mis_count}')
print(f'Number of Spam correct-classifed: {spam_correct_count}')

Number of Ham mis-classifed: 448
Number of Ham correct-classifed: 4379
Number of Spam mis-classifed: 13
Number of Spam correct-classifed: 734
