## Biggest thing I learned from this project is that I have to be careful counting words in a string because the whitespace is always included in the count. The best way to count instead is to convert the string into a list containing the individual words by default and then finding the length of this string

## We want to build a filter that classifies messages as either spam or non-spam. To this end, we would teach the computer to identify words associated with spam messages and thereafter allow the computer to analyze messages based on the words they contain and assign spam/non-spam probabilities

In [1]:
import pandas as pd
import re

In [2]:
df = pd.read_csv("SMSSpamCollection",delimiter="\t",header=None)
df.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df.columns = ["Label","SMS"]
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
df["Label"].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## We see that there more non-spam messages than spam

In [5]:
df = df.sample(frac=1, random_state=1)
df.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [6]:
train_data = df.iloc[:int((len(df)+1)*.80)] #Remaining 80% to training set
test_data = df.iloc[int(len(df)*.80+1):]

In [7]:
print(train_data["Label"].value_counts(normalize=True)*100)
print(test_data["Label"].value_counts(normalize=True)*100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


## We note that the percentage of spam and non-spam messages in the training and test data set are similar to those of the original dataset

## Next we clean the SMS column by removing non-word characters(e.g. punctuations) and transforming all word characters to lower case. To prevent us from selecting whitespace characters, we apply str.split() to the column

In [8]:
train_data["SMS"] = train_data["SMS"].str.replace(r"\W"," ").str.lower().str.split()
train_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Label,SMS
1078,ham,"[yep, by, the, pretty, sculpture]"
4028,ham,"[yes, princess, are, you, going, to, make, me,..."
958,ham,"[welp, apparently, he, retired]"
4642,ham,[havent]
4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [9]:
vocabulary = list()
def select(row):
    for elt in row["SMS"]:
        if elt in vocabulary:
            continue
        else:
            vocabulary.append(elt)
    return None

In [13]:
train_data.apply(select,axis="columns")

1078    None
4028    None
958     None
4642    None
4674    None
        ... 
1982    None
5180    None
4020    None
371     None
3482    None
Length: 4458, dtype: object

In [14]:
len(vocabulary)

7783

In [15]:
## This is DataQuest's approach. Same as mine above
#vocab = list()
#train_data["SMS"] = train_data["SMS"].str.split()
#for i in train_data["SMS"]:
 #   for j in i:
  #      vocab.append(j)
#vocab = set(vocab)
#vocab = list(vocab)

In [16]:
word_counts_per_sms = {}
for elt in vocabulary:
    newlist = list()
    for row in train_data["SMS"]:
        counts = row.count(elt)
        newlist.append(counts)
    word_counts_per_sms[elt] = newlist

In [17]:
len(word_counts_per_sms)

7783

In [18]:
word_count = pd.DataFrame(word_counts_per_sms)

In [19]:
word_count.head()

Unnamed: 0,yep,by,the,pretty,sculpture,yes,princess,are,you,going,...,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre
0,1,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
train_data.reset_index(drop=True,inplace=True)
train_data.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [21]:
train_data_clean = pd.concat([train_data,word_count],axis="columns")

In [22]:
train_data_clean.head()

Unnamed: 0,Label,SMS,yep,by,the,pretty,sculpture,yes,princess,are,...,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre
0,ham,"[yep, by, the, pretty, sculpture]",1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
p_ham = train_data_clean["Label"].value_counts(normalize=True)["ham"]
p_spam = 1 - p_ham
print(p_ham,p_spam)

0.8654104979811574 0.13458950201884257


In [24]:
ham_data = train_data_clean[train_data_clean["Label"]=="ham"]
spam_data = train_data_clean[train_data_clean["Label"]=="spam"]

In [25]:
n_spam = 0
n_ham = 0

for row in ham_data["SMS"]:
    n_ham += len(row)
for row in spam_data["SMS"]:
    n_spam += len(row)

n_vocabulary = len(vocabulary)
alpha = 1
print(n_spam,n_ham,n_vocabulary)

15190 57237 7783


In [26]:
spam_dict = {unique_word:0 for unique_word in vocabulary}
ham_dict = {unique_word:0 for unique_word in vocabulary}
for word in vocabulary:
    n_word_given_spam = spam_data[word].sum()
    n_word_given_ham = ham_data[word].sum()
    spam_dict[word] = (n_word_given_spam + alpha) / (n_spam + alpha * n_vocabulary)
    ham_dict[word] = (n_word_given_ham + alpha) / (n_ham + alpha * n_vocabulary)

In [27]:
spam_dict

{'yep': 4.3529360553693465e-05,
 'by': 0.0015235276193792714,
 'the': 0.006877638967483567,
 'pretty': 4.3529360553693465e-05,
 'sculpture': 4.3529360553693465e-05,
 'yes': 0.0007399991294127889,
 'princess': 4.3529360553693465e-05,
 'are': 0.0028294084359900755,
 'you': 0.011143516301745527,
 'going': 0.00017411744221477386,
 'to': 0.023810560222870324,
 'make': 0.00047882296609062813,
 'me': 0.0010882340138423366,
 'moan': 4.3529360553693465e-05,
 'welp': 4.3529360553693465e-05,
 'apparently': 4.3529360553693465e-05,
 'he': 4.3529360553693465e-05,
 'retired': 4.3529360553693465e-05,
 'havent': 4.3529360553693465e-05,
 'i': 0.002219997388238367,
 'forgot': 4.3529360553693465e-05,
 '2': 0.007225873851913115,
 'ask': 4.3529360553693465e-05,
 'ü': 4.3529360553693465e-05,
 'all': 0.001218822095503417,
 'smth': 4.3529360553693465e-05,
 'there': 0.0006094110477517085,
 's': 0.0027858790754363818,
 'a': 0.013407043050537588,
 'card': 0.00017411744221477386,
 'on': 0.004918817742567362,
 'da'

In [28]:
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in vocabulary:
            p_spam_given_message *= spam_dict[word]
            p_ham_given_message *= ham_dict[word]
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [29]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300844e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [30]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888126e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


In [31]:
classify("Yep, by the pretty sculpture")

P(Spam|message): 1.1631822187809433e-19
P(Ham|message): 1.9794443217005706e-17
Label: Ham


In [32]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in vocabulary:
            p_spam_given_message *= spam_dict[word]
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return "ham"
    elif p_ham_given_message < p_spam_given_message:
        return "spam"
    else:
        return "needs human classification"

In [33]:
test_data['predicted'] = test_data['SMS'].apply(classify_test_set)
test_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Label,SMS,predicted
2131,ham,Later i guess. I needa do mcat study too.,ham
3418,ham,But i haf enuff space got like 4 mb...,ham
3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
5393,ham,"All done, all handed in. Don't know if mega sh...",ham


In [34]:
correct = 0
total = len(test_data)
    
for row in test_data.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


## The algorithm is extremely accurate. I am very surprised by this. Let's isolate the messages that were incorrectly predicted

In [35]:
for row in test_data.iterrows():
    row = row[1]
    if row["Label"] != row["predicted"]:
        print(row["predicted"],"|||",row["SMS"])
        print("-"*80)

ham ||| Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
--------------------------------------------------------------------------------
ham ||| More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB
--------------------------------------------------------------------------------
spam ||| Unlimited texts. Limited minutes.
--------------------------------------------------------------------------------
spam ||| 26th OF JULY
--------------------------------------------------------------------------------
spam ||| Nokia phone is lovly..
--------------------------------------------------------------------------------
needs human classification ||| A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a t