# Tokenizing Text Message Data to Identify and Predict Spam

## Introduction 

In this project I looked into using basic machine learning and complex text vectorization to predict whether a text message was spam or not. For the big data aspect of this project, I analyzed a set of data collected by the spam archive. This sight stated that it was a list of text messages that many "average" citizens had recieved over the past year. From this data set, I classified the messages, in order to calculate the predicted percent of messages that the average user receives is spam. The purpose of this was to find how often and how frequently companies send advertisments to phone users. 

I was intrested in this project because text messages are relevant to me (and everyone). I also noticed that many people (including me) complain about reciving "too many" spam text messages. This project will determine what the actual percent of messages the average user recieves which is classified as spam. 

In [3]:
import statistics
import pandas as pd
df = pd.read_csv('spam.csv', encoding='latin-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


We imported the csv sms spam file into an organized list using the pandas import statement. In this list, spam messages are categorized as spam, and non spam messages are categorized as ham. This organization makes splitting the data into different classes (train and labels and test) much easier. This data is from UCI Machine learning. The data set is called SMS Spam Collection Data Set. 

In [4]:
# split into train and test
from sklearn import cross_validation
data_train, data_test, labels_train, labels_test = cross_validation.train_test_split(
    df.v2,
    df.v1, 
    test_size=0.1, 
    random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)
data_train_transformed = vectorizer.fit_transform(data_train)
data_test_transformed  = vectorizer.transform(data_test)

print (data_train_transformed[:10])


# slim the data for training and testing
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(data_train_transformed, labels_train)
data_train_transformed = selector.transform(data_train_transformed).toarray()
data_test_transformed  = selector.transform(data_test_transformed).toarray()


  (0, 472)	0.271851684175
  (0, 4806)	0.271851684175
  (0, 3863)	0.090064750337
  (0, 6363)	0.133268350836
  (0, 5110)	0.110581844957
  (0, 7936)	0.160644082652
  (0, 1425)	0.182835174112
  (0, 2538)	0.115644909871
  (0, 1743)	0.11141201918
  (0, 7245)	0.146013424599
  (0, 478)	0.271851684175
  (0, 1851)	0.21934953621
  (0, 1383)	0.148716702655
  (0, 7682)	0.15530772968
  (0, 8150)	0.118013534293
  (0, 3224)	0.122300733507
  (0, 1970)	0.271851684175
  (0, 7705)	0.324568003175
  (0, 5174)	0.102426140378
  (0, 5819)	0.250504415332
  (0, 8156)	0.0996053050366
  (0, 3137)	0.0984238824146
  (0, 7200)	0.163248247172
  (0, 4447)	0.460284912592
  (1, 1790)	0.369929821608
  :	:
  (8, 3449)	0.172818514477
  (8, 3572)	0.116564662538
  (8, 5176)	0.190803658853
  (8, 7265)	0.200793180274
  (8, 1028)	0.215127904721
  (8, 4780)	0.223246772394
  (8, 1117)	0.122793433021
  (8, 4202)	0.143928572963
  (8, 4332)	0.181598012047
  (8, 4110)	0.129529092589
  (8, 4590)	0.200793180274
  (8, 4285)	0.30651169741

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5)
data_train_transformed = vectorizer.fit_transform(data_train)
data_test_transformed  = vectorizer.transform(data_test)

# slim the data for training and testing
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(data_train_transformed, labels_train)
data_train_transformed = selector.transform(data_train_transformed).toarray()
data_test_transformed  = selector.transform(data_test_transformed).toarray()

print(data_test)

3245    Funny fact Nobody teaches volcanoes 2 erupt, t...
944     I sent my scores to sophas and i had to do sec...
1044    We know someone who you know that fancies you....
2484    Only if you promise your getting out as SOON a...
812     Congratulations ur awarded either å£500 of CD ...
2973           I'll text carlos and let you know, hang on
2991            K.i did't see you.:)k:)where are you now?
2942               No message..no responce..what happend?
230     Get down in gandhipuram and walk to cross cut ...
1181                           You flippin your shit yet?
1912    For real tho this sucks. I can't even cook my ...
1992    Free tones Hope you enjoyed your new content. ...
5435                    I'm wif him now buying tix lar...
4805          Call me when u finish then i come n pick u.
401               Dear how is chechi. Did you talk to her
1859            What's up. Do you want me to come online?
1344                     Were somewhere on Fredericksburg
2952    URGENT

In this section, I seperated my data list into different groups. The groups are categorized based on their classification. In order to randomize each list, I ran a randomize function to scramble the order of each list. 

After creating each seperate list, I used functions from sklearn to begin vectorizing the text. Vectorization is the process of converting text into numbers based on certain parameters. In this case, the text was given a number based on the frequency the given words appeared in spam messages compared to ham messages. Words that were commonly associated with spam messages where classified as spam words. If these words appeared in a message, the program would most likley classify the message as spam. 

The next step in tokenizing and vectorizing the text is sliming the data. In order to do this, we need to use sklearn to import select percentile. This program coverts each line into a toarray. The purpose of a toarray is to condense and store all values. Without a toarray, the program will attempt to store all of the values in a sparse matrix. A sparse matrix will keep all "significant numbers." This means that all zeros are erased from the matrix. We do not want our zeros to be erased from our collected data. Without zeros, the result of our code would change significantly. After the vectorized text has been proporly fitted, it can be used to predict whether or not a given message is spam. 

I found help for this code online. The online code gave me an explanation along with examples of how to use sklearn and text vectorization. The online example used a count vectorizer instead of a frequency vectorizer. A count vectorzier gives words / messgages based on how often they are used (it basically assigns words different weights based on their predicted importance). This is a link to the code --> http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html 

In [5]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

clf = GaussianNB()
clf.fit(data_train_transformed, labels_train)
predictions = clf.predict(data_test_transformed)

print(accuracy_score(labels_test, predictions))

0.97311827957


After I finished training my data, I decided to test the accuracy of my predictions based on my trained dataset. In order to do this, I used sklearn and imported GaussianNB, and accuracy score. Gaussian NB uses the Naive Bayes method to analyze the dataset. Naive Bays is a set of supervised learning algorithms used to analyze the trained data. In this case, we are simply fitting the data and predicting whether our transformed data is spam or not. After this we can use an accuracy score to calculate how accurately my program was able to classify data. In this case, my code has an above 97% accuracy. This means that my code can very accurately predict whether a text message is spam or not.  

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer 

get_input = input("Enter a text message: ");
text = [get_input]
text_transformed  = vectorizer.transform(text) 
text_test_transformed  = selector.transform(text_transformed).toarray()

prediction = clf.predict(text_test_transformed)
print("This text message is", prediction);

Enter a text message: It's Freddie from DoSomething! Today we are offering a chance to win a $3,000 scholarship for members who stand against bullying. Wanna learn how? Yes or No
This text message is ['spam']


In this model, I can use my algorithim to predict whether a message should be classified as spam or not. In the above model, I tried inputing a spam message that I recieved from a collge website. My program was able to correctly classify the message as spam. 

In [9]:
sc = open("csvfile2.csv", "r", errors = "ignore"); 

data_list = []
for aa in sc:
    data_list.append(aa); 

spam_list = []
ham_list = []

for bb in data_list:
    text = [bb]
    text_transformed  = vectorizer.transform(text) 
    text_test_transformed  = selector.transform(text_transformed).toarray()

    prediction = clf.predict(text_test_transformed)
    
    if prediction == 'ham':
        ham_list.append(bb)
    
    elif prediction == 'spam':
        spam_list.append(bb)

spam_len = (len(spam_list))
ham_len = (len(ham_list))
print("There are a total of", len(data_list), "messages in this dataset.")

average_spam = (float(spam_len/ham_len)*100)

print("Around", str(average_spam) + "% of the messages in this dataset are spam.")

There are a total of 5574 messages in this dataset.
Around 14.549938347718866% of the messages in this dataset are spam.


In this final method, I used spam messages from the spam archive to predict the average number of messages a user recieves that is spam. I did this by categorizing the messages into seperate lists based on their classification. After running this test, I found that around 14% of the messages that a typical user receives is spam. This seemed like a pretty signifcant percent of messages. Due to the capatalism buisness-based lifestyle of America, it would make sense that people would receive a large number of advertisments concerning buying or showing interest in certain products. 

It would have also been interesting to run this test on other sources such as emails and phone calls. With this information, I could look into where people are receiving the most spam. Are people receving large amounts of spam via mobile, phone call, email, or all three? 

In conclusion, I used machine learning and text vectroization to create a program which can predict (with a high accuracy) the amount of messages that a user receives which is spam. After analyzing a large group of texts, I concluded that 14% of the messgages that the average phone user recives is spam. 