# Spam or Ham?

## Lab Assignment Two: Exploring Text Data 

### Justin Ledford, Luke Wood, Traian Pop 
___

## Business Understanding

### Data Background
SMS messages play a huge role in a person's life, and the confidentiality and integrity of said messages are of the highest priority to mobile carriers around the world. Due to this fact, many unlawful individuals and groups try and take advantange of the average consumer by flooding their inbox with spam, and while the majority of people successfully avoid it, there are people out there affected negatively by falling for false messages.  

The data we selected is a compilation of 5574 SMS messages acquired from a variety of different sources, broken down in the following way: 452 of the messages came from the Grumbletext Web Site, 3375 of the messages were taken from the NUS SMS Corpus (database with legitimate message from the University of Singapore), 450 messages collected from Caroline Tag's PhD Thesis, and the last 1324 messages were from the SMS Spam Corpus v.0.1 Big. 

Overall there were 4827 "ham" messages and 747 "spam" messages, and about 92,000 words.

### Purpose
This data was collected initially for studies on deciphering the differences between a spam or ham (legitimate) messages. Uses for this research can involve advanced spam filtering technology or improved data sets for machine learning programs. However, a slight problem with this data set, as with most localized language-based data sets, is that due to the relatively small area of sampling, there are a lot of regional data points (such as slang, acronyms, etc) that can be considering "useless" data if a much more generalized data set is wanted. For our specific project however, we are keeping all this data in order for us to analyze it and get a better understanding of our data.
___

## Data Encoding

### Extracting the Data

In [2]:
import pandas as pd
import numpy as np
import requests
import re
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

descriptors_url = 'https://raw.githubusercontent.com/LukeWoodSMU/TextAnalysis/master/data/SMSSpamCollection'
descriptors = requests.get(descriptors_url).text
texts = []


for line in descriptors.splitlines():
    texts.append(line.rstrip().split("\t"))

After the first look at the data we noticed a lot of phone numbers. Since almost every number was unique we concluded that the numbers were irrelevant to consider as words. We considered grouping all number tokens into one "word" and analyze the presence of words, but we decided to first start by just removing the numbers.

In [42]:
# Remove numbers
texts = list(zip([a for a,b in texts], [re.sub('[0-9-]3+.*', ' ', b) for a,b in texts]))
texts[:10]

[('ham',
  'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'),
 ('ham', 'Ok lar... Joking wif u oni...'),
 ('spam', 'Free entry in NUMBER_TOKEN'),
 ('ham', 'U dun say so early hor... U c already then say...'),
 ('ham', "Nah I don't think he goes to usf, he lives around here though"),
 ('spam', "FreeMsg Hey there darling it's been NUMBER_TOKEN"),
 ('ham',
  'Even my brother is not like to speak with me. They treat me like aids patent.'),
 ('ham',
  "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *NUMBER_TOKEN"),
 ('spam',
  'WINNER!! As a valued network customer you have been selected to receivea £NUMBER_TOKEN'),
 ('spam', 'Had your mobile NUMBER_TOKEN')]

Converting the data from raw text into a sparse encoded bag-of-words representation. 

In [43]:
# Create bag of words
count_vect = CountVectorizer(stop_words='english')
bag_words = count_vect.fit_transform([t[1] for t in texts])
count_vect_ham = CountVectorizer(stop_words='english')
bag_words_ham = count_vect_ham.fit_transform([t[1] for t in texts if t[0] == 'ham'])
count_vect_spam = CountVectorizer(stop_words='english')
bag_words_spam = count_vect_spam.fit_transform([t[1] for t in texts if t[0] == 'spam'])

In [5]:
# Words counts per row in pandas dataframe
df = pd.DataFrame(data=bag_words.toarray(),columns=count_vect.get_feature_names())

In [6]:
# print out 10 most common words in our data
df.sum().sort_values()[-10:]

good    243
like    244
know    262
free    268
ll      269
ok      292
lt      316
gt      318
just    368
ur      382
dtype: int64

It is interesting to see that 40% of the top words of our data set are "slang" words, as one would assume in a texting database there would be a higher percentage of abbreviations.

Converting the data into a sparse encoded tf-idf representation.

In [41]:
# Get tfidf
tfidf_vect = TfidfVectorizer(stop_words='english')
tfidf_mat = tfidf_vect.fit_transform([t[1] for t in texts])
tfidf_df = pd.DataFrame(data=tfidf_mat.toarray(), columns=tfidf_vect.get_feature_names())
tfidf_df.head(n=5)

Unnamed: 0,00,000,000pes,008704050406,0089,0121,0125698789,012number_token,02,0207,...,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
#print out 10 most common words in our data
tfidf_df.sum().sort_values()[-10:]

time     56.945668
know     59.017660
good     59.766063
lt       64.356000
ur       64.382773
gt       64.679294
come     67.380358
just     72.266374
ll       79.736271
ok      103.308348
dtype: float64

When analyzing the top 10 most common words in our data with the TF-IDF representation, our top winners change. Instead of the previous "ur" being #1, it seems that it drops all the way down to #7, while "ok" jumps from #5 to #1. This means that "ok" is a more more relevant word throughout the documents compared to "ur" while "ur" itself just happens to have more individual hits. 

Using tf-idf we can see which words are most relevant for ham and spam messages.

In [9]:
tfidf_vect_ham = TfidfVectorizer(stop_words='english')
tfidf_mat_ham = tfidf_vect_ham.fit_transform([t[1] for t in texts if t[0] == 'ham'])
tfidf_df_ham = pd.DataFrame(data=tfidf_mat_ham.toarray(), columns=tfidf_vect_ham.get_feature_names())
tfidf_df_ham.sum().sort_values()[-10:]

home      54.427173
know      54.526118
sorry     55.010803
good      57.433290
lt        62.619921
gt        62.980016
just      63.966590
come      65.946440
ll        77.769828
ok       100.941869
dtype: float64

In [10]:
tfidf_vect_spam = TfidfVectorizer(stop_words='english')
tfidf_mat_spam = tfidf_vect_spam.fit_transform([t[1] for t in texts if t[0] == 'spam'])
tfidf_df_spam = pd.DataFrame(data=tfidf_mat_spam.toarray(), columns=tfidf_vect_spam.get_feature_names())
tfidf_df_spam.sum().sort_values()[-10:]

reply     14.216356
stop      14.292372
claim     15.725948
urgent    15.902412
prize     16.668448
text      17.226986
txt       18.288373
ur        18.304256
mobile    18.412027
free      26.184182
dtype: float64