# Spam or Ham?

## Lab Assignment Two: Exploring Text Data 

### Justin Ledford, Luke Wood, Traian Pop 
___

## Business Understanding

### Data Background
SMS messages play a huge role in a person's life, and the confidentiality and integrity of said messages are of the highest priority to mobile carriers around the world. Due to this fact, many unlawful individuals and groups try and take advantange of the average consumer by flooding their inbox with spam, and while the majority of people successfully avoid it, there are people out there affected negatively by falling for false messages.  

The data we selected is a compilation of 5574 SMS messages acquired from a variety of different sources, broken down in the following way: 452 of the messages came from the Grumbletext Web Site, 3375 of the messages were taken from the NUS SMS Corpus (database with legitimate message from the University of Singapore), 450 messages collected from Caroline Tag's PhD Thesis, and the last 1324 messages were from the SMS Spam Corpus v.0.1 Big. 

Overall there were 4827 "ham" messages and 747 "spam" messages, and about 92,000 words.

### Purpose
This data was collected initially for studies on deciphering the differences between a spam or ham (legitimate) messages. Uses for this research can involve advanced spam filtering technology or improved data sets for machine learning programs. However, a slight problem with this data set, as with most localized language-based data sets, is that due to the relatively small area of sampling, there are a lot of regional data points (such as slang, acronyms, etc) that can be considering "useless" data if a much more generalized data set is wanted. For our specific project however, we are keeping all this data in order for us to analyze it and get a better understanding of our data.
___

## Data Encoding

### Extracting the Data

In [1]:
import pandas as pd
import numpy as np
import requests
import re
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

descriptors_url = 'https://raw.githubusercontent.com/LukeWoodSMU/TextAnalysis/master/data/SMSSpamCollection'
descriptors = requests.get(descriptors_url).text
texts = []


for line in descriptors.splitlines():
    texts.append(line.rstrip().split("\t"))

After the first look at the data we noticed a lot of phone numbers. Since almost every number was unique we concluded that the numbers were irrelevant to consider as words. We considered grouping all number tokens into one "word" and analyze the presence of words, but we decided to first start by just removing the numbers.

In [2]:
# Remove numbers
texts = list(zip([a for a,b in texts], [re.sub('[0-9-]3+.*', ' ', b) for a,b in texts]))
texts[:10]

[('ham',
  'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'),
 ('ham', 'Ok lar... Joking wif u oni...'),
 ('spam',
  "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"),
 ('ham', 'U dun say so early hor... U c already then say...'),
 ('ham', "Nah I don't think he goes to usf, he lives around here though"),
 ('spam',
  "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"),
 ('ham',
  'Even my brother is not like to speak with me. They treat me like aids patent.'),
 ('ham',
  "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"),
 ('spam',
  'WINNER!! As a valued network customer you have been selected to receivea £

In [3]:
import numpy as np
from keras.preprocessing import sequence

Using TensorFlow backend.


In [4]:
X = [x[1] for x in texts]
y = [x[0] for x in texts]
X = np.array(X)
print(type(X))

<class 'numpy.ndarray'>


In [5]:
from nltk.tokenize import word_tokenize
X = [word_tokenize(x) for x in X]

In [6]:
encoder = {}
counter = 0
def encode_sentence(seq):
    global encoder, counter
    fseq = []
    for x in seq:
        if x not in encoder:
            encoder[x] = counter
            counter+=1
        fseq.append(encoder[x])
    return fseq

X = [encode_sentence(x) for x in X]
X

[[0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  16],
 [22, 23, 16, 24, 25, 26, 27, 16],
 [28,
  29,
  8,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  36,
  34,
  45,
  34,
  46,
  29,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  56],
 [59, 60, 61, 62, 63, 64, 16, 59, 65, 66, 67, 61, 16],
 [68, 69, 70, 71, 72, 73, 74, 34, 75, 4, 73, 76, 77, 78, 79],
 [80,
  81,
  18,
  82,
  83,
  56,
  84,
  85,
  86,
  56,
  87,
  88,
  89,
  90,
  91,
  92,
  69,
  93,
  94,
  95,
  96,
  97,
  98,
  99,
  83,
  100,
  101,
  102,
  103,
  92,
  104,
  49,
  105,
  34,
  106,
  4,
  107,
  34,
  108],
 [109,
  110,
  111,
  112,
  113,
  94,
  34,
  114,
  115,
  116,
  43,
  117,
  118,
  116,
  94,
  119,
  120,
  43],
 [121,
  122,
  123,
  124,
  125,
  126,
  48,
  127,
  128,
  129,
  130,
  52,
  131,
  132,
  84,
  133,
  134,
  123,
  135,
  

In [7]:
from keras.preprocessing.sequence import pad_sequences
X = pad_sequences(X, maxlen=None)
X

array([[    0,     0,     0, ...,    20,    21,    16],
       [    0,     0,     0, ...,    26,    27,    16],
       [    0,     0,     0, ...,    57,    58,    56],
       ..., 
       [    0,     0,     0, ...,  1615, 11472,   101],
       [    0,     0,     0, ...,   407,    99,   724],
       [    0,     0,     0, ...,    34,   296,   282]], dtype=int32)

In [12]:
import keras
y = [1 if y_ == "spam" else 0 for y_ in y]
y_ohe = keras.utils.to_categorical(y)
y_ohe

array([[ 1.],
       [ 1.],
       [ 1.],
       ..., 
       [ 1.],
       [ 1.],
       [ 1.]])