# Spam or Ham (Working Title)
Lab Assignment Two: Exploring Text Data

**_Jake Oien, Seung Ki Lee, Jenn Le_**

## Business Understanding

In [154]:
import pandas as pd
import numpy as np

# Here, we'll import the data, remove unwanted columns(cause this data has 3 empty columns for some reason,
# and rename the columns to be more descriptive
data = pd.read_csv("./spam.csv", encoding='latin-1')
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":"label", "v2":"text"})

data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Data Encoding

### Read in Data as string

In [155]:
#requests for handling url
import requests
import warnings

#for testing
TEST = False

warnings.filterwarnings("ignore")
%matplotlib inline

# #read in data as string
ds = data.to_string(index=True)

if TEST:
    print(type(ds))

### Verify Data Quality

To clean up the data set, we've analyzed what words were meaningless in the context of constituting a message. First major filler we've noticed was the markdown tags. We concluded words such as **& lt;#& gt;** are used for formatting purposes and not for anything pertinent to the meaning of the text. Also, we did not come across any words which started with '&' and ended with ';' which wasn't a markdown tag, so the probability of removing important data seems very low. Aside from this, generally accepted stopwords embeded in the sklearn were removed as they work to give context but not meaning.

We also noticed that callback numbers, website urls and email addresses are usually unique or sparse in appearance, however they were strong indicators of spam. For purposes of further analysis we did not remove these words, but rather we replaced them for collective verifiers such as "replaced_callback_number" or "replaced_url"

In [156]:
#verify data quality
import re

#remove irrelavant words : markdown tags
markdown_tag = r"([\&].+[\;])"
ds = re.sub(markdown_tag,' ',ds)

# email_address = r"^([\w|-]+\@[\w]+\.[\w]+)$"
# ds = re.sub(email_address,"replaced_email",ds)

# #cited from https://www.regexpal.com/93652
# website_url = r"^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$"
# ds = re.sub(website_url,"replaced_url",ds)

# #replace sparse but indicative words : callback numbers, email addresses, urls
# #cited from https://www.regextester.com/17
# callback_number = r"^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$" 
# ds = re.sub(website_url, "replaced_phone_number", ds)

if TEST:
    print (ds)


### convert the data into a sparse encoded bag-of-word representation

In [157]:
from sklearn.feature_extraction.text import CountVectorizer
#create bag of words
count_vector = CountVectorizer(stop_words='english')
bag_of_words = count_vector.fit_transform([ds])

#put word counts in pd.DataFrame
df = pd.DataFrame(data=bag_of_words.toarray(), columns=count_vector.get_feature_names())

df.sum().sort_values()[-12:]

hi        130
sorry     136
come      145
know      152
good      162
got       165
ll        184
ur        203
just      218
ok        246
spam      748
ham      4825
dtype: int64

Aside from Ham and Spam which are classifiers in the dataset, top 10 word in the dataset are typical "responses" in a casual texting environment such as "ok," "good," "hi," "sorry," or "good."

On the other hand, we also observed causative verb of "got," and "ur," a shorthand term for "your" or "you are." Adverb "just" listed no.2 of the entire words as it has various usage.

### convert data into sparse encoded tr-idf representation

In [158]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Create tfidf
tfidf_vector = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vector.fit_transform([ds])

#build pd.DataFrame
tfidf_df = pd.DataFrame(data=tfidf_matrix.toarray(), columns=tfidf_vector.get_feature_names())

#tfidf_df.head(n=10)
tfidf_df.sum().sort_values()[-12:]

hi       0.026140
sorry    0.027346
come     0.029156
know     0.030563
good     0.032574
got      0.033177
ll       0.036997
ur       0.040818
just     0.043834
ok       0.049464
spam     0.150403
ham      0.970178
dtype: float64

Using TF-IDF representation, the pool of top words stayed the same. Not only that, the ordering of frequency maintained the same.

## Data Visualization