# Spam or Ham (Working Title)
Lab Assignment Two: Exploring Text Data

**_Jake Oien, Seung Ki Lee, Jenn Le_**

## Business Understanding

In [6]:
import pandas as pd
import numpy as np

# Here, we'll import the data, remove unwanted columns(cause this data has 3 empty columns for some reason,
# and rename the columns to be more descriptive
data = pd.read_csv("./spam.csv", encoding='latin-1')
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":"label", "v2":"text"})

data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Data Encoding

### Read in Data as string

In [35]:
import warnings

#for testing
TEST = False

warnings.filterwarnings("ignore")
%matplotlib inline

if TEST:
    print (data)

### Verify Data Quality

To clean up the data set, we've analyzed what words were meaningless in the context of constituting a message. First major filler we've noticed was the markdown tags. We concluded words such as **& lt;#& gt;** are used for formatting purposes and not for anything pertinent to the meaning of the text. Also, we did not come across any words which started with '&' and ended with ';' which wasn't a markdown tag, so the probability of removing important data seems very low. Aside from this, generally accepted stopwords embeded in the sklearn were removed as they work to give context but not meaning.

In [36]:
#verify data quality
import re

#remove irrelavant words : markdown tags
data.text.replace(to_replace=["\&.+\;"],value=[' '],regex=True, inplace=True)

if TEST:
    print (data)


### convert the data into a sparse encoded bag-of-word representation

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
#create bag of words
count_vector = CountVectorizer(stop_words='english')
bag_of_words = count_vector.fit_transform(data['text'])

#put word counts in pd.DataFrame
df = pd.DataFrame(data=bag_of_words.toarray(), columns=count_vector.get_feature_names())

df.sum().sort_values()[-10:]

come    229
got     239
good    240
like    243
know    260
ll      269
free    282
ok      292
just    371
ur      384
dtype: int64

Some of the top 10 word in the dataset are typical "responses" in a casual texting environment such as "ok," or "good." words represent the texter's state such as "free," "come," "know" were also included.

Top occurence was "ur," a shorthand term for "your" or "you are."

On the other hand, we also observed terms with flexible and various usage in high frequency. Causative verb of "got," and "like," the new comma, and adverb "just."

### convert data into sparse encoded tr-idf representation

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Create tfidf
tfidf_vector = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vector.fit_transform(data['text'])

#build pd.DataFrame
tfidf_df = pd.DataFrame(data=tfidf_matrix.toarray(), columns=tfidf_vector.get_feature_names())

#tfidf_df.head(n=10)
tfidf_df.sum().sort_values()[-10:]

got       55.759683
like      56.039836
sorry     56.432024
know      59.201013
good      60.014388
ur        64.058322
come      66.751932
just      72.067140
ll        80.344564
ok       103.128889
dtype: float64

Using TF-IDF representation, the pool of top words changed. sorry, replaced free in the listing. in relative placements of the value, ok marked up at highest value instead of ur. ur actually only came up around middle of the top 10 list.

## Data Visualization