# Lab Assignment Two: Exploring Text Data

## 1. Business Understanding

Short Message Service, also known as SMS, is the one of the largest communication standards in the world. SMS is platform independent, and used by most mobile telephone companies in the world. Even though SMS reached its peak of 3.5 billion active users at the end of 2010, its legacy has not diminished. SMS shaped and inspired a revolution of text messaging web and mobile apps in the years since; the world's currently most famous messenger clients such as iMessage, WhatsApp, Facebook Messenger, and WeChat all drew inspiration from SMS.

In 2008, over 70 billion SMS texts were sent *per month* in the US alone. Unfortunately, a large portion of these texts are sent as unwanted, unwarranted advertisements. These unsolicited messages are called **spam**. Spam can be dangerous, because even though most tech savvy individuals can discern a real, genuine message, from malicious advertisements, many portions of the population cannot (especially the elderly and not tech savvy). Thus, it is important to accurately automate the process of filtering spam SMS from a user's phone in the first place.

The dataset we chose is an SMS Spam Collection Dataset, where each of the 5,574 messages are tagged according to being **spam**, or **ham** (legitimate). This data was collected in the first place in order to properly identify which words are commonly associated with spam messages. The document contains over 90,000 words. Once we begin modelling, our prediction algorithm would need to perform to at least a 50% usage rate in order to beat random. However, in an ideal world in which an automated system should filter out all messages, our team's personal goal would be to beat at least 90% to ensure a clean messaging client for all.

---

Link to Dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset/kernels

--- 

## 2. Data Encoding As String

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/spam.csv', encoding = 'latin1') 
df = df.rename(index=str, columns={"v1": "classification", "v2": "message"})

In [3]:
df.head()

Unnamed: 0,classification,message,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### 2.1 Data Input as Strings

### 2.2 Data Quality Verification

### 2.3 Bag of Words Representation

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

simple_train_dtm = vect.fit_transform(df['message'])

simple_train_dtm.toarray()

pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,ó_,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2.4 tf-idf Representation

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words='english',
                             max_df=0.01,
                             min_df=4)
tfidf_mat = tfidf_vect.fit_transform(df['message'])
df_tfidf = pd.DataFrame(data=tfidf_mat.toarray(),columns=tfidf_vect.get_feature_names())
print(df_tfidf)

       00       000   02   03   04   05   06  0800  08000839402  08000930705  \
0     0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
1     0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
2     0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
3     0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
4     0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
5     0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
6     0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
7     0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
8     0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
9     0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
10    0.0  0.000000  0.0  0.0  0.0  0.0  0.0   0.0          0.0          0.0   
11    0.0  0.298721  0.0  0.0  0.0  0.0 

## 3. Data Visualization

### 3.1 Statistical Summaries Visualized

### 3.2 Most Common Relevant Words

## 4. Word Cloud