### Coding Challenge #1: Natural Language Processing

In this Coding Challenge, you will be exposed to the steps needed to get data organized for modelling purposes. You will be exposed to a range of NLP related concepts such as **a)** Tokenization, **b)** Stopwords, **c)** Stemming/Lemmatization, and **d)** Vectorization. 

Walking through this challenge will equip you with the necessay knowledge to work through the first part of the Project Assignment.

**Dataset**: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection







**Step 1**: Explore the dataset to ascertain the following:

**a)** Determine whether there are any missing values. If missing values are diagnosed, treat them. 

**b)** Ascertain the breakdown/count of messages. 1) How many "Spam" messages are there and 2) How many "Ham" messages are there?

In [0]:
# Step 1
# Get the data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip -o smsspamcollection.zip
!head SMSSpamCollection

--2018-06-11 15:03:55--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/zip]
Saving to: ‘smsspamcollection.zip’


2018-06-11 15:03:55 (1.36 MB/s) - ‘smsspamcollection.zip’ saved [203415/203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives arou

In [0]:
# Read with pandas
import pandas as pd
sms_data = pd.read_table('./SMSSpamCollection', header=None,
                         names=['category', 'content'])
sms_data.head()

Unnamed: 0,category,content
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [0]:
# Answer a), b) here

print(sms_data.isnull().sum())
print(sms_data['category'].value_counts())

category    0
content     0
dtype: int64
ham     4825
spam     747
Name: category, dtype: int64


**Step 2: **Massage/Pre-process the dataset:

**a)** You will need to eliminate punctuations

**b)** You will have to deal with/remove stopwords

**c)** Tokenize the text

**d)** Stem or Lemmatize the text

In [0]:
!pip install -U nltk

import nltk

nltk.download('all')

# Inaugural is one of the data packages included within NLTK

# Import the "inaugural" data package
from nltk.corpus import inaugural

In [0]:
# Step 2

import string
from nltk import PorterStemmer, LancasterStemmer, word_tokenize

punctuation = ['(', ')', '?', ':', ':', ',', '.', '!', '/', '"', "'"]

# Remove punctuation
sms_data['clean'] = sms_data['content'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
en_stopwords = set(nltk.corpus.stopwords.words('english'))

# Tokenize the sentence
from nltk.tokenize import word_tokenize
sms_data['tokens']  = sms_data['clean'].apply(lambda x: word_tokenize(x))
sms_data['nontokens'] = sms_data['tokens'].apply(lambda x:' '.join([i for i in x if i not in en_stopwords]))

# # Import the Stemmers and Word Tokenizer

# rawText = "My name is Thomson Comer, commander-in-chief of the Machine Learning program at Lambda school. I am creating the curriculum for the Machine Learning program and will be teaching the full-time Machine Learning program beginning in April 2018."

# tokens = word_tokenize(rawText)

pStemmer = PorterStemmer()
sms_data['porter_Stems'] = sms_data['tokens'].apply(lambda x: ' '.join([pStemmer.stem(t) for t in x]))

n = 422
print(sms_data['content'][n])
print(sms_data['clean'][n])
print(sms_data['porter_Stems'][n])

Someone has contacted our dating service and entered your phone because they fancy you! To find out who it is call from a landline 09111032124 . PoBox12n146tf150p
Someone has contacted our dating service and entered your phone because they fancy you To find out who it is call from a landline 09111032124  PoBox12n146tf150p
someon ha contact our date servic and enter your phone becaus they fanci you To find out who it is call from a landlin 09111032124 pobox12n146tf150p


**Step 3:** Perform Vectorization - you will apply 3 different vectorization techniques. Each technique will generate similar document term matrices where the rows of the matrix will represent the respective text messages and the columns will represent each word or a combination of words. Note that the biggest difference between the techniques is the value depicted in the actual cells of the matrix. 

**1)** Create a document term matrix based on the count of the words in the document. You may want to restrict the # of features/columns based on the top most features ordered by term frequency across the document

**2)** Create a trigram vector using a combination of adjacent words. In this case, n=3

**3) ** Create a TF-IDF vector wherein the cells of the matrix contain values (i.e. weights) to depict how important a word is to an individual SMS message




In [0]:
# Step 3

from sklearn.feature_extraction.text import CountVectorizer

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(sms_data['clean'])

# summarize
# print(vectorizer.vocabulary_)

# encode document
vector = vectorizer.transform(sms_data['clean'])

# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

(5572, 9544)
<class 'scipy.sparse.csr.csr_matrix'>
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
