## Vectorizing Raw Data: N-Grams

### N-Grams

Creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.

<img src='images/N-grams.png'>

### Read in text

In [1]:
import pandas as pd
import re
import string
from nltk import PorterStemmer
from nltk.corpus import stopwords

In [2]:
ps = PorterStemmer()
stopword = stopwords.words('english')

In [3]:
data = pd.read_csv('data/SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [4]:
def clean_text(text):
    text = "".join([char for char in text if char not in string.punctuation])
    tokenize = re.split('\W+', text)
    text = " ".join([ps.stem(word) for word in tokenize if word not in stopword])
    return text

data['cleaned_text'] = data['body_text'].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,label,body_text,cleaned_text
0,ham,I've been searching for the right words to tha...,ive search right word thank breather I promis ...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entri 2 wkli comp win FA cup final tkt 21...
2,ham,"Nah I don't think he goes to usf, he lives aro...",nah I dont think goe usf live around though
3,ham,Even my brother is not like to speak with me. ...,even brother like speak they treat like aid pa...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I have A date ON sunday with will


### Apply CountVectorizer(w/ N-Grams)

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

ngrams_vect = CountVectorizer(ngram_range=(1, 2))
X_counts = ngrams_vect.fit_transform(data['cleaned_text'])

print(X_counts.shape)
print(ngrams_vect.get_feature_names())

(5568, 42281)
['008704050406', '008704050406 sp', '0089mi', '0089mi last', '0121', '0121 2025050', '01223585236', '01223585236 xx', '01223585334', '01223585334 cum', '0125698789', '0125698789 ring', '02', '02 user', '020603', '020603 2nd', '020603 thi', '0207', '0207 153', '02070836089', '02072069400', '02072069400 bx', '02073162414', '02073162414 cost', '02085076972', '02085076972 repli', '020903', '020903 thi', '021', '021 3680', '021 3680offer', '050703', '050703 tcsbcm4235wc1n3xx', '0578', '06', '06 good', '060505', '061104', '07008009200', '07046744435', '07046744435 arrang', '07090201529', '07090298926', '07090298926 reschedul', '07099833605', '07099833605 reschedul', '071104', '07123456789', '07123456789 87077', '0721072', '0721072 find', '07732584351', '07732584351 rodger', '07734396839', '07734396839 ibh', '07742676969', '07742676969 show', '07753741225', '07753741225 show', '0776xxxxxxx', '0776xxxxxxx uve', '07786200117', '077xxx', '077xxx won', '078', '07801543489', '0780154

In [6]:
data_sample = data[:20]

ngrams_vect_sample = CountVectorizer(ngram_range=(2, 2))
X_counts_sample = ngrams_vect_sample.fit_transform(data_sample['cleaned_text'])

print(X_counts_sample.shape)
print(ngrams_vect_sample.get_feature_names())

(20, 229)
['09061701461 claim', '100 20000', '100000 prize', '11 month', '12 hour', '150pday 6day', '16 tsandc', '20000 pound', '2005 text', '21st may', '4txtú120 poboxox36504w45wq', '6day 16', '81010 tc', '87077 eg', '87077 trywal', '87121 receiv', '87575 cost', '900 prize', 'aft finish', 'aid patent', 'anymor tonight', 'appli 08452810075over18', 'appli repli', 'ard smth', 'around though', 'as per', 'as valu', 'bless time', 'breather promis', 'brother like', 'call 09061701461', 'call the', 'caller press', 'callertun caller', 'camera free', 'cash from', 'chanc win', 'claim call', 'claim code', 'claim no', 'click httpwap', 'click wap', 'co free', 'code kl341', 'colour mobil', 'comp win', 'copi friend', 'cost 150pday', 'credit click', 'cri enough', 'csh11 send', 'cup final', 'custom select', 'da stock', 'date on', 'dont miss', 'dont think', 'dont want', 'eg england', 'eh rememb', 'england 87077', 'england macedonia', 'enough today', 'entitl updat', 'entri questionstd', 'entri wkli', 'eve

In [7]:
X_counts_sample

<20x229 sparse matrix of type '<class 'numpy.int64'>'
	with 230 stored elements in Compressed Sparse Row format>

#### Vectorizes output sparse matrics

**Sparse Matrix:** A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing locations of the non-zero elements.

In [10]:
X_counts_df = pd.DataFrame(X_counts_sample.toarray(), columns=ngrams_vect_sample.get_feature_names())

X_counts_df.head()

Unnamed: 0,09061701461 claim,100 20000,100000 prize,11 month,12 hour,150pday 6day,16 tsandc,20000 pound,2005 text,21st may,...,wkli comp,wonder bless,wont take,word claim,word thank,wwwdbuknet lccltd,xxxmobilemovieclub to,ye he,you week,you wonder
0,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,1,0,0,0,0,1
1,0,0,0,0,0,0,0,0,1,1,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
