# Vectorizing Raw Data: N-Grams

### N-Grams 

Creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.

"NLP is an interesting topic"

| n | Name      | Tokens                                                         |
|---|-----------|----------------------------------------------------------------|
| 2 | bigram    | ["nlp is", "is an", "an interesting", "interesting topic"]      |
| 3 | trigram   | ["nlp is an", "is an interesting", "an interesting topic"] |
| 4 | four-gram | ["nlp is an interesting", "is an interesting topic"]    |

### Read in text

In [2]:
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']
data.head()

Unnamed: 0,label,body_text
0,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
1,ham,"Nah I don't think he goes to usf, he lives around here though"
2,ham,Even my brother is not like to speak with me. They treat me like aids patent.
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
4,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...


### Create function to remove punctuation, tokenize, remove stopwords, and stem

In [3]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = " ".join([ps.stem(word) for word in tokens if word not in stopwords]) # to join them back together to create a sentence
    return text

data['cleaned_text'] = data['body_text'].apply(lambda x: clean_text(x))
data.head()

  tokens = re.split('\W+', text)


Unnamed: 0,label,body_text,cleaned_text
0,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd...
1,ham,"Nah I don't think he goes to usf, he lives around here though",nah dont think goe usf live around though
2,ham,Even my brother is not like to speak with me. They treat me like aids patent.,even brother like speak treat like aid patent
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,date sunday
4,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...,per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend...


### Apply CountVectorizer (w/ N-Grams)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

ngram_vect = CountVectorizer(ngram_range=(2,2)) #range = (n_start,n_end) here its bigram only
x_counts = ngram_vect.fit_transform(data['cleaned_text'])
print(x_counts.shape)
print(ngram_vect.get_feature_names_out())

(5567, 31260)
['008704050406 sp' '0089mi last' '0121 2025050' ... 'üll submit'
 'üll take' '〨ud even']


### Apply CountVectorizer (w/ N-Grams) to smaller sample

In [12]:
data_sample = data[0:20]

ngram_vect_sample = CountVectorizer(ngram_range=(3,3)) #range = (n_start,n_end) here its bigram only
x_counts_sample = ngram_vect_sample.fit_transform(data_sample['cleaned_text'])
print(x_counts_sample.shape)
print(ngram_vect_sample.get_feature_names_out())


(20, 179)
['09061701461 claim code' '100 20000 pound' '100000 prize jackpot'
 '11 month entitl' '150pday 6day 16' '16 tsandc appli' '20000 pound txt'
 '2005 text fa' '21st may 2005' '4txtú120 poboxox36504w45wq 16'
 '6day 16 tsandc' '81010 tc wwwdbuknet' '87077 eg england'
 '87077 trywal scotland' '87121 receiv entri' '87575 cost 150pday'
 '900 prize reward' 'aft finish lunch' 'alright way meet'
 'anymor tonight ive' 'appli repli hl' 'ard smth lor' 'brother like speak'
 'call 09061701461 claim' 'call mobil updat' 'caller press copi'
 'callertun caller press' 'camera free call' 'cash 100 20000'
 'chanc win cash' 'claim 81010 tc' 'claim call 09061701461'
 'claim code kl341' 'click httpwap xxxmobilemovieclubcomnqjkgighjjgcbl'
 'click wap link' 'co free 08002986030' 'code kl341 valid'
 'colour mobil camera' 'comp win fa' 'copi friend callertun'
 'cost 150pday 6day' 'credit click wap' 'cri enough today'
 'csh11 send 87575' 'cup final tkt' 'custom select receivea'
 'da stock comin' 'dont miss

### Vectorizers output sparse matrices

_**Sparse Matrix**: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements._

In [13]:
xdf= pd.DataFrame(x_counts_sample.toarray())
xdf.columns = ngram_vect_sample.get_feature_names_out()
xdf

Unnamed: 0,09061701461 claim code,100 20000 pound,100000 prize jackpot,11 month entitl,150pday 6day 16,16 tsandc appli,20000 pound txt,2005 text fa,21st may 2005,4txtú120 poboxox36504w45wq 16,...,way meet sooner,week free membership,win cash 100,win fa cup,winner valu network,wkli comp win,word claim 81010,wwwdbuknet lccltd pobox,xxxmobilemovieclub use credit,ye naughti make
0,0,0,0,0,0,0,0,1,1,0,...,0,0,0,1,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
6,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,1,0,0,1,1,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
9,0,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,1,0,0
