### Working with Text Data and Naive Bayes in scikit-learn

    Problem: we need to classify wheather a SMS is spam or not

### Agenda

**Working with text data**

- Representing text as data
- Reading SMS data
- Vectorizing SMS data
- Examining the tokens and their counts
- Bonus: Calculating the "spamminess" of each token

**Naive Bayes classification**

- Building a Naive Bayes model
- Comparing Naive Bayes with logistic regression

### Part 1: Representing text as data

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [2]:
# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [3]:
simple_train

['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [4]:
# learn the 'vocabulary' of the training data
vect = CountVectorizer(ngram_range=(1, 2),binary=True)
vect.fit(simple_train)
vect.get_feature_names()

['cab',
 'call',
 'call me',
 'call you',
 'me',
 'me cab',
 'me please',
 'please',
 'please call',
 'tonight',
 'you',
 'you tonight']

In [5]:
vect

CountVectorizer(analyzer='word', binary=True, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [6]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [7]:
# print the sparse matrix
print(simple_train_dtm)

  (0, 1)	1
  (0, 3)	1
  (0, 9)	1
  (0, 10)	1
  (0, 11)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (1, 4)	1
  (1, 5)	1
  (2, 1)	1
  (2, 2)	1
  (2, 4)	1
  (2, 6)	1
  (2, 7)	1
  (2, 8)	1


In [8]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1],
       [1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0]], dtype=int64)

In [9]:
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,call me,call you,me,me cab,me please,please,please call,tonight,you,you tonight
0,0,1,0,1,0,0,0,0,0,1,1,1
1,1,1,1,0,1,1,0,0,0,0,0,0
2,0,1,1,0,1,0,1,1,1,0,0,0


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [10]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0]], dtype=int64)

In [11]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,call me,call you,me,me cab,me please,please,please call,tonight,you,you tonight
0,0,1,1,0,1,0,0,1,0,0,0,0


**Summary:**

- `vect.fit(train)` learns the vocabulary of the training data
- `vect.transform(train)` uses the fitted vocabulary to build a document-term matrix from the training data
- `vect.transform(test)` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

### Part 2: Reading SMS data

In [12]:
sms = pd.read_csv('C:/Users/om/Downloads/Text mining & NLP files/sms case study/sms.csv')

In [14]:
sms.shape

sms.head(20)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [15]:
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [16]:
# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})

In [17]:
# define X and y
X = sms.message
y = sms.label

In [18]:
# split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_test.shape)
print(y_train.shape)

(4179,)
(1393,)
(1393,)
(4179,)


### Part 3: Vectorizing SMS data

In [19]:
# instantiate the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(strip_accents='unicode',stop_words='english',max_df=0.9,min_df=0.001)  

# max_df,min_df control the number of term/feature in Dataframe
# max_features= 2000, give top 2000 feature based on occuring
# max_df=0.9,min_df=0.001 , max_feature =1000, first apply max_df and min_df and then apply max_feature

In [20]:
CountVectorizer? 

In [24]:
# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

<4179x1279 sparse matrix of type '<class 'numpy.int64'>'
	with 23463 stored elements in Compressed Sparse Row format>

In [25]:
# alternative: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<4179x1279 sparse matrix of type '<class 'numpy.int64'>'
	with 23463 stored elements in Compressed Sparse Row format>

In [26]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x1279 sparse matrix of type '<class 'numpy.int64'>'
	with 7734 stored elements in Compressed Sparse Row format>

### Part 4: Examining the tokens and their counts

In [27]:
# store token names
X_train_tokens = vect.get_feature_names()

In [28]:
# first 50 tokens
print(X_train_tokens[:50])

['00', '000', '03', '04', '0800', '08000839402', '08000930705', '0870', '08707509020', '08712300220', '08712460324', '10', '100', '1000', '10am', '10p', '11', '11mths', '12', '12hrs', '1327', '150', '150p', '150pm', '150ppm', '16', '18', '1st', '20', '200', '2000', '2003', '20p', '21', '25', '250', '25p', '2day', '2lands', '2nd', '2nite', '30', '3030', '350', '36504', '3g', '40gb', '4th', '4u', '50']


In [29]:
# last 50 tokens
print(X_train_tokens[-50:])

['wins', 'wish', 'wishes', 'wishing', 'wit', 'wiv', 'wk', 'wkly', 'woke', 'won', 'wonder', 'wonderful', 'wondering', 'wont', 'word', 'words', 'work', 'working', 'works', 'world', 'worried', 'worries', 'worry', 'worse', 'worth', 'wot', 'wow', 'write', 'wrong', 'www', 'xmas', 'xx', 'xxx', 'xy', 'ya', 'yar', 'yay', 'yeah', 'year', 'years', 'yep', 'yes', 'yest', 'yesterday', 'ym', 'yo', 'yr', 'yrs', 'yup', 'zed']


In [30]:
# view X_train_dtm as a dense matrix
X_train_dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [31]:
# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts

array([ 5, 23,  6, ...,  6, 33,  6], dtype=int64)

In [32]:
X_train_counts.shape

(1279,)

In [33]:
# create a DataFrame of tokens with their counts
pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts})

Unnamed: 0,token,count
0,00,5
1,000,23
2,03,6
3,04,9
4,0800,10
...,...,...
1274,yo,23
1275,yr,11
1276,yrs,6
1277,yup,33


### Bonus: Calculating the "spamminess" of each token

In [34]:
# create separate DataFrames for ham and spam
sms_ham = sms[sms.label==0]
sms_spam = sms[sms.label==1]

In [36]:
# learn the vocabulary of ALL messages and save it
vect.fit(sms.message)
all_tokens = vect.get_feature_names()
all_tokens

['00',
 '000',
 '02',
 '03',
 '04',
 '06',
 '0800',
 '08000839402',
 '08000930705',
 '0870',
 '08707509020',
 '08712300220',
 '08712460324',
 '08718720201',
 '09050090044',
 '10',
 '100',
 '1000',
 '10am',
 '10p',
 '11',
 '11mths',
 '12',
 '12hrs',
 '1327',
 '150',
 '150p',
 '150pm',
 '150ppm',
 '16',
 '18',
 '1st',
 '20',
 '200',
 '2000',
 '2003',
 '2004',
 '20p',
 '25',
 '250',
 '25p',
 '28',
 '2day',
 '2lands',
 '2nd',
 '2nite',
 '30',
 '3030',
 '350',
 '36504',
 '3g',
 '40gb',
 '4th',
 '4u',
 '50',
 '500',
 '5000',
 '50p',
 '5wb',
 '5we',
 '62468',
 '750',
 '7pm',
 '800',
 '8007',
 '82277',
 '85023',
 '86021',
 '86688',
 '87066',
 '8th',
 '900',
 'aathi',
 'abiola',
 'able',
 'abt',
 'ac',
 'accept',
 'access',
 'account',
 'activate',
 'actually',
 'add',
 'added',
 'address',
 'admirer',
 'advance',
 'aft',
 'afternoon',
 'aftr',
 'age',
 'age16',
 'ago',
 'ah',
 'aha',
 'ahead',
 'ahmad',
 'aight',
 'air',
 'aiyah',
 'aiyo',
 'al',
 'alex',
 'alright',
 'alrite',
 'amp',
 'angry

In [37]:
# create document-term matrices for ham and spam
ham_dtm = vect.transform(sms_ham.message)
spam_dtm = vect.transform(sms_spam.message)

In [38]:
# count how many times EACH token appears across ALL ham messages
ham_counts = np.sum(ham_dtm.toarray(), axis=0)

In [39]:
# count how many times EACH token appears across ALL spam messages
spam_counts = np.sum(spam_dtm.toarray(), axis=0)

In [40]:
# create a DataFrame of tokens with their separate ham and spam counts
token_counts = pd.DataFrame({'token':all_tokens, 'ham':ham_counts, 'spam':spam_counts})

In [41]:
# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
token_counts['ham'] = token_counts.ham + 1
token_counts['spam'] = token_counts.spam + 1

In [42]:
# calculate ratio of spam-to-ham for each token
token_counts['spam_ratio'] = token_counts.spam / token_counts.ham
token_counts.sort_values('spam_ratio')

Unnamed: 0,token,ham,spam,spam_ratio
506,gt,319,1,0.003135
709,lt,317,1,0.003155
695,lor,163,1,0.006135
307,da,151,1,0.006623
648,later,136,1,0.007353
...,...,...,...,...
30,18,1,52,52.000000
1193,tone,1,61,61.000000
26,150p,1,72,72.000000
926,prize,1,94,94.000000


### Part 5: Building a Naive Bayes model

We will use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [43]:
X_train_dtm

<4179x1279 sparse matrix of type '<class 'numpy.int64'>'
	with 23463 stored elements in Compressed Sparse Row format>

In [44]:
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [45]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [46]:
# calculate accuracy of class predictions
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.9834888729361091


In [47]:
# confusion matrix
print(metrics.confusion_matrix(y_test, y_pred_class))

[[1198   10]
 [  13  172]]


In [48]:
# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([7.71879642e-03, 3.90291031e-04, 1.34481933e-01, ...,
       1.57609186e-05, 1.00000000e+00, 2.05392162e-07])

In [49]:
# calculate AUC
print(metrics.roc_auc_score(y_test, y_pred_prob))

0.9925496688741722


In [50]:
# print message text for the false positives
X_test[y_test < y_pred_class]

4773    Hi, Mobile no.  &lt;#&gt;  has added you in th...
4419                           When you get free, call me
2340    Cheers for the message Zogtorius. Ive been st...
2903    Bill, as in: Are there any letters for me. i’m...
45                       No calls..messages..missed calls
3589    If you were/are free i can give. Otherwise nal...
3120                             Stop knowing me so well!
3415                              No pic. Please re-send.
1160    Yun buying... But school got offer 2000 plus o...
1988                     No calls..messages..missed calls
Name: message, dtype: object

In [51]:
# print message text for the false negatives
X_test[y_test > y_pred_class]

1217    You have 1 new voicemail. Please call 08719181...
2295     You have 1 new message. Please call 08718738034.
420     Send a logo 2 ur lover - 2 names joined by a h...
5110      You have 1 new message. Please call 08715205273
3530    Xmas & New Years Eve tickets are now on sale f...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
3991    (Bank of Granite issues Strong-Buy) EXPLOSIVE ...
2941     You have 1 new message. Please call 08712400200.
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [52]:
# what do you notice about the false negatives?
X_test[3316]

'FREE MESSAGE Activate your 500 FREE Text Messages by replying to this message with the word FREE For terms & conditions, visit www.07781482378.com'