<img src="Images/PU.png" width="100%">

### Course Name : Machine Learning for Professionals (ML 701)    
##### Notebook compiled by : Bhushan Garware, Project Lead at Learning and Development  
** Important ! ** For internal circulation olny

# Extracting features from the Text
Many machine learning applications like sentiment analysis, text data is used as explanatory variable. Text must be converted to a different representation that captures as much of its information  as possible in a feature vector.
<img src="Images/Text_Data.png" width="80%">


# The bag-of-words representation

Let’s assume that, we are working on document classification problem. The collection of all the documents is called as Corpus.

In [1]:
#This import works as a bridge between python 2 syntax and python 3 syntax
from __future__ import absolute_import, division, print_function

In [2]:
X = ["Orbit program is important for all of us",
     "Orbit program is very interesting"]

In [3]:
len(X)

2

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [5]:
vectorizer.vocabulary_

{'all': 0,
 'for': 1,
 'important': 2,
 'interesting': 3,
 'is': 4,
 'of': 5,
 'orbit': 6,
 'program': 7,
 'us': 8,
 'very': 9}

In [6]:
X_bag_of_words = vectorizer.transform(X)
X_bag_of_words

<2x10 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [7]:
X_bag_of_words.shape

(2, 10)

In [8]:
X_bag_of_words.toarray()

array([[1, 1, 1, 0, 1, 1, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 0, 1, 1, 0, 1]], dtype=int64)

### Adding stop words

In [9]:
my_list=['is','of']

In [10]:

vectorizer = CountVectorizer(stop_words=my_list)
vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=['is', 'of'],
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [11]:
vectorizer.vocabulary_

{'all': 0,
 'for': 1,
 'important': 2,
 'interesting': 3,
 'orbit': 4,
 'program': 5,
 'us': 6,
 'very': 7}

In [12]:
X_bag_of_words = vectorizer.transform(X)
print(X_bag_of_words.shape)
X_bag_of_words.toarray()

(2, 8)


array([[1, 1, 1, 0, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 0, 1]], dtype=int64)

# Finding Important Words in Text Using TF-IDF
TF-IDF stands for "Term Frequency, Inverse Document Frequency". It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.

+ If a word appears frequently in a document, it's important. Give the word a high score.
+ But if a word appears in many documents, it's not a unique identifier. Give the word a low score.

Please find more math details [here](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [15]:
import numpy as np
np.set_printoptions(precision=2)

print(tfidf_vectorizer.transform(X).toarray())

[[0.39 0.39 0.39 0.   0.28 0.39 0.28 0.28 0.39 0.  ]
 [0.   0.   0.   0.53 0.38 0.   0.38 0.38 0.   0.53]]


# N-Grams
Look for sequence of tokens

In [16]:
Ngram_vectorizer = CountVectorizer(ngram_range=(2, 3))
Ngram_vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(2, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [17]:
Ngram_vectorizer.get_feature_names()

['all of',
 'all of us',
 'for all',
 'for all of',
 'important for',
 'important for all',
 'is important',
 'is important for',
 'is very',
 'is very interesting',
 'of us',
 'orbit program',
 'orbit program is',
 'program is',
 'program is important',
 'program is very',
 'very interesting']

In [18]:
Ngram_vectorizer.transform(X).toarray()

array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1]], dtype=int64)

# SMS Spam Collection Data Set


The dataset is available at [UCI Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) It is a collection of more than ** 5 thousand SMS phone messages.** 
<img src="Images/spam.jpg" width="80%">

In [19]:
import pandas as pd
import numpy as np
import seaborn as sns

In [20]:
import matplotlib.pyplot as plt
% matplotlib inline

In [21]:
sms = pd.read_csv('./Datasets/SMSSpamCollection', sep='\t', names=["label", "message"])
sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [22]:
# examine the class distribution
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [23]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

In [24]:
# check that the conversion worked
sms.head(10)

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


In [25]:
X = sms.message
y = sms.label_num

In [26]:
# split X and y into training and testing sets
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

### Vectorizing our dataset

In [27]:
# instantiate the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [28]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [29]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

# Machine Learning 

In [30]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [31]:
# train the model using X_train_dtm 
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [32]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [33]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9885139985642498

In [34]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[1203,    5],
       [  11,  174]], dtype=int64)

In [35]:
# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [36]:
# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]

3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [37]:
# example false negative
X_test[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

# Time for Testing 

In [41]:
# example text for model testing
simple_test = ["Free entry to Awesome orbit session"]

In [42]:
X_temp = vect.transform(simple_test)
X_temp.toarray()

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [43]:
nb.predict(X_temp)

array([1], dtype=int64)