# A Notebook for Text Classification #  

This notebook will show you how to classify text data.  Before running a classifier, the text has to be converted into features.  This is called featurization, and we will create a vector with all the words and add a 1 if that word is in the instance and a 0 if it is not.  The example text data that we will use are SMS messages which are labeled as spam or no spam.  The task is to classify a new text message as spam or no spam.

YIBO: If you can create an initial example with a very small subset of the data I think that would be useful.  That way when you show featurization the matrix will be just a few words.  So walk them through a smaller dataset, then with the large dataset.

YIBO: After you describe featurization, you should include classification.  Without the classification part, this notebook will not be as helpful to the students.

YIBO: When I enter the test message, the output is a feature matrix.  That is fine, but I would also like for it to show me the classification of my message as spam or no spam.  Can you add classification at the end of this notebook?  

The following cell contains some predefined functions to implement text featurization and classification. Please make sure you have run this cell before you run other cells in this notebook.

In [63]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

def SampleData(dataset):
    import pandas as pd
    df=pd.read_csv(dataset,'\t')
    with pd.option_context('max_colwidth',160):
        display(df.head())
    return df.head()
    

def featDataset(dataset):
    output=dataset[:-4]+'_Vectorized.txt'
    with open(output,"w") as w:
        with open(dataset) as f:
            data=f.readlines()
            text=[entry.split('\t')[1].rstrip() for entry in data[1:]]
            labels=[entry.split('\t')[0] for entry in data[1:]]
            parsedText=list(map(textParse,text))
            vocabList=createVocabList(parsedText)
            for word in vocabList:
                w.write(word+',')
            w.write('class')
            for i in range(len(labels)):
                returnVec=setOfWords2Vec(vocabList,parsedText[i])
                for num in returnVec:
                    w.write(str(num)+',')
                w.write(labels[i]+"\n")
            return vocabList

def createVocabList(dataSet):
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet|set(document)
    return list(vocabSet)
        
def textParse(bigString):
    import re
    #listOfTokens=re.split(r'\W*',bigString)
    listOfTokens=re.split(r'[^A-Za-z]*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]=1
        else: print('the word: %s is not in my Vocabulary' % word)
    return returnVec

def loadDataSet(dataset): 
    with open(dataset) as f:
        data=f.readlines()
        attributes=data[0].rstrip().split(',')[:-1]
        instances=[entry.rstrip().split(',')[:-1] for entry in data[1:]]
        dataArray=[]
        for i in range(len(instances[0])):
            try:
                dataArray.append([float(instance[i]) for instance in instances])
            except:
                encodedData,codeBook=encode([instance[i] for instance in instances])
                dataArray.append(encodedData)
                print(attributes[i],': ',list(codeBook.items()))
        instances=np.array(dataArray).T
        labels=[entry.rstrip().split(',')[-1] for entry in data[1:]]
        return instances,labels

def chooseClassifier(choice,instances,labels):
    clf=[]
    choice=choice.split(',')
    if "1" in choice:
        clf_B = BernoulliNB()
        clf_B.fit(instances, labels)
        print('Bernoulli Naive Bayes is used.')
        clf.append(clf_B)
    if "2" in choice:
        clf_G = GaussianNB()
        clf_G.fit(instances, labels)
        print("Gaussian Naive Bayes is used.")
        clf.append(clf_G)
    if "3" in choice:
        clf_M = MultinomialNB()
        clf_M.fit(instances, labels)
        print("Multinomial Naive Bayes is used.")
        clf.append(clf_M)
    if '1' and '2' and '3' not in choice:
        print("Please choose a correct classifier.")
    return clf
    
def evaluateClf(clf,instances,labels,n_foldCV):
    for item in clf:
        if type(item).__name__=="BernoulliNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======BernoulliNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif type(item).__name__=="GaussianNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======GaussianNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif type(item).__name__=="MultinomialNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======MultinomialNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
            
def predict(clf,testset):
    for item in clf:
        if type(item).__name__=="BernoulliNB":
            prediction=item.predict(testset)
            print("BernoulliNB: ",prediction)
        elif type(item).__name__=="GaussianNB":
            prediction=item.predict(testset)
            print("GaussianNB: ",prediction) 
        elif type(item).__name__=="MultinomialNB":
            prediction=item.predict(testset)
            print("MultinomialNB:",prediction) 

## Explore the data
The following cell will give you an excerpt of the SMS message dataset. It will also output the vocabulary list and text vector of each instance.

In [31]:
dataset=input('Please Enter Your Data Set:')
sample=SampleData(dataset)

Please Enter Your Data Set:./Dataset/SMSSpamCollection.txt


Unnamed: 0,class,content
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


## Preprocess the text
Before we featurize the text, it has to be parsed into tokens and compiled into a vocabulary list. The parsing rule here is to keep words whose length is longer than 2 and which consist of pure English alphabet. The following cell will display the parsed text of the excerpt and its vocabulary list.

In [48]:
text=[instance.rstrip() for instance in sample.iloc[:,1]]
parsedText=list(map(textParse,text))
vocabList=createVocabList(parsedText)
print('Parsed Text: ')
for instance in parsedText:
    print(instance)
print('Vocabulary List: \n',vocabList)

Parsed Text: 
['until', 'jurong', 'point', 'crazy', 'available', 'only', 'bugis', 'great', 'world', 'buffet', 'cine', 'there', 'got', 'amore', 'wat']
['lar', 'joking', 'wif', 'oni']
['free', 'entry', 'wkly', 'comp', 'win', 'cup', 'final', 'tkts', 'may', 'text', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'apply', 'over']
['dun', 'say', 'early', 'hor', 'already', 'then', 'say']
['nah', 'don', 'think', 'goes', 'usf', 'lives', 'around', 'here', 'though']
Vocabulary List: 
 ['dun', 'cine', 'wif', 'entry', 'comp', 'may', 'here', 'bugis', 'point', 'buffet', 'txt', 'got', 'available', 'early', 'amore', 'say', 'question', 'usf', 'don', 'oni', 'joking', 'win', 'receive', 'rate', 'jurong', 'until', 'lar', 'std', 'nah', 'great', 'already', 'crazy', 'world', 'free', 'text', 'over', 'around', 'there', 'wkly', 'only', 'tkts', 'apply', 'final', 'then', 'lives', 'cup', 'hor', 'wat', 'goes', 'though', 'think']


  return _compile(pattern, flags).split(string, maxsplit)


## Vectorize your text##
"Vectorize" here means the algorithm creates a vocabulary list for all distinct words appears in the traning set and preprocesses each instance into a vector which records the appearance of each word in the vocabulary list. For instance, if the training set contained two instance, "I am happy" and "I like you", then the vocabulary list would be ["I","am","happy","like","you"] and the two vectorized instances would be [1,1,1,0,0] and [1,0,0,1,1]. 0 means unshown and 1 means shown. The following cell will output the vectors corresponding to the parsed text you got from the last step.

In [49]:
for instance in parsedText:
    print(setOfWords2Vec(vocabList,instance))

[0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1]


## Put them together
Now you should know how the SMS messages were converted to featurized vectors and in the following cell we will apply this method to the whole dataset, which will generate a vectorized text file for spam classification.

**Training set vectorization**

In [52]:
dataset=input('Please Enter Your Data Set:')
vocabList=featDataset(dataset)
print('Text featurization is done!')

Please Enter Your Data Set:./Dataset/SMSSpamCollection.txt


  return _compile(pattern, flags).split(string, maxsplit)


Text featurization is done!


**Test set vectorization**  
Copy and paste one SMS message in the test set file at a time and you will get the vectorized data.

In [54]:
testset=input('Please Enter Your Text Message:')
returnVec=setOfWords2Vec(vocabList,textParse(testset))
print(returnVec)

Please Enter Your Text Message:PRIVATE! Your 2004 Account Statement for 07742676969 shows 786 unredeemed Bonus Points. To claim call 08719180248 Identifier Code: 45239 Expires
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

  return _compile(pattern, flags).split(string, maxsplit)


## Train a Naïve Bayes classifier
The following cell will train a Naïve Bayes classifier on the featurized dataset. There are three Naive Bayes classifiers provided. They are based on different mathmatical fundations and might have different performance over different datasets.  

If you want to use Bernoulli Naive Bayes, input **1** For Gaussian Naive Bayes, input **2** For Multinomial Naive Bayes, input **3** You can choose multiple classifiers at the same time. You can choose multiple classifiers at the same time. Input the numbers and separate them with comma.

In [64]:
choice=input("Please Choose Classifiers:")
instances,labels=loadDataSet(dataset[:-4]+'_Vectorized.txt')
clf=chooseClassifier(choice,instances,labels)

Please Choose Classifiers:2,3,1
Bernoulli Naive Bayes is used.
Gaussian Naive Bayes is used.
Multinomial Naive Bayes is used.


## Predict unseen examples
The following cell will use the classifiers to predict the featurized test set you got from the above. Run the cell and you will get the results from the classifiers you chose.

In [66]:
#testset=input('Please Enter Your Unseen Instance:')
#testset=list(map(float,testset.split(',')))
testset=np.array(returnVec).reshape(1, -1)
predict(clf,testset)

BernoulliNB:  ['spam']
GaussianNB:  ['spam']
MultinomialNB: ['spam']


Now you can print this notebook as a PDF file and turn it in.