# A Notebook for Text Classification #  

This notebook will show you how to classify text data.  Before running a classifier, the text has to be converted into features.  This is called featurization, and we will create a vector with all the words and add a 1 if that word is in the instance and a 0 if it is not.  The example text data that we will use are SMS messages which are labeled as spam or no spam.  The task is to classify a new text message as spam or no spam.

YIBO: If you can create an initial example with a very small subset of the data I think that would be useful.  That way when you show featurization the matrix will be just a few words.  So walk them through a smaller dataset, then with the large dataset.

YIBO: After you describe featurization, you should include classification.  Without the classification part, this notebook will not be as helpful to the students.

YIBO: When I enter the test message, the output is a feature matrix.  That is fine, but I would also like for it to show me the classification of my message as spam or no spam.  Can you add classification at the end of this notebook?  

The following cell contains some predefined functions to implement text featurization and classification. Please make sure you have run this cell before you run other cells in this notebook.

In [3]:
def loadDataSet(dataset):
    output=dataset[:-4]+'_Vectorized.txt'
    with open(output,"w") as w:
        with open(dataset) as f:
            data=f.readlines()
            text=[entry.split('\t')[1].rstrip() for entry in data]
            labels=[entry.split('\t')[0] for entry in data]
            parsedText=map(textParse,text)
            vocabList=createVocabList(parsedText)
            for word in vocabList:
                w.write(word+',')
            w.write('class')
            for i in range(len(labels)):
                returnVec=setOfWords2Vec(vocabList,parsedText[i])
                for num in returnVec:
                    w.write(str(num)+',')
                w.write(labels[i]+"\n")
            return vocabList

def createVocabList(dataSet):
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet|set(document)
    return list(vocabSet)
        
def textParse(bigString):
    import re
    #listOfTokens=re.split(r'\W*',bigString)
    listOfTokens=re.split(r'[^A-Za-z]*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]=1
        else: print 'the word: %s is not in my Vocabulary' % word
    return returnVec

## Vectorize your text##
"Vectorize" here means the algorithm creates a vocabulary list for all distinct words appears in the traning set and preprocesses each instance into a vector which records the appearance of each word in the vocabulary list. For instance, if the training set contained two instance, "I am happy" and "I like you", then the vocabulary list would be ["I","am","happy","like","you"] and the two vectorized instances would be [1,1,1,0,0] and [1,0,0,1,1]. 0 means unshown and 1 means shown.

**Training set vectorization**

In [4]:
dataset=raw_input('Please Enter Your Data Set:')
vocabList=loadDataSet(dataset)
print 'Text featurization is done!'

Please Enter Your Data Set:SMSSpamCollection.txt
Text featurization is done!


**Test set vectorization**  
Copy and paste one SMS message in the test set file at a time and you will get the vectorized data.

In [7]:
testset=raw_input('Please Enter Your Text Message:')
returnVec=setOfWords2Vec(vocabList,textParse(testset))
print returnVec

Please Enter Your Text Message:I can't even
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Now you can print this notebook as a PDF file and turn it in.