# Text Featurization#  
**Note: This following cell contains some predefined functions to implement text featurization function. Please make sure you have run this cell before you run other cells in this notebook.**

In [25]:
def loadDataSet(dataset):
    output=dataset[:-4]+'_Vectorized.txt'
    with open(output,"w") as w:
        with open(dataset) as f:
            data=f.readlines()
            text=[entry.split('\t')[1].rstrip() for entry in data]
            labels=[entry.split('\t')[0] for entry in data]
            parsedText=map(textParse,text)
            vocabList=createVocabList(parsedText)
            for word in vocabList:
                w.write(word+',')
            w.write('class')
            for i in range(len(labels)):
                returnVec=setOfWords2Vec(vocabList,parsedText[i])
                for num in returnVec:
                    w.write(str(num)+',')
                w.write(labels[i]+"\n")
            return vocabList

def createVocabList(dataSet):
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet|set(document)
    return list(vocabSet)
        
def textParse(bigString):
    import re
    #listOfTokens=re.split(r'\W*',bigString)
    listOfTokens=re.split(r'[^A-Za-z]*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]=1
        else: print 'the word: %s is not in my Vocabulary' % word
    return returnVec

## Vectorize your text##
"Vectorize" here means the algorithm creates a vocabulary list for all distinct words appears in the traning set and preprocesses each instance into a vector which records the appearance of each word in the vocabulary list. For instance, if the training set contained two instance, "I am happy" and "I like you", then the vocabulary list would be ["I","am","happy","like","you"] and the two vectorized instances would be [1,1,1,0,0] and [1,0,0,1,1]. 0 means unshown and 1 means shown.

**Training set vectorization**

In [26]:
dataset=raw_input('Please Enter Your Data Set:')
vocabList=loadDataSet(dataset)
print 'Text featurization is done!'

Please Enter Your Data Set:SMSSpamCollection.txt
Text featurization is done!


**Test set vectorization**  
Copy and paste one SMS message in the test set file at a time and you will get the vectorized data.

In [29]:
testset=raw_input('Please Enter Your Test Set:')
returnVec=setOfWords2Vec(vocabList,textParse(testset))
print returnVec

Please Enter Your Test Set:Please don't text me anymore. I have nothing else to say.
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,