# A Notebook for Text Featurization #  

This notebook will show you how to classify text data.  Before running a classifier, the text has to be converted into features.  This is called featurization, and we will create a vector with all the words and add a 1 if that word is in the instance and a 0 if it is not.  The example text data that we will use are SMS messages which are labeled as spam or no spam.  The task is to classify a new text message as spam or no spam.

The following cell contains some predefined functions to implement text featurization and classification. Please make sure you have run this cell before you run other cells in this notebook.

In [13]:
def loadDataSet(dataset):
    output=dataset[:-4]+'_Vectorized.txt'
    with open(output,"w") as w:
        with open(dataset) as f:
            data=f.readlines()
            text=[entry.split('\t')[1].rstrip() for entry in data]
            labels=[entry.split('\t')[0] for entry in data]
            parsedText=list(map(textParse,text))
            vocabList=createVocabList(parsedText)
            for word in vocabList:
                w.write(word+',')
            w.write('class')
            for i in range(len(labels)):
                returnVec=setOfWords2Vec(vocabList,parsedText[i])
                for num in returnVec:
                    w.write(str(num)+',')
                w.write(labels[i]+"\n")
            return vocabList

def createVocabList(dataSet):
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet|set(document)
    return list(vocabSet)
        
def textParse(bigString):
    import re
    #listOfTokens=re.split(r'\W*',bigString)
    listOfTokens=re.split(r'[^A-Za-z]*',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]=1
        else: print('the word: %s is not in my Vocabulary' % word)
    return returnVec

def loadHead(dataset):
    with open(dataset) as f:
        lines=f.readlines()
        for line in lines[:5]:
            print(line.rstrip())

## Explore the dataset
Run the following cell and you will get the first five lines of the dataset. The left part represents the labels of SMS messages and the right part is the text of these messages.

In [14]:
dataset=input('Please Enter Your Data Set:')
loadHead(dataset)

Please Enter Your Data Set:./Dataset/SMSSpamCollection.txt
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though


## Vectorize your text##
"Vectorize" here means the algorithm creates a vocabulary list for all distinct words appears in the traning set and preprocesses each instance into a vector which records the appearance of each word in the vocabulary list. For instance, if the training set contained two instance, "I am happy" and "I like you", then the vocabulary list would be ["I","am","happy","like","you"] and the two vectorized instances would be [1,1,1,0,0] and [1,0,0,1,1]. 0 means unshown and 1 means shown.

**Training set vectorization**

In [6]:
dataset=input('Please Enter Your Data Set:')
vocabList=loadDataSet(dataset)
print('Text featurization is done!')

Please Enter Your Data Set:./Dataset/SMSSpamCollection.txt


  return _compile(pattern, flags).split(string, maxsplit)


Text featurization is done!


**Test set vectorization**  
Copy and paste one SMS message in the test set file at a time and you will get the vectorized data.

In [8]:
testset=input('Please Enter Your Text Message:')
returnVec=setOfWords2Vec(vocabList,textParse(testset))
print(returnVec)

Please Enter Your Text Message:PRIVATE! Your 2004 Account Statement for 07742676969 shows 786 unredeemed Bonus Points. To claim call 08719180248 Identifier Code: 45239 Expires
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0

  return _compile(pattern, flags).split(string, maxsplit)


Now you can print this notebook as a PDF file and turn it in.