## Data Preprocessing Using NLTK
The training data contains about 8,000 comments with corresponding stars(1 to 5). Assume that we're going to train a model to predict the star by the comment. In this homework, we're going to implement data preprocessing by using NLTK. The steps are shown below.
<br>
<li> Import the packages you need and read the csv file.
<li> Turn each comment into a word bag. Remember the bag only contain verb and adjective, stop words and punctuations are excluded.
<li> Turn the word bag into number using one-hot encoding. Each row represents the sample, and each column represent the word.
<li> Finally, using the train_test_split function in sklearn.model_selection to split the data into training set and testing set. Then put the training set into the model.(This part of code is provided.)

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
import string
from nltk.corpus import stopwords
from nltk import pos_tag, word_tokenize

df = pd.read_csv("training_data.csv")
df.head()

Unnamed: 0,review_id,business_id,user_id,text,date,stars
0,3223,2055,2533,"Sometimes things happen, and when they do this...",2010/12/30,5
1,9938,4165,6371,I know Kerrie through my networking and we ben...,2011/4/26,5
2,7123,869,4929,Love their pizza!!!\r\nVery fresh. Their canno...,2012/9/28,5
3,3601,1603,2789,Being from NJ I am always on the prowl for my ...,2009/6/7,4
4,3948,2347,1245,We have tried this spot a few times and each v...,2011/2/20,4


<li>Write a function to turn all the comments into wordbag, and pick up verbs and adjectives only.
<br>1.input the "text" column in df (i.e. df.text), and tokenize all the comments(nltk.word_tokenize() )
<br>2. pick up all the stop words and punctuation (string.punctuation and nltk.corpus.stopwords.words('english')  )
<br>3. pos_tag the remain words, and pick up lemmatized verbs and  lemmatized adjectives only.(nltk.pos_tag()  and wnl.lemmatize())
<br>4. return a list which contains dictionaries, each dictionary is a comment, i.e.
<br>[{'happen': 1, 'want': 1, 'take': 1, 'best': 1, 'nice': 1, 'find': 1},
 {'know': 1,
  'kerrie': 2,
  'benefit': 1,
  'need': 3,
  'plan': 1,
  'remind': 1,
1},
 {'love': 1, 'fresh': 1, 'good': 1, 'seem': 1, 'great': 1},
 {'hometown': 1,
  'italian': 1,
  'best': 1,
  'pizza': 1,
  'big': 1,}]

In [3]:
wnl = WordNetLemmatizer()

def tokenize_document(lis_text):
    output = list()
    for sentence in lis_text:
        stopword = stopwords.words('english')
        all_punc = [punc for punc in string.punctuation]
        for punc in all_punc:
            sentence=sentence.replace(punc," . ")
        words=nltk.word_tokenize(sentence)
        wordstosave=list()
        for word, pos in pos_tag(words):
            if ("." not in word and word not in stopword):
                if pos.startswith("J"):
                    wordstosave.append(wnl.lemmatize(word, "a"))
                elif pos.startswith("V"):
                    wordstosave.append(wnl.lemmatize(word, "v"))
        wordbag = dict()
        for word in wordstosave:
            wordbag[word] = 1 if word not in wordbag else wordbag[word] + 1
        output.append(wordbag)
    return output

<li>Write a function to turn the bag of word into numeric numpy array by one-hot encoding method.
<br>1. Input the list from the return of above function.
<br>2. create a python set, called "features", containing all the word in all comments.  ex:{"I", "have", "a", "dog", "cat"}
<br>3. create a nested list, called mat, containg the counts of word in each comment.  
<br>ex:  [{"I":1, "have":1, "a":1, "dog":1}, {"I":1, "have":1, "a":1, "cat":1}] -->[[1,1,1,1,0], [1,1,1,0,1]]
<br>4. put the nested into numpy.array() and return the array and "features" set as the function results

In [4]:
def vectorize_mat(token_lis):
    features = set()
    for lines in token_lis:
        for word in lines:
            if word not in features: 
                features.add(word)
    features = sorted(features)
    mat = list()
    for lines in token_lis:
        count = [0] * len(features)
        for word in lines:
            count[features.index(word)] = lines[word]
        mat.append(count)
    return set(features),np.array(mat)
features,one_hot_mat = vectorize_mat(tokenize_document(df.text))
one_hot_mat

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Here, we choose Multinomial Naive Bayes Classifier as the model. 

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

In [7]:
train_data, test_data, train_lab, test_lab = train_test_split(csr_matrix(one_hot_mat), df.stars
                                                              , train_size = 0.8, random_state = 123)
MNB_model = MultinomialNB(alpha = 0.5)
MNB_model.fit(train_data, train_lab)
MSE = np.std(MNB_model.predict(test_data) - test_lab)
ACC = MNB_model.score(test_data, test_lab)
print(one_hot_mat.shape)
print("Under MNB model, MSE and accuracy are %.3f and %.3f, respectively." % (MSE, ACC) )

(7997, 10173)
Under MNB model, MSE and accuracy are 1.057 and 0.475, respectively.
