# Week 08: Phrase Classification
The assignment this week needs you to distinguish between good and bad phrases of the word "**earn**" (e.g., earn money). The method, word2vector, learned today will be used in the process. 

There're some data for this assignment: 
* train.tsv: Some phrases with labels to train and validate the classification model. There are only two types of label: 1 means *good*; 0 means *bad*.
* test.tsv: Same format as train.tsv. It's used to test your model.
* GoogleNews-vectors-negative300.bin.gz: a pre-trained word2vector model trained by Google ([source](https://code.google.com/archive/p/word2vec/))

## Requirement
* pandas
* tensorflow
* sklearn

# Download word2vec data and training/testing data

In [23]:
!pip install googledrivedownloader
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id='1ekUZ1zGSs6UjM_jfmAOuzNCgcOyWdsyR',
                                    dest_path='./GoogleNews-vectors-negative300.bin.gz')



In [24]:
gdd.download_file_from_google_drive(file_id='1VAxq0DOAekM9DFVIvcig9bdwl4GEA6KD',
                                    dest_path='./test.tsv')
gdd.download_file_from_google_drive(file_id='1f6b8hmcfUFztzOhdKgnoCDpPewio1odl',
                                    dest_path='./train.tsv')

In [25]:
import gzip
import shutil
with gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb') as f_in:
    with open('GoogleNews-vectors-negative300.bin', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

## Read Data
We use dataframe to store data here.

In [26]:
import pandas as pd

def loadData(path):
    ngram = []
    _class = []
    with open(path) as f:
        for line in f.readlines():
            line = line.strip("\n").split("\t")
            ngram.append(line[0])
            _class.append(int(line[1]))
    return pd.DataFrame({"phrase":ngram,"class":_class})
train = loadData("train.tsv")
print(train.head())
test = loadData("test.tsv")    
print(test.head())

                         phrase  class
0      earn a strong reputation      1
1  Marty will surely earn every      0
2             to earn between $      0
3          to earn some college      0
4        that earn rave reviews      0
                   phrase  class
0  degree earn 62 percent      0
1     earn maybe 30 or 50      0
2  earn the kind of money      1
3      earn his 14th save      1
4   earn a smaller amount      1


## load word2vec model
<font color="red">**[ TODO ]**</font> Please load [GoogleNews-vectors-negative300.bin.gz](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g) model and check the embedding of the word `language`.

* package `gensim` is a good choice

In [27]:
import gensim
import gensim.downloader
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
w2v_model = KeyedVectors.load_word2vec_format("./GoogleNews-vectors-negative300.bin", binary=True)

In [28]:
print(w2v_model['language'])

[ 2.30712891e-02  1.68457031e-02  1.54296875e-01  1.27929688e-01
 -2.67578125e-01  3.51562500e-02  1.19140625e-01  2.48046875e-01
  1.93359375e-01 -7.95898438e-02  1.46484375e-01 -1.43554688e-01
 -3.04687500e-01  3.46679688e-02 -1.85546875e-02  1.06933594e-01
 -1.52343750e-01  2.89062500e-01  2.35595703e-02 -3.80859375e-01
  1.09863281e-01  4.41406250e-01  3.75976562e-02 -1.22680664e-02
  1.62353516e-02 -2.24609375e-01  7.61718750e-02 -3.12500000e-02
 -2.16064453e-02  1.49414062e-01 -4.02832031e-02 -4.46777344e-02
 -1.72851562e-01  3.32031250e-02  1.50390625e-01 -5.05371094e-02
  2.72216797e-02  3.00781250e-01 -1.33789062e-01 -7.56835938e-02
  1.93359375e-01 -1.98242188e-01 -1.27563477e-02  4.19921875e-01
 -2.19726562e-01  1.44531250e-01 -3.93066406e-02  1.94335938e-01
 -3.12500000e-01  1.84570312e-01  1.48773193e-04 -1.67968750e-01
 -7.37304688e-02 -3.12500000e-02  1.57226562e-01  3.30078125e-01
 -1.42578125e-01 -3.16406250e-01 -7.32421875e-02 -5.76171875e-02
  1.02050781e-01 -1.08886

<font color="green">Expected output: </font>

>  <font face='monospace' size=3>\[&nbsp;2.30712891e-02&nbsp;&nbsp;1.68457031e-02&nbsp;&nbsp;1.54296875e-01&nbsp; 1.27929688e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-2.67578125e-01&nbsp;&nbsp;3.51562500e-02&nbsp;&nbsp;1.19140625e-01&nbsp; 2.48046875e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.93359375e-01&nbsp;-7.95898438e-02&nbsp;&nbsp;1.46484375e-01&nbsp;-1.43554688e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-3.04687500e-01&nbsp;&nbsp;3.46679688e-02&nbsp;-1.85546875e-02&nbsp; 1.06933594e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.52343750e-01&nbsp;&nbsp;2.89062500e-01&nbsp;&nbsp;2.35595703e-02&nbsp;-3.80859375e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.09863281e-01&nbsp;&nbsp;4.41406250e-01&nbsp;&nbsp;3.75976562e-02&nbsp;-1.22680664e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.62353516e-02&nbsp;-2.24609375e-01&nbsp;&nbsp;7.61718750e-02&nbsp;-3.12500000e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-2.16064453e-02&nbsp;&nbsp;1.49414062e-01&nbsp;-4.02832031e-02&nbsp;-4.46777344e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.72851562e-01&nbsp;&nbsp;3.32031250e-02&nbsp;&nbsp;1.50390625e-01&nbsp;-5.05371094e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;2.72216797e-02&nbsp;&nbsp;3.00781250e-01&nbsp;-1.33789062e-01&nbsp;-7.56835938e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.93359375e-01&nbsp;-1.98242188e-01&nbsp;-1.27563477e-02&nbsp; 4.19921875e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-2.19726562e-01&nbsp;&nbsp;1.44531250e-01&nbsp;-3.93066406e-02&nbsp; 1.94335938e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-3.12500000e-01&nbsp;&nbsp;1.84570312e-01&nbsp;&nbsp;1.48773193e-04&nbsp;-1.67968750e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-7.37304688e-02&nbsp;-3.12500000e-02&nbsp;&nbsp;1.57226562e-01&nbsp; 3.30078125e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.42578125e-01&nbsp;-3.16406250e-01&nbsp;-7.32421875e-02&nbsp;-5.76171875e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.02050781e-01&nbsp;-1.08886719e-01&nbsp;&nbsp;1.24023438e-01&nbsp;-2.50244141e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-2.49023438e-01&nbsp;&nbsp;1.25976562e-01&nbsp;-1.79687500e-01&nbsp; 3.32031250e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;7.14111328e-03&nbsp;&nbsp;2.51953125e-01&nbsp;&nbsp;4.34570312e-02&nbsp;-4.34570312e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-3.90625000e-01&nbsp;&nbsp;1.76757812e-01&nbsp;-1.13525391e-02&nbsp;-1.97753906e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;2.79296875e-01&nbsp;&nbsp;2.36328125e-01&nbsp;&nbsp;1.19140625e-01&nbsp; 5.59082031e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.73828125e-01&nbsp;-1.10839844e-01&nbsp;-4.95605469e-02&nbsp; 2.13867188e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;6.17675781e-02&nbsp;&nbsp;1.38671875e-01&nbsp;-4.45556641e-03&nbsp; 2.55859375e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.80664062e-01&nbsp;&nbsp;5.88378906e-02&nbsp;-6.59179688e-02&nbsp;-2.08007812e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.19140625e-01&nbsp;-1.57226562e-01&nbsp;&nbsp;5.02929688e-02&nbsp;-6.29882812e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;5.00488281e-02&nbsp;-7.27539062e-02&nbsp;&nbsp;1.74560547e-02&nbsp;-3.56445312e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.93359375e-01&nbsp;&nbsp;3.93066406e-02&nbsp;-3.36914062e-02&nbsp;-1.07421875e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;5.78613281e-02&nbsp;-8.20312500e-02&nbsp;&nbsp;1.74560547e-02&nbsp;-1.65039062e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.46484375e-01&nbsp;-3.08837891e-02&nbsp;-3.86718750e-01&nbsp; 2.49023438e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;8.74023438e-02&nbsp;-2.15820312e-01&nbsp;-4.10156250e-02&nbsp; 1.60156250e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.85546875e-01&nbsp;-2.27050781e-02&nbsp;-3.73535156e-02&nbsp; 7.86132812e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.46484375e-01&nbsp;&nbsp;6.78710938e-02&nbsp;&nbsp;1.26953125e-01&nbsp; 3.30078125e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.11328125e-01&nbsp;&nbsp;9.27734375e-02&nbsp;-3.45703125e-01&nbsp;-1.41601562e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-5.29785156e-02&nbsp;-1.50390625e-01&nbsp;-7.81250000e-02&nbsp;-1.27929688e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-4.02343750e-01&nbsp;-1.41601562e-01&nbsp;&nbsp;8.44726562e-02&nbsp; 1.08398438e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-4.44335938e-02&nbsp;&nbsp;3.73535156e-02&nbsp;&nbsp;5.61523438e-02&nbsp;-1.91406250e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.54296875e-01&nbsp;-5.12695312e-02&nbsp;-6.49414062e-02&nbsp;-8.30078125e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;7.17773438e-02&nbsp;-1.33789062e-01&nbsp;&nbsp;1.05468750e-01&nbsp; 3.33984375e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.08398438e-01&nbsp;&nbsp;1.91650391e-02&nbsp;&nbsp;2.14843750e-01&nbsp; 2.15820312e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.05468750e-01&nbsp;-1.44531250e-01&nbsp;&nbsp;4.32128906e-02&nbsp;-2.71484375e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-3.78906250e-01&nbsp;&nbsp;1.09863281e-01&nbsp;-8.15429688e-02&nbsp;-6.12792969e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.33789062e-01&nbsp;&nbsp;9.71679688e-02&nbsp;-1.04370117e-02&nbsp;-1.21093750e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-2.44140625e-01&nbsp;&nbsp;1.02050781e-01&nbsp;&nbsp;1.10839844e-01&nbsp;-1.00585938e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.71875000e-01&nbsp;-3.61328125e-02&nbsp;-4.39453125e-02&nbsp; 2.83203125e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-8.93554688e-02&nbsp;-1.70898438e-01&nbsp;&nbsp;2.46093750e-01&nbsp; 1.16699219e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;8.39843750e-02&nbsp;-1.32812500e-01&nbsp;-1.61132812e-01&nbsp;-1.39648438e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-8.59375000e-02&nbsp;-1.37695312e-01&nbsp;-9.32617188e-02&nbsp;-1.33789062e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.65039062e-01&nbsp;&nbsp;4.93164062e-02&nbsp;-1.21093750e-01&nbsp;-2.11914062e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;1.61132812e-01&nbsp;-1.07421875e-01&nbsp;-3.97949219e-02&nbsp;-3.51562500e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-5.02929688e-02&nbsp;&nbsp;1.46484375e-01&nbsp;-4.68750000e-02&nbsp; 4.17480469e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.27929688e-01&nbsp;-9.76562500e-02&nbsp;-2.46093750e-01&nbsp; 6.78710938e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-2.30468750e-01&nbsp;&nbsp;1.80664062e-02&nbsp;&nbsp;3.54003906e-02&nbsp; 7.32421875e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-2.23632812e-01&nbsp;-1.25976562e-01&nbsp;&nbsp;2.12890625e-01&nbsp;-3.93066406e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-2.41699219e-02&nbsp;-9.61914062e-02&nbsp;&nbsp;7.51953125e-02&nbsp;-1.46484375e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.49414062e-01&nbsp;-8.83789062e-02&nbsp;-4.88281250e-02&nbsp; 2.32421875e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;3.30078125e-01&nbsp;&nbsp;1.59179688e-01&nbsp;-2.35351562e-01&nbsp;-1.25976562e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;2.68554688e-02&nbsp;-5.29785156e-02&nbsp;-6.59179688e-02&nbsp;-2.17773438e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-6.37817383e-03&nbsp;-2.53906250e-01&nbsp;&nbsp;2.28515625e-01&nbsp; 4.93164062e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;3.54003906e-02&nbsp;&nbsp;1.66992188e-01&nbsp;-7.27539062e-02&nbsp;-2.53906250e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.34765625e-01&nbsp;&nbsp;3.69140625e-01&nbsp;&nbsp;1.83593750e-01&nbsp;-1.64062500e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;2.26562500e-01&nbsp;-8.88671875e-02&nbsp;&nbsp;3.69140625e-01&nbsp; 5.54199219e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-3.63769531e-02&nbsp;-1.48437500e-01&nbsp;&nbsp;9.13085938e-02&nbsp; 2.47955322e-04<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;2.67578125e-01&nbsp;-1.63085938e-01&nbsp;&nbsp;1.19628906e-01&nbsp; 2.77343750e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.49414062e-01&nbsp;&nbsp;1.33789062e-01&nbsp;-8.25195312e-02&nbsp;-1.74804688e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.77734375e-01&nbsp;&nbsp;2.06054688e-01&nbsp;&nbsp;5.07812500e-02&nbsp;-2.08007812e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.74804688e-01&nbsp;&nbsp;9.66796875e-02&nbsp;&nbsp;6.98242188e-02&nbsp;-5.79833984e-04<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;9.22851562e-02&nbsp;&nbsp;7.95898438e-02&nbsp;&nbsp;1.41601562e-01&nbsp; 8.72802734e-03<br> </font>
>  <font face='monospace' size=3>&nbsp;-8.05664062e-02&nbsp;&nbsp;4.80957031e-02&nbsp;&nbsp;2.49023438e-01&nbsp;-1.64062500e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-4.66308594e-02&nbsp;-2.81250000e-01&nbsp;-1.66015625e-01&nbsp;-2.22656250e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-2.32421875e-01&nbsp;&nbsp;1.32812500e-01&nbsp;&nbsp;4.15039062e-02&nbsp; 1.15234375e-01<br> </font>
>  <font face='monospace' size=3>&nbsp;-7.66601562e-02&nbsp;-1.10839844e-01&nbsp;-1.97265625e-01&nbsp; 3.06396484e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.03515625e-01&nbsp;&nbsp;2.49023438e-02&nbsp;-2.52685547e-02&nbsp; 3.39355469e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;4.29687500e-02&nbsp;-1.44531250e-01&nbsp;&nbsp;2.12402344e-02&nbsp; 2.28271484e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.88476562e-01&nbsp;&nbsp;3.22265625e-01&nbsp;-1.13281250e-01&nbsp;-7.61718750e-02<br> </font>
>  <font face='monospace' size=3>&nbsp;&nbsp;2.94921875e-01&nbsp;-1.33789062e-01&nbsp;-1.80664062e-02&nbsp;-6.25610352e-03<br> </font>
>  <font face='monospace' size=3>&nbsp;-1.62353516e-02&nbsp;&nbsp;5.98144531e-02&nbsp;&nbsp;1.21582031e-01&nbsp; 4.17480469e-02\] </font>

## Preprocessing
Preprocess two tsv files here.

#### adjust the ratio of the two classes of training data
In training data, the ratio of good phrases to bad phrases is about one to thirty. That will make training classification unsatisfactory, so we need to adjust the ratio. Reducing bad phrases and adding good phrases are both common way.

<font color="red">**[ TODO ]**</font> Please adjust the ratio of good phrases to bad phrases in any way which you think is the best and output the number of two class for demo.

You need to explain why you choose this ratio and how you do it.

In [31]:
from collections import Counter
print(f"Training target statistics: {Counter(train['class'])}")

Training target statistics: Counter({0: 193493, 1: 6105})


#### number words
Let each word have its unique number.

In [32]:
from tensorflow.keras.preprocessing.text import Tokenizer
tok = Tokenizer()
tok.fit_on_texts(pd.concat([train,test],ignore_index=True)['phrase'])
vocab_size = len(tok.word_index) + 1

#### convert phrases into numbers
Because model can't read words, so we have to do this transform. 

The number should be same as the last step.

In [33]:
train_encoded_phrase = tok.texts_to_sequences(train['phrase'])
test_encoded_phrase = tok.texts_to_sequences(test['phrase'])

#### padding
Make all phrases become same length. The longest phrases in two tsv have five tokens. Hence, we should make the phrases whose lengths less than five become five by adding 0. 

In [34]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_ngram = 5
X_train= pad_sequences(train_encoded_phrase, maxlen=max_ngram, padding='post')
X_test= pad_sequences(test_encoded_phrase, maxlen=max_ngram, padding='post')
print(X_train[:5])

[[   1    3 1970  253    0]
 [7468   10 1971    1  122]
 [   2    1  119    0    0]
 [   2    1   43   64    0]
 [  23    1 1760 1507    0]]


#### one hot encodding label

In [35]:
from tensorflow.keras.utils import to_categorical
y_train=to_categorical(train['class'])
y_test=to_categorical(test['class'])

In [36]:
from imblearn.over_sampling import RandomOverSampler
import numpy
over_sampler = RandomOverSampler(random_state=42)
x_res, y_res = over_sampler.fit_resample(X_train, y_train )
unique, counts = numpy.unique(y_res, return_counts=True)
print(*zip(unique, counts))

(0, 193493) (1, 193493)




#### split training data into train and validation

In [37]:
from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val=train_test_split(x_res,y_res,test_size=0.20,random_state=42)

#### creating the embedding matrix
The embedding matrix is used by classification model. It should be a list of list. Each sub-list is an embedding vector of a word and the order of all embedding vectors should be same as *tokenizer*. It is stored in a dictionary. You can check it by `tok.word_index.items()`.

<font color="red">**[ TODO ]**</font> Make embedding matrix. If you don't need it for your classification model, you can skip it. We won't check it when demo. 

In [53]:
num_tokens = vocab_size
embedding_dim = 300
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = numpy.zeros((num_tokens, embedding_dim))
for word, i in tok.word_index.items():
    if word in w2v_model:
        embedding_matrix[i] = w2v_model[word]
        hits+=1
    else:
        misses+=1 
print(hits)
print(misses)     
embedding_matrix

7446
1702


array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.02416992, -0.12695312, -0.359375  , ..., -0.203125  ,
         0.23828125, -0.15332031],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.16796875, -0.07373047, -0.24707031, ...,  0.03613281,
        -0.02661133,  0.0324707 ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

## Classification

#### build model
<font color="red">**[ TODO ]**</font> Please build your classification model by ***keras*** here. 

You **must** use the pre-trained word2vec model to represent the words of phrases.

In [57]:
import tensorflow as tf
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding


In [58]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dim , embeddings_initializer=keras.initializers.Constant(embedding_matrix),
                    trainable=False, input_length=X_train.shape[1]))
model.add(LSTM(32))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])


In [59]:
print(model.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 5, 300)            2744700   
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                42624     
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 66        
Total params: 2,787,390
Trainable params: 42,690
Non-trainable params: 2,744,700
_________________________________________________________________
None


#### train
Train classification model here.

<font color="red">**[ TODO ]**</font> Adjust the hyperparameter to optimize the validation accuracy and validation loss.

* The higher the accuracy, the better; the lower the validation, the better.
* **number of epoch** and **batch size** are the most important

In [60]:
model.fit(X_train, y_train, validation_data=(X_val,y_val), epochs=5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f3c5ba83490>

#### test

<font color="red">**[ TODO ]**</font> Test your model by test.tsv and output the accuracy. Your accuracy need to beat baseline: **0.97**.

In [61]:
accuracy = model.evaluate(X_test, numpy.argmax(y_test,axis=1))
print(accuracy[1])

0.9850000143051147


## Show wrong prediction results
Observing wrong prediction result may help you improve your prediction.

<font color="red">**[ TODO ]**</font> show the wrong prediction results like this: 

<img src="https://imgur.com/BOTMyZH.jpg" width=30%><br>

In [105]:
print("{:<30} {:<5} {:<5}".format("ngram","label","predict"))
for i in range(len(X_test)):
    prediction = model.predict(numpy.expand_dims(X_test[i], axis=0))
    prediction_num = numpy.argmax(prediction)
    if prediction_num != numpy.argmax(y_test[i]):
        print("{:<30} {:<5} {:<5}".format(test['phrase'][i],prediction_num,numpy.argmax(y_test[i]))) 

ngram                          label predict
earn a playoff berth           0     1    
- earn - money -               1     0    
earn the money ?               1     0    
earn your Masters Degree       1     0    
earn their money ;             1     0    
earn 1 comp point              0     1    
earn a roster spot             0     1    
earn a very good livelihood    1     0    
earn your commission !         1     0    
earn victory                   0     1    
- earn affiliate income        1     0    
more you earn . "              1     0    
earn commission                0     1    
] " earn money                 1     0    
earn a standard diploma        0     1    
what they can earn ,           1     0    
earn residual income           0     1    
, earn on average              1     0    
Family Meeting can earn        0     1    
, earn or get money            1     0    
earn an NCAA berth             0     1    
what I earn ,                  1     0    
earn an i

## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=807282025) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to eeclass. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.

## Learning Resource
[Deep Learning with Python](https://tanthiamhuat.files.wordpress.com/2018/03/deeplearningwithpython.pdf)

[Classification on IMDB](https://keras.io/examples/nlp/bidirectional_lstm_imdb/)