# Assignment 2: CNN

#### Option 1: Paper Reading
- Pick one of the following papers, understand its key ideas, and replicate the experiments using the datasets described in the paper or your own dataset.
    - Convolutional Neural Network Architectures for Matching Natural Language Sentences, https://arxiv.org/pdf/1503.03244.pdf
    - Matching Networks for One Shot Learning , https://arxiv.org/pdf/1606.04080v2.pdf
    - Prototypical Networks for Few-shot Learning, https://arxiv.org/pdf/1703.05175v2.pdf
- Requirements:
    - Write a report summarizing the key idea, your experiment results, and your analysis and critique
    - Give a 5-10 min oral presentation in class

#### Option 2: Siamese CNN 

In this Assignment, let's use CNN to detect duplicate sentences. Two datasets have been prepared for you: train.csv and test.csv. Both files are in the following format

|question1 | question2 |is_duplicate|
|------|------|-------|
|How do you take a screenshot on a Mac laptop?|  How do I take a screenshot on my MacBook Pro? ...|   1 |
|Is the US election rigged?|  Was the US election rigged?|   1 |
|How scary is it to drive on the road to Hana g...|  Do I need a four-wheel-drive car to drive all ...	|  0  |
|...|...| ...|

Follow the instructions below to clasify the sentence pairs in the training dataset and then test the model using test.csv.

#### 1. Define a **class** called "text_processor" to preporcess text:
- first create **"\_\_init\_\_"** function:
    - set the *class attributes*: MAX_SEN_LEN (max sentence length) and MAX_WORDS (max number of words in corpus). You'll need to explore the dataset to set these two parameters properly
    - initialize a tokenizer with parameter num_words = MAX_WORDS and set the tokenizer object as a *class object*
    - fit the tokenizer using the training sentence pairs (i.e. method "*fit_on_texts*")
- create a function **"generate_seq"** which does the following:
    - take a list of sentences as an input 
    - generates padded sequences from the sentences using the class tokenizer object define above
    - retrun the padding sequences

In [137]:
# Add import
from keras.layers import Embedding, Dense, Conv1D, MaxPooling1D, Dropout, Activation, Input, Flatten, Concatenate
import pandas as pd
import nltk,string
from gensim import corpora
from keras.models import Model


from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
 

# fix random number
from numpy.random import seed
seed(123)
from tensorflow import set_random_seed
set_random_seed(231)

In [138]:
# Define text_preprocessor class
class text_preprocessor(object):
    
    # define __init__ function    
    def __init__(self, max_sen_len, max_words, docs):
        
        # add your code here
        # set sentence/document length
        self.max_sen_len = max_sen_len
        
        # set the maximum number of words to be used
        self.max_words = max_words
        
        #set data to self.doc
        self.docs = docs
        #print(self.docs.head()) #success
        
        
        
    # define generate_seq function 
    def generate_seq(self, docs):
        
        sequences = None
        
        # add your code here
        # convert each document to a list of word index as a sequence
        # get a Keras tokenizer
        # https://keras.io/preprocessing/text/
        
        tokenizer = Tokenizer(num_words=self.max_words)
        #Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. 
        #So if you give it something like, "The cat sat on the mat." It will create a dictionary s.t. word_index["the"] = 1; word_index["cat"] = 2 
        #it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. 
        #So lower integer means more frequent word (often the first few are stop words because they appear a lot).
        
        tokenizer.fit_on_texts(self.docs)
        #Transforms each text in texts to a sequence of integers. So it basically takes each word in the text 
        #and replaces it with its corresponding integer value from the word_index dictionary. 
        #Nothing more, nothing less, certainly no magic involved.
        
        sequences = tokenizer.texts_to_sequences(self.docs)
        
        #this is final result, it's a 
        padded_sequences = pad_sequences(sequences, \
                                 maxlen=self.max_sen_len, \
                                 padding='post', \
                                 truncating='post')

        word_index = tokenizer.word_index
        #print('Number of Unique Tokens',len(word_index))
        #print(padded_sequences[24])
        #print(padded_sequences.shape)
        #print(type(padded_sequences))
        
        return padded_sequences
    

105
0          Android phone is best up to range of 15000?
1    I forgot my Gmail username and have no access ...
2    What were the major effects of the cambodia ea...
3    What are the best ways to clean a Jansport bac...
4                   How can I download Arrow season 5?
Name: question1, dtype: object
Number of Unique Tokens 12011
[  13  500   40   11  178    1  266    9 5694 3801 2924    6   32   82
 2388 2925  281    1 1999    9   82  433    6   32  130 2925    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0]


#### 2. Create a function "cnn_model" to define a CNN model as follows:
- take parameters: FILTER_SIZERS (a list of Conv1D filter sizes), NUM_FILTERS (the number of filters), MAX_WORDS, MAX_SEN_LEN, and EMBEDDING_DIM (dimision of word vectors)
- define a CNN model with **Conv1D** using the specifified FILTER_SIZERS and NUM_FILTERS. For example, if FILTER_SIZERS=[1,2,3] and NUM_FILTERS=64, your model may look like the figure below.
- return this CNN model

<img src='04_Images/12_cnn_model.png' width='50%'>

In [236]:
# define CNN model
def cnn_model(EMBEDDING_DIM, \
              # word vector dimension
              FILTER_SIZES,\
              # filter sizes as a list
              MAX_WORDS, \
              # total number of words
              MAX_SEN_LEN, \
              # max words in a doc
              #NAME = 'cnn',\
              NUM_FILTERS\
              #add input layer from outside
              #outside_input
              ):            
    
    model = None
    
    # add your code here
    # define input layer, where a sentence represented as
    # 1 dimension array with integers
    main_input = Input(shape=(MAX_SEN_LEN,), dtype='int32', name='main_input')
    
    # define the embedding layer
    # input_dim is the size of all words +1,because we will use wide convolution, we add zero before first input layer
    # where 1 is for the padding symbol
    # output_dim is the word vector dimension
    # input_length is the max. length of a document
    # input to embedding layer is the "main_input" layer
    embed_1 = Embedding(input_dim=MAX_WORDS+1, \
                    output_dim=EMBEDDING_DIM, \
                    input_length=MAX_SEN_LEN,\
                    name='embedding')(main_input)  
    

    # define 1D convolution layer
    # 64 filters are used
    # a filter slides through each word (kernel_size=1)
    # input to this layer is the embedding layer
    conv1d_1= Conv1D(filters=NUM_FILTERS, kernel_size=FILTER_SIZES[0], \
                     name='conv_unigram',\
                     activation='relu')(embed_1)

    # define a 1-dimension MaxPooling 
    # to take the output of the previous convolution layer
    # the convolution layer produce 
    # MAX_SEN_LEN-1+1 values as ouput (???)
    pool_1 = MaxPooling1D(MAX_SEN_LEN-1+1, \
                          name='pool_unigram')(conv1d_1)

    # The pooling layer creates output 
    # in the size of (# of sample, 1, 64)  
    # remove one dimension since the size is 1
    flat_1 = Flatten(name='flat_unigram')(pool_1)

    
    #***********************************************************************************#
    
    
    # following the same logic to define 
    # filters for bigram
    conv1d_2= Conv1D(filters=NUM_FILTERS, kernel_size=FILTER_SIZES[1], \
                     name='conv_bigram',\
                     activation='relu')(embed_1)
    pool_2 = MaxPooling1D(MAX_SEN_LEN-2+1, name='pool_bigram')(conv1d_2)
    flat_2 = Flatten(name='flat_bigram')(pool_2)

    
    
    #***********************************************************************************#
        
    # filters for trigram
    conv1d_3= Conv1D(filters=NUM_FILTERS, kernel_size=FILTER_SIZES[2], \
                     name='conv_trigram',activation='relu')(embed_1)
    pool_3 = MaxPooling1D(MAX_SEN_LEN-3+1, name='pool_trigram')(conv1d_3)
    flat_3 = Flatten(name='flat_trigram')(pool_3)

    # Concatenate flattened output
    z=Concatenate(name='concate')([flat_1, flat_2, flat_3])

    # create the model with input layer
    # and the output layer
    model = Model(inputs=main_input, outputs=z)
    #model.summary()
    
    return model

#### 3. Define three architecutres as described below. You may add appropriate regularizers in each model:
- Model A: Use one CNN described above to process each question **without any parameter sharing**. Concatenate features extracted from both CNNs and then use a dense layer to predict the output
- Model B: Use a **shared** CNN to process both questions. Concatenate features extracted from the CNN and then use a dense layer to predict the output
- Model C: Use a shared CNN to process both questions. Then take the **absolute difference** between the feature vectors extracted from the CNN (hint, use keras "Lambda" layer), and connect the difference to a dense layer to predict the output
- Model D (**Bonus**): You can come up with your own architecture as long as it can outperform the above models


##### Reference architectures

| Model A | Model B   | Model C |
|:------:|:------:|:---------:|
|   <img src="04_Images/09_model_a.png"/>| <img src="04_Images/10_model_b.png" />| <img src="04_Images/11_model_c.png"/> |



In [257]:
#test for model

FILTER_SIZES = [1, 2, 3]

#how manny feature maps to extract
NUM_FILTERS = 64

#I think this parameter is saame as MAX_SEN_LEN
#MAX_DOC_LEN = 500

#for these two column sentence, the lengthist is 105 words in single sentence
#MAX_SEN_LEN = len(df['question1'].max())
MAX_SEN_LEN = 105

#this is definnation of tokenzier max token value, smaller and less feature, and greater value will get more features
#typical we setup 10000 to adapat requirement of word feature extract
MAX_WORDS = 10000

#the output dimention of embedding input layers 
EMBEDDING_DIM = 200

BATCH_SIZE = 64

NUM_EPOCHES = 10
a = model_A(EMBEDDING_DIM, FILTER_SIZES,NUM_FILTERS,MAX_WORDS,MAX_SEN_LEN)

a.fit()

<class 'tensorflow.python.framework.ops.Tensor'>
<class 'tensorflow.python.framework.ops.Tensor'>
Model: "model_58"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
q1_input (InputLayer)           (None, 105)          0                                            
__________________________________________________________________________________________________
q2_input (InputLayer)           (None, 105)          0                                            
__________________________________________________________________________________________________
model_56 (Model)                (None, 192)          2077192     q1_input[0][0]                   
__________________________________________________________________________________________________
model_57 (Model)                (None, 192)          2077192     q2_input[0][0]             

<keras.engine.training.Model at 0x25a65c66668>

In [259]:
# define Model A
def model_A(EMBEDDING_DIM, FILTER_SIZES,NUM_FILTERS,MAX_WORDS,MAX_SEN_LEN):
    
    
    model = None
    
    # add your code here
    q1_input = Input(shape=(MAX_SEN_LEN,), dtype='int32', name='q1_input')
    q2_input = Input(shape=(MAX_SEN_LEN,), dtype='int32', name='q2_input')
    
    
    left_cnn = cnn_model(EMBEDDING_DIM, FILTER_SIZES,MAX_WORDS, MAX_SEN_LEN, NUM_FILTERS)(q1_input)
    print(type(left_cnn))
    right_cnn = cnn_model(EMBEDDING_DIM, FILTER_SIZES,MAX_WORDS, MAX_SEN_LEN, NUM_FILTERS)(q2_input)
    print(type(right_cnn))
    con_layer=Concatenate(name='concate')([left_cnn, right_cnn])
    
    # Create a dropout layer
    # In each iteration only 50% units are turned on
    drop_1=Dropout(rate=0.5, name='dropout')(con_layer)

    # Create a dense layer
    dense_1 = Dense(192, activation='relu', name='dense')(drop_1)
    # Create the output layer
    preds = Dense(1, activation='sigmoid', name='output')(dense_1)
    model = Model(inputs=[q1_input,q2_input], outputs=preds)
    model.summary()
    
    
    return model

In [163]:
# define Model B

def model_B(FILTER_SIZES,NUM_FILTERS, MAX_SEN_LEN, MAX_WORDS, EMBEDDING_DIM):
    
    model = None
    
    # add your code here
    q1_input = Input(shape=(MAX_SEN_LEN,), dtype='int32', name='q1_input')
    q2_input = Input(shape=(MAX_SEN_LEN,), dtype='int32', name='q2_input')
    
    
    left_cnn = cnn_model(FILTER_SIZES,\
              # filter sizes as a list
              MAX_WORDS, \
              # total number of words
              MAX_SEN_LEN, \
              # max words in a doc
              #NAME = 'cnn',\
              EMBEDDING_DIM, \
              # word vector dimension
              NUM_FILTERS,\
              #add input layer from outside
              q1_input)
    
    right_cnn = cnn_model(FILTER_SIZES,\
              # filter sizes as a list
              MAX_WORDS, \
              # total number of words
              MAX_SEN_LEN, \
              # max words in a doc
              #NAME = 'cnn',\
              EMBEDDING_DIM, \
              # word vector dimension
              NUM_FILTERS,\
              #add input layer from outside
              q2_input)
    
    con_layer=Concatenate(name='concate')([left_cnn, right_cnn])
    
    # Create a dropout layer
    # In each iteration only 50% units are turned on
    drop_1=Dropout(rate=0.5, name='dropout')(con_layter)

    # Create a dense layer
    dense_1 = Dense(192, activation='relu', name='dense')(drop_1)
    # Create the output layer
    
    return model



In [164]:
# define Model C

def model_C(FILTER_SIZES,NUM_FILTERS, MAX_SEN_LEN, MAX_WORDS, EMBEDDING_DIM):
    
    model = None
    
    # add your code here
    
    return model

#### 4. Define a function "train_model" to:
- Train a model provided as an input parameter
- Use appropriate techniques to ensure you don't overfit the model
- Plot training history to make sure your model is reasonable good
- Using the testing dataset, calculate precision, recall, and F-1 score of each class (assuming 0.5 probabbility threshould), and also report AUC score.


In [258]:
# define train_model function

def train_model(model, \
                q1_train, q2_train, y1_train, y2_train # training subset
                #q1_val, q2_val,y_val, # evaluation subset
                #q1_test, q2_test,y_test # evaluation subset
                #BATCH_SIZE
                #NUM_EPOCHES
               ):
    
    # compile and train model
    # process test dataset
    model.compile(loss="binary_crossentropy", \
              optimizer="adam", \
              metrics=["accuracy"])
    
    BATCH_SIZE = 64
    NUM_EPOCHES = 10

    # fit the model and save fitting history to "training"
    training=model.fit(q1_train, y1_train, \
                   batch_size=BATCH_SIZE, \
                   epochs=NUM_EPOCHES,\
                   validation_data=[q1_val, y1_val], \
                   verbose=2)
    
    # plot training history

    
    # predict testing samples
    
    
    
    # print classfication report
    
    
    
    # calculate ROC AUC
    
    
    # add your code here
    
    

#### 5. Call the function "train_model" to get the performance of each model. Show your analysis as markdowns on the following:
- How did you choose the hyperparaters: FILTER_SIZES,NUM_FILTERS, MAX_SEN_LEN, MAX_WORDS, EMBEDDING_DIM
- Analyze each architecture to understand its pros and cons
- Which architecture is the most effective and why is it effective for this classification task?
- What regularizers did you use and why did it work?
- What features do you think CNN can successfully extract? What kind of useful features could be missed by CNN?

In [253]:
# Train and test each model
if __name__ == "__main__":  
    
    #read data
    df_train = pd.read_csv('03_data/21_train.csv')
    df_test = pd.read_csv('03_data/22_test.csv')
    
    # Set hyper parameters
    #maybe means, first is unigram(kernel_size=1), second is bigram(kernel_size=2), third is trigram(kernel_size=3)
    FILTER_SIZES = [1, 2, 3]
    
    #how manny feature maps to extract
    NUM_FILTERS = 64
    
    #I think this parameter is saame as MAX_SEN_LEN
    #MAX_DOC_LEN = 500
    
    #for these two column sentence, the lengthist is 105 words in single sentence
    MAX_SEN_LEN = len(df['question1'].max())
    
    #this is definnation of tokenzier max token value, smaller and less feature, and greater value will get more features
    #typical we setup 10000 to adapat requirement of word feature extract
    MAX_WORDS = 10000
    
    #the output dimention of embedding input layers 
    EMBEDDING_DIM = 200
    
    BATCH_SIZE = 64
    
    NUM_EPOCHES = 10
    
    #try to find the max length of these two questions set
    print(len(df_train['question1'].max()))

    # process training dataset
    seq_1 = text_preprocessor(MAX_SEN_LEN, MAX_WORDS, df_train['question1'])
    data_1= seq_1.generate_seq(df_train['question1'])
    #print(type(data_1))
    #print(data_1.shape)
    
    seq_2 = text_preprocessor(MAX_SEN_LEN, MAX_WORDS, df['question2'])
    data_2 = seq_2.generate_seq(df['question2'])
    
   
    #split into train and validation set
    q1_train, q1_val, y1_train, y1_val = train_test_split(\
                        data_1, df_train['is_duplicate'],\
                        test_size=0.3, random_state=1)
    q2_train, q2_val, y2_train, y2_val = train_test_split(\
                        data_2, df_train['is_duplicate'],\
                        test_size=0.3, random_state=1)
    
    
    q1_test = df_test['question1']
    q2_test = df_test['question2']
    y_test = df_test['is_duplicate']
    

    # train and test model A/B/C
    model_A= model_A(EMBEDDING_DIM, FILTER_SIZES,NUM_FILTERS,MAX_WORDS,MAX_SEN_LEN)
    #model_A.summary
    

    train_model(model_A, \
                q1_train, q2_train,y1_train, y2_train # training subset
                #q1_val, q2_val,y_val, # evaluation subset
                #q1_test, q2_test,y_test # evaluation subset
               )

    
    # add your code here
    
    

105
<class 'tensorflow.python.framework.ops.Tensor'>
<class 'tensorflow.python.framework.ops.Tensor'>
Model: "model_55"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
q1_input (InputLayer)           (None, 105)          0                                            
__________________________________________________________________________________________________
q2_input (InputLayer)           (None, 105)          0                                            
__________________________________________________________________________________________________
model_53 (Model)                (None, 192)          2077192     q1_input[0][0]                   
__________________________________________________________________________________________________
model_54 (Model)                (None, 192)          2077192     q2_input[0][0]         

ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[   2,   11,    1, ...,    0,    0,    0],
       [   2,    3,    1, ...,    0,    0,    0],
       [  24,  249,  120, ...,    0,    0,    0],
       ...,
       [  38,   10,   15, ...,    0, ...