## [COM4513-6513] Assignment 2: Text Classification with a Feedforward Network


In [7]:
import pandas as pd
import numpy as np
from collections import Counter
import re
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import random
from time import localtime, strftime
from scipy.stats import spearmanr,pearsonr
import zipfile
import gc

# fixing random seed for reproducibility
random.seed(123)
np.random.seed(123)


## Transform Raw texts into training and development data

First, you need to load the training, development and test sets from their corresponding CSV files (tip: you can use Pandas dataframes).

#### 1. load data

In [8]:
train_data=pd.read_csv('./data_topic/train.csv',header=None,names=["label","text"])
dev_data=pd.read_csv('./data_topic/dev.csv',header=None,names=["label","text"])
test_data=pd.read_csv('./data_topic/test.csv',header=None,names=["label","text"])
train_data.head()

Unnamed: 0,label,text
0,1,Reuters - Venezuelans turned out early\and in ...
1,1,Reuters - South Korean police used water canno...
2,1,Reuters - Thousands of Palestinian\prisoners i...
3,1,AFP - Sporadic gunfire and shelling took place...
4,1,AP - Dozens of Rwandan soldiers flew into Suda...


#### 2. Make the raw texts into lists and their corresponding labels into  np.arrays:

In [9]:
def creat_list_array(data_text,data_label):
    x=data_text.tolist()
    y=np.array(data_label)
    return x,y
def lower_F(data):
    lower_list=[]
    for i in range(len(data)):
        lower_data=str.lower(data[i])
        lower_list.append(lower_data)
    return lower_list

# Transform train data
data_train_x_raw,data_train_y=creat_list_array(train_data['text'],train_data['label']) #len 2400
# Transform validation data
data_dev_x_raw,data_dev_y=creat_list_array(dev_data['text'],dev_data['label']) #len 150
# Transform test data
data_test_x_raw,data_test_y=creat_list_array(test_data['text'],test_data['label']) #len 900

# lower data
data_train_x=lower_F(data_train_x_raw)
data_dev_x=lower_F(data_dev_x_raw)
data_test_x=lower_F(data_test_x_raw)


# Create input representations


To train your Feedforward network, you first need to obtain input representations given a vocabulary. One-hot encoding requires large memory capacity. Therefore, we will instead represent documents as lists of vocabulary indices (each word corresponds to a vocabulary index). 


## Text Pre-Processing Pipeline

To obtain a vocabulary of words. You should: 
- tokenise all texts into a list of unigrams (tip: you can re-use the functions from Assignment 1) 
- remove stop words (using the one provided or one of your preference) 
- remove unigrams appearing in less than K documents
- use the remaining to create a vocabulary of the top-N most frequent unigrams in the entire corpus.


### Unigram extraction from a document

You first need to implement the `extract_ngrams` function. It takes as input:
- `x_raw`: a string corresponding to the raw text of a document
- `ngram_range`: a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes extracting unigrams and bigrams.
- `token_pattern`: a string to be used within a regular expression to extract all tokens. Note that data is already tokenised so you could opt for a simple white space tokenisation.
- `stop_words`: a list of stop words
- `vocab`: a given vocabulary. It should be used to extract specific features.

and returns:

- a list of all extracted features.


In [10]:
stop_words = ['a','in','on','at','and','or', 
              'to', 'the', 'of', 'an', 'by', 
              'as', 'is', 'was', 'were', 'been', 'be', 
              'are','for', 'this', 'that', 'these', 'those', 'you', 'i', 'if',
             'it', 'he', 'she', 'we', 'they', 'will', 'have', 'has',
              'do', 'did', 'can', 'could', 'who', 'which', 'what',
              'but', 'not', 'there', 'no', 'does', 'not', 'so', 've', 'their',
             'his', 'her', 'they', 'them', 'from', 'with', 'its']

# tokenise, create unigrams, using stop-words
def tokenise(data,token_pattern,stop_words):
    token_data=[]
    token_list=re.findall(token_pattern,data)
    for word in token_list:
        if word not in stop_words:
            token_data.append(word)
    return token_data

# based on the tokenised data(unigrams), create bigrams or trigrams
def ngrams_generate(data,n):
    result_list=[]
    ngrams = zip(*[data[i:] for i in range(n)])
    for ngram in ngrams:
        result_list.append((ngram))
    return result_list

# extract ngrams function
def extract_ngrams(x_raw,ngram_range=(1,3),token_pattern=r'\b[A-Za-z][A-Za-z]+\b',stop_words=[],vocab=set()):
    # tokenise data
    token_data=tokenise(x_raw,token_pattern=token_pattern,stop_words=stop_words)
    # create ngrams list which save ngrams result
    result_ngrams=[]
    result_vocab=[]
    # Extract ngrams based on the ngram_range
    if ngram_range == 1:
        result_ngrams = token_data
    elif ngram_range[0]==1:
        result_ngrams=token_data
        for i in range(ngram_range[0],ngram_range[1]):
            ngrams=ngrams_generate(token_data,i+1)
            result_ngrams=result_ngrams+ngrams
    else:
        result_ngrams=ngrams_generate(token_data,ngram_range[0])
        for i in range(ngram_range[0],ngram_range[1]):
            ngrams=ngrams_generate(token_data,ngram_range[0]+1)
            result_ngrams=result_ngrams+ngrams
    # Extract specific vocab based on the vocab set()
    if len(vocab)==0:
        return result_ngrams
    else:
        for word in vocab:
            if word in result_ngrams:
                result_vocab.append(word)
        return result_vocab
    
# Extract ngrams on the complete data set, for dev and test sets
def extract_ngrams_for_test(X_data,ngram_range,stop_words=stop_words):
    # Extract ngrams from raw data
    ngrams_list_without_Ded=[]
    for i in range(len(X_data)):
        ngrams_data=extract_ngrams(X_data[i],ngram_range=ngram_range,stop_words=stop_words)
        ngrams_list_without_Ded.append(ngrams_data)
    return ngrams_list_without_Ded

### Create a vocabulary of n-grams

Then the `get_vocab` function will be used to (1) create a vocabulary of ngrams; (2) count the document frequencies of ngrams; (3) their raw frequency. It takes as input:
- `X_raw`: a list of strings each corresponding to the raw text of a document
- `ngram_range`: a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes extracting unigrams and bigrams.
- `token_pattern`: a string to be used within a regular expression to extract all tokens. Note that data is already tokenised so you could opt for a simple white space tokenisation.
- `stop_words`: a list of stop words
- `min_df`: keep ngrams with a minimum document frequency.
- `keep_topN`: keep top-N more frequent ngrams.

and returns:

- `vocab`: a set of the n-grams that will be used as features.
- `df`: a Counter (or dict) that contains ngrams as keys and their corresponding document frequency as values.
- `ngram_counts`: counts of each ngram in vocab


In [11]:
def get_vocab(X_raw, ngram_range=(1,3), token_pattern=r'\b[A-Za-z][A-Za-z]+\b', min_df=0, keep_topN=0, stop_words=[]):
    ngrams_list = []
    ngrams_list_without_Ded = []
    for i in range(len(X_raw)):
        ngrams_data=extract_ngrams(X_raw[i],ngram_range=ngram_range,stop_words=stop_words)
        ngrams_list_without_Ded.append(ngrams_data)
        # Deduplication
        ngrams_data_Ded=sorted(set(ngrams_data),key=ngrams_data.index)
        ngrams_list.append(ngrams_data_Ded)
    
    # create vocab dictionary
    vocab_dict = {}
    for i in range(len(ngrams_list)):
        for word in ngrams_list[i]:
            if word in vocab_dict:
                vocab_dict[word]+=1
            else:
                vocab_dict[word]=1
                
    # keep ngrams with a minimun df
    for word in list(vocab_dict.keys()):
        if vocab_dict[word] < min_df:
            del vocab_dict[word]
    
    # sorted then keep only topN
    vocab_sorted=sorted(vocab_dict.items(),key=lambda item:item[1],reverse=True)
    if keep_topN == 0:
        vocab_topN = vocab_sorted
    else:
        vocab_topN = vocab_sorted[:keep_topN]
        
    vocab = []
    for i in range(len(vocab_topN)):
        vocab.append(vocab_topN[i][0])
        
    return vocab,vocab_topN,ngrams_list_without_Ded

Now you should use `get_vocab` to create your vocabulary and get document and raw frequencies of unigrams:

In [12]:
# extract vocab, df——train, train token data
vocab,df_tr,ngrams_without_Ded_tr=get_vocab(data_train_x,ngram_range=(1),  
                                            min_df=0,keep_topN=0, stop_words=stop_words)
# extract dev token
ngrams_without_Ded_dev=extract_ngrams_for_test(data_dev_x,ngram_range=(1),stop_words=stop_words)
# extract test token
ngrams_without_Ded_test=extract_ngrams_for_test(data_test_x,ngram_range=(1),stop_words=stop_words)

In [13]:
print(len(vocab))
print()
print(random.sample(vocab,100))
print()
print(df_tr[:10])

8931

['stores', 'jan', 'renewed', 'coached', 'quarterfinals', 'scheduled', 'refugees', 'la', 'jns', 'tourists', 'advisers', 'call', 'embarrassment', 'mate', 'valuable', 'benazir', 'relinquish', 'revenues', 'one', 'scholarship', 'senate', 'equities', 'mistakes', 'left', 'erupts', 'marco', 'throw', 'opened', 'disciplinary', 'lance', 'surprise', 'career', 'downturn', 'conjunctivitis', 'usatoday', 'instruction', 'deflated', 'euro', 'gravely', 'occupation', 'warehouses', 'blows', 'alcoa', 'protecting', 'grip', 'co', 'nicol', 'woes', 'tilt', 'des', 'penned', 'casting', 'denmark', 'older', 'strengthen', 'injury', 'brussels', 'declare', 'categories', 'glorious', 'earl', 'dom', 'charts', 'arsenal', 'debut', 'vulnerable', 'smudged', 'seals', 'murder', 'computers', 'shrugged', 'slammed', 'through', 'papal', 'lethal', 'luck', 'slapped', 'evaluation', 'wife', 'prv', 'cbi', 'organisers', 'breakingviews', 'astorga', 'counter', 'argentina', 'treading', 'squares', 'robbing', 'shenzhen', 'increase', 'c

Then, you need to create vocabulary id -> word and id -> word dictionaries for reference:

In [14]:
def create_dict(vocab):
    id_word_dict = {}
    word_id_dict = {}
    for i in range(len(vocab)):
        id_word_dict[i] = vocab[i]
        word_id_dict[vocab[i]] = i
    return id_word_dict,word_id_dict
id_word_dict,word_id_dict = create_dict(vocab)

In [15]:
word_id_dict

{'reuters': 0,
 'said': 1,
 'tuesday': 2,
 'wednesday': 3,
 'new': 4,
 'after': 5,
 'ap': 6,
 'athens': 7,
 'monday': 8,
 'first': 9,
 'two': 10,
 'york': 11,
 'over': 12,
 'us': 13,
 'olympic': 14,
 'inc': 15,
 'more': 16,
 'year': 17,
 'oil': 18,
 'prices': 19,
 'company': 20,
 'world': 21,
 'than': 22,
 'aug': 23,
 'about': 24,
 'had': 25,
 'united': 26,
 'one': 27,
 'out': 28,
 'sunday': 29,
 'into': 30,
 'against': 31,
 'up': 32,
 'second': 33,
 'last': 34,
 'president': 35,
 'stocks': 36,
 'gold': 37,
 'team': 38,
 'when': 39,
 'three': 40,
 'night': 41,
 'time': 42,
 'yesterday': 43,
 'games': 44,
 'olympics': 45,
 'states': 46,
 'greece': 47,
 'off': 48,
 'iraq': 49,
 'washington': 50,
 'percent': 51,
 'home': 52,
 'day': 53,
 'google': 54,
 'public': 55,
 'record': 56,
 'week': 57,
 'men': 58,
 'government': 59,
 'win': 60,
 'american': 61,
 'won': 62,
 'years': 63,
 'all': 64,
 'billion': 65,
 'shares': 66,
 'city': 67,
 'offering': 68,
 'officials': 69,
 'would': 70,
 'today

### Convert the list of unigrams  into a list of vocabulary indices

Storing actual one-hot vectors into memory for all words in the entire data set is prohibitive. Instead, we will store word indices in the vocabulary and look-up the weight matrix. This is equivalent of doing a dot product between an one-hot vector and the weight matrix. 

First, represent documents in train, dev and test sets as lists of words in the vocabulary:

In [16]:
def create_index(data,word_id_dict):
    X_uni_tr = data
    X_tr = []
    for i in range(len(X_uni_tr)):
        list_a = []
        for word in X_uni_tr[i]:
            if word in word_id_dict:
                word_id = word_id_dict[word]
            else:
                pass
            list_a.append(word_id)
        X_tr.append(list_a)
    return X_uni_tr,X_tr
# represent train set
X_uni_tr,X_tr = create_index(ngrams_without_Ded_tr,word_id_dict)

In [17]:
X_uni_tr[0]

['reuters',
 'venezuelans',
 'turned',
 'out',
 'early',
 'large',
 'numbers',
 'sunday',
 'vote',
 'historic',
 'referendum',
 'either',
 'remove',
 'left',
 'wing',
 'president',
 'hugo',
 'chavez',
 'office',
 'give',
 'him',
 'new',
 'mandate',
 'govern',
 'next',
 'two',
 'years']

In [18]:
X_tr[0]

[0,
 1011,
 758,
 28,
 208,
 1103,
 1367,
 29,
 308,
 816,
 262,
 1586,
 2704,
 108,
 759,
 35,
 172,
 175,
 493,
 701,
 97,
 4,
 1221,
 2203,
 173,
 10,
 63]

In [19]:
X_uni_dev,X_dev = create_index(ngrams_without_Ded_dev,word_id_dict)
X_uni_test,X_test = create_index(ngrams_without_Ded_test,word_id_dict)

# Network Architecture

Your network should pass each word index into its corresponding embedding by looking-up on the embedding matrix and then compute the first hidden layer $\mathbf{h}_1$:

$$\mathbf{h}_1 = \frac{1}{|x|}\sum_i W^e_i, i \in x$$

where $|x|$ is the number of words in the document and $W^e$ is an embedding matrix $|V|\times d$, $|V|$ is the size of the vocabulary and $d$ the embedding size.

Then $\mathbf{h}_1$ should be passed through a ReLU activation function:

$$\mathbf{a}_1 = relu(\mathbf{h}_1)$$

Finally the hidden layer is passed to the output layer:


$$\mathbf{y} = \text{softmax}(\mathbf{a}_1W^T) $$ 
where $W$ is a matrix $d \times |{\cal Y}|$, $|{\cal Y}|$ is the number of classes.

During training, $\mathbf{a}_1$ should be multiplied with a dropout mask vector (elementwise) for regularisation before it is passed to the output layer.

You can extend to a deeper architecture by passing a hidden layer to another one:

$$\mathbf{h_i} = \mathbf{a}_{i-1}W_i^T $$

$$\mathbf{a_i} = relu(\mathbf{h_i}) $$



# Network Training

First we need to define the parameters of our network by initiliasing the weight matrices. For that purpose, you should implement the `network_weights` function that takes as input:

- `vocab_size`: the size of the vocabulary
- `embedding_dim`: the size of the word embeddings
- `hidden_dim`: a list of the sizes of any subsequent hidden layers (for the Bonus). Empty if there are no hidden layers between the average embedding and the output layer 
- `num_clusses`: the number of the classes for the output layer

and returns:

- `W`: a dictionary mapping from layer index (e.g. 0 for the embedding matrix) to the corresponding weight matrix initialised with small random numbers (hint: use numpy.random.uniform with from -0.1 to 0.1)

See the examples below for expected outputs. Make sure that the dimensionality of each weight matrix is compatible with the previous and next weight matrix, otherwise you won't be able to perform forward and backward passes. Consider also using np.float32 precision to save memory.

In [20]:
def network_weights(vocab_size=1000, embedding_dim=300, 
                    hidden_dim=[], num_classes=3, init_val = 0.5):
    dict_num_list = hidden_dim
    dict_num_list.insert(0,vocab_size)
    dict_num_list.insert(1,embedding_dim)
    dict_num_list.append(num_classes)
    W={}
    for i in range(len(dict_num_list)):
        if i == len(dict_num_list)-1:
            break
        else:
            np.random.seed(2020)
            W[i] = np.random.uniform(-init_val,init_val,(dict_num_list[i],dict_num_list[i+1])).astype('float32')
    return W
    

In [21]:
W = network_weights(vocab_size=5,embedding_dim=10,hidden_dim=[], num_classes=2)

print('W_emb:', W[0].shape)
print('W_out:', W[1].shape)

W_emb: (5, 10)
W_out: (10, 2)


In [22]:
W = network_weights(vocab_size=3,embedding_dim=4,hidden_dim=[2], num_classes=2)

In [23]:
print('W_emb:', W[0].shape)
print('W_h1:', W[1].shape)
print('W_out:', W[2].shape)

W_emb: (3, 4)
W_h1: (4, 2)
W_out: (2, 2)


In [25]:
W[0]

array([[ 0.48627684,  0.37339196,  0.00974553, -0.22816429],
       [-0.16308127, -0.28304574, -0.22352286, -0.15668441],
       [ 0.36215892, -0.34330034, -0.35911277,  0.2570803 ]],
      dtype=float32)

Then you need to develop a `softmax` function (same as in Assignment 1) to be used in the output layer. It takes as input:

- `z`: array of real numbers 

and returns:

- `sig`: the softmax of `z`

In [26]:
def softmax(z):
    sig = (np.exp(z).T/np.sum(np.exp(z),axis=1)).T
    return sig

Now you need to implement the categorical cross entropy loss by slightly modifying the function from Assignment 1 to depend only on the true label `y` and the class probabilities vector `y_preds`:

In [27]:
def categorical_loss(y, y_preds):
    loss = -np.log(y_preds[y])
    #l2_regularization = (alpha/2)*(np.sum(np.square(weights)))
    return loss

In [28]:
# example for 5 classes

y = 2 #true label
y_preds = softmax(np.array([[-2.1,1.,0.9,-1.3,1.5]]))[0]

print('y_preds: ',y_preds)
print('loss:', categorical_loss(y, y_preds))

y_preds:  [0.01217919 0.27035308 0.24462558 0.02710529 0.44573687]
loss: 1.40802648485675


Then, implement the `relu` function to introduce non-linearity after each hidden layer of your network (during the forward pass): 

$$relu(z_i)= max(z_i,0)$$

and the `relu_derivative` function to compute its derivative (used in the backward pass):

\begin{equation}
  \text{relu_derivative}(z_i)=\begin{cases}
    0, & \text{if $z_i<=0$}.\\
    1, & \text{otherwise}.
  \end{cases}
\end{equation}

Note that both functions take as input a vector $z$ 

Hint use .copy() to avoid in place changes in array z

In [75]:
#def relu(z):
#    return z*(z>0)

def relu(z):
    return np.maximum(z, 0)

def relu_derivative(z):
    return (z>0)*1

During training you should also apply a dropout mask element-wise after the activation function (i.e. vector of ones with a random percentage set to zero). The `dropout_mask` function takes as input:

- `size`: the size of the vector that we want to apply dropout
- `dropout_rate`: the percentage of elements that will be randomly set to zeros

and returns:

- `dropout_vec`: a vector with binary values (0 or 1)

In [30]:
def dropout_mask(size, dropout_rate):
    dropout_vec = np.ones(size)
    num = int(size*dropout_rate)
    dropout_vec[:num] = 0
    np.random.shuffle(dropout_vec)
    return dropout_vec
    
    

In [31]:
print(dropout_mask(10, 0.2))
print(dropout_mask(10, 0.3))

[1. 1. 0. 1. 1. 1. 0. 1. 1. 1.]
[0. 1. 1. 1. 0. 1. 1. 1. 0. 1.]


Now you need to implement the `forward_pass` function that passes the input x through the network up to the output layer for computing the probability for each class using the weight matrices in `W`. The ReLU activation function should be applied on each hidden layer. 

- `x`: a list of vocabulary indices each corresponding to a word in the document (input)
- `W`: a list of weight matrices connecting each part of the network, e.g. for a network with a hidden and an output layer: W[0] is the weight matrix that connects the input to the first hidden layer, W[1] is the weight matrix that connects the hidden layer to the output layer.
- `dropout_rate`: the dropout rate that is used to generate a random dropout mask vector applied after each hidden layer for regularisation.

and returns:

- `out_vals`: a dictionary of output values from each layer: h (the vector before the activation function), a (the resulting vector after passing h from the activation function), its dropout mask vector; and the prediction vector (probability for each class) from the output layer.

In [260]:
def forward_pass(x, W, dropout_rate=0.2):
    out_vals = {}
    h_vecs = []
    a_vecs = []
    dropout_vecs = []
    
    x_vecs = [W[0][x_num] for x_num in x]
    # h0 = 1/x * sum.(list_vec)
    h0 = np.expand_dims(1/len(x)*np.sum(x_vecs,axis = 0).T,axis = 0) # (,4) -> (1,4)
    a0 = relu(h0)
    d0 = dropout_mask(a0.shape[1],dropout_rate)
    #output_0 = a0*d0
    output_0 = (a0*d0)/dropout_rate
    
    # add h, a, dropout array to list
    h_vecs.append(h0.squeeze())
    a_vecs.append(a0.squeeze())
    dropout_vecs.append(d0.squeeze())
    
    if len(W) == 2:
        y = softmax(output_0@W[2])
    else:
        output = output_0
        for i in range(len(W)):
            h = output@W[i+1]
            a = relu(h)
            d = dropout_mask(a.shape[1],dropout_rate)
            #output = a*d
            output = (a*d)/dropout_rate
            # add h, a, dropout array to list
            h_vecs.append(h.squeeze())
            a_vecs.append(a.squeeze())
            dropout_vecs.append(d.squeeze())
            
            if i == len(W)-3:
                break
        y = softmax(output@W[len(W)-1])
    
    # output result to dictionary
    out_vals['h'] = h_vecs
    out_vals['a'] = a_vecs
    out_vals['dropout_vecs'] = dropout_vecs
    out_vals['y'] = y.squeeze()
   
    return out_vals
    

In [264]:
W = network_weights(vocab_size=3,embedding_dim=4,hidden_dim=[5], num_classes=2)
 
for i in range(len(W)):
    print('Shape W'+str(i), W[i].shape)

print()
print(forward_pass([2,1], W, dropout_rate=0.5))

Shape W0 (3, 4)
Shape W1 (4, 5)
Shape W2 (5, 2)

{'h': [array([ 0.09953883, -0.31317306, -0.29131782,  0.05019794], dtype=float32), array([ 0.01674634, -0.02840193,  0.00616702, -0.0377309 , -0.01809771])], 'a': [array([0.09953883, 0.        , 0.        , 0.05019794], dtype=float32), array([0.01674634, 0.        , 0.00616702, 0.        , 0.        ])], 'dropout_vecs': [array([0., 0., 1., 1.]), array([0., 1., 1., 1., 0.])], 'y': array([0.50036991, 0.49963009])}


In [202]:
'''
x = [2,1]
dropout_rate = 0.5
list_vecs = [W[0][i] for i in x]
h0 = np.expand_dims(1/len(x)*np.sum(list_vecs,axis = 0).T,axis = 0) # (,4) -> (1,4)
a0 = relu(h0)
dropout_0 = dropout_mask(a0.shape[1],dropout_rate)
output_0 = a0*dropout_0
h1 = output_0@W[1] # (1,4)@(4,5) = (1,5)
a1 = relu(h1)
dropout_1 = dropout_mask(a1.shape[1],dropout_rate)
output_1 = a1*dropout_1
y = softmax(output_1@W[2]) # (1,5)@(5,2) = (1,2)
'''

In [26]:
W = network_weights(vocab_size=3,embedding_dim=4,hidden_dim=[5], num_classes=2)
 
for i in range(len(W)):
    print('Shape W'+str(i), W[i].shape)

print()
print(forward_pass([2,1], W, dropout_rate=0.5))

Shape W0 (3, 4)
Shape W1 (4, 5)
Shape W2 (5, 2)

{'h': [array([-0.04668263, -0.12518334,  0.17532286, -0.32932055], dtype=float32), array([0., 0., 0., 0., 0.])], 'a': [array([0.        , 0.        , 0.17532286, 0.        ], dtype=float32), array([0., 0., 0., 0., 0.])], 'dropout_vec': [array([1., 0., 0., 1.]), array([0., 0., 1., 1., 1.])], 'y': array([0.5, 0.5])}


The `backward_pass` function computes the gradients and update the weights for each matrix in the network from the output to the input. It takes as input 

- `x`: a list of vocabulary indices each corresponding to a word in the document (input)
- `y`: the true label
- `W`: a list of weight matrices connecting each part of the network, e.g. for a network with a hidden and an output layer: W[0] is the weight matrix that connects the input to the first hidden layer, W[1] is the weight matrix that connects the hidden layer to the output layer.
- `out_vals`: a dictionary of output values from a forward pass.
- `learning_rate`: the learning rate for updating the weights.
- `freeze_emb`: boolean value indicating whether the embedding weights will be updated.

and returns:

- `W`: the updated weights of the network.

Hint: the gradients on the output layer are similar to the multiclass logistic regression.

In [3]:
import numpy as np
a = np.array([1,2,3,4,5])
b = np.array([1,2,3,4,5])


In [4]:
a*b

array([ 1,  4,  9, 16, 25])

In [27]:
def backward_pass(x, y, W, out_vals, lr=0.001, freeze_emb=False):
    
    
    return W




In [265]:
x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

In [266]:
x.shape

(4, 2)

Finally you need to modify SGD to support back-propagation by using the `forward_pass` and `backward_pass` functions.

The `SGD` function takes as input:

- `X_tr`: array of training data (vectors)
- `Y_tr`: labels of `X_tr`
- `W`: the weights of the network (dictionary)
- `X_dev`: array of development (i.e. validation) data (vectors)
- `Y_dev`: labels of `X_dev`
- `lr`: learning rate
- `dropout`: regularisation strength
- `epochs`: number of full passes over the training data
- `tolerance`: stop training if the difference between the current and previous validation loss is smaller than a threshold
- `freeze_emb`: boolean value indicating whether the embedding weights will be updated (to be used by the backward pass function).
- `print_progress`: flag for printing the training progress (train/validation loss)


and returns:

- `weights`: the weights learned
- `training_loss_history`: an array with the average losses of the whole training set after each epoch
- `validation_loss_history`: an array with the average losses of the whole development set after each epoch

In [7]:
def SGD(X_tr, Y_tr, W, X_dev=[], Y_dev=[], lr=0.001, 
        dropout=0.2, epochs=5, tolerance=0.001, freeze_emb=False, print_progress=True):
    

    
    
    return W, training_loss_history, validation_loss_history

Now you are ready to train and evaluate you neural net. First, you need to define your network using the `network_weights` function followed by SGD with backprop:

In [9]:
W = network_weights(vocab_size=len(vocab),embedding_dim=300,hidden_dim=[], num_classes=3)

for i in range(len(W)):
    print('Shape W'+str(i), W[i].shape)

W, loss_tr, dev_loss = SGD(X_tr, Y_tr,
                            W,
                            X_dev=X_dev, 
                            Y_dev=Y_dev,
                            lr=0.001, 
                            dropout=0.2,
                            freeze_emb=False,
                            tolerance=0.01,
                            epochs=100)


Plot the learning process:

Compute accuracy, precision, recall and F1-Score:

In [10]:
preds_te = [np.argmax(forward_pass(x, W, dropout_rate=0.0)['y']) for x,y in zip(X_te,Y_te)]
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))

### Discuss how did you choose model hyperparameters ? 

# Use Pre-trained Embeddings

Now re-train the network using GloVe pre-trained embeddings. You need to modify the `backward_pass` function above to stop computing gradients and updating weights of the embedding matrix.

Use the function below to obtain the embedding martix for your vocabulary.

In [32]:
def get_glove_embeddings(f_zip, f_txt, word2id, emb_size=300):
    
    w_emb = np.zeros((len(word2id), emb_size))
    
    with zipfile.ZipFile(f_zip) as z:
        with z.open(f_txt) as f:
            for line in f:
                line = line.decode('utf-8')
                word = line.split()[0]
                     
                if word in vocab:
                    emb = np.array(line.strip('\n').split()[1:]).astype(np.float32)
                    w_emb[word2id[word]] +=emb
    return w_emb

In [33]:
w_glove = get_glove_embeddings("glove.840B.300d.zip","glove.840B.300d.txt",word2id)

First, initialise the weights of your network using the `network_weights` function. Second, replace the weigths of the embedding matrix with `w_glove`. Finally, train the network by freezing the embedding weights: 

In [14]:
preds_te = [np.argmax(forward_pass(x, W, dropout_rate=0.0)['y']) for x,y in zip(X_te,Y_te)]
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))

### Discuss how did you choose model hyperparameters ? 

# Extend to support deeper architectures (Bonus)

Extend the network to support back-propagation for more hidden layers. You need to modify the `backward_pass` function above to compute gradients and update the weights between intermediate hidden layers. Finally, train and evaluate a network with a deeper architecture. 

In [13]:
preds_te = [np.argmax(forward_pass(x, W, dropout_rate=0.0)['y']) for x,y in zip(X_te,Y_te)]
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))

## Full Results

Add your final results here:

| Model | Precision  | Recall  | F1-Score  | Accuracy
|:-:|:-:|:-:|:-:|:-:|
| Average Embedding  |   |   |   |   |
| Average Embedding (Pre-trained)  |   |   |   |   |
| Average Embedding (Pre-trained) + X hidden layers (BONUS)   |   |   |   |   |
