# CIS421 Fall 2023
Final Project - classification with CNNs

In this notebook we implement the approched described in this [paper](https://arxiv.org/pdf/1408.5882.pdf) for classifiying sentences using Convolutional Neural Networks. In particular, we will classify sentences into "subjective" or "objective".

## Subjectivity Dataset

The subjectivity dataset has 5000 subjective and 5000 objective processed sentences. To get the data, write a function

```python
def unpack_dataset()
```
that

- downloads from https://cis335.guihang.org/data/rotten_imdb.tar.gz
- makes a folder `data`
- unpacks the package into `data` folder with bash command:

```bash
 tar -xvf rotten_imdb.tar.gz -C data
```

## <font color="red">Your code here:</font>

In [None]:
import os
from urllib.request import urlretrieve
import tarfile

def unpack_dataset():
  # you code here
  newpath = r'/content/data'
  if not os.path.exists(newpath):
      os.makedirs(newpath)

  url = ('https://cis335.guihang.org/data/rotten_imdb.tar.gz')
  filename = os.path.join(newpath, 'rotten_imdb.tar.gz')
  urlretrieve(url, filename)

  # os.system('tar -xvf /content/data/rotten_imdb.tar.gz -C /content/data')
  with tarfile.open(filename, "r") as tf:
      tf.extractall(path=newpath)
      print("All files extracted")

In [None]:
unpack_dataset()

All files extracted


In [None]:
from pathlib import Path
PATH = Path("data")
list(PATH.iterdir())

[PosixPath('data/glove.6B.200d.txt'),
 PosixPath('data/glove.6B.50d.txt'),
 PosixPath('data/rotten_imdb.tar.gz'),
 PosixPath('data/quote.tok.gt9.5000'),
 PosixPath('data/glove.6B.100d.txt'),
 PosixPath('data/subjdata.README.1.0'),
 PosixPath('data/plot.tok.gt9.5000'),
 PosixPath('data/glove.6B.300d.txt')]

Read `subjdata.README.1.0` file:
- we have one file containing 5000 subjective sentences (or snippets)
- another file contains 5000 objective sentences

In [None]:
! head data/plot.tok.gt9.5000

the movie begins in the past where a young boy named sam attempts to save celebi from a hunter . 
emerging from the human psyche and showing characteristics of abstract expressionism , minimalism and russian constructivism , graffiti removal has secured its place in the history of modern art while being created by artists who are unconscious of their artistic achievements . 
spurning her mother's insistence that she get on with her life , mary is thrown out of the house , rejected by joe , and expelled from school as she grows larger with child . 
amitabh can't believe the board of directors and his mind is filled with revenge and what better revenge than robbing the bank himself , ironic as it may sound . 
she , among others excentricities , talks to a small rock , gertrude , like if she was alive . 
this gives the girls a fair chance of pulling the wool over their eyes using their sexiness to poach any last vestige of common sense the dons might have had . 
styled after vh1's " behin

## String cleaning functions

In [None]:
import numpy as np
from collections import defaultdict
import re

## <font color="red">Your code here:</font>

In [None]:
def read_file(inputFile, encoding='utf-8'):
    """ Read file returns a numpy list.
    """
    # Your code here: read file and split into lines
    with open(inputFile, 'r', encoding=encoding) as read_file:
      content = read_file.readlines()

    content = np.array(content)
    return content # content is a 1-D numpy (equivalent to list) of text lines read from input inputFile:

In [None]:
def get_vocab(list_of_doc):
    """
    Input: a list of documents. Each list item is a document.

    Computes Dictionary of counts of words.
    Dict keys: each individual word in all docs in the input list
    Dict values: how many documents contains each of the word in keys? Use that count as a value for that key(word)
    Returns Dict
    """
    # your code
    vocab = {}
    for doc in list_of_doc:
      doc_words = doc.split()
      for i in doc_words:
        vocab[i] = vocab.get(i, 0) + 1
    return vocab # vocab is a dict

In [None]:
temp = read_file('/content/data/plot.tok.gt9.5000')
temp

array(['the movie begins in the past where a young boy named sam attempts to save celebi from a hunter . \n',
       'emerging from the human psyche and showing characteristics of abstract expressionism , minimalism and russian constructivism , graffiti removal has secured its place in the history of modern art while being created by artists who are unconscious of their artistic achievements . \n',
       "spurning her mother's insistence that she get on with her life , mary is thrown out of the house , rejected by joe , and expelled from school as she grows larger with child . \n",
       ...,
       'enter the beautiful and mysterious secret agent petra schmitt . \n',
       'after listening to a missionary from china speak , a christian man ( josh gaffga ) becomes very convinced by what he hears . \n',
       'looking for a short cut to fame , glass concocted sources , quotes and even entire stories , but his deception did not go unnoticed forever , and eventually , his world came c

In [None]:
get_vocab(temp)

{'the': 6311,
 'movie': 60,
 'begins': 79,
 'in': 2144,
 'past': 71,
 'where': 169,
 'a': 4106,
 'young': 250,
 'boy': 72,
 'named': 64,
 'sam': 33,
 'attempts': 28,
 'to': 3307,
 'save': 55,
 'celebi': 6,
 'from': 516,
 'hunter': 16,
 '.': 5387,
 'emerging': 1,
 'human': 33,
 'psyche': 3,
 'and': 3571,
 'showing': 7,
 'characteristics': 3,
 'of': 3117,
 'abstract': 1,
 'expressionism': 1,
 ',': 6554,
 'minimalism': 1,
 'russian': 14,
 'constructivism': 1,
 'graffiti': 4,
 'removal': 3,
 'has': 515,
 'secured': 1,
 'its': 73,
 'place': 63,
 'history': 34,
 'modern': 24,
 'art': 31,
 'while': 146,
 'being': 104,
 'created': 18,
 'by': 581,
 'artists': 9,
 'who': 714,
 'are': 440,
 'unconscious': 3,
 'their': 646,
 'artistic': 11,
 'achievements': 1,
 'spurning': 1,
 'her': 990,
 "mother's": 16,
 'insistence': 2,
 'that': 852,
 'she': 453,
 'get': 162,
 'on': 806,
 'with': 1108,
 'life': 362,
 'mary': 15,
 'is': 1756,
 'thrown': 11,
 'out': 352,
 'house': 70,
 'rejected': 5,
 'joe': 23,


## Split train and test

## <font color='red'>Read code below and answer questions</font>

- What is the purpose of `sub_y` and `sub_obj` ? What do they represent here?

`sub_content` represents the subjective content, `obj_content` represents objective content.  
`sub_y` is initialized as a matrix of zeroes because we want to represent `1 = objective` and `obj_y` is represented as a matrix of ones because we want to represent `0 = subjective`. Here the response variable is binary (1 if it is objecive, 0 if it is not). The subjective and objective content and dependent variables are then concatenated correspondingly to X and Y for training.  

In [None]:
sub_content = read_file(PATH/"quote.tok.gt9.5000", encoding="ISO-8859-1")
obj_content = read_file(PATH/"plot.tok.gt9.5000")
sub_content = np.array([line.strip() for line in sub_content])
obj_content = np.array([line.strip() for line in obj_content])
sub_y = np.zeros(len(sub_content))
obj_y = np.ones(len(obj_content))
X = np.append(sub_content, obj_content)
y = np.append(sub_y, obj_y)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train[:5], y_train[:5]

(array(['will god let her fall or give her a new path ?',
        "the director's twitchy sketchbook style and adroit perspective shifts grow wearisome amid leaden pacing and indifferent craftsmanship ( most notably wretched sound design ) .",
        "welles groupie/scholar peter bogdanovich took a long time to do it , but he's finally provided his own broadside at publishing giant william randolph hearst .",
        'based on the 1997 john king novel of the same name with a rather odd synopsis : " a first novel about a seasoned chelsea football club hooligan who represents a disaffected society operating by brutal rules .',
        'yet , beneath an upbeat appearance , she is struggling desperately with the emotional and physical scars left by the attack .'],
       dtype='<U691'),
 array([1., 0., 0., 1., 1.]))

In [None]:
X_train.shape

(8000,)

In [None]:
X_train[:5]

array(['will god let her fall or give her a new path ?',
       "the director's twitchy sketchbook style and adroit perspective shifts grow wearisome amid leaden pacing and indifferent craftsmanship ( most notably wretched sound design ) .",
       "welles groupie/scholar peter bogdanovich took a long time to do it , but he's finally provided his own broadside at publishing giant william randolph hearst .",
       'based on the 1997 john king novel of the same name with a rather odd synopsis : " a first novel about a seasoned chelsea football club hooligan who represents a disaffected society operating by brutal rules .',
       'yet , beneath an upbeat appearance , she is struggling desperately with the emotional and physical scars left by the attack .'],
      dtype='<U691')

In [None]:
# getting vocab from training sets
data_vocab = get_vocab(X_train)

##  <font color='red'>Validate your function get_vocab: (you should see expected result)</font>


In [None]:
sampletext = X_train[:10]

In [None]:

data_vocab0 = get_vocab(sampletext)
# test
stop = 0
for k,v in data_vocab0.items():
    print(f"Key: '{k}'", f"Counts: {v}")
    stop += 1
    if stop >=5: break

Key: 'will' Counts: 1
Key: 'god' Counts: 1
Key: 'let' Counts: 1
Key: 'her' Counts: 3
Key: 'fall' Counts: 1


## Embedding Layer

- <font color='red'>Note</font> `Embedding` tries to map a text into a vector using neuralnet. You are encouraged to understand as much as possible about embedding from PyTorch document and other tutorials for this project but you are not required to write your own code for it. It's fine to just use the given code here.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
# an Embedding module containing 10 (words) tensors of size 3
embed = nn.Embedding(10, 3)
a = torch.LongTensor([[1,2,4,5,1]])
embed(a)

tensor([[[-0.3628, -0.7968,  0.4457],
         [ 0.1522,  1.0493,  0.8779],
         [ 1.5134, -0.6654, -0.0792],
         [-0.9242, -0.4839, -0.1281],
         [-0.3628, -0.7968,  0.4457]]], grad_fn=<EmbeddingBackward0>)

In [None]:
## here is the randomly initialized embeddings
embed.weight.data

tensor([[ 0.5324,  0.7288, -0.6350],
        [-0.3628, -0.7968,  0.4457],
        [ 0.1522,  1.0493,  0.8779],
        [ 0.7712, -0.3209, -0.8640],
        [ 1.5134, -0.6654, -0.0792],
        [-0.9242, -0.4839, -0.1281],
        [ 1.3983, -0.5136, -1.5807],
        [-2.3788,  0.0872, -0.9536],
        [ 0.7670, -0.5708, -0.3470],
        [ 0.6567,  1.5418,  0.2362]])

### Initializing embedding layer with Glove embeddings

To get glove pre-trained embeddings:
    `wget http://nlp.stanford.edu/data/glove.6B.zip`

- <font color='red'>Note</font> `glove` is  pre-trained`Embedding` with all weights filled up.  You are encouraged to understand as much as possible about glove embedding from your research, for this project but you are not required to write your own code for it. It's fine just use the given code here.

- <font color='red'>Complete the function below</font>


In [None]:
import zipfile

def unpack_glove():
    # download from  http://nlp.stanford.edu/data/glove.6B.zip
    url = ('http://nlp.stanford.edu/data/glove.6B.zip')
    zip_filepath = os.path.join('/content','glove.6B.zip')
    urlretrieve(url, zip_filepath)
    # unzip the downloaded file
    with zipfile.ZipFile(zip_filepath, 'r') as zip_ref:
      # move files into folder `data`
      zip_ref.extractall('/content/data')
    return



In [None]:
unpack_glove()

In this section we are keeping the whole Glove embeddings. You can decide to keep just the words on your training set.


### <font color='red'>You must run below:

</font>


In [None]:
! head -2 data/glove.6B.50d.txt

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392


We would like to initialize the embeddings from our model with the pre-trained Glove embeddings. After initializing we should "freeze" the embeddings at least initially. The rationale is that we first want the network to learn weights for the other parameters that were randomly initialize. After that phase we could finetune the embeddings to our task.

`embed.weight.requires_grad = False` <font color='red'>freezes the embedding parameters (so it cannot be updated during training).</font>

The following code initializes the embedding. Here `V` is the vocabulary size and `D` is the embedding size. `pretrained_weight` is a numpy matrix of shape `(V, D)`. Each row is a vector representing each of the V words after embedding.

In [None]:
def loadGloveModel(gloveFile=PATH/"glove.6B.300d.txt"):
    """ Loads word vectors into a dictionary."""
    f = open(gloveFile,'r')
    word_vecs = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        word_vecs[word] = np.array([float(val) for val in splitLine[1:]])
    return word_vecs

In [None]:
word_vecs = loadGloveModel()

In [None]:
print(len(word_vecs.keys()), len(data_vocab.keys()))

400000 21416


- <font color='red'>Complete the function here</font>  

In [None]:
def delete_rare_words(word_vecs, data_vocab, min_df=2):
    """ Deletes rare words from data_vocab

    Deletes words from data_vocab if they are not in word_vecs
    and don't have at least min_df occurrencies in data_vocab.
    """
    # Your code here
    shrink_data_vocab = data_vocab.copy()
    for word, word_count in data_vocab.items():
      if word_vecs.get(word) is None and word_count < min_df:
        del shrink_data_vocab[word]
        continue

    return shrink_data_vocab # returns shinked data_vocab

In [None]:
len(data_vocab.keys())

21416

In [None]:
# clean up issues here
data_vocab = delete_rare_words(word_vecs, data_vocab, min_df=2)

In [None]:
len(data_vocab.keys())

18767

In [None]:
def create_embedding_matrix(word_vecs, data_vocab, min_df=2, D=300):
    """Creates embedding matrix from word vectors. """
    data_vocab = delete_rare_words(word_vecs, data_vocab, min_df)
    V = len(data_vocab.keys()) + 2
    vocab2index = {}
    W = np.zeros((V, D), dtype="float32")
    vocab = ["", "UNK"]
    # adding a vector for padding
    W[0] = np.zeros(D, dtype='float32')
    # adding a vector for rare words
    W[1] = np.random.uniform(-0.25, 0.25, D)
    vocab2index["UNK"] = 1
    i = 2
    for word in data_vocab:
        if word in word_vecs:
            W[i] = word_vecs[word]
            vocab2index[word] = i
            vocab.append(word)
            i += 1
        else:
            W[i] = np.random.uniform(-0.25,0.25,D)
            vocab2index[word] = i
            vocab.append(word)
            i += 1
    return W, np.array(vocab), vocab2index

In [None]:
pretrained_weight, vocab, vocab2index = create_embedding_matrix(word_vecs, data_vocab)

In [None]:
len(pretrained_weight) # note that index 0 is for padding

18769

In [None]:
D = 300
V = len(pretrained_weight)
emb = nn.Embedding(V, D)
emb.weight.data.copy_(torch.from_numpy(pretrained_weight))

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0986,  0.0508, -0.0988,  ..., -0.2125, -0.1108,  0.1963],
        [-0.3457,  0.2848, -0.4848,  ..., -0.4811, -0.3120, -0.0681],
        ...,
        [-0.1369, -0.0570, -0.1921,  ..., -0.2036, -0.4955, -0.2766],
        [-0.3170, -0.4958,  0.3020,  ..., -0.1990,  0.0607, -0.1257],
        [ 0.2028, -0.3397, -0.1055,  ...,  0.5841, -0.4893,  0.0245]])

## <font color='red'>Note: </font>

So far data_vocab is a dictionary with values representing word frequency in training set while word_vecs is a dictionary with values to be vectors of floats and the vectors are results of training from generic data set--it has nothing to do with training set.

### <font color='red'>Questions:

- Briefly explain what does the code above do, in particular why 'UNK' is introduced here?
- What does it mean?
- How many parameters do we have in this embedding matrix?

</font>



## Your answer

### <font color='red'>Answers:

* `create_embedding_matrix` code walkthrough

1. delete rare words or words that do not appear in the word vectorizor
2. increase V (vocabulary size) by 2
3. initialize a dictionary that converts vocabulary to indices inside the W matrix. This will be convenient when we want to obtain the vector embeddings of a particular word using the matrix.
4. create W matrix of float weights with size of V by D (vocabulary size + 2 by embedding size)
5. Create list `vocab` with 2 initial values of nothing which correspond to padding and **"UNK" which stands for unknown words**. These will correspond to the 1st and 2nd row of the W matrix as seen later.
6. Assign 1st row of W matrix as vector for padding (initialization =0)
7. Assign 2nd row of W matrix as vector for rare words (random initialization between -0.25 and 0.25)
8. add key "UNK" to vocab2index with value of 1. UNK = unknown
9. initialize i as 2 to skip the 1st 2 rows of W (which we have already assigned)
10. Loop over each word in reduced `data_vocab` dictionary

10a1. If the word is in the `word_vecs` dictionary, then the ith row of the W matrix is assigned to the value of the word vector embedding from `word_vecs`

10a2. If the word is not in the `word_vecs` dictionary, this means that there is no embedding for this particular word. Then the ith row of the W matrix is assigned to a randomly initialized vector of corresponding size.


10b. The index of the row (`i`) is assigned as a value corresponding to the key (equal to the word) in the `vocab2index` dictionary
10c. Append the word to the `vocab` list
10d. Increase the counter `i` by 1

11. Return weight matrix `W`, vocab list as Numpy array (including "UNK" and ""), and the dictionary `vocab2index` that converts vocabulary to the row index in matrix W

* UNK stands for unknown words, which are words that we do not have embeddings for and therefore require us to creat embeddings for these words.

* The total number of parameters is equal to the V x D which is the original vocabulary size (not +2!) multiplied by the size of the vector embeddings.


## Encoding training and validation sets

We will be using 1D Convolutional neural networks as our model. CNNs assume a fixed input size so we need to assume a fixed size and truncate or pad the sentences as needed. Let's find a good value to set our sequence length to.

In [None]:
x_len = np.array([len(x.split()) for x in X_train])

In [None]:
np.percentile(x_len, 95) # let set the max sequence len to N=40

43.0

In [None]:
X_train[0]

'will god let her fall or give her a new path ?'

In [None]:
# returns the index of the word or the index of "UNK" otherwise
vocab2index.get("will", vocab2index["UNK"])

2

In [None]:
np.array([vocab2index.get(w, vocab2index["UNK"]) for w in X_train[0].split()])

array([ 2,  3,  4,  5,  6,  7,  8,  5,  9, 10, 11, 12])

In [None]:
def encode_sentence(s, N=40):
    enc = np.zeros(N, dtype=np.int32)
    enc1 = np.array([vocab2index.get(w, vocab2index["UNK"]) for w in s.split()])
    l = min(N, len(enc1))
    enc[:l] = enc1[:l]
    return enc

In [None]:
encode_sentence(X_train[0])

array([ 2,  3,  4,  5,  6,  7,  8,  5,  9, 10, 11, 12,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0], dtype=int32)

In [None]:
x_train = np.vstack([encode_sentence(x) for x in X_train])
x_train.shape

(8000, 40)

In [None]:
x_val = np.vstack([encode_sentence(x) for x in X_val])
x_val.shape

(2000, 40)

## Playing and debugging CNN layers

## <font color='red'>Note: </font>

Carefully read the code below and prepare to answer questions at the end of this section


In [None]:
V = len(pretrained_weight)
D = 300
N = 40

In [None]:
emb = nn.Embedding(V, D)
emb.weight.data.copy_(torch.from_numpy(pretrained_weight))

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0986,  0.0508, -0.0988,  ..., -0.2125, -0.1108,  0.1963],
        [-0.3457,  0.2848, -0.4848,  ..., -0.4811, -0.3120, -0.0681],
        ...,
        [-0.1369, -0.0570, -0.1921,  ..., -0.2036, -0.4955, -0.2766],
        [-0.3170, -0.4958,  0.3020,  ..., -0.1990,  0.0607, -0.1257],
        [ 0.2028, -0.3397, -0.1055,  ...,  0.5841, -0.4893,  0.0245]])

In [None]:
x = x_train[:2]
x.shape

(2, 40)

In [None]:
x = torch.LongTensor(x)
x

tensor([[ 2,  3,  4,  5,  6,  7,  8,  5,  9, 10, 11, 12,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0],
        [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 18, 27, 28, 29,
         30, 31, 32, 33, 34, 35, 36,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0]])

In [None]:
x1 = emb(x)
x1.shape

torch.Size([2, 40, 300])

In [None]:
x1 = x1.transpose(1,2)  # needs to convert x to (batch, embedding_dim, sentence_len)
x1.size()

torch.Size([2, 300, 40])

In [None]:
conv_3 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=3)

In [None]:
x3 = conv_3(x1)

In [None]:
x3.size()

torch.Size([2, 100, 38])

In [None]:
conv_4 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=4)
conv_5 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=5)

In [None]:
x4 = conv_4(x1)
x5 = conv_5(x1)
print(x4.size(), x5.size())

torch.Size([2, 100, 37]) torch.Size([2, 100, 36])


Note that the convolution all apply to the same `x1`. How do we combine now the results of the convolutions?

In [None]:
# 100 3-gram detectors
x3 = nn.ReLU()(x3)
x3 = nn.MaxPool1d(kernel_size = 38)(x3)
x3.size()

torch.Size([2, 100, 1])

In [None]:
# 100 4-gram detectors
x4 = nn.ReLU()(x4)
x4 = nn.MaxPool1d(kernel_size = 37)(x4)
x4.size()

torch.Size([2, 100, 1])

In [None]:
# 100 5-gram detectors
x5 = nn.ReLU()(x5)
x5 = nn.MaxPool1d(kernel_size = 36)(x5)
x5.size()

torch.Size([2, 100, 1])

In [None]:
# concatenate x3, x4, x5
out = torch.cat([x3, x4, x5], 2)
out.size()

torch.Size([2, 100, 3])

In [None]:
out = out.view(out.size(0), -1)
out.size()

torch.Size([2, 300])

After this we have a fully connected network. Let's write a network that implements this.

## 1D CNN model for sentence classification

Notation:
* V -- vocabulary size
* D -- embedding size
* N -- MAX Sentence length

In [None]:
class SentenceCNN(nn.Module):

    def __init__(self, V, D, glove_weights):
        super(SentenceCNN, self).__init__()
        self.glove_weights = glove_weights
        self.embedding = nn.Embedding(V, D, padding_idx=0)
        self.embedding.weight.data.copy_(torch.from_numpy(self.glove_weights))
        self.embedding.weight.requires_grad = False ## freeze embeddings

        self.conv_3 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=3)
        self.conv_4 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=4)
        self.conv_5 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=5)

        self.dropout = nn.Dropout(p=0.5)
        self.fc = nn.Linear(300, 1)

    def forward(self, x):
        x = self.embedding(x)
        x = x.transpose(1,2)
        x3 = F.relu(self.conv_3(x))
        x4 = F.relu(self.conv_4(x))
        x5 = F.relu(self.conv_5(x))
        x3 = nn.MaxPool1d(kernel_size = 38)(x3)
        x4 = nn.MaxPool1d(kernel_size = 37)(x4)
        x5 = nn.MaxPool1d(kernel_size = 36)(x5)
        out = torch.cat([x3, x4, x5], 2)
        out = out.view(out.size(0), -1)
        out = self.dropout(out)
        return self.fc(out)

In [None]:
V = len(pretrained_weight)
D = 300
N = 40
model = SentenceCNN(V, D, glove_weights=pretrained_weight)

In [None]:
# testing the model
x = x_train[:10]
print(x.shape)
x = torch.LongTensor(x)

(10, 40)


In [None]:
y_hat = model(x)
y_hat.size()

torch.Size([10, 1])

In [None]:
test = model.forward(x)
print(test)
print(test.shape)

tensor([[ 0.2956],
        [ 0.0108],
        [ 0.1466],
        [ 0.0215],
        [ 0.4906],
        [ 0.3931],
        [-0.0763],
        [ 0.1265],
        [-0.0939],
        [ 0.2707]], grad_fn=<AddmmBackward0>)
torch.Size([10, 1])



### <font color='red'>Questions:

- What is the output dimension of `.forward()` What does it mean ?
- What are parameters to be LEARNED in the model?
- in `.forward()`, how are x3, x4, x5 connected ? i.e., are they in a pipeline or in parallel ?
- Briefly explain what has been the effect for each of the CNN output x3, x4, x5 ?
- How do x3, x4, x5 contribute the prediction ?

</font>


### <font color='red'>Answers:

* The output dimension of `.forward()` is x by 1. x = number of encoded sentences. x=  10 for the subset above. 1 represents the log loss, which can later be converted to binary output.
* The parameters to be LEARNED in the model are the weights in conv_3, conv_4, conv_5 as well as the bias term(s). (The embedding terms are frozen via `requires_grad=False`)
* In `.forward()`, x3, x4, and x5 are computed in parallel and stacked vertically (`torch.cat([x3, x4, x5], 2)` concatenates them so that they are on top of each other)
* The effect for each of the CNN output x3, x4, x5 is to learn based on 3,4,or 5 grams the pattern for subjective and objective content. In other words, extract features based on different gram lengths. However, since the words may not appear in the same order or there might be words in between the same words, there is additional complexity that a simple neural network cannot account for. However, the use of a convolutional layer allows more complex patterns to be learned. The pooling layer is a subsampling technique. It can remove noise and help significant features to stand out more. For example, some strong words such as "hate" might stand out more after pooling.
* x3 is a 3 gram detector, x4 is a 4 gram detector, and x5 is a 5 gram detector. What this means is that the model tries to learn the difference between subjective and objective content based on all possible combinations of 3 words, 4 words, and 5 words (corresponding to x3, x4, and x5).  

## Training

Note that we are not bothering with mini-batches since our dataset is small.

In [None]:
model = SentenceCNN(V, D, glove_weights=pretrained_weight) #.cuda()

In [None]:
def val_metrics(m):
    model.eval()
    x = torch.LongTensor(x_val) #.cuda()
    y = torch.Tensor(y_val).unsqueeze(1) #).cuda()
    y_hat = m(x)
    loss = F.binary_cross_entropy_with_logits(y_hat, y)
    y_pred = y_hat > 0
    correct = (y_pred.float() == y).float().sum()
    accuracy = correct/y_pred.shape[0]
    return loss.item(), accuracy.item()

In [None]:
# accuracy of a random model should be around 0.5
val_metrics(model)

(0.6948997974395752, 0.5059999823570251)

In [None]:
# this filters parameters with p.requires_grad=True
parameters = filter(lambda p: p.requires_grad, model.parameters())
optimizer = torch.optim.Adam(parameters, lr=0.01)


### <font color='red'>Questions:

- What is the effect of function `parameters` above ?


</font>


### <font color='red'>Answers:

The effect of function `parameters()` above is to filter out paramters with `p.requires_grad=True` because we want to update the parameters of the CNN instead of the vector embedding parameters (which we want to fix as aforementioned).

In [None]:
def train_epocs(model, epochs=10, lr=0.01):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr)
    model.train()
    for i in range(epochs):
        model.train()
        x = torch.LongTensor(x_train)  #.cuda()
        y = torch.Tensor(y_train).unsqueeze(1)
        y_hat = model(x)
        loss = F.binary_cross_entropy_with_logits(y_hat, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        val_loss, accuracy = val_metrics(model)
        print("train loss %.3f test loss %.3f and accuracy %.3f" %
              (loss.item(), val_loss, accuracy))

In [None]:
model = SentenceCNN(V, D, glove_weights=pretrained_weight)

In [None]:
train_epocs(model, epochs=10, lr=0.005)

train loss 0.697 test loss 0.813 and accuracy 0.504
train loss 0.776 test loss 0.544 and accuracy 0.681
train loss 0.543 test loss 0.533 and accuracy 0.705
train loss 0.534 test loss 0.378 and accuracy 0.855
train loss 0.368 test loss 0.370 and accuracy 0.849
train loss 0.342 test loss 0.419 and accuracy 0.798
train loss 0.381 test loss 0.416 and accuracy 0.800
train loss 0.373 test loss 0.368 and accuracy 0.839
train loss 0.329 test loss 0.327 and accuracy 0.867
train loss 0.293 test loss 0.317 and accuracy 0.871


In [None]:
# how to figure out the parameters
parameters = filter(lambda p: p.requires_grad, model.parameters())
print([p.size() for p in parameters])

[torch.Size([100, 300, 3]), torch.Size([100]), torch.Size([100, 300, 4]), torch.Size([100]), torch.Size([100, 300, 5]), torch.Size([100]), torch.Size([1, 300]), torch.Size([1])]


### Unfreezing the embeddings

In [None]:
# unfreezing the embeddings
model.embedding.weight.requires_grad = True

In [None]:
parameters = filter(lambda p: p.requires_grad, model.parameters())
print([p.size() for p in parameters])

[torch.Size([18769, 300]), torch.Size([100, 300, 3]), torch.Size([100]), torch.Size([100, 300, 4]), torch.Size([100]), torch.Size([100, 300, 5]), torch.Size([100]), torch.Size([1, 300]), torch.Size([1])]


- Tranin again with embedding unfreezed.

In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.289 test loss 0.321 and accuracy 0.867
train loss 0.276 test loss 0.312 and accuracy 0.870
train loss 0.262 test loss 0.295 and accuracy 0.872
train loss 0.246 test loss 0.284 and accuracy 0.877
train loss 0.233 test loss 0.277 and accuracy 0.883
train loss 0.224 test loss 0.270 and accuracy 0.881
train loss 0.212 test loss 0.265 and accuracy 0.885
train loss 0.198 test loss 0.263 and accuracy 0.881
train loss 0.191 test loss 0.260 and accuracy 0.887
train loss 0.182 test loss 0.255 and accuracy 0.887


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.172 test loss 0.252 and accuracy 0.900
train loss 0.174 test loss 0.242 and accuracy 0.897
train loss 0.154 test loss 0.251 and accuracy 0.895
train loss 0.149 test loss 0.254 and accuracy 0.892
train loss 0.145 test loss 0.243 and accuracy 0.899
train loss 0.134 test loss 0.234 and accuracy 0.904
train loss 0.123 test loss 0.230 and accuracy 0.906
train loss 0.120 test loss 0.229 and accuracy 0.906
train loss 0.116 test loss 0.226 and accuracy 0.909
train loss 0.106 test loss 0.225 and accuracy 0.910



### <font color='red'>Questions:

- In the above we trained before and after unfreezing embeddings. What are the major difference between the two stages ?

</font>


### <font color='red'>Answers:

In the 1st stage, we train the parameters of the CNN layers to learn from the embeddings of the word vector. This prevents us from encountering a problem of a dog chasing a tail (a moving target is not helpful).
In the 2nd stage, we allow the parameters of the CNN and the word embeddings to be learned which increases overall accuracy and allows the word embeddings to improve as well.

I quote the above description to help explain:

`The rationale is that we first want the network to learn weights for the other parameters that were randomly initialized. After that phase we could finetune the embeddings to our task.`

## Whithout pretrain emmbeddings


### <font color='red'>Your turn: </font>


- Complete the `SentenceCNN2` below for `__init__` method, but unlike the class `SentenceCNN` -- you <font color='red'> do not </font> fill in pretrained weights for the embedding. Then proceed to next stage for training.



In [None]:
class SentenceCNN2(nn.Module):

    def __init__(self, V, D):
        # your code here
        super(SentenceCNN2, self).__init__()

        # Create an empty matrix (all zeros) of the desired shape
        self.random_weights = np.zeros((V, D), dtype="float32")
        for i in range(V):
          self.random_weights[i] = np.random.uniform(-0.25, 0.25, D)

        self.embedding = nn.Embedding(V, D, padding_idx=0)
        self.embedding.weight.data.copy_(torch.from_numpy(self.random_weights))
        self.embedding.weight.requires_grad = True ## unfreeze embeddings

        self.conv_3 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=3)
        self.conv_4 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=4)
        self.conv_5 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=5)

        self.dropout = nn.Dropout(p=0.5)
        self.fc = nn.Linear(300, 1)

    def forward(self, x):
        x = self.embedding(x)
        x = x.transpose(1,2)
        x3 = F.relu(self.conv_3(x))
        x4 = F.relu(self.conv_4(x))
        x5 = F.relu(self.conv_5(x))
        x3 = nn.MaxPool1d(kernel_size = 38)(x3)
        x4 = nn.MaxPool1d(kernel_size = 37)(x4)
        x5 = nn.MaxPool1d(kernel_size = 36)(x5)
        out = torch.cat([x3, x4, x5], 2)
        out = out.view(out.size(0), -1)
        out = self.dropout(out)
        return self.fc(out)

In [None]:
V = len(pretrained_weight)
model = SentenceCNN2(V, D=100)

In [None]:
train_epocs(model, epochs=10, lr=0.01)

train loss 0.696 test loss 0.669 and accuracy 0.527
train loss 0.663 test loss 0.725 and accuracy 0.532
train loss 0.717 test loss 0.573 and accuracy 0.791
train loss 0.550 test loss 0.578 and accuracy 0.648
train loss 0.535 test loss 0.505 and accuracy 0.753
train loss 0.442 test loss 0.400 and accuracy 0.860
train loss 0.323 test loss 0.356 and accuracy 0.850
train loss 0.256 test loss 0.329 and accuracy 0.857
train loss 0.202 test loss 0.281 and accuracy 0.882
train loss 0.137 test loss 0.265 and accuracy 0.883


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.096 test loss 0.259 and accuracy 0.892
train loss 0.090 test loss 0.257 and accuracy 0.892
train loss 0.085 test loss 0.255 and accuracy 0.891
train loss 0.079 test loss 0.253 and accuracy 0.893
train loss 0.073 test loss 0.252 and accuracy 0.891
train loss 0.069 test loss 0.252 and accuracy 0.892
train loss 0.064 test loss 0.251 and accuracy 0.894
train loss 0.059 test loss 0.251 and accuracy 0.895
train loss 0.056 test loss 0.250 and accuracy 0.896
train loss 0.052 test loss 0.250 and accuracy 0.897


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.048 test loss 0.250 and accuracy 0.896
train loss 0.044 test loss 0.251 and accuracy 0.897
train loss 0.040 test loss 0.252 and accuracy 0.897
train loss 0.038 test loss 0.253 and accuracy 0.899
train loss 0.034 test loss 0.255 and accuracy 0.897
train loss 0.031 test loss 0.256 and accuracy 0.897
train loss 0.028 test loss 0.259 and accuracy 0.895
train loss 0.026 test loss 0.261 and accuracy 0.895
train loss 0.023 test loss 0.263 and accuracy 0.893
train loss 0.021 test loss 0.266 and accuracy 0.896


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.020 test loss 0.268 and accuracy 0.897
train loss 0.017 test loss 0.271 and accuracy 0.896
train loss 0.016 test loss 0.274 and accuracy 0.896
train loss 0.014 test loss 0.277 and accuracy 0.896
train loss 0.013 test loss 0.281 and accuracy 0.895
train loss 0.011 test loss 0.284 and accuracy 0.896
train loss 0.010 test loss 0.288 and accuracy 0.896
train loss 0.009 test loss 0.292 and accuracy 0.896
train loss 0.008 test loss 0.297 and accuracy 0.896
train loss 0.007 test loss 0.301 and accuracy 0.896


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.007 test loss 0.305 and accuracy 0.895
train loss 0.005 test loss 0.310 and accuracy 0.895
train loss 0.005 test loss 0.315 and accuracy 0.895
train loss 0.004 test loss 0.321 and accuracy 0.895
train loss 0.004 test loss 0.326 and accuracy 0.894
train loss 0.003 test loss 0.331 and accuracy 0.892
train loss 0.003 test loss 0.337 and accuracy 0.894
train loss 0.003 test loss 0.342 and accuracy 0.894
train loss 0.002 test loss 0.347 and accuracy 0.892
train loss 0.002 test loss 0.353 and accuracy 0.892



### <font color='red'>Questions:

- In the above we trained the model without pretrained weights for embedding, how is the model performance compared to previous one with pretrained weights?
- Briefly Explain why ?
</font>


### <font color='red'>Answers:

* The model performance by the end of the 1st 10 epochs is much better than the one with pretrained weights for embedding (last testing loss is 0.24 instead of 0.428). Although the first 10 epochs for pretrained weights is 0.005 and the one without pretrained weights is 0.01. The lowest testing loss for the model with pretrained weights for embedding is 0.225 and the lowest testing loss for the model above without pretrained weights for embedding is 0.250, which is worse. In fact, in the last 10 epochs of `Sentence2CNN`, the testing loss increased, indictating overfitting.
* My explanation is that randomized embeddings are noisy and it is hard to find the patterns unless we use some type of regularization to prevent overfitting. The glove embeddings are general but good enough to at least help find a rough sense of the real features. It is also not an apples to apples comparision because we used different learning rates and different number of epochs (I didn't change the code above because the instructions said not to).


### <font color='red'>Your turn:

- Try to write a third model with pre-trained weights for embedding, but you want to improve perfornace by using more CNN layer(s). Demonstrate your code and show the model performance. In comparison, please use the same number of epoch and learning rates.
</font>


## 8 Layer CNN

In [None]:
class SentenceCNN3_frozen(nn.Module):

    def __init__(self, V, D, glove_weights):
        # your code here
        super(SentenceCNN3_frozen, self).__init__()

        self.glove_weights = glove_weights
        self.embedding = nn.Embedding(V, D, padding_idx=0)
        self.embedding.weight.data.copy_(torch.from_numpy(self.glove_weights))
        self.embedding.weight.requires_grad = False ## freeze embeddings

        self.conv_3 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=3)
        self.conv_4 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=4)
        self.conv_5 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=5)
        self.conv_6 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=6)
        self.conv_7 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=7)
        self.conv_8 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=8)

        self.dropout = nn.Dropout(p=0.5)
        self.fc = nn.Linear(600, 1)

    def forward(self, x):
        x = self.embedding(x)
        x = x.transpose(1,2)
        x3 = F.relu(self.conv_3(x))
        x4 = F.relu(self.conv_4(x))
        x5 = F.relu(self.conv_5(x))
        x6 = F.relu(self.conv_6(x))
        x7 = F.relu(self.conv_7(x))
        x8 = F.relu(self.conv_8(x))
        x3 = nn.MaxPool1d(kernel_size = 38)(x3)
        x4 = nn.MaxPool1d(kernel_size = 37)(x4)
        x5 = nn.MaxPool1d(kernel_size = 36)(x5)
        x6 = nn.MaxPool1d(kernel_size = 35)(x6)
        x7 = nn.MaxPool1d(kernel_size = 34)(x7)
        x8 = nn.MaxPool1d(kernel_size = 33)(x8)
        out = torch.cat([x3, x4, x5, x6, x7, x8], 2)
        out = out.view(out.size(0), -1)
        out = self.dropout(out)
        return self.fc(out)

In [None]:
V = len(pretrained_weight)
model = SentenceCNN3_frozen(V, D=300, glove_weights=pretrained_weight)

In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.700 test loss 0.657 and accuracy 0.506
train loss 0.643 test loss 0.538 and accuracy 0.850
train loss 0.528 test loss 0.516 and accuracy 0.755
train loss 0.506 test loss 0.451 and accuracy 0.850
train loss 0.437 test loss 0.414 and accuracy 0.871
train loss 0.392 test loss 0.405 and accuracy 0.840
train loss 0.376 test loss 0.382 and accuracy 0.849
train loss 0.352 test loss 0.348 and accuracy 0.874
train loss 0.321 test loss 0.328 and accuracy 0.877
train loss 0.306 test loss 0.320 and accuracy 0.878


In [None]:
train_epocs(model, epochs=10, lr=0.01)

train loss 0.300 test loss 8.015 and accuracy 0.494
train loss 7.887 test loss 1.976 and accuracy 0.494
train loss 1.909 test loss 0.519 and accuracy 0.692
train loss 0.501 test loss 0.717 and accuracy 0.508
train loss 0.736 test loss 0.666 and accuracy 0.507
train loss 0.682 test loss 0.605 and accuracy 0.510
train loss 0.603 test loss 0.587 and accuracy 0.515
train loss 0.575 test loss 0.588 and accuracy 0.527
train loss 0.573 test loss 0.590 and accuracy 0.561
train loss 0.573 test loss 0.582 and accuracy 0.585


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.561 test loss 0.570 and accuracy 0.596
train loss 0.546 test loss 0.560 and accuracy 0.609
train loss 0.532 test loss 0.550 and accuracy 0.619
train loss 0.520 test loss 0.541 and accuracy 0.632
train loss 0.511 test loss 0.533 and accuracy 0.647
train loss 0.503 test loss 0.527 and accuracy 0.660
train loss 0.495 test loss 0.521 and accuracy 0.677
train loss 0.488 test loss 0.514 and accuracy 0.695
train loss 0.481 test loss 0.507 and accuracy 0.715
train loss 0.472 test loss 0.499 and accuracy 0.737


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.462 test loss 0.490 and accuracy 0.769
train loss 0.451 test loss 0.483 and accuracy 0.789
train loss 0.441 test loss 0.474 and accuracy 0.836
train loss 0.432 test loss 0.465 and accuracy 0.844
train loss 0.422 test loss 0.454 and accuracy 0.851
train loss 0.410 test loss 0.444 and accuracy 0.855
train loss 0.398 test loss 0.433 and accuracy 0.858
train loss 0.386 test loss 0.423 and accuracy 0.856
train loss 0.375 test loss 0.412 and accuracy 0.858
train loss 0.364 test loss 0.403 and accuracy 0.861


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.351 test loss 0.394 and accuracy 0.866
train loss 0.342 test loss 0.386 and accuracy 0.864
train loss 0.332 test loss 0.376 and accuracy 0.863
train loss 0.319 test loss 0.367 and accuracy 0.863
train loss 0.308 test loss 0.358 and accuracy 0.864
train loss 0.298 test loss 0.350 and accuracy 0.865
train loss 0.288 test loss 0.343 and accuracy 0.867
train loss 0.278 test loss 0.337 and accuracy 0.868
train loss 0.269 test loss 0.331 and accuracy 0.868
train loss 0.260 test loss 0.327 and accuracy 0.869


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.254 test loss 0.322 and accuracy 0.873
train loss 0.247 test loss 0.318 and accuracy 0.875
train loss 0.240 test loss 0.314 and accuracy 0.876
train loss 0.233 test loss 0.312 and accuracy 0.877
train loss 0.227 test loss 0.308 and accuracy 0.877
train loss 0.220 test loss 0.305 and accuracy 0.877
train loss 0.215 test loss 0.303 and accuracy 0.878
train loss 0.209 test loss 0.301 and accuracy 0.877
train loss 0.204 test loss 0.299 and accuracy 0.878
train loss 0.198 test loss 0.297 and accuracy 0.879


### <font color='red'>Answers:

The lowest testing loss is worse than the 3-layer NN without pretrain embeddings (0.297 > 0.250) and worse than the 3 layer NN with pretrain embeddings (0.297 > 0.250). I suspect that 6 layers might be too deep and therefore more iterations and epochs are required to find the true patterns (I don't think it is overfitting yet because the loss is still decreasing slowly).
some overfitting is happening when I train for that many epochs.

## 8 Layer CNN with non-Frozen Glove Embeddings

In [None]:
class SentenceCNN3(nn.Module):

    def __init__(self, V, D, glove_weights):
        # your code here
        super(SentenceCNN3, self).__init__()

        self.glove_weights = glove_weights
        self.embedding = nn.Embedding(V, D, padding_idx=0)
        self.embedding.weight.data.copy_(torch.from_numpy(self.glove_weights))
        self.embedding.weight.requires_grad = True ## don't freeze embeddings

        self.conv_3 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=3)
        self.conv_4 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=4)
        self.conv_5 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=5)
        self.conv_6 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=6)
        self.conv_7 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=7)
        self.conv_8 = nn.Conv1d(in_channels=D, out_channels=100, kernel_size=8)

        self.dropout = nn.Dropout(p=0.5)
        self.fc = nn.Linear(600, 1)

    def forward(self, x):
        x = self.embedding(x)
        x = x.transpose(1,2)
        x3 = F.relu(self.conv_3(x))
        x4 = F.relu(self.conv_4(x))
        x5 = F.relu(self.conv_5(x))
        x6 = F.relu(self.conv_6(x))
        x7 = F.relu(self.conv_7(x))
        x8 = F.relu(self.conv_8(x))
        x3 = nn.MaxPool1d(kernel_size = 38)(x3)
        x4 = nn.MaxPool1d(kernel_size = 37)(x4)
        x5 = nn.MaxPool1d(kernel_size = 36)(x5)
        x6 = nn.MaxPool1d(kernel_size = 35)(x6)
        x7 = nn.MaxPool1d(kernel_size = 34)(x7)
        x8 = nn.MaxPool1d(kernel_size = 33)(x8)
        out = torch.cat([x3, x4, x5, x6, x7, x8], 2)
        out = out.view(out.size(0), -1)
        out = self.dropout(out)
        return self.fc(out)

In [None]:
V = len(pretrained_weight)
model = SentenceCNN3(V, D=300, glove_weights=pretrained_weight)

In [None]:
train_epocs(model, epochs=10, lr=0.01)

train loss 0.703 test loss 3.721 and accuracy 0.506
train loss 3.766 test loss 0.621 and accuracy 0.711
train loss 0.540 test loss 0.740 and accuracy 0.632
train loss 0.644 test loss 0.435 and accuracy 0.804
train loss 0.366 test loss 0.389 and accuracy 0.876
train loss 0.323 test loss 0.410 and accuracy 0.839
train loss 0.336 test loss 0.421 and accuracy 0.816
train loss 0.334 test loss 0.405 and accuracy 0.825
train loss 0.301 test loss 0.365 and accuracy 0.851
train loss 0.245 test loss 0.317 and accuracy 0.876


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.182 test loss 0.297 and accuracy 0.885
train loss 0.159 test loss 0.282 and accuracy 0.896
train loss 0.139 test loss 0.271 and accuracy 0.899
train loss 0.122 test loss 0.264 and accuracy 0.900
train loss 0.107 test loss 0.260 and accuracy 0.905
train loss 0.098 test loss 0.259 and accuracy 0.903
train loss 0.087 test loss 0.260 and accuracy 0.902
train loss 0.081 test loss 0.262 and accuracy 0.904
train loss 0.074 test loss 0.265 and accuracy 0.905
train loss 0.069 test loss 0.267 and accuracy 0.905


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.063 test loss 0.264 and accuracy 0.908
train loss 0.058 test loss 0.264 and accuracy 0.908
train loss 0.052 test loss 0.265 and accuracy 0.908
train loss 0.047 test loss 0.268 and accuracy 0.908
train loss 0.043 test loss 0.270 and accuracy 0.906
train loss 0.038 test loss 0.272 and accuracy 0.908
train loss 0.034 test loss 0.275 and accuracy 0.908
train loss 0.030 test loss 0.278 and accuracy 0.907
train loss 0.027 test loss 0.282 and accuracy 0.908
train loss 0.024 test loss 0.286 and accuracy 0.908


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.021 test loss 0.293 and accuracy 0.909
train loss 0.018 test loss 0.298 and accuracy 0.908
train loss 0.016 test loss 0.304 and accuracy 0.910
train loss 0.014 test loss 0.310 and accuracy 0.911
train loss 0.012 test loss 0.317 and accuracy 0.911
train loss 0.010 test loss 0.325 and accuracy 0.913
train loss 0.009 test loss 0.334 and accuracy 0.914
train loss 0.007 test loss 0.341 and accuracy 0.915
train loss 0.006 test loss 0.349 and accuracy 0.915
train loss 0.005 test loss 0.357 and accuracy 0.914


In [None]:
train_epocs(model, epochs=10, lr=0.001)

train loss 0.004 test loss 0.371 and accuracy 0.914
train loss 0.004 test loss 0.376 and accuracy 0.914
train loss 0.003 test loss 0.385 and accuracy 0.914
train loss 0.002 test loss 0.395 and accuracy 0.915
train loss 0.002 test loss 0.405 and accuracy 0.914
train loss 0.002 test loss 0.416 and accuracy 0.914
train loss 0.001 test loss 0.426 and accuracy 0.914
train loss 0.001 test loss 0.435 and accuracy 0.914
train loss 0.001 test loss 0.443 and accuracy 0.914
train loss 0.001 test loss 0.452 and accuracy 0.914


### <font color='red'>Answers:

In general the performance is worse than the one with frozen Glove embeddings. However the comparing their lowest testing loss it's 0.259 < 0. 297, so it may seem like using non-frozen embeddings is a good choice. However, the performance becomes so bad towards the end of the epochs that the testing loss goes up and overfitting happens.

I think it's because there's too many parameters to learn, so if we don't fix some of them it will cause the NN to either require a lot more epochs and iterations to converge to a low testing loss or require regularization to prevent overfitting.

We can make the inferred conclusion that when there are lots of parameters to be learned it is good to freeze some of them with pretrained embeddings. However, when there are only few parameters to be learned it is better to unfreeze all of them and start from scratch.