<a href="https://colab.research.google.com/github/Rt247/Not_NLP_CW/blob/sentence-level-word-embeddings/sentence_level_word_embeddings_with_ff.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

Download datasets:

In [2]:
from os.path import exists

if not exists('enzh_data.zip'):
    !wget -O enzh_data.zip https://competitions.codalab.org/my/datasets/download/03e23bd7-8084-4542-997b-6a1ca6dd8a5f
    !unzip enzh_data.zip

--2020-02-19 20:47:18--  https://competitions.codalab.org/my/datasets/download/03e23bd7-8084-4542-997b-6a1ca6dd8a5f
Resolving competitions.codalab.org (competitions.codalab.org)... 129.175.22.230
Connecting to competitions.codalab.org (competitions.codalab.org)|129.175.22.230|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://newcodalab.lri.fr/prod-private/dataset_data_file/None/630ec/en-zh.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=39f30aabc3f4b3c50b4db6d373d3c4ccbad62150c90fd0e3d1e23c1c941cd8f6&X-Amz-Date=20200219T204718Z&X-Amz-Credential=AZIAIOSAODNN7EX123LE%2F20200219%2Fnewcodalab%2Fs3%2Faws4_request [following]
--2020-02-19 20:47:19--  https://newcodalab.lri.fr/prod-private/dataset_data_file/None/630ec/en-zh.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=39f30aabc3f4b3c50b4db6d373d3c4ccbad62150c90fd0e3d1e23c1c941cd8f6&X-Amz-Date=20200219T204

Check data downloaded successfully:

In [3]:
with open("./train.enzh.src", "r") as enzh_src:
  print("Source: ",enzh_src.readline())
with open("./train.enzh.mt", "r") as enzh_mt:
  print("Translation: ",enzh_mt.readline())
with open("./train.enzh.scores", "r") as enzh_scores:
  print("Score: ",enzh_scores.readline())

Source:  The last conquistador then rides on with his sword drawn.

Translation:  最后的征服者骑着他的剑继续前进.

Score:  -1.5284005772625449



### English Models Setup

Download English models:

In [4]:
!spacy download en_core_web_md
!spacy link en_core_web_md en300

Collecting en_core_web_md==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.1.0/en_core_web_md-2.1.0.tar.gz (95.4MB)
[K     |████████████████████████████████| 95.4MB 1.6MB/s 
[?25hBuilding wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.1.0-cp36-none-any.whl size=97126236 sha256=f17908a069b90da5980199c9e1207523c2215ee5f66f5a499b75f832eef21d52
  Stored in directory: /tmp/pip-ephem-wheel-cache-ze03tpr_/wheels/c1/2c/5f/fd7f3ec336bf97b0809c86264d2831c5dfb00fc2e239d1bb01
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_md -->
/usr/local/lib/py

Load a GloVe English model with dim 100.

Some Chinese models only have **dim 100**, so we will need to **tokenize with spaCy, then embed with GloVe**.

In [5]:
import torchtext
import spacy

# Embedding for English when dim 100
glove = torchtext.vocab.GloVe(name='6B', dim=100)

# Tokenizer for English when dim 100, Tokenizer and Embedding when dim 300
nlp_en = spacy.load('en300')


.vector_cache/glove.6B.zip: 862MB [06:27, 2.22MB/s]                           
100%|█████████▉| 398452/400000 [00:31<00:00, 25209.71it/s]

Functions for processing English dataset:

In [6]:
import numpy as np
import torch
from nltk import download
from nltk.corpus import stopwords

#downloading stopwords from the nltk package
download('stopwords') #stopwords dictionary, run once
stop_words_en = set(stopwords.words('english'))


def preprocess_en(sentence, nlp):
    text = sentence.lower()
    doc = [token.lemma_ for token in  nlp.tokenizer(text)]
    doc = [word for word in doc if word not in stop_words_en]
    doc = [word for word in doc if word.isalpha()] #restricts string to alphabetic characters only
    return doc

def get_word_vector_en(embeddings, word):
    try:
      vec = embeddings.vectors[embeddings.stoi[word]]
      return vec
    except KeyError:
      #print(f"Word {word} does not exist")
      pass
      

def get_sentence_emb_en(line, nlp):
  text = line.lower()
  l = [token.lemma_ for token in nlp.tokenizer(text)]
  l = ' '.join([word for word in l if word not in stop_words_en])

  sen = nlp(l)
  return sen.vector


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Chinese Models Setup

Download Chinese stopwords:

In [7]:
!wget -c https://github.com/Tony607/Chinese_sentiment_analysis/blob/master/data/chinese_stop_words.txt

--2020-02-19 20:55:32--  https://github.com/Tony607/Chinese_sentiment_analysis/blob/master/data/chinese_stop_words.txt
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘chinese_stop_words.txt’

chinese_stop_words.     [  <=>               ] 416.74K  1.59MB/s    in 0.3s    

2020-02-19 20:55:33 (1.59 MB/s) - ‘chinese_stop_words.txt’ saved [426741]



Download and load Chinese model with **dim 100** (University of Oslo):

In [8]:
if not exists('zh_100.zip'):
  !wget -O zh_100.zip http://vectors.nlpl.eu/repository/20/35.zip
  !unzip zh_100.zip -d ./zh_100

from gensim.models import KeyedVectors

wv_from_bin_100 = KeyedVectors.load_word2vec_format("./zh_100/model.bin", binary=True) 

--2020-02-19 20:55:34--  http://vectors.nlpl.eu/repository/20/35.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.225
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.225|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1458485917 (1.4G) [application/zip]
Saving to: ‘zh_100.zip’


2020-02-19 20:57:01 (16.2 MB/s) - ‘zh_100.zip’ saved [1458485917/1458485917]

Archive:  zh_100.zip
  inflating: ./zh_100/LIST           
  inflating: ./zh_100/meta.json      
  inflating: ./zh_100/model.bin      
  inflating: ./zh_100/model.txt      
  inflating: ./zh_100/README         


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Functions for processing Chinese dataset:

In [0]:
import string
import jieba
import gensim 
import spacy
import numpy as np

stop_words = [ line.rstrip() for line in open('./chinese_stop_words.txt',"r", encoding="utf-8") ]

def processing_zh(sentence):
  seg_list = jieba.lcut(sentence,cut_all=True)
  doc = [word for word in seg_list if word not in stop_words]
  docs = [e for e in doc if e.isalnum()]
  return docs


## Process Scores

In [0]:
import spacy
import torchtext
from torchtext import data

f_train_scores = open("./train.enzh.scores", 'r')
zh_train_scores = f_train_scores.readlines()

f_val_scores = open("./dev.enzh.scores", 'r')
zh_val_scores = f_val_scores.readlines()

train_scores = np.array(zh_train_scores).astype(float)
y_train_zh = train_scores

val_scores = np.array(zh_val_scores).astype(float)
y_val_zh = val_scores

## Support Vector Machines


In [0]:
# Setup
def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

from sklearn.svm import SVR
from scipy.stats.stats import pearsonr


### Using Average Word Embedding Vectors



In [0]:
def get_avg_sentence_vector_zh(line, word_vectors):
  vectors = []
  for w in line:
    try:
      emb = word_vectors[w]
      vectors.append(emb)
    except:
      pass #Do not add if the word is out of vocabulary
  if vectors:
    return np.mean(vectors, axis=0).tolist()
  return [0] * 100

def get_avg_embeddings_zh(f, word_vectors):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]
  for l in lines:
    sent  = processing_zh(l)
    vec = get_avg_sentence_vector_zh(sent, word_vectors)

    if vec is not None:
      sentences_vectors.append(vec)
    else:
      print(l)
  return sentences_vectors


def get_avg_sentence_vector_en(embeddings, line):
  vectors = []
  for w in line:
    emb = get_word_vector_en(embeddings, w)
    #do not add if the word is out of vocabulary
    if emb is not None:
      vectors.append(emb)
  if vectors:
    return torch.mean(torch.stack(vectors), dim=0).tolist()
  return [0] * 100

# assume dim 100
def get_avg_embeddings_en(f, embeddings, nlp):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]

  for l in lines:
    sentence = preprocess_en(l, nlp)
    try:
      vec = get_avg_sentence_vector_en(embeddings, sentence)
      sentences_vectors.append(vec)
    except:
      sentences_vectors.append(0)

  return sentences_vectors

In [13]:
zh_train_mt_100_a = get_avg_embeddings_zh("./train.enzh.mt", wv_from_bin_100)
zh_train_src_100_a = get_avg_embeddings_en("./train.enzh.src", glove, nlp_en)

zh_val_mt_100_a = get_avg_embeddings_zh("./dev.enzh.mt", wv_from_bin_100)
zh_val_src_100_a = get_avg_embeddings_en("./dev.enzh.src", glove, nlp_en)

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.838 seconds.
Prefix dict has been built successfully.


In [0]:
X_train_100_a = [x + y for x, y in zip(zh_train_src_100_a, zh_train_mt_100_a)]
X_train_zh_100_a = np.array(X_train_100_a)

X_val_100_a = [x + y for x, y in zip(zh_val_src_100_a, zh_val_mt_100_a)]
X_val_zh_100_a = np.array(X_val_100_a)


In [0]:
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_a, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_a)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

"""
linear
RMSE: 0.9044962563186333 Pearson 0.3017781690203462

poly
RMSE: 0.8990697909416231 Pearson 0.3032902746054339

rbf
RMSE: 0.8900985622788053 Pearson 0.3403404558003603

sigmoid
RMSE: 7.152607007355879 Pearson -0.03977439348067312
"""

linear
RMSE: 0.9044962563186333 Pearson 0.3017781690203462

poly
RMSE: 0.8990697909416231 Pearson 0.3032902746054339

rbf
RMSE: 0.8900985622788053 Pearson 0.3403404558003603

sigmoid
RMSE: 7.152607007355879 Pearson -0.03977439348067312



### Using Sum of Word Embedding Vectors

In [0]:
def get_sum_sentence_vector_zh(line, word_vectors):
  vectors = []
  for w in line:
    try:
      emb = word_vectors[w]
      vectors.append(emb)
    except:
      pass #Do not add if the word is out of vocabulary
  if vectors:
    return np.sum(vectors, axis=0).tolist()
  return [0] * 100

def get_sum_embeddings_zh(f, word_vectors):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]
  for l in lines:
    sent  = processing_zh(l)
    vec = get_sum_sentence_vector_zh(sent, word_vectors)

    if vec is not None:
      sentences_vectors.append(vec)
    else:
      print(l)
  return sentences_vectors


def get_sum_sentence_vector_en(embeddings, line):
  vectors = []
  for w in line:
    emb = get_word_vector_en(embeddings, w)
    #do not add if the word is out of vocabulary
    if emb is not None:
      vectors.append(emb)
  if vectors:
    return torch.sum(torch.stack(vectors), dim=0).tolist()
  return [0] * 100

# assume dim 100
def get_sum_embeddings_en(f, embeddings, nlp):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]

  for l in lines:
    sentence = preprocess_en(l, nlp)
    try:
      vec = get_sum_sentence_vector_en(embeddings, sentence)
      sentences_vectors.append(vec)
    except:
      sentences_vectors.append(0)

  return sentences_vectors

In [0]:
zh_train_mt_100_s = get_sum_embeddings_zh("./train.enzh.mt", wv_from_bin_100)
zh_train_src_100_s = get_sum_embeddings_en("./train.enzh.src", glove, nlp_en)

zh_val_mt_100_s = get_sum_embeddings_zh("./dev.enzh.mt", wv_from_bin_100)
zh_val_src_100_s = get_sum_embeddings_en("./dev.enzh.src", glove, nlp_en)

In [0]:
X_train_100_s = [x + y for x, y in zip(zh_train_src_100_s, zh_train_mt_100_s)]
X_train_zh_100_s = np.array(X_train_100_s)

X_val_100_s = [x + y for x, y in zip(zh_val_src_100_s, zh_val_mt_100_s)]
X_val_zh_100_s = np.array(X_val_100_s)

In [0]:
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_s, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_s)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

"""
linear
RMSE: 0.9203004562821944 Pearson 0.2525785904226626

poly
RMSE: 0.9499024011434094 Pearson 0.18792023737965777

rbf
RMSE: 0.905470160037738 Pearson 0.29378234780721474

sigmoid
RMSE: 34.73144893811673 Pearson -0.004944951711832229
"""

linear
RMSE: 0.9203004562821944 Pearson 0.2525785904226626

poly
RMSE: 0.9499024011434094 Pearson 0.18792023737965777

rbf
RMSE: 0.905470160037738 Pearson 0.29378234780721474

sigmoid
RMSE: 34.73144893811673 Pearson -0.004944951711832229



### Using Min/Max of Word Embedding Vectors

In [0]:
def get_min_sentence_vector_zh(line, word_vectors):
  vectors = []
  for w in line:
    try:
      emb = word_vectors[w]
      vectors.append(emb)
    except:
      pass #Do not add if the word is out of vocabulary
  if vectors:
    return np.amin(vectors, axis=0).tolist()
  return [0] * 100

def get_min_embeddings_zh(f, word_vectors):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]
  for l in lines:
    sent  = processing_zh(l)
    vec = get_min_sentence_vector_zh(sent, word_vectors)

    if vec is not None:
      sentences_vectors.append(vec)
    else:
      print(l)
  return sentences_vectors


def get_min_sentence_vector_en(embeddings, line):
  vectors = []
  for w in line:
    emb = get_word_vector_en(embeddings, w)
    #do not add if the word is out of vocabulary
    if emb is not None:
      vectors.append(emb)
  if vectors:
    return torch.min(torch.stack(vectors), dim=0)[0].tolist()
  return [0] * 100

# assume dim 100
def get_min_embeddings_en(f, embeddings, nlp):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]

  for l in lines:
    sentence = preprocess_en(l, nlp)
    try:
      vec = get_min_sentence_vector_en(embeddings, sentence)
      sentences_vectors.append(vec)
    except:
      sentences_vectors.append(0)

  return sentences_vectors


def get_max_sentence_vector_zh(line, word_vectors):
  vectors = []
  for w in line:
    try:
      emb = word_vectors[w]
      vectors.append(emb)
    except:
      pass #Do not add if the word is out of vocabulary
  if vectors:
    return np.amax(vectors, axis=0).tolist()
  return [0] * 100

def get_max_embeddings_zh(f, word_vectors):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]
  for l in lines:
    sent  = processing_zh(l)
    vec = get_max_sentence_vector_zh(sent, word_vectors)

    if vec is not None:
      sentences_vectors.append(vec)
    else:
      print(l)
  return sentences_vectors


def get_max_sentence_vector_en(embeddings, line):
  vectors = []
  for w in line:
    emb = get_word_vector_en(embeddings, w)
    #do not add if the word is out of vocabulary
    if emb is not None:
      vectors.append(emb)
  if vectors:
    return torch.max(torch.stack(vectors), dim=0)[0].tolist()
  return [0] * 100

# assume dim 100
def get_max_embeddings_en(f, embeddings, nlp):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]

  for l in lines:
    sentence = preprocess_en(l, nlp)
    try:
      vec = get_max_sentence_vector_en(embeddings, sentence)
      sentences_vectors.append(vec)
    except:
      sentences_vectors.append(0)

  return sentences_vectors

In [0]:
zh_train_mt_100_min = get_min_embeddings_zh("./train.enzh.mt", wv_from_bin_100)
zh_train_src_100_min = get_min_embeddings_en("./train.enzh.src", glove, nlp_en)
zh_val_mt_100_min = get_min_embeddings_zh("./dev.enzh.mt", wv_from_bin_100)
zh_val_src_100_min = get_min_embeddings_en("./dev.enzh.src", glove, nlp_en)

zh_train_mt_100_max = get_max_embeddings_zh("./train.enzh.mt", wv_from_bin_100)
zh_train_src_100_max = get_max_embeddings_en("./train.enzh.src", glove, nlp_en)
zh_val_mt_100_max = get_max_embeddings_zh("./dev.enzh.mt", wv_from_bin_100)
zh_val_src_100_max = get_max_embeddings_en("./dev.enzh.src", glove, nlp_en)

In [106]:
X_train_100_min = [x + y for x, y in zip(zh_train_src_100_min, zh_train_mt_100_min)]
X_train_zh_100_min = np.array(X_train_100_min)
X_val_100_min = [x + y for x, y in zip(zh_val_src_100_min, zh_val_mt_100_min)]
X_val_zh_100_min = np.array(X_val_100_min)

X_train_100_max = [x + y for x, y in zip(zh_train_src_100_max, zh_train_mt_100_max)]
X_train_zh_100_max = np.array(X_train_100_max)
X_val_100_max = [x + y for x, y in zip(zh_val_src_100_max, zh_val_mt_100_max)]
X_val_zh_100_max = np.array(X_val_100_max)

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
    send_bytes(obj)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 230, in _feed
    close()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor


In [0]:
print("min")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_min, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_min)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

print("max")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_max, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_max)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

"""
min
linear
RMSE: 0.9174155670157896 Pearson 0.26925288419123733

poly
RMSE: 0.9721279726125316 Pearson 0.23475167649201809

rbf
RMSE: 0.9049414520907609 Pearson 0.2920329610610498

sigmoid
RMSE: 26.865545536417198 Pearson 0.0076818897644271664

max
linear
RMSE: 0.9288259124772581 Pearson 0.2342204449314311

poly
RMSE: 1.0188986421056523 Pearson 0.18458085384794573

rbf
RMSE: 0.9074016100647716 Pearson 0.28092258260753544

sigmoid
RMSE: 26.529780761913454 Pearson -0.007926148030142141
"""


min
linear
RMSE: 0.9174155670157896 Pearson 0.26925288419123733

poly
RMSE: 0.9721279726125316 Pearson 0.23475167649201809

rbf
RMSE: 0.9049414520907609 Pearson 0.2920329610610498

sigmoid
RMSE: 26.865545536417198 Pearson 0.0076818897644271664

max
linear
RMSE: 0.9288259124772581 Pearson 0.2342204449314311

poly
RMSE: 1.0188986421056523 Pearson 0.18458085384794573

rbf
RMSE: 0.9074016100647716 Pearson 0.28092258260753544

sigmoid
RMSE: 26.529780761913454 Pearson -0.007926148030142141



### Combinations

In [0]:
# min + max
X_train_100_mm = [sum(t, []) for t in zip(zh_train_src_100_min, zh_train_src_100_max, zh_train_mt_100_min, zh_train_mt_100_max)]
X_train_zh_100_mm = np.array(X_train_100_mm)
X_val_100_mm = [sum(t, []) for t in zip(zh_val_src_100_min, zh_val_src_100_max, zh_val_mt_100_min, zh_val_mt_100_max)]
X_val_zh_100_mm = np.array(X_val_100_mm)

# min + avg + max
X_train_100_mam = [sum(t, []) for t in zip(zh_train_src_100_min, zh_train_src_100_a, zh_train_src_100_max, zh_train_mt_100_min, zh_train_mt_100_a, zh_train_mt_100_max)]
X_train_zh_100_mam = np.array(X_train_100_mam)
X_val_100_mam = [sum(t, []) for t in zip(zh_val_src_100_min, zh_val_src_100_a, zh_val_src_100_max, zh_val_mt_100_min, zh_val_mt_100_a, zh_val_mt_100_max)]
X_val_zh_100_mam = np.array(X_val_100_mam)

# avg + sum
X_train_100_as = [sum(t, []) for t in zip(zh_train_src_100_a, zh_train_src_100_s, zh_train_mt_100_a, zh_train_mt_100_s)]
X_train_zh_100_as = np.array(X_train_100_mam)
X_val_100_as = [sum(t, []) for t in zip(zh_val_src_100_a, zh_val_src_100_s, zh_val_mt_100_a, zh_val_mt_100_s)]
X_val_zh_100_as = np.array(X_val_100_mam)

# min + avg + max + sum
X_train_100_mams = [sum(t, []) for t in zip(zh_train_src_100_min, zh_train_src_100_a, zh_train_src_100_max, zh_train_src_100_s, zh_train_mt_100_min, zh_train_mt_100_a, zh_train_mt_100_max, zh_train_src_100_s)]
X_train_zh_100_mams = np.array(X_train_100_mam)
X_val_100_mams = [sum(t, []) for t in zip(zh_val_src_100_min, zh_val_src_100_a, zh_val_src_100_max, zh_val_src_100_s, zh_val_mt_100_min, zh_val_mt_100_a, zh_val_mt_100_max, zh_val_mt_100_s)]
X_val_zh_100_mams = np.array(X_val_100_mam)

In [0]:
print("min + max")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_mm, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_mm)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

print("min + avg + max")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_mam, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_mam)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

print("avg + sum")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_as, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_as)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

print("min + avg + max + sum")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_mams, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_mams)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

"""
min + max
linear
RMSE: 0.9165262212864648 Pearson 0.26668369681069803

poly
RMSE: 0.9081786905379852 Pearson 0.3058041114768354

rbf
RMSE: 0.9084429289156349 Pearson 0.30464644770878846

sigmoid
RMSE: 3.5248833190245237 Pearson -0.006792767476776504

min + avg + max
linear
RMSE: 0.9081244309157551 Pearson 0.28851234087420935

poly
RMSE: 0.8982943971746072 Pearson 0.3296975280939544

rbf
RMSE: 0.8992479975159785 Pearson 0.3292233306043897

sigmoid
RMSE: 2.600000956750181 Pearson 0.014142349847365135

avg + sum
linear
RMSE: 0.9081244309157551 Pearson 0.28851234087420935

poly
RMSE: 0.8982943971746072 Pearson 0.3296975280939544

rbf
RMSE: 0.8992479975159785 Pearson 0.3292233306043897

sigmoid
RMSE: 2.600000956750181 Pearson 0.014142349847365135

min + avg + max + sum
linear
RMSE: 0.9081244309157551 Pearson 0.28851234087420935

poly
RMSE: 0.8982943971746072 Pearson 0.3296975280939544

rbf
RMSE: 0.8992479975159785 Pearson 0.3292233306043897

sigmoid
RMSE: 2.600000956750181 Pearson 0.014142349847365135
"""

min + max
linear
RMSE: 0.9165262212864648 Pearson 0.26668369681069803

poly
RMSE: 0.9081786905379852 Pearson 0.3058041114768354

rbf
RMSE: 0.9084429289156349 Pearson 0.30464644770878846

sigmoid
RMSE: 3.5248833190245237 Pearson -0.006792767476776504

min + avg + max
linear
RMSE: 0.9081244309157551 Pearson 0.28851234087420935

poly
RMSE: 0.8982943971746072 Pearson 0.3296975280939544

rbf
RMSE: 0.8992479975159785 Pearson 0.3292233306043897

sigmoid
RMSE: 2.600000956750181 Pearson 0.014142349847365135

avg + sum
linear
RMSE: 0.9081244309157551 Pearson 0.28851234087420935

poly
RMSE: 0.8982943971746072 Pearson 0.3296975280939544

rbf
RMSE: 0.8992479975159785 Pearson 0.3292233306043897

sigmoid
RMSE: 2.600000956750181 Pearson 0.014142349847365135

min + avg + max + sum
linear
RMSE: 0.9081244309157551 Pearson 0.28851234087420935

poly
RMSE: 0.8982943971746072 Pearson 0.3296975280939544

rbf
RMSE: 0.8992479975159785 Pearson 0.3292233306043897

sigmoid
RMSE: 2.600000956750181 Pearson 0.0141423

## FFNN

### Setup Environment

In [15]:
import torch
from torch import nn
import time
import math

###############
# Torch setup #
###############
print('Torch version: {}, CUDA: {}'.format(torch.__version__, torch.version.cuda))
cuda_available = torch.cuda.is_available()
if not torch.cuda.is_available():
  print('WARNING: You may want to change the runtime to GPU for Neural LM experiments!')
  DEVICE = 'cpu'
else:
  DEVICE = 'cuda:0'


Torch version: 1.4.0, CUDA: 10.1


In [0]:
import torch.nn.functional as F
import torch.utils.data as Data
from torch.autograd import Variable
from sklearn.metrics import mean_squared_error


def ffln(train, valid, hidden_sizes=[64], batch_size=64, epochs=100, verbose=2, early_stop=True):
  torch.manual_seed(42)

  sizes = [train[0].size] + hidden_sizes
  prev_s = None
  layers = []
  for s in sizes:
    if prev_s:
      layers.append(nn.Linear(prev_s, s).cuda())
      layers.append(nn.LeakyReLU().cuda())
    prev_s = s
  layers.append(nn.Linear(prev_s, 1).cuda())

  net = nn.Sequential(*layers)
  
  optimizer = torch.optim.Adam(net.parameters(), lr=0.0001)
  loss_func = nn.MSELoss()

  net_X = Variable(torch.from_numpy(train))
  net_y = Variable(torch.from_numpy(y_train_zh))
  torch_dataset = Data.TensorDataset(net_X, net_y)
  
  loader = Data.DataLoader(
      dataset=torch_dataset,
      batch_size=batch_size,
      shuffle=True,
      num_workers=2,
  )
  
  net = net.float()

  final_epoch = 0
  last_pearson = None
  for epoch in range(epochs):
    training_loss = 0
    for step, (batch_x, batch_y) in enumerate(loader):
      b_x = Variable(batch_x.float().to(DEVICE))
      b_y = Variable(batch_y.float().to(DEVICE))
      prediction = torch.flatten(net(b_x))
      loss = loss_func(prediction, b_y)
      training_loss += mean_squared_error(b_y.cpu().detach().numpy(), prediction.cpu().detach().numpy())
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

    if epoch % 10 == 0:
      training_loss = training_loss / step
      with torch.no_grad():
        net.eval()
        net_val = Variable(torch.from_numpy(valid).float().to(DEVICE))
        pred = torch.flatten(net.forward(net_val)).cpu().detach().numpy()
        net.train()
        pearson = pearsonr(y_val_zh, pred)
        if last_pearson and early_stop and last_pearson > pearson[0]:
          final_epoch = epoch
          break
        else:
          last_pearson = pearson[0]
        if verbose >= 2:
          print(f"Epoch {epoch} Training Loss: {training_loss}, Pearson Score: {pearson[0]}, MSE: {mean_squared_error(y_val_zh, pred)}")
    final_epoch = epoch
  
  with torch.no_grad():
    net.eval()
    net_val = Variable(torch.from_numpy(valid).float().to(DEVICE))
    pred = torch.flatten(net.forward(net_val)).cpu().detach().numpy()
    net.train()
    pearson = pearsonr(y_val_zh, pred)
    if verbose >= 1:
      print(f"Final Validation Pearson Score: {pearson[0]}")
    return pearson[0], final_epoch


In [187]:
hidden_sizes = [16, 32, 64, 128, 256]
batch_sizes = [128, 256, 512, 1024, 2048]


print("| Hidden Sizes | Batch Size | Final Epoch | Final Pearson Score |")
print("|---|---|---|---|")
for h in hidden_sizes:
  for b in batch_sizes:
    p, e = ffln(X_train_zh_100_a, X_val_zh_100_a, hidden_sizes=[h], batch_size=b, epochs=500, verbose=0)
    print(f"| {h} | {b} | {e} | {p} |")



| Hidden Sizes | Batch Size | Final Epoch | Final Pearson Score |
|---|---|---|---|
| 16 | 128 | 110 | 0.3405388743143982 |
| 16 | 256 | 170 | 0.3397500889705397 |
| 16 | 512 | 240 | 0.3397802382015175 |
| 16 | 1024 | 360 | 0.33749458056571224 |
| 16 | 2048 | 450 | 0.33555132728957543 |
| 32 | 128 | 150 | 0.35763019712700345 |
| 32 | 256 | 200 | 0.35874265186421816 |
| 32 | 512 | 280 | 0.3589732991591593 |
| 32 | 1024 | 480 | 0.36208552479635675 |
| 32 | 2048 | 480 | 0.35664063839674903 |
| 64 | 128 | 110 | 0.3551239486036262 |
| 64 | 256 | 150 | 0.35456388642073894 |
| 64 | 512 | 200 | 0.35440647538707865 |
| 64 | 1024 | 300 | 0.35546352390181984 |
| 64 | 2048 | 340 | 0.3511669538054668 |
| 128 | 128 | 80 | 0.35767577699197606 |
| 128 | 256 | 100 | 0.358928734182707 |
| 128 | 512 | 190 | 0.36029386571754496 |
| 128 | 1024 | 260 | 0.3604532449340413 |
| 128 | 2048 | 260 | 0.35858468110668523 |
| 256 | 128 | 70 | 0.35963275593845023 |
| 256 | 256 | 70 | 0.35865215871534833 |
| 256 | 512

In [19]:
import itertools

sizes = [8, 16, 32, 64, 128, 256]
batch_sizes = [128, 256, 512, 1024]
sizes = list(list(t) for t in itertools.product(sizes, sizes))
search = list(itertools.product(sizes, batch_sizes))

print("| Hidden Sizes | Batch Size | Final Epoch | Final Pearson Score |")
print("|---|---|---|---|")
for s, b in search:
  p, e = ffln(X_train_zh_100_a, X_val_zh_100_a, hidden_sizes=s, batch_size=b, epochs=500, verbose=0)
  print(f"| {s} | {b} | {e} | {p} |")

| Hidden Sizes | Batch Size | Final Epoch | Final Pearson Score |
|---|---|---|---|
| [8, 8] | 128 | 100 | 0.31607020961324017 |
| [8, 8] | 256 | 190 | 0.325676763726344 |
| [8, 8] | 512 | 300 | 0.3324969548357744 |
| [8, 8] | 1024 | 360 | 0.3281781902107646 |
| [8, 16] | 128 | 100 | 0.3209777141863661 |
| [8, 16] | 256 | 130 | 0.3221153224067193 |
| [8, 16] | 512 | 190 | 0.3231566499936152 |
| [8, 16] | 1024 | 280 | 0.3230498553232637 |
| [8, 32] | 128 | 120 | 0.3346557475599783 |
| [8, 32] | 256 | 180 | 0.33722424679038093 |
| [8, 32] | 512 | 260 | 0.3366840707272191 |
| [8, 32] | 1024 | 380 | 0.33902853923076354 |
| [8, 64] | 128 | 70 | 0.32322565055032326 |
| [8, 64] | 256 | 100 | 0.32140369828739085 |
| [8, 64] | 512 | 140 | 0.32344015250985336 |
| [8, 64] | 1024 | 210 | 0.32310159143028583 |
| [8, 128] | 128 | 110 | 0.32803432994968884 |
| [8, 128] | 256 | 100 | 0.3225687835232611 |
| [8, 128] | 512 | 150 | 0.324132143439027 |
| [8, 128] | 1024 | 250 | 0.33223561787958855 |
| [8,

In [180]:
ffln(X_train_zh_100_a, X_val_zh_100_a, hidden_sizes=[16], batch_size=256, epochs=500, verbose=0)

0.3397500889705397

| Hidden Sizes | Batch Size | Final Epoch | Final Pearson Score |
|---|---|---|---|
| 16 | 128 | 110 | 0.3405388743143982 |
| 16 | 256 | 170 | 0.3397500889705397 |
| 16 | 512 | 240 | 0.3397802382015175 |
| 16 | 1024 | 360 | 0.33749458056571224 |
| 16 | 2048 | 450 | 0.33555132728957543 |
| 32 | 128 | 150 | 0.35763019712700345 |
| 32 | 256 | 200 | 0.35874265186421816 |
| 32 | 512 | 280 | 0.3589732991591593 |
| 32 | 1024 | 480 | **0.36208552479635675** |
| 32 | 2048 | 480 | 0.35664063839674903 |
| 64 | 128 | 110 | 0.3551239486036262 |
| 64 | 256 | 150 | 0.35456388642073894 |
| 64 | 512 | 200 | 0.35440647538707865 |
| 64 | 1024 | 300 | 0.35546352390181984 |
| 64 | 2048 | 340 | 0.3511669538054668 |
| 128 | 128 | 80 | 0.35767577699197606 |
| 128 | 256 | 100 | 0.358928734182707 |
| 128 | 512 | 190 | 0.36029386571754496 |
| 128 | 1024 | 260 | 0.3604532449340413 |
| 128 | 2048 | 260 | 0.35858468110668523 |
| 256 | 128 | 70 | 0.35963275593845023 |
| 256 | 256 | 70 | 0.35865215871534833 |
| 256 | 512 | 110 | 0.35839268431540866 |
| 256 | 1024 | 210 | 0.3596300201177217 |
| 256 | 2048 | 230 | 0.3589089199131525 |



| Hidden Sizes | Batch Size | Final Epoch | Final Pearson Score |
|---|---|---|---|
| [8, 8] | 128 | 100 | 0.31607020961324017 |
| [8, 8] | 256 | 190 | 0.325676763726344 |
| [8, 8] | 512 | 300 | 0.3324969548357744 |
| [8, 8] | 1024 | 360 | 0.3281781902107646 |
| [8, 16] | 128 | 100 | 0.3209777141863661 |
| [8, 16] | 256 | 130 | 0.3221153224067193 |
| [8, 16] | 512 | 190 | 0.3231566499936152 |
| [8, 16] | 1024 | 280 | 0.3230498553232637 |
| [8, 32] | 128 | 120 | 0.3346557475599783 |
| [8, 32] | 256 | 180 | 0.33722424679038093 |
| [8, 32] | 512 | 260 | 0.3366840707272191 |
| [8, 32] | 1024 | 380 | 0.33902853923076354 |
| [8, 64] | 128 | 70 | 0.32322565055032326 |
| [8, 64] | 256 | 100 | 0.32140369828739085 |
| [8, 64] | 512 | 140 | 0.32344015250985336 |
| [8, 64] | 1024 | 210 | 0.32310159143028583 |
| [8, 128] | 128 | 110 | 0.32803432994968884 |
| [8, 128] | 256 | 100 | 0.3225687835232611 |
| [8, 128] | 512 | 150 | 0.324132143439027 |
| [8, 128] | 1024 | 250 | 0.33223561787958855 |
| [8, 256] | 128 | 70 | 0.32121855472735966 |
| [8, 256] | 256 | 100 | 0.3237546131325545 |
| [8, 256] | 512 | 140 | 0.3329252849480922 |
| [8, 256] | 1024 | 210 | 0.3336608951953127 |
| [16, 8] | 128 | 110 | 0.3338218750701204 |
| [16, 8] | 256 | 160 | 0.33367430200846443 |
| [16, 8] | 512 | 260 | 0.33783487037907806 |
| [16, 8] | 1024 | 380 | 0.3343602634895123 |
| [16, 16] | 128 | 90 | 0.32751355162369783 |
| [16, 16] | 256 | 130 | 0.32544911716819336 |
| [16, 16] | 512 | 180 | 0.32611950256584715 |
| [16, 16] | 1024 | 300 | 0.3252353755969807 |
| [16, 32] | 128 | 90 | 0.33678115996980906 |
| [16, 32] | 256 | 120 | 0.33694877569799886 |
| [16, 32] | 512 | 150 | 0.3379517207724726 |
| [16, 32] | 1024 | 240 | 0.33864793192955467 |
| [16, 64] | 128 | 90 | 0.33546713766353936 |
| [16, 64] | 256 | 110 | 0.3351268630755747 |
| [16, 64] | 512 | 170 | 0.334831960971154 |
| [16, 64] | 1024 | 240 | 0.3345301173314305 |
| [16, 128] | 128 | 50 | 0.32197850700097486 |
| [16, 128] | 256 | 70 | 0.321114292249135 |
| [16, 128] | 512 | 90 | 0.32243263757947094 |
| [16, 128] | 1024 | 150 | 0.3231851621602055 |
| [16, 256] | 128 | 50 | 0.3319655433287403 |
| [16, 256] | 256 | 70 | 0.3318783247706317 |
| [16, 256] | 512 | 100 | 0.32936809161762937 |
| [16, 256] | 1024 | 150 | 0.3309949833319247 |
| [32, 8] | 128 | 100 | 0.3545390609497202 |
| [32, 8] | 256 | 160 | 0.35443749508226174 |
| [32, 8] | 512 | 200 | 0.35279404732501746 |
| [32, 8] | 1024 | 300 | 0.35189691560051056 |
| [32, 16] | 128 | 80 | 0.3450570858370463 |
| [32, 16] | 256 | 120 | 0.3470644324236091 |
| [32, 16] | 512 | 150 | 0.34258651905040516 |
| [32, 16] | 1024 | 250 | 0.3433293927002053 |
| [32, 32] | 128 | 60 | 0.3461671087335401 |
| [32, 32] | 256 | 80 | 0.3440974263749527 |
| [32, 32] | 512 | 120 | 0.3445699352918116 |
| [32, 32] | 1024 | 220 | 0.345383131162951 |
| [32, 64] | 128 | 60 | 0.3502308183702131 |
| [32, 64] | 256 | 80 | 0.34882030623007176 |
| [32, 64] | 512 | 120 | 0.35037675287933817 |
| [32, 64] | 1024 | 160 | 0.3508613704841181 |
| [32, 128] | 128 | 40 | 0.34007629655976485 |
| [32, 128] | 256 | 60 | 0.34000653142576936 |
| [32, 128] | 512 | 90 | 0.3375105359908257 |
| [32, 128] | 1024 | 130 | 0.33929379357353306 |
| [32, 256] | 128 | 40 | 0.3462243601725155 |
| [32, 256] | 256 | 60 | 0.3434485621830905 |
| [32, 256] | 512 | 80 | 0.3450122144543767 |
| [32, 256] | 1024 | 120 | 0.34611127468897557 |
| [64, 8] | 128 | 80 | 0.33932535776382566 |
| [64, 8] | 256 | 100 | 0.34147346129887296 |
| [64, 8] | 512 | 140 | 0.3437998245201962 |
| [64, 8] | 1024 | 220 | 0.34299150676415635 |
| [64, 16] | 128 | 70 | 0.33874336234115937 |
| [64, 16] | 256 | 90 | 0.3437671822812732 |
| [64, 16] | 512 | 120 | 0.343263556245599 |
| [64, 16] | 1024 | 190 | 0.33977573867193467 |
| [64, 32] | 128 | 50 | 0.3442887197349294 |
| [64, 32] | 256 | 60 | 0.3453337603357219 |
| [64, 32] | 512 | 80 | 0.3474228567118349 |
| [64, 32] | 1024 | 120 | 0.3446385026618806 |
| [64, 64] | 128 | 50 | 0.34922983283257997 |
| [64, 64] | 256 | 70 | 0.3517278731666611 |
| [64, 64] | 512 | 90 | 0.3581329379744402 |
| [64, 64] | 1024 | 130 | 0.3564654015874357 |
| [64, 128] | 128 | 40 | 0.3360028177223755 |
| [64, 128] | 256 | 50 | 0.34156212595622587 |
| [64, 128] | 512 | 70 | 0.34200163122214994 |
| [64, 128] | 1024 | 100 | 0.34459876433987324 |
| [64, 256] | 128 | 40 | 0.34210065236058884 |
| [64, 256] | 256 | 50 | 0.3458023122705053 |
| [64, 256] | 512 | 70 | 0.3446082731532024 |
| [64, 256] | 1024 | 100 | 0.3467760543524617 |
| [128, 8] | 128 | 80 | 0.347945865626375 |
| [128, 8] | 256 | 100 | 0.3446894641672712 |
| [128, 8] | 512 | 140 | 0.3412687219325876 |
| [128, 8] | 1024 | 210 | 0.34220621977988247 |
| [128, 16] | 128 | 50 | 0.35183135764958157 |
| [128, 16] | 256 | 70 | 0.35388721228149544 |
| [128, 16] | 512 | 100 | 0.35624014377777374 |
| [128, 16] | 1024 | 150 | 0.3534698964396123 |
| [128, 32] | 128 | 50 | 0.33228526224288907 |
| [128, 32] | 256 | 70 | 0.33563942459467067 |
| [128, 32] | 512 | 90 | 0.3378259833179989 |
| [128, 32] | 1024 | 130 | 0.3390340362227212 |
| [128, 64] | 128 | 40 | 0.3405378288090267 |
| [128, 64] | 256 | 50 | 0.3458943126862943 |
| [128, 64] | 512 | 70 | 0.3477402042030332 |
| [128, 64] | 1024 | 100 | 0.3494080282981141 |
| [128, 128] | 128 | 30 | 0.3473700109493069 |
| [128, 128] | 256 | 40 | 0.3490906563275689 |
| [128, 128] | 512 | 60 | 0.34718029092073815 |
| [128, 128] | 1024 | 80 | 0.35043725615687976 |
| [128, 256] | 128 | 30 | 0.3284333649221471 |
| [128, 256] | 256 | 40 | 0.3343714298023622 |
| [128, 256] | 512 | 50 | 0.34002675577802416 |
| [128, 256] | 1024 | 70 | 0.34164066750370364 |
| [256, 8] | 128 | 50 | 0.35175351547264366 |
| [256, 8] | 256 | 60 | 0.3535180692365392 |
| [256, 8] | 512 | 90 | 0.3539472427754941 |
| [256, 8] | 1024 | 130 | 0.35419315646426636 |
| [256, 16] | 128 | 40 | 0.35841958055775397 |
| [256, 16] | 256 | 60 | 0.36049100401781636 |
| [256, 16] | 512 | 80 | 0.36290937059677664 |
| [256, 16] | 1024 | 110 | 0.3618579321062537 |
| [256, 32] | 128 | 40 | 0.3551907086727445 |
| [256, 32] | 256 | 50 | 0.35375870065567455 |
| [256, 32] | 512 | 70 | 0.3576068849944134 |
| [256, 32] | 1024 | 100 | 0.36105843960104067 |
| [256, 64] | 128 | 30 | 0.3537321580153696 |
| [256, 64] | 256 | 40 | 0.35352727459453176 |
| [256, 64] | 512 | 60 | 0.35321548914203643 |
| [256, 64] | 1024 | 80 | 0.3569290871829409 |
| [256, 128] | 128 | 30 | 0.3332589457261402 |
| [256, 128] | 256 | 40 | 0.3419634143990482 |
| [256, 128] | 512 | 50 | 0.3503789298387748 |
| [256, 128] | 1024 | 70 | 0.3533004091096382 |
| [256, 256] | 128 | 30 | 0.328592704747179 |
| [256, 256] | 256 | 30 | 0.34040318926962315 |
| [256, 256] | 512 | 40 | 0.34238660783044944 |
| [256, 256] | 1024 | 60 | 0.3419457472489287 |



## Results

(Haven't tested the function yet...)

In [0]:
import os
from google.colab import files
from zipfile import ZipFile

def writeScores(scores):
    fn = "predictions.txt"
    print("")
    with open(fn, 'w') as output_file:
        for idx,x in enumerate(scores):
            #out =  metrics[idx]+":"+str("{0:.2f}".format(x))+"\n"
            #print(out)
            output_file.write(f"{x}\n")


def downloadScores(method_name, scores):
  writeScores(scores)
  with ZipFile(f"en-zh_{method_name}.zip", "w") as newzip:
    newzip.write("predictions.txt")
  
  files.download(f"en-zh_{method_name}.zip")