<a href="https://colab.research.google.com/github/Rt247/Not_NLP_CW/blob/BERT_method/sentence_level_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

Download datasets:

In [2]:
from os.path import exists

if not exists('enzh_data.zip'):
    !wget -O enzh_data.zip https://competitions.codalab.org/my/datasets/download/03e23bd7-8084-4542-997b-6a1ca6dd8a5f
    !unzip enzh_data.zip

--2020-02-19 14:27:03--  https://competitions.codalab.org/my/datasets/download/03e23bd7-8084-4542-997b-6a1ca6dd8a5f
Resolving competitions.codalab.org (competitions.codalab.org)... 129.175.22.230
Connecting to competitions.codalab.org (competitions.codalab.org)|129.175.22.230|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://newcodalab.lri.fr/prod-private/dataset_data_file/None/630ec/en-zh.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=0c04ce13836af89df7bdde0ae9a04943355b13151112fada78ca89b9199ab942&X-Amz-Date=20200219T142704Z&X-Amz-Credential=AZIAIOSAODNN7EX123LE%2F20200219%2Fnewcodalab%2Fs3%2Faws4_request [following]
--2020-02-19 14:27:04--  https://newcodalab.lri.fr/prod-private/dataset_data_file/None/630ec/en-zh.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=0c04ce13836af89df7bdde0ae9a04943355b13151112fada78ca89b9199ab942&X-Amz-Date=20200219T142

Check data downloaded successfully:

In [0]:
with open("./train.enzh.src", "r") as enzh_src:
  print("Source: ",enzh_src.readline())
with open("./train.enzh.mt", "r") as enzh_mt:
  print("Translation: ",enzh_mt.readline())
with open("./train.enzh.scores", "r") as enzh_scores:
  print("Score: ",enzh_scores.readline())

### English Models Setup

Download English models:

In [0]:
!spacy download en_core_web_md
!spacy link en_core_web_md en300

Load a GloVe English model with dim 100.

Some Chinese models only have **dim 100**, so we will need to **tokenize with spaCy, then embed with GloVe**.

In [0]:
import torchtext
import spacy

# Embedding for English when dim 100
glove = torchtext.vocab.GloVe(name='6B', dim=100)

# Tokenizer for English when dim 100, Tokenizer and Embedding when dim 300
nlp_en = spacy.load('en300')


Functions for processing English dataset:

In [0]:
import numpy as np
import torch
from nltk import download
from nltk.corpus import stopwords

#downloading stopwords from the nltk package
download('stopwords') #stopwords dictionary, run once
stop_words_en = set(stopwords.words('english'))


def preprocess_en(sentence, nlp):
    text = sentence.lower()
    doc = [token.lemma_ for token in  nlp.tokenizer(text)]
    doc = [word for word in doc if word not in stop_words_en]
    doc = [word for word in doc if word.isalpha()] #restricts string to alphabetic characters only
    return doc

def get_word_vector_en(embeddings, word):
    try:
      vec = embeddings.vectors[embeddings.stoi[word]]
      return vec
    except KeyError:
      #print(f"Word {word} does not exist")
      pass
      

def get_sentence_emb_en(line, nlp):
  text = line.lower()
  l = [token.lemma_ for token in nlp.tokenizer(text)]
  l = ' '.join([word for word in l if word not in stop_words_en])

  sen = nlp(l)
  return sen.vector


### Chinese Models Setup

Download Chinese stopwords:

In [0]:
!wget -c https://github.com/Tony607/Chinese_sentiment_analysis/blob/master/data/chinese_stop_words.txt

Download and load Chinese model with **dim 100** (University of Oslo):

In [0]:
if not exists("zh_100.zip"):
  !wget -O zh_100.zip http://vectors.nlpl.eu/repository/20/35.zip
  !unzip zh_100.zip -d ./zh_100

from gensim.models import KeyedVectors

wv_from_bin_100 = KeyedVectors.load_word2vec_format("./zh_100/model.bin", binary=True) 

Functions for processing Chinese dataset:

In [0]:
import string
import jieba
import gensim 
import spacy
import numpy as np

stop_words = [ line.rstrip() for line in open('./chinese_stop_words.txt',"r", encoding="utf-8") ]

def processing_zh(sentence):
  seg_list = jieba.lcut(sentence,cut_all=True)
  doc = [word for word in seg_list if word not in stop_words]
  docs = [e for e in doc if e.isalnum()]
  return docs


### BERT embedding Setup


In [4]:
!pip install transformers
import torch
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertModel, AdamW, BertConfig

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ee/fc/bd726a15ab2c66dc09306689d04da07a3770dad724f0883f0a4bfb745087/transformers-2.4.1-py3-none-any.whl (475kB)
[K     |▊                               | 10kB 25.2MB/s eta 0:00:01[K     |█▍                              | 20kB 29.2MB/s eta 0:00:01[K     |██                              | 30kB 34.5MB/s eta 0:00:01[K     |██▊                             | 40kB 38.7MB/s eta 0:00:01[K     |███▍                            | 51kB 20.5MB/s eta 0:00:01[K     |████▏                           | 61kB 17.5MB/s eta 0:00:01[K     |████▉                           | 71kB 15.1MB/s eta 0:00:01[K     |█████▌                          | 81kB 14.5MB/s eta 0:00:01[K     |██████▏                         | 92kB 14.1MB/s eta 0:00:01[K     |██████▉                         | 102kB 14.8MB/s eta 0:00:01[K     |███████▋                        | 112kB 14.8MB/s eta 0:00:01[K     |████████▎                       | 

In [10]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


In [7]:
# Load pre-trained model tokenizer (vocabulary)
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

# Tokenize our sentence with the BERT tokenizer.
original_texts = open("./train.enzh.src").readlines()
translated_texts = open("./train.enzh.mt").readlines()

text_pairs = list(zip(original_texts, translated_texts))
MAX_LENGTH = 128

inputs = [tokenizer.encode_plus(original, text_pair=translated, add_special_tokens = True, max_length=MAX_LENGTH, pad_to_max_length=True) for original, translated in text_pairs]
input_ids = [d['input_ids'] for d in inputs]

input_attention_masks = [d['attention_mask'] for d in inputs]

#Scores
f_train_scores = open("./train.enzh.scores", 'r')
zh_train_scores = f_train_scores.readlines()
labels = np.array(zh_train_scores).astype(float)

Loading BERT tokenizer...
{'input_ids': [101, 10117, 12469, 25735, 11849, 11059, 48543, 10107, 10135, 10169, 10226, 79400, 34788, 119, 102, 4458, 2775, 5718, 3763, 4463, 6457, 8575, 5778, 2196, 5718, 2570, 6352, 6356, 2568, 7701, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [0]:
model = BertModel.from_pretrained("bert-base-multilingual-cased")
model.cuda()

In [0]:
def get_BERT_embedding(batch_start, batch_end, input_tokens, attention_masks):
  input_tensors = torch.tensor(input_tokens[batch_start:batch_end]).to(device)
  attention_mask_tensors = torch.tensor(attention_masks[batch_start:batch_end]).to(device)

  with torch.no_grad():
    last_hidden_states = model(input_tensors, attention_mask=attention_mask_tensors)
  return last_hidden_states[0][:,0,:].cpu().numpy()

In [0]:
features = get_BERT_embedding(1000, 2000, input_ids, input_attention_masks)
print(features)

[[ 0.25943062 -0.00346926  0.12020143 ...  0.04750847  0.07468242
  -0.07719439]
 [-0.09621606  0.10552414 -0.14640783 ...  0.2795364   0.17046465
   0.15454468]
 [-0.0317821   0.05737093  0.11677047 ...  0.1405592  -0.03532625
  -0.04134069]
 ...
 [ 0.02738766 -0.11463747 -0.12570816 ...  0.08636823 -0.10662194
  -0.0699797 ]
 [-0.19610456 -0.03125958  0.11257638 ...  0.03555025  0.03250408
  -0.23782566]
 [-0.04782587 -0.18706174 -0.05254837 ...  0.29173648  0.09936409
  -0.08668579]]


## Process Scores

In [0]:
import spacy
import torchtext
from torchtext import data

f_train_scores = open("./train.enzh.scores", 'r')
zh_train_scores = f_train_scores.readlines()

f_val_scores = open("./dev.enzh.scores", 'r')
zh_val_scores = f_val_scores.readlines()

train_scores = np.array(zh_train_scores).astype(float)
y_train_zh = train_scores

val_scores = np.array(zh_val_scores).astype(float)
y_val_zh = val_scores

## Support Vector Machines


In [0]:
# Setup
def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

from sklearn.svm import SVR
from scipy.stats.stats import pearsonr


### Using Average Word Embedding Vectors



In [0]:
def get_avg_sentence_vector_zh(line, word_vectors):
  vectors = []
  for w in line:
    try:
      emb = word_vectors[w]
      vectors.append(emb)
    except:
      pass #Do not add if the word is out of vocabulary
  if vectors:
    return np.mean(vectors, axis=0).tolist()
  return [0] * 100

def get_avg_embeddings_zh(f, word_vectors):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]
  for l in lines:
    sent  = processing_zh(l)
    vec = get_avg_sentence_vector_zh(sent, word_vectors)

    if vec is not None:
      sentences_vectors.append(vec)
    else:
      print(l)
  return sentences_vectors


def get_avg_sentence_vector_en(embeddings, line):
  vectors = []
  for w in line:
    emb = get_word_vector_en(embeddings, w)
    #do not add if the word is out of vocabulary
    if emb is not None:
      vectors.append(emb)
  if vectors:
    return torch.mean(torch.stack(vectors), dim=0).tolist()
  return [0] * 100

# assume dim 100
def get_avg_embeddings_en(f, embeddings, nlp):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]

  for l in lines:
    sentence = preprocess_en(l, nlp)
    try:
      vec = get_avg_sentence_vector_en(embeddings, sentence)
      sentences_vectors.append(vec)
    except:
      sentences_vectors.append(0)

  return sentences_vectors

In [0]:
zh_train_mt_100_a = get_avg_embeddings_zh("./train.enzh.mt", wv_from_bin_100)
zh_train_src_100_a = get_avg_embeddings_en("./train.enzh.src", glove, nlp_en)

zh_val_mt_100_a = get_avg_embeddings_zh("./dev.enzh.mt", wv_from_bin_100)
zh_val_src_100_a = get_avg_embeddings_en("./dev.enzh.src", glove, nlp_en)

In [0]:
X_train_100_a = [x + y for x, y in zip(zh_train_src_100_a, zh_train_mt_100_a)]
X_train_zh_100_a = np.array(X_train_100_a)

X_val_100_a = [x + y for x, y in zip(zh_val_src_100_a, zh_val_mt_100_a)]
X_val_zh_100_a = np.array(X_val_100_a)


In [0]:
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_a, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_a)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

"""
linear
RMSE: 0.9044962563186333 Pearson 0.3017781690203462

poly
RMSE: 0.8990697909416231 Pearson 0.3032902746054339

rbf
RMSE: 0.8900985622788053 Pearson 0.3403404558003603

sigmoid
RMSE: 7.152607007355879 Pearson -0.03977439348067312
"""

### Using Sum of Word Embedding Vectors

In [0]:
def get_sum_sentence_vector_zh(line, word_vectors):
  vectors = []
  for w in line:
    try:
      emb = word_vectors[w]
      vectors.append(emb)
    except:
      pass #Do not add if the word is out of vocabulary
  if vectors:
    return np.sum(vectors, axis=0).tolist()
  return [0] * 100

def get_sum_embeddings_zh(f, word_vectors):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]
  for l in lines:
    sent  = processing_zh(l)
    vec = get_sum_sentence_vector_zh(sent, word_vectors)

    if vec is not None:
      sentences_vectors.append(vec)
    else:
      print(l)
  return sentences_vectors


def get_sum_sentence_vector_en(embeddings, line):
  vectors = []
  for w in line:
    emb = get_word_vector_en(embeddings, w)
    #do not add if the word is out of vocabulary
    if emb is not None:
      vectors.append(emb)
  if vectors:
    return torch.sum(torch.stack(vectors), dim=0).tolist()
  return [0] * 100

# assume dim 100
def get_sum_embeddings_en(f, embeddings, nlp):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]

  for l in lines:
    sentence = preprocess_en(l, nlp)
    try:
      vec = get_sum_sentence_vector_en(embeddings, sentence)
      sentences_vectors.append(vec)
    except:
      sentences_vectors.append(0)

  return sentences_vectors

In [0]:
zh_train_mt_100_s = get_sum_embeddings_zh("./train.enzh.mt", wv_from_bin_100)
zh_train_src_100_s = get_sum_embeddings_en("./train.enzh.src", glove, nlp_en)

zh_val_mt_100_s = get_sum_embeddings_zh("./dev.enzh.mt", wv_from_bin_100)
zh_val_src_100_s = get_sum_embeddings_en("./dev.enzh.src", glove, nlp_en)

In [0]:
X_train_100_s = [x + y for x, y in zip(zh_train_src_100_s, zh_train_mt_100_s)]
X_train_zh_100_s = np.array(X_train_100_s)

X_val_100_s = [x + y for x, y in zip(zh_val_src_100_s, zh_val_mt_100_s)]
X_val_zh_100_s = np.array(X_val_100_s)

In [0]:
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_s, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_s)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

"""
linear
RMSE: 0.9203004562821944 Pearson 0.2525785904226626

poly
RMSE: 0.9499024011434094 Pearson 0.18792023737965777

rbf
RMSE: 0.905470160037738 Pearson 0.29378234780721474

sigmoid
RMSE: 34.73144893811673 Pearson -0.004944951711832229
"""

### Using Min/Max of Word Embedding Vectors

In [0]:
def get_min_sentence_vector_zh(line, word_vectors):
  vectors = []
  for w in line:
    try:
      emb = word_vectors[w]
      vectors.append(emb)
    except:
      pass #Do not add if the word is out of vocabulary
  if vectors:
    return np.amin(vectors, axis=0).tolist()
  return [0] * 100

def get_min_embeddings_zh(f, word_vectors):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]
  for l in lines:
    sent  = processing_zh(l)
    vec = get_min_sentence_vector_zh(sent, word_vectors)

    if vec is not None:
      sentences_vectors.append(vec)
    else:
      print(l)
  return sentences_vectors


def get_min_sentence_vector_en(embeddings, line):
  vectors = []
  for w in line:
    emb = get_word_vector_en(embeddings, w)
    #do not add if the word is out of vocabulary
    if emb is not None:
      vectors.append(emb)
  if vectors:
    return torch.min(torch.stack(vectors), dim=0)[0].tolist()
  return [0] * 100

# assume dim 100
def get_min_embeddings_en(f, embeddings, nlp):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]

  for l in lines:
    sentence = preprocess_en(l, nlp)
    try:
      vec = get_min_sentence_vector_en(embeddings, sentence)
      sentences_vectors.append(vec)
    except:
      sentences_vectors.append(0)

  return sentences_vectors


def get_max_sentence_vector_zh(line, word_vectors):
  vectors = []
  for w in line:
    try:
      emb = word_vectors[w]
      vectors.append(emb)
    except:
      pass #Do not add if the word is out of vocabulary
  if vectors:
    return np.amax(vectors, axis=0).tolist()
  return [0] * 100

def get_max_embeddings_zh(f, word_vectors):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]
  for l in lines:
    sent  = processing_zh(l)
    vec = get_max_sentence_vector_zh(sent, word_vectors)

    if vec is not None:
      sentences_vectors.append(vec)
    else:
      print(l)
  return sentences_vectors


def get_max_sentence_vector_en(embeddings, line):
  vectors = []
  for w in line:
    emb = get_word_vector_en(embeddings, w)
    #do not add if the word is out of vocabulary
    if emb is not None:
      vectors.append(emb)
  if vectors:
    return torch.max(torch.stack(vectors), dim=0)[0].tolist()
  return [0] * 100

# assume dim 100
def get_max_embeddings_en(f, embeddings, nlp):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]

  for l in lines:
    sentence = preprocess_en(l, nlp)
    try:
      vec = get_max_sentence_vector_en(embeddings, sentence)
      sentences_vectors.append(vec)
    except:
      sentences_vectors.append(0)

  return sentences_vectors

In [0]:
zh_train_mt_100_min = get_min_embeddings_zh("./train.enzh.mt", wv_from_bin_100)
zh_train_src_100_min = get_min_embeddings_en("./train.enzh.src", glove, nlp_en)
zh_val_mt_100_min = get_min_embeddings_zh("./dev.enzh.mt", wv_from_bin_100)
zh_val_src_100_min = get_min_embeddings_en("./dev.enzh.src", glove, nlp_en)

zh_train_mt_100_max = get_max_embeddings_zh("./train.enzh.mt", wv_from_bin_100)
zh_train_src_100_max = get_max_embeddings_en("./train.enzh.src", glove, nlp_en)
zh_val_mt_100_max = get_max_embeddings_zh("./dev.enzh.mt", wv_from_bin_100)
zh_val_src_100_max = get_max_embeddings_en("./dev.enzh.src", glove, nlp_en)

In [0]:
X_train_100_min = [x + y for x, y in zip(zh_train_src_100_min, zh_train_mt_100_min)]
X_train_zh_100_min = np.array(X_train_100_min)
X_val_100_min = [x + y for x, y in zip(zh_val_src_100_min, zh_val_mt_100_min)]
X_val_zh_100_min = np.array(X_val_100_min)

X_train_100_max = [x + y for x, y in zip(zh_train_src_100_max, zh_train_mt_100_max)]
X_train_zh_100_max = np.array(X_train_100_max)
X_val_100_max = [x + y for x, y in zip(zh_val_src_100_max, zh_val_mt_100_max)]
X_val_zh_100_max = np.array(X_val_100_max)

In [0]:
print("min")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_min, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_min)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

print("max")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_max, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_max)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

"""
min
linear
RMSE: 0.9174155670157896 Pearson 0.26925288419123733

poly
RMSE: 0.9721279726125316 Pearson 0.23475167649201809

rbf
RMSE: 0.9049414520907609 Pearson 0.2920329610610498

sigmoid
RMSE: 26.865545536417198 Pearson 0.0076818897644271664

max
linear
RMSE: 0.9288259124772581 Pearson 0.2342204449314311

poly
RMSE: 1.0188986421056523 Pearson 0.18458085384794573

rbf
RMSE: 0.9074016100647716 Pearson 0.28092258260753544

sigmoid
RMSE: 26.529780761913454 Pearson -0.007926148030142141
"""


### Combinations

In [0]:
# min + max
X_train_100_mm = [sum(t, []) for t in zip(zh_train_src_100_min, zh_train_src_100_max, zh_train_mt_100_min, zh_train_mt_100_max)]
X_train_zh_100_mm = np.array(X_train_100_mm)
X_val_100_mm = [sum(t, []) for t in zip(zh_val_src_100_min, zh_val_src_100_max, zh_val_mt_100_min, zh_val_mt_100_max)]
X_val_zh_100_mm = np.array(X_val_100_mm)

# min + avg + max
X_train_100_mam = [sum(t, []) for t in zip(zh_train_src_100_min, zh_train_src_100_a, zh_train_src_100_max, zh_train_mt_100_min, zh_train_mt_100_a, zh_train_mt_100_max)]
X_train_zh_100_mam = np.array(X_train_100_mam)
X_val_100_mam = [sum(t, []) for t in zip(zh_val_src_100_min, zh_val_src_100_a, zh_val_src_100_max, zh_val_mt_100_min, zh_val_mt_100_a, zh_val_mt_100_max)]
X_val_zh_100_mam = np.array(X_val_100_mam)

# avg + sum
X_train_100_as = [sum(t, []) for t in zip(zh_train_src_100_a, zh_train_src_100_s, zh_train_mt_100_a, zh_train_mt_100_s)]
X_train_zh_100_as = np.array(X_train_100_mam)
X_val_100_as = [sum(t, []) for t in zip(zh_val_src_100_a, zh_val_src_100_s, zh_val_mt_100_a, zh_val_mt_100_s)]
X_val_zh_100_as = np.array(X_val_100_mam)

# min + avg + max + sum
X_train_100_mams = [sum(t, []) for t in zip(zh_train_src_100_min, zh_train_src_100_a, zh_train_src_100_max, zh_train_src_100_s, zh_train_mt_100_min, zh_train_mt_100_a, zh_train_mt_100_max, zh_train_src_100_s)]
X_train_zh_100_mams = np.array(X_train_100_mam)
X_val_100_mams = [sum(t, []) for t in zip(zh_val_src_100_min, zh_val_src_100_a, zh_val_src_100_max, zh_val_src_100_s, zh_val_mt_100_min, zh_val_mt_100_a, zh_val_mt_100_max, zh_val_mt_100_s)]
X_val_zh_100_mams = np.array(X_val_100_mam)

In [0]:
print("min + max")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_mm, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_mm)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

print("min + avg + max")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_mam, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_mam)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

print("avg + sum")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_as, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_as)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

print("min + avg + max + sum")
for k in ['linear','poly','rbf','sigmoid']:
    clf_t = SVR(kernel=k)
    clf_t.fit(X_train_zh_100_mams, y_train_zh)
    print(k)
    predictions = clf_t.predict(X_val_zh_100_mams)
    pearson = pearsonr(y_val_zh, predictions)
    print(f'RMSE: {rmse(predictions,y_val_zh)} Pearson {pearson[0]}')
    print()

"""
min + max
linear
RMSE: 0.9165262212864648 Pearson 0.26668369681069803

poly
RMSE: 0.9081786905379852 Pearson 0.3058041114768354

rbf
RMSE: 0.9084429289156349 Pearson 0.30464644770878846

sigmoid
RMSE: 3.5248833190245237 Pearson -0.006792767476776504

min + avg + max
linear
RMSE: 0.9081244309157551 Pearson 0.28851234087420935

poly
RMSE: 0.8982943971746072 Pearson 0.3296975280939544

rbf
RMSE: 0.8992479975159785 Pearson 0.3292233306043897

sigmoid
RMSE: 2.600000956750181 Pearson 0.014142349847365135

avg + sum
linear
RMSE: 0.9081244309157551 Pearson 0.28851234087420935

poly
RMSE: 0.8982943971746072 Pearson 0.3296975280939544

rbf
RMSE: 0.8992479975159785 Pearson 0.3292233306043897

sigmoid
RMSE: 2.600000956750181 Pearson 0.014142349847365135

min + avg + max + sum
linear
RMSE: 0.9081244309157551 Pearson 0.28851234087420935

poly
RMSE: 0.8982943971746072 Pearson 0.3296975280939544

rbf
RMSE: 0.8992479975159785 Pearson 0.3292233306043897

sigmoid
RMSE: 2.600000956750181 Pearson 0.014142349847365135
"""

## Results

(Haven't tested the function yet...)

In [0]:
import os
from google.colab import files
from zipfile import ZipFile

def writeScores(scores):
    fn = "predictions.txt"
    print("")
    with open(fn, 'w') as output_file:
        for idx,x in enumerate(scores):
            #out =  metrics[idx]+":"+str("{0:.2f}".format(x))+"\n"
            #print(out)
            output_file.write(f"{x}\n")


def downloadScores(method_name, scores):
  writeScores(scores)
  with ZipFile(f"en-zh_{method_name}.zip", "w") as newzip:
    newzip.write("predictions.txt")
  
  files.download(f"en-zh_{method_name}.zip")