<a href="https://colab.research.google.com/github/Rt247/Not_NLP_CW/blob/master/NLP_CW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

Download datasets:

In [1]:
from os.path import exists

if not exists('enzh_data.zip'):
    !wget -O enzh_data.zip https://competitions.codalab.org/my/datasets/download/03e23bd7-8084-4542-997b-6a1ca6dd8a5f
    !unzip enzh_data.zip

--2020-02-12 13:01:17--  https://competitions.codalab.org/my/datasets/download/03e23bd7-8084-4542-997b-6a1ca6dd8a5f
Resolving competitions.codalab.org (competitions.codalab.org)... 129.175.22.230
Connecting to competitions.codalab.org (competitions.codalab.org)|129.175.22.230|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://newcodalab.lri.fr/prod-private/dataset_data_file/None/630ec/en-zh.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=eb5cc2a37cd3e995902d43749373071da8686c4fbe0b0ff2a4fffb83b3ad712f&X-Amz-Date=20200212T130118Z&X-Amz-Credential=AZIAIOSAODNN7EX123LE%2F20200212%2Fnewcodalab%2Fs3%2Faws4_request [following]
--2020-02-12 13:01:18--  https://newcodalab.lri.fr/prod-private/dataset_data_file/None/630ec/en-zh.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=eb5cc2a37cd3e995902d43749373071da8686c4fbe0b0ff2a4fffb83b3ad712f&X-Amz-Date=20200212T130

Check data downloaded successfully:

In [2]:
with open("./train.enzh.src", "r") as enzh_src:
  print("Source: ",enzh_src.readline())
with open("./train.enzh.mt", "r") as enzh_mt:
  print("Translation: ",enzh_mt.readline())
with open("./train.enzh.scores", "r") as enzh_scores:
  print("Score: ",enzh_scores.readline())

Source:  The last conquistador then rides on with his sword drawn.

Translation:  最后的征服者骑着他的剑继续前进.

Score:  -1.5284005772625449



### English Models Setup

Download English models:

In [3]:
!spacy download en_core_web_md
!spacy link en_core_web_md en300

Collecting en_core_web_md==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.1.0/en_core_web_md-2.1.0.tar.gz (95.4MB)
[K     |████████████████████████████████| 95.4MB 1.1MB/s 
[?25hBuilding wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.1.0-cp36-none-any.whl size=97126236 sha256=7d3c5e3a31b8011666ca96d2dff87d112c6bec09923673a12ed8ff52973f189e
  Stored in directory: /tmp/pip-ephem-wheel-cache-vir_3mu6/wheels/c1/2c/5f/fd7f3ec336bf97b0809c86264d2831c5dfb00fc2e239d1bb01
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_md -->
/usr/local/lib/py

Load a GloVe English model with dim 100, and spaCy English model with dim 300.

Some Chinese models only have **dim 100**, so we will need to **tokenize with spaCy, then embed with GloVe**.

Other Chinese models have **dim 300**, so we can **just use spaCy**.

In [4]:
import torchtext
import spacy

# Embedding for English when dim 100
glove = torchtext.vocab.GloVe(name='6B', dim=100)

# Tokenizer for English when dim 100, Tokenizer and Embedding when dim 300
nlp_en = spacy.load('en300')


.vector_cache/glove.6B.zip: 862MB [06:25, 2.23MB/s]                           
100%|█████████▉| 399916/400000 [00:32<00:00, 20399.67it/s]

Functions for processing English dataset:

In [5]:
import numpy as np
import torch
from nltk import download
from nltk.corpus import stopwords

#downloading stopwords from the nltk package
download('stopwords') #stopwords dictionary, run once
stop_words_en = set(stopwords.words('english'))


def preprocess_en(sentence, nlp):
    text = sentence.lower()
    doc = [token.lemma_ for token in  nlp.tokenizer(text)]
    doc = [word for word in doc if word not in stop_words_en]
    doc = [word for word in doc if word.isalpha()] #restricts string to alphabetic characters only
    return doc

def get_word_vector_en(embeddings, word):
    try:
      vec = embeddings.vectors[embeddings.stoi[word]]
      return vec
    except KeyError:
      #print(f"Word {word} does not exist")
      pass

def get_sentence_vector_en(embeddings, line):
  vectors = []
  for w in line:
    emb = get_word_vector_en(embeddings, w)
    #do not add if the word is out of vocabulary
    if emb is not None:
      vectors.append(emb)
   
  return torch.mean(torch.stack(vectors))

def get_sentence_emb_en(line, nlp):
  text = line.lower()
  l = [token.lemma_ for token in nlp.tokenizer(text)]
  l = ' '.join([word for word in l if word not in stop_words_en])

  sen = nlp(l)
  return sen.vector


# By default we assume dim of Chinese model will be 100, and so we will need
# to embed English model with dim 100.
# If using Chinese model with dim 300, set dim=300.
def get_embeddings_en(f, embeddings, nlp, dim=100):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]

  if dim == 300:
    for l in lines:
      vec = get_sentence_emb_en(l, nlp)
      if vec is not None:
        vec = np.mean(vec)
        sentences_vectors.append(vec)
      else:
        sentences_vectors.append(0)
    return sentences_vectors

  for l in lines:
    sentence = preprocess_en(l, nlp)
    try:
      vec = get_sentence_vector_en(embeddings, sentence)
      sentences_vectors.append(vec)
    except:
      sentences_vectors.append(0)

  return sentences_vectors


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Chinese Models Setup

Download Chinese stopwords:

In [9]:
!wget -c https://github.com/Tony607/Chinese_sentiment_analysis/blob/master/data/chinese_stop_words.txt

--2020-02-12 13:14:43--  https://github.com/Tony607/Chinese_sentiment_analysis/blob/master/data/chinese_stop_words.txt
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘chinese_stop_words.txt’

chinese_stop_words.     [<=>                 ]       0  --.-KB/s               chinese_stop_words.     [ <=>                ] 419.05K  --.-KB/s    in 0.03s   

2020-02-12 13:14:43 (13.1 MB/s) - ‘chinese_stop_words.txt’ saved [429109]

--2020-02-12 13:14:44--  http://vectors.nlpl.eu/repository/20/35.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.225
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.225|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1458485917 (1.4G) [application/zip]
Saving to: ‘zh_100.zip’


2020-02-12 13:16:14 (15.6 MB/s) - ‘zh_100.zip’ saved [1458485917/14584859

Download and load Chinese model with **dim 100** (University of Oslo):

In [0]:
!wget -O zh_100.zip http://vectors.nlpl.eu/repository/20/35.zip

!unzip zh_100.zip -d ./zh_100

from gensim.models import KeyedVectors

wv_from_bin_100 = KeyedVectors.load_word2vec_format("./zh_100/model.bin", binary=True) 

Download and load Chinese moel with **dim 300** (Kyubyong):

In [13]:
!pip install gdown

!gdown -O zh_300.zip https://drive.google.com/uc?id=0B0ZXk88koS2KNER5UHNDY19pbzQ

!unzip zh_300.zip -d ./zh_300

from gensim.models import Word2Vec

wv_from_bin_300 = Word2Vec.load("./zh_300/zh.bin")

Downloading...
From: https://drive.google.com/uc?id=0B0ZXk88koS2KNER5UHNDY19pbzQ
To: /content/zh_300.zip
203MB [00:04, 42.8MB/s]
Archive:  zh_300.zip
  inflating: ./zh_300/zh.bin         
  inflating: ./zh_300/zh.tsv         
  inflating: ./zh_300/zh.bin.syn1neg.npy  
  inflating: ./zh_300/zh.bin.syn0.npy  


Functions for processing Chinese dataset:

In [0]:
import string
import jieba
import gensim 
import spacy
import numpy as np

stop_words = [ line.rstrip() for line in open('./chinese_stop_words.txt',"r", encoding="utf-8") ]


def get_sentence_vector_zh(line, word_vectors):
  vectors = []
  for w in line:
    try:
      emb = word_vectors[w]
      vectors.append(emb)
    except:
      pass #Do not add if the word is out of vocabulary
  if vectors:
    vectors = np.array(vectors)
    return np.mean(vectors)  
  else:
    return 0


def processing_zh(sentence):
  seg_list = jieba.lcut(sentence,cut_all=True)
  doc = [word for word in seg_list if word not in stop_words]
  docs = [e for e in doc if e.isalnum()]
  return docs


def get_embeddings_zh(f, word_vectors):
  file = open(f) 
  lines = file.readlines() 
  sentences_vectors =[]
  for l in lines:
    sent  = processing_zh(l)
    vec = get_sentence_vector_zh(sent, word_vectors)

    if vec is not None:
      sentences_vectors.append(vec)
    else:
      print(l)
  return sentences_vectors

## Embedding

In [24]:
import spacy
import torchtext
from torchtext import data

zh_train_mt_100 = get_embeddings_zh("./train.enzh.mt", wv_from_bin_100)
zh_train_mt_300 = get_embeddings_zh("./train.enzh.mt", wv_from_bin_300)
zh_train_src_100 = get_embeddings_en("./train.enzh.src", glove, nlp_en, dim=100)
zh_train_src_300 = get_embeddings_en("./train.enzh.src", glove, nlp_en, dim=300)
f_train_scores = open("./train.enzh.scores", 'r')
zh_train_scores = f_train_scores.readlines()

zh_val_mt_100 = get_embeddings_zh("./dev.enzh.mt", wv_from_bin_100)
zh_val_mt_300 = get_embeddings_zh("./dev.enzh.mt", wv_from_bin_300)
zh_val_src_100 = get_embeddings_en("./dev.enzh.src", glove, nlp_en, dim=100)
zh_val_src_300 = get_embeddings_en("./dev.enzh.src", glove, nlp_en, dim=300)
f_val_scores = open("./dev.enzh.scores", 'r')
zh_val_scores = f_val_scores.readlines()

  


Check embedded correctly:

In [25]:
print(f"Training mt (100): {len(zh_train_mt_100)} Training mt (300): {len(zh_train_mt_300)} Training src (100): {len(zh_train_src_100)} Training src (300): {len(zh_train_src_300)}")
print()
print(f"Validation mt (100): {len(zh_val_mt_100)} Validation mt (300): {len(zh_val_mt_300)} Validation src (100): {len(zh_val_src_100)} Validation src (300): {len(zh_val_src_300)}")

Training mt (100): 7000 Training mt (300): 7000 Training src (100): 7000 Training src (300): 7000

Validation mt (100): 1000 Validation mt (300): 1000 Validation src (100): 1000 Validation src (300): 1000


Setup input and predicted outputs:

In [0]:
import numpy as np

X_train_100 = [np.array(zh_train_src_100), np.array(zh_train_mt_100)]
X_train_zh_100 = np.array(X_train_100).transpose()

X_val_100 = [np.array(zh_val_src_100),np.array(zh_val_mt_100)]
X_val_zh_100 = np.array(X_val_100).transpose()

X_train_300 = [np.array(zh_train_src_300), np.array(zh_train_mt_300)]
X_train_zh_300 = np.array(X_train_300).transpose()

X_val_300 = [np.array(zh_val_src_300),np.array(zh_val_mt_300)]
X_val_zh_300 = np.array(X_val_300).transpose()

#Scores
train_scores = np.array(zh_train_scores).astype(float)
y_train_zh = train_scores

val_scores = np.array(zh_val_scores).astype(float)
y_val_zh = val_scores

## Methods

**TODO** e.g. SVM, random forest etc

## Results

(Haven't tested the function yet...)

In [0]:
import os
from google.colab import files
from zipfile import ZipFile

def writeScores(scores):
    fn = "predictions.txt"
    print("")
    with open(fn, 'w') as output_file:
        for idx,x in enumerate(scores):
            #out =  metrics[idx]+":"+str("{0:.2f}".format(x))+"\n"
            #print(out)
            output_file.write(f"{x}\n")


def downloadScores(method_name, scores):
  writeScores(scores)
  with ZipFile(f"en-zh_{method_name}.zip", "w") as newzip:
    newzip.write("predictions.txt")
  
  files.download(f"en-zh_{method_name}.zip")