<a href="https://colab.research.google.com/github/SebastianJia/nlp_research_conceptor/blob/master/Re_implement_CN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Re-implementation of Conceptor Negation


*   Word-similarity task
*   STS(Semantic Textual Similarity) tasks



# Data
For both tasks, I used small GloVe and word2vec word vector dataset, as well as Fasttext English word vector 1M dataset. \
Small GloVe: https://drive.google.com/uc?id=1U_UGB2vyTuTIcbV_oeDtJCtAtlFMvXOM \
Small word2vec: https://drive.google.com/uc?id=1j_b4TRpL3f0HQ8mV17_CtOXp862YjxxB \
Fasttext 1M: https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip\
  

In [0]:
import numpy as np
import scipy, requests, codecs, os, re, nltk, itertools, csv
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AgglomerativeClustering, KMeans
import tensorflow as tf
from scipy.stats import spearmanr
import pandas as pd
import functools as ft
import os
import io
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
!ls

EN-MEN-TR-3k.txt		     fasttext.bin
EN-MTurk-287.txt		     sample_data
EN-RG-65.txt			     SIF
EN-RW-STANFORD.txt		     small_glove.txt
EN-SIMLEX-999.txt		     small_word2vec.txt
EN-SimVerb-3500.txt		     wiki-news-300d-1M.vec
enwiki-20150602-words-frequency.txt  wiki-news-300d-1M.vec.zip
EN-WS-353-ALL.txt


In [0]:
!pip install -q gdown
!gdown https://drive.google.com/uc?id=1U_UGB2vyTuTIcbV_oeDtJCtAtlFMvXOM # download a small subset of glove
!gdown https://drive.google.com/uc?id=1j_b4TRpL3f0HQ8mV17_CtOXp862YjxxB # download a small subset of word2vec
!ls

Downloading...
From: https://drive.google.com/uc?id=1U_UGB2vyTuTIcbV_oeDtJCtAtlFMvXOM
To: /content/small_glove.txt
333MB [00:02, 165MB/s]
Downloading...
From: https://drive.google.com/uc?id=1j_b4TRpL3f0HQ8mV17_CtOXp862YjxxB
To: /content/small_word2vec.txt
267MB [00:01, 159MB/s]
sample_data  small_glove.txt  small_word2vec.txt


In [0]:
!gdown https://drive.google.com/uc?id=1Zl6a75Ybf8do9uupmrJWKQMnvqqme4fh

Downloading...
From: https://drive.google.com/uc?id=1Zl6a75Ybf8do9uupmrJWKQMnvqqme4fh
To: /content/fasttext.bin
2.42GB [00:55, 43.8MB/s]


In [0]:
!wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip
!unzip wiki-news-300d-1M.vec.zip
!ls

--2019-01-30 14:33:18--  https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip
Resolving s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)... 52.219.24.185
Connecting to s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)|52.219.24.185|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip’


2019-01-30 14:33:29 (63.5 MB/s) - ‘wiki-news-300d-1M.vec.zip’ saved [681808098/681808098]

Archive:  wiki-news-300d-1M.vec.zip
  inflating: wiki-news-300d-1M.vec   
sample_data	 small_word2vec.txt	wiki-news-300d-1M.vec.zip
small_glove.txt  wiki-news-300d-1M.vec


# Load Fasttext, small GloVe and small word2vec data

In [0]:
import gensim

from gensim.models.keyedvectors import KeyedVectors



In [0]:
fasttext2 = KeyedVectors.load_word2vec_format('/content/' + 'fasttext.bin', binary=True)
print('The fasttext embedding has been loaded!')

The fasttext embedding has been loaded!


In [0]:
fasttext = KeyedVectors.load_word2vec_format('/content/' + 'wiki-news-300d-1M.vec')

In [0]:
!python -m gensim.scripts.glove2word2vec -i small_glove.txt -o small_glove_w2v.txt
!python -m gensim.scripts.glove2word2vec -i small_word2vec.txt -o small_w2v_w2v.txt

2019-01-30 14:55:36,843 - glove2word2vec - INFO - running /usr/local/lib/python3.6/dist-packages/gensim/scripts/glove2word2vec.py -i small_glove.txt -o small_glove_w2v.txt
2019-01-30 14:55:37,191 - glove2word2vec - INFO - converting 128607 vectors from small_glove.txt to small_glove_w2v.txt
2019-01-30 14:55:38,871 - glove2word2vec - INFO - Converted model with 128607 vectors and 300 dimensions
2019-01-30 14:55:40,147 - glove2word2vec - INFO - running /usr/local/lib/python3.6/dist-packages/gensim/scripts/glove2word2vec.py -i small_word2vec.txt -o small_w2v_w2v.txt
2019-01-30 14:55:40,359 - glove2word2vec - INFO - converting 76078 vectors from small_word2vec.txt to small_w2v_w2v.txt
2019-01-30 14:55:41,663 - glove2word2vec - INFO - Converted model with 76078 vectors and 300 dimensions


In [0]:
glove = KeyedVectors.load_word2vec_format('/content/' + 'small_glove_w2v.txt')
w2v = KeyedVectors.load_word2vec_format('/content/' + 'small_w2v_w2v.txt')

# Post-processing with CN

In [0]:
#Use this func for data size smaller than 1M
def cn_mat(pre_cn_f_name, alpha):
  pre_cn_data = eval(pre_cn_f_name)
  #word_pairs = set(list(cn_data.keys()))
  cn_mat = pre_cn_data.vectors
  word_vec = np.array(cn_mat, dtype = float).T
  num_word = word_vec.shape[1]
  num_vec = word_vec.shape[0]
  print(num_word, num_vec)
  corr_mat = word_vec.dot(word_vec.T) /num_word
  print('got corr_mat')
  concept_mat = corr_mat @ np.linalg.inv(corr_mat + alpha ** (-2) * np.eye(num_vec))
  print('got concep_mat')
  new_mat = ((np.eye(num_vec)-concept_mat)@word_vec).T
  print('got new_mat')
  return new_mat
  

In [0]:
cn_fasttext_mat = cn_mat('fasttext', alpha = 2)
print('CN preprocess done for fasttext data')
cn_glove_mat = cn_mat('glove', alpha = 2)
print('CN preprocess done for glove data')
cn_w2v_mat = cn_mat('w2v', alpha =2)
print('CN preprocess done for w2v data')

999994 300
got corr_mat
got concep_mat
got new_mat
CN preprocess done for fasttext data
128607 300
got corr_mat
got concep_mat
got new_mat
CN preprocess done for glove data
76078 300
got corr_mat
got concep_mat
got new_mat
CN preprocess done for w2v data


In [0]:
!wget https://raw.githubusercontent.com/IlyaSemenov/wikipedia-word-frequency/master/results/enwiki-20150602-words-frequency.txt

--2019-01-27 15:01:16--  https://raw.githubusercontent.com/IlyaSemenov/wikipedia-word-frequency/master/results/enwiki-20150602-words-frequency.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23741395 (23M) [text/plain]
Saving to: ‘enwiki-20150602-words-frequency.txt’


2019-01-27 15:01:16 (130 MB/s) - ‘enwiki-20150602-words-frequency.txt’ saved [23741395/23741395]



In [0]:
wikiWordsPath = '/content/' + 'enwiki-20150602-words-frequency.txt'
wikiWords = []
with open(wikiWordsPath, "r+") as f_in:
    for line in f_in:
      one_line = line.split(' ')
      if int(one_line[1]) > 200:
        wikiWords.append(one_line[0]) 

In [0]:
!git clone https://github.com/PrincetonML/SIF

Cloning into 'SIF'...
remote: Enumerating objects: 128, done.[K
remote: Total 128 (delta 0), reused 0 (delta 0), pack-reused 128[K
Receiving objects: 100% (128/128), 2.80 MiB | 12.46 MiB/s, done.
Resolving deltas: 100% (55/55), done.


In [0]:
wikiWordsPath = '/content/' + '/SIF/auxiliary_data/enwiki_vocab_min200.txt' # https://github.com/PrincetonML/SIF/blob/master/auxiliary_data/enwiki_vocab_min200.txt
wikiWords = []

with open(wikiWordsPath, "r+") as f_in:
    for line in f_in:
        wikiWords.append(line.split(' ')[0])   

In [0]:
print(len(wikiWords))

188033


In [0]:
from numpy.linalg import norm, inv, eig

In [0]:
def reduced_cn_mat(wordVecModel_str, alpha = 1):
    # compute the prototype conceptor with alpha = 1
    
    
    wordVecModel = eval(wordVecModel_str)    
    word_in_wiki_and_model = set(list(wordVecModel.vocab)).intersection(set(wikiWords))

    x_collector_indices = []


    for word in word_in_wiki_and_model:
        x_collector_indices.append(wordVecModel.vocab[word].index)

    # put the word vectors in columns
    x_collector = wordVecModel.vectors[x_collector_indices,:].T       
   
    
    nrWords = x_collector.shape[1] # number of total words
    print(nrWords)
    
    R = x_collector.dot(x_collector.T) / nrWords # calculate the correlation matrix
    
    concept_mat = R @ inv(R + alpha ** (-2) * np.eye(300))# calculate the conceptor matrix
    
    return concept_mat
  

In [0]:
fasttext2_concept_mat = reduced_cn_mat('fasttext2', alpha = 1)
print('CN preprocess done for fasttext2 data')

119127
CN preprocess done for fasttext2 data


In [0]:
print(len(fasttext2_concept_mat))

300


# Experiment 1: Word similarity evaluation
I re-implemented word similarity task by evaluating CN post-processed word vectors with 7 standard word similarity datasets, namely the RG65 (Rubenstein and Goodenough, 1965), the WordSim-353 (WS) (Finkelstein et al., 2002), the rare- words (RW) (Luong, Socher, and Manning, 2013), the MEN dataset (Bruni, Tran, and Baroni, 2014), the MTurk (Radinsky et al., 2011), the SimLex-999 (SimLex) (Hill, Reichart, and Korhonen, 2015), and the SimVerb-3500 (Gerz et al., 2016) \


#Load word similarity text data

In [0]:
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/wordSimData/EN-MEN-TR-3k.txt
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/wordSimData/EN-MTurk-287.txt
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/wordSimData/EN-RG-65.txt
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/wordSimData/EN-RW-STANFORD.txt
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/wordSimData/EN-SIMLEX-999.txt
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/wordSimData/EN-SimVerb-3500.txt
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/wordSimData/EN-WS-353-ALL.txt
!ls

--2019-01-27 14:53:40--  https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/wordSimData/EN-MEN-TR-3k.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53593 (52K) [text/plain]
Saving to: ‘EN-MEN-TR-3k.txt’


2019-01-27 14:53:40 (2.53 MB/s) - ‘EN-MEN-TR-3k.txt’ saved [53593/53593]

--2019-01-27 14:53:41--  https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/wordSimData/EN-MTurk-287.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7218 (7.0K) [text/plain]
Saving to: ‘EN-MTurk

In [0]:
!pwd

/content


#Compare word similarity scores and calculate Spearman Correlation

In [0]:
def get_sim(data_f_name, cn_f_name, cn_mat, alpha):
  cn_data = eval(cn_f_name)
  #word_pairs = set(list(cn_data.keys()))
  fin = io.open(data_f_name, 'r', encoding='utf-8', newline='\n', errors='ignore')
  dataset = []
  word_vec = []
  keys = []
  ls_word = list(cn_data.vocab)
  #line_num = 0
  for line in fin:
   # if line_num > 0:
      tokens = line.rstrip().split()
      if tokens[0] in cn_data.vocab and tokens[1] in cn_data.vocab:
        dataset.append(((tokens[0], tokens[1]), float(tokens[2])))
        id1 = ls_word.index(tokens[0])
        id2 = ls_word.index(tokens[1])
        word_vec.append(cn_mat[id1])
        word_vec.append(cn_mat[id2])
        keys.append(tokens[0])
        keys.append(tokens[1])
    #line_num +=1
  dataset.sort(key = lambda score: -score[1]) #sort based on score
 # print(cn_data['gem'])
  cn_dataset = {}
  cn_dataset_list = []
  
  for ((word1, word2), score) in dataset:
    #print(word1, word2)
    id1 = ls_word.index(word1)
    id2 = ls_word.index(word2)
    sim_score = 1 - cosine_similarity(cn_mat[id1].reshape(1,-1), cn_mat[id2].reshape(1,-1))
    cn_dataset[(word1, word2)] = sim_score
    cn_dataset_list.append(((word1, word2),sim_score))
  cn_dataset_list.sort(key = lambda score: score[1])
  spearman_list1=[]
  spearman_list2=[]
  for pos_1, (pair, score_1) in enumerate(dataset):
    score_2 = cn_dataset[pair]
    pos_2 = cn_dataset_list.index((pair, score_2))
    spearman_list1.append(pos_1)
    spearman_list2.append(pos_2)
  rho = spearmanr(spearman_list1, spearman_list2)
  return rho[0]

In [0]:
def get_sim_large_data(data_f_name, cn_f_name, C, alpha):
  cn_data = eval(cn_f_name)
  #word_pairs = set(list(cn_data.keys()))
  fin = io.open(data_f_name, 'r', encoding='utf-8', newline='\n', errors='ignore')
  dataset = []
  ls_word = list(cn_data.vocab)
  #line_num = 0
  for line in fin:
   # if line_num > 0:
      tokens = line.rstrip().split()
      if tokens[0] in cn_data.vocab and tokens[1] in cn_data.vocab:
        dataset.append(((tokens[0], tokens[1]), float(tokens[2])))
    #line_num +=1
  dataset.sort(key = lambda score: -score[1]) #sort based on score
 # print(cn_data['gem'])
  cn_dataset = {}
  cn_dataset_list = []
  
  for ((word1, word2), score) in dataset:
    #print(word1, word2)
    sim_score = 1 - cosine_similarity( cn_data[word1]  - (C @ cn_data[word1]).reshape(1,-1)  , cn_data[word2] - (C@ cn_data[word2]).reshape(1,-1))
    cn_dataset[(word1, word2)] = sim_score
    cn_dataset_list.append(((word1, word2),sim_score))
  cn_dataset_list.sort(key = lambda score: score[1])
  spearman_list1=[]
  spearman_list2=[]
  for pos_1, (pair, score_1) in enumerate(dataset):
    score_2 = cn_dataset[pair]
    pos_2 = cn_dataset_list.index((pair, score_2))
    spearman_list1.append(pos_1)
    spearman_list2.append(pos_2)
  rho = spearmanr(spearman_list1, spearman_list2)
  return rho[0] 



In [0]:
dataSets = ['EN-RG-65.txt', 'EN-WS-353-ALL.txt', 'EN-RW-STANFORD.txt', 'EN-MEN-TR-3k.txt', 'EN-MTurk-287.txt', 'EN-SIMLEX-999.txt', 'EN-SimVerb-3500.txt']
for dataset in dataSets:
    dataSetAddress = '/content/'+  dataset
    print('evaluating the data set', dataSetAddress)
    print('        Fasttext2')
    print('With CN ', "%.4f" % get_sim_large_data(dataSetAddress, 'fasttext2', fasttext2_concept_mat, 1))
    print('No   CN ', "%.4f" % get_sim_no_cn(dataSetAddress, 'fasttext2'))

evaluating the data set /content/EN-RG-65.txt
        Fasttext2
With CN  0.8755
No   CN  0.8526
evaluating the data set /content/EN-WS-353-ALL.txt
        Fasttext2
With CN  0.7904
No   CN  0.7921
evaluating the data set /content/EN-RW-STANFORD.txt
        Fasttext2
With CN  0.6135
No   CN  0.5949
evaluating the data set /content/EN-MEN-TR-3k.txt
        Fasttext2
With CN  0.8466
No   CN  0.8362
evaluating the data set /content/EN-MTurk-287.txt
        Fasttext2
With CN  0.7323
No   CN  0.7254
evaluating the data set /content/EN-SIMLEX-999.txt
        Fasttext2
With CN  0.5168
No   CN  0.5051
evaluating the data set /content/EN-SimVerb-3500.txt
        Fasttext2
With CN  0.4348
No   CN  0.4304


# Without CN post-processing

In [0]:
def get_sim_no_cn(data_f_name, f_name):
  model = eval(f_name)
  fin = io.open(data_f_name, 'r', encoding='utf-8', newline='\n', errors='ignore')
  data = []
  #line_num = 0
  for line in fin:
    #if line_num > 0:
      tokens = line.rstrip().split()
      if tokens[0] in model.vocab and tokens[1] in model.vocab:
        data.append(((tokens[0], tokens[1]), float(tokens[2])))
   # line_num +=1
  data.sort(key = lambda score: -score[1]) #sort based on score
  dataset = {}
  dataset_list = []
  
  for ((word1, word2), score) in data:
    sim_score = 1 - cosine_similarity(model[word1].reshape(1,-1), model[word2].reshape(1,-1))
    dataset[(word1, word2)] = sim_score
    dataset_list.append(((word1, word2),sim_score))
  dataset_list.sort(key = lambda score: score[1])
  spearman_list1=[]
  spearman_list2=[]
  for pos_1, (pair, score_1) in enumerate(data):
    score_2 = dataset[pair]
    pos_2 = dataset_list.index((pair, score_2))
    spearman_list1.append(pos_1)
    spearman_list2.append(pos_2)
  rho = spearmanr(spearman_list1, spearman_list2)
  return rho[0] 


In [0]:
dataSets = ['EN-RG-65.txt', 'EN-WS-353-ALL.txt', 'EN-RW-STANFORD.txt', 'EN-MEN-TR-3k.txt', 'EN-MTurk-287.txt', 'EN-SIMLEX-999.txt', 'EN-SimVerb-3500.txt']
for dataset in dataSets:
    dataSetAddress = '/content/'+  dataset
    print('evaluating the data set', dataSetAddress)
    print('Fasttext ', 'GloVe ', 'w2v ')
    print("%.4f" % get_sim_no_cn(dataSetAddress, 'fasttext'), "%.4f" % get_sim_no_cn(dataSetAddress, 'glove'), "%.4f" % get_sim_no_cn(dataSetAddress, 'w2v'))
    

evaluating the data set /content/EN-RG-65.txt
Fasttext  GloVe  w2v 
0.8457 0.7603 0.7494
evaluating the data set /content/EN-WS-353-ALL.txt
Fasttext  GloVe  w2v 
0.7334 0.7379 0.6934
evaluating the data set /content/EN-RW-STANFORD.txt
Fasttext  GloVe  w2v 
0.5231 0.5101 0.5578
evaluating the data set /content/EN-MEN-TR-3k.txt
Fasttext  GloVe  w2v 
0.7904 0.8013 0.7707
evaluating the data set /content/EN-MTurk-287.txt
Fasttext  GloVe  w2v 
0.7051 0.6916 0.6831
evaluating the data set /content/EN-SIMLEX-999.txt
Fasttext  GloVe  w2v 
0.4503 0.4076 0.4427
evaluating the data set /content/EN-SimVerb-3500.txt
Fasttext  GloVe  w2v 
0.3605 0.2842 0.3654


In [0]:
dataSets = ['EN-RG-65.txt', 'EN-WS-353-ALL.txt', 'EN-RW-STANFORD.txt', 'EN-MEN-TR-3k.txt', 'EN-MTurk-287.txt', 'EN-SIMLEX-999.txt', 'EN-SimVerb-3500.txt']
for dataset in dataSets:
    dataSetAddress = '/content/'+  dataset
    print('evaluating the data set', dataSetAddress)
    print('Fasttext2')
    print("%.4f" % get_sim_no_cn(dataSetAddress, 'fasttext2'))

evaluating the data set /content/EN-RG-65.txt
Fasttext2
0.8587
evaluating the data set /content/EN-WS-353-ALL.txt
Fasttext2
0.7915
evaluating the data set /content/EN-RW-STANFORD.txt
Fasttext2
0.5948
evaluating the data set /content/EN-MEN-TR-3k.txt
Fasttext2
0.8364
evaluating the data set /content/EN-MTurk-287.txt
Fasttext2
0.7262
evaluating the data set /content/EN-SIMLEX-999.txt
Fasttext2
0.5038
evaluating the data set /content/EN-SimVerb-3500.txt
Fasttext2
0.4304


# Results

In [0]:
dataSets = ['EN-RG-65.txt', 'EN-WS-353-ALL.txt', 'EN-RW-STANFORD.txt', 'EN-MEN-TR-3k.txt', 'EN-MTurk-287.txt', 'EN-SIMLEX-999.txt', 'EN-SimVerb-3500.txt']
for dataset in dataSets:
    dataSetAddress = '/content/'+  dataset
    print('evaluating the data set', dataSetAddress)
    print('         Fasttext ', 'GloVe ', 'w2v ')
    print('With CN',"%.4f" % get_sim(dataSetAddress, 'fasttext',cn_fasttext_mat, alpha =2), "%.4f" % get_sim(dataSetAddress, 'glove', cn_glove_mat, alpha =2), "%.4f" % get_sim(dataSetAddress, 'w2v', cn_w2v_mat, alpha =2))
    print('NO   CN',"%.4f" % get_sim_no_cn(dataSetAddress, 'fasttext'), "%.4f" % get_sim_no_cn(dataSetAddress, 'glove'), "%.4f" % get_sim_no_cn(dataSetAddress, 'w2v'))

evaluating the data set /content/EN-RG-65.txt
         Fasttext  GloVe  w2v 
With CN 0.8621 0.7840 0.7892
NO   CN 0.8400 0.7510 0.7391
evaluating the data set /content/EN-WS-353-ALL.txt
         Fasttext  GloVe  w2v 
With CN 0.7336 0.7908 0.6930
NO   CN 0.7334 0.7385 0.6935
evaluating the data set /content/EN-RW-STANFORD.txt
         Fasttext  GloVe  w2v 
With CN 0.5369 0.5898 0.5804
NO   CN 0.5231 0.5101 0.5578
evaluating the data set /content/EN-MEN-TR-3k.txt
         Fasttext  GloVe  w2v 
With CN 0.8062 0.8338 0.7867
NO   CN 0.7902 0.8011 0.7705
evaluating the data set /content/EN-MTurk-287.txt
         Fasttext  GloVe  w2v 
With CN 0.7141 0.7107 0.6681
NO   CN 0.7072 0.6908 0.6831
evaluating the data set /content/EN-SIMLEX-999.txt
         Fasttext  GloVe  w2v 
With CN 0.4584 0.4853 0.4682
NO   CN 0.4521 0.4073 0.4419
evaluating the data set /content/EN-SimVerb-3500.txt
         Fasttext  GloVe  w2v 
With CN 0.3652 0.3636 0.3830
NO   CN 0.3603 0.2843 0.3654


# Experiment 2: STS Benchmark
I re-implement STS tasks by evaluating CN post-processed word vectors with 2012-2017 SemEval STS tasks.

#Load STS datasets

In [0]:
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/stsbenchmark/sts-dev.csv
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/stsbenchmark/sts-mt.csv
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/stsbenchmark/sts-other.csv
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/stsbenchmark/sts-test.csv
!wget https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/stsbenchmark/sts-train.csv

--2019-01-19 13:54:22--  https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/stsbenchmark/sts-dev.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 255680 (250K) [text/plain]
Saving to: ‘sts-dev.csv’


2019-01-19 13:54:22 (5.94 MB/s) - ‘sts-dev.csv’ saved [255680/255680]

--2019-01-19 13:54:23--  https://raw.githubusercontent.com/liutianlin0121/Conceptor-Negation-WV/master/data/stsbenchmark/sts-mt.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 513141 (501K) [text/plain]
Saving to: ‘sts-mt.csv’


2019-01

In [0]:
!pwd
!ls

/content
EN-MEN-TR-3k.txt     sample_data	  sts-other.csv
EN-MTurk-287.txt     small_glove.txt	  sts-test.csv
EN-RG-65.txt	     small_glove_w2v.txt  sts-train.csv
EN-RW-STANFORD.txt   small_w2v_w2v.txt	  wiki-news-300d-1M.vec
EN-SIMLEX-999.txt    small_word2vec.txt   wiki-news-300d-1M.vec.zip
EN-SimVerb-3500.txt  sts-dev.csv
EN-WS-353-ALL.txt    sts-mt.csv


In [0]:
import io
def load_sts_dataset(fname):
      fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    # For a STS dataset, loads the relevant information: the sentences and their human rated similarity score.
      sent_pairs = []
      for line in fin:
          items = line.rstrip().split('\t')
          if len(items) == 7 or len(items) == 9:
              sent_pairs.append((re.sub("[^0-9]", "", items[2]) + '-' + items[1] , items[5], items[6], float(items[4])))
          elif len(items) == 6 or len(items) == 8:
              sent_pairs.append((re.sub("[^0-9]", "", items[1]) + '-' + items[0] , items[4], items[5], float(items[3])))
          else:
              print('data format is wrong!!!')
      return pd.DataFrame(sent_pairs, columns=["year_task", "sent_1", "sent_2", "sim"])






def load_all_sts_dataset():
    # Loads all of the STS datasets 
    resourceFile = '/content/'
    sts_train = load_sts_dataset(resourceFile + 'sts-train.csv') 
    sts_dev = load_sts_dataset(resourceFile + "sts-dev.csv")
    sts_test = load_sts_dataset(resourceFile + "sts-test.csv")
    sts_other = load_sts_dataset(resourceFile + "sts-other.csv")
    sts_mt = load_sts_dataset(resourceFile +"sts-mt.csv")
    
    sts_all = pd.concat([sts_train, sts_dev, sts_test, sts_other, sts_mt ])
    
    return sts_all

sts_all = load_all_sts_dataset()


# Load dataset by year-task

In [0]:
def load_by_task_year(sts_all):
  sts_task_year = {}
  for i in sts_all['year_task']:
    indices = [index for index, x in enumerate(sts_all['year_task']) if x == i]
    sts_task_year[i] = sts_all.iloc[indices]
  return sts_task_year
sts_year_task = load_by_task_year(sts_all)
print(sts_year_task.keys())
print(sts_year_task['2012-MSRvid'][0:5])

dict_keys(['2012-MSRvid', '2014-images', '2015-images', '2014-deft-forum', '2012-MSRpar', '2014-deft-news', '2013-headlines', '2014-headlines', '2015-headlines', '2016-headlines', '2017-track5.en-en', '2015-answers-forums', '2016-answer-answer', '2012-surprise.OnWN', '2013-FNWN', '2013-OnWN', '2014-OnWN', '2014-tweet-news', '2015-belief', '2016-plagiarism', '2016-question-question', '2012-SMTeuroparl', '2012-surprise.SMTnews', '2016-postediting'])
     year_task                                         sent_1  \
0  2012-MSRvid                         A plane is taking off.   
1  2012-MSRvid                A man is playing a large flute.   
2  2012-MSRvid  A man is spreading shreded cheese on a pizza.   
3  2012-MSRvid                   Three men are playing chess.   
4  2012-MSRvid                    A man is playing the cello.   

                                              sent_2   sim  
0                        An air plane is taking off.  5.00  
1                          A man is

# Load dataset by year

In [0]:
sts_year = {}
def load_by_year(sts_all):
  for year in ['2012', '2013', '2014', '2015', '2016', '2017']:
    indices = [index for index, x in enumerate(sts_all['year_task'])if year in x]
    # store year as dictionary, [year: year-task]
    #year_task = sts_all.iloc[indices]
    sts_year[year] = sts_all.iloc[indices]
  return sts_year
sts_year = load_by_year(sts_all)
print(len(sts_year.keys()))
print(sts_year['2016'][:5])

6
           year_task                                             sent_1  \
5552  2016-headlines  Driver backs into stroller with child, drives off   
5553  2016-headlines   Spain Princess Testifies in Historic Fraud Probe   
5554  2016-headlines  Senate confirms Obama nominee to key appeals c...   
5555  2016-headlines  U.N. rights chief presses Egypt on Mursi deten...   
5556  2016-headlines  US Senate confirms Janet Yellen as US Federal ...   

                                                 sent_2  sim  
5552  Driver backs into mom, stroller with child the...  4.0  
5553   Spain princess testifies in historic fraud probe  5.0  
5554  Senate approves Obama nominee to key appeals c...  5.0  
5555   UN Rights Chief Presses Egypt on Morsi Detention  5.0  
5556  Senate confirms Janet Yellen as next Federal R...  5.0  


# Preparation for STS Evaluation


*   Define Sentence class, which has raw data and tokenized data
*   Get similarity scores based on embeddings



In [0]:
class Sentence:
  def __init__(self, sentence):
    self.raw = sentence
    normalized = sentence.replace("‘", "'").replace("’", "'")
    self.tokens = [token.lower() for token in nltk.word_tokenize(normalized)]

def sen_sim(sentences1, sentences2, cn_fname, cn_mat):
  model = eval(cn_fname)
  embeddings = []
  ls_word = list(model.vocab)
  for sent_1, sent_2 in zip(sentences1, sentences2):
    tokens1 = sent_1.tokens
    tokens2 = sent_2.tokens
    tokens1 = [token for token in tokens1 if token in model.vocab and token.islower()]
    tokens2 = [token for token in tokens2 if token in model.vocab and token.islower()]
    ids1 = [ls_word.index(token) for token in tokens1 ]
    ids2 = [ls_word.index(token) for token in tokens2 ]
    embedding1 = np.average([cn_mat[id] for id in ids1], axis = 0)
    embedding2 = np.average([cn_mat[id] for id in ids2], axis = 0)
    if isinstance(embedding1, float) or isinstance(embedding2, float):
      embeddings.append(np.zeros(300))
      embeddings.append(np.zeros(300))
    else:
      embeddings.append(embedding1)
      embeddings.append(embedding2)
  sim_score = [cosine_similarity(embeddings[id*2].reshape(1, -1), embeddings[id*2+1].reshape(1, -1))[0][0] for id in range(len(embeddings)//2)]
  return sim_score
        
  
  


In [0]:
def no_cn_sen_sim(sentences1, sentences2, fname):
  model = eval(fname)
  embeddings = []
  for sent_1, sent_2 in zip(sentences1, sentences2):
    tokens1 = sent_1.tokens
    tokens2 = sent_2.tokens
    tokens1 = [token for token in tokens1 if token in model.vocab and token.islower()]
    tokens2 = [token for token in tokens2 if token in model.vocab and token.islower()]
    embedding1 = np.average([model[token] for token in tokens1], axis = 0)
    embedding2 = np.average([model[token] for token in tokens2], axis = 0)
    if isinstance(embedding1, float) or isinstance(embedding2, float):
      embeddings.append(np.zeros(300))
      embeddings.append(np.zeros(300))
    else:
      embeddings.append(embedding1)
      embeddings.append(embedding2)
  sim_score = [cosine_similarity(embeddings[id*2].reshape(1, -1), embeddings[id*2+1].reshape(1, -1))[0][0] for id in range(len(embeddings)//2)]
  return sim_score

# Results

In [0]:
model_list = ['glove', 'w2v', 'fasttext']
pearson_cors = {}
pearson_cors_no_cn = {}
mat = []
for year_task in sts_all['year_task'].unique():
  for model in model_list:
    if model == 'glove':
      mat = cn_glove_mat
    elif model == 'w2v':
      mat = cn_w2v_mat
    elif model == 'fasttext':
      mat = cn_fasttext_mat
        
    sentences1=[Sentence(sent1) for sent1 in sts_year_task[year_task]['sent_1']]
    sentences2=[Sentence(sent2) for sent2 in sts_year_task[year_task]['sent_2']]
    sim = sen_sim(sentences1, sentences2, model, mat)
    pearson_correlation = round(scipy.stats.pearsonr(sim, sts_year_task[year_task]['sim'])[0] * 100,2)
    pearson_cors[(model, year_task)] = pearson_correlation
    sim2 = no_cn_sen_sim(sentences1, sentences2, model)
    pearson_correlation_no_cn = round(scipy.stats.pearsonr(sim2, sts_year_task[year_task]['sim'])[0] * 100,2)
    pearson_cors_no_cn[(model, year_task)] = pearson_correlation_no_cn
count = 0
for (i,j) in pearson_cors.keys():
  if count % 3 ==0:
    print('')
  count +=1
  print('With CN',i, j, pearson_cors[(i,j)])
  print('NO   CN',i, j, pearson_cors_no_cn[(i,j)])
    

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)



With CN glove 2012-MSRvid 62.5
NO   CN glove 2012-MSRvid 65.85
With CN w2v 2012-MSRvid 75.22
NO   CN w2v 2012-MSRvid 76.06
With CN fasttext 2012-MSRvid 66.44
NO   CN fasttext 2012-MSRvid 66.35

With CN glove 2014-images 65.81
NO   CN glove 2014-images 61.89
With CN w2v 2014-images 78.24
NO   CN w2v 2014-images 77.43
With CN fasttext 2014-images 63.41
NO   CN fasttext 2014-images 61.6

With CN glove 2015-images 71.43
NO   CN glove 2015-images 69.14
With CN w2v 2015-images 80.48
NO   CN w2v 2015-images 79.97
With CN fasttext 2015-images 71.13
NO   CN fasttext 2015-images 70.42

With CN glove 2014-deft-forum 37.57
NO   CN glove 2014-deft-forum 28.82
With CN w2v 2014-deft-forum 42.8
NO   CN w2v 2014-deft-forum 41.33
With CN fasttext 2014-deft-forum 40.18
NO   CN fasttext 2014-deft-forum 38.65

With CN glove 2012-MSRpar 41.19
NO   CN glove 2012-MSRpar 42.01
With CN w2v 2012-MSRpar 40.3
NO   CN w2v 2012-MSRpar 42.2
With CN fasttext 2012-MSRpar 45.03
NO   CN fasttext 2012-MSRpar 44.98

With 

# References
1. https://github.com/liutianlin0121/Conceptor-Negation-WV
2. Unsupervised Post-processing of Word Vectors via Conceptor Negation. Tianlin Liu, João Sedoc, and Lyle Ungar, Unsupervised Post-processing of Word Vectors via Conceptor Negation, AAAI 2019.
