# Testing 

### Amazon-Google: reduced

In [70]:
! source activate entity-resolution && python create-embedding-data-by-set.py glove ../data/embeddings/glove-300.gensim ../data/raw/amazon-google ../data/converted/amazon-google-reduced -c name description -s --max_df=0.1

Loading Gensim model...
Check id column names are valid: passed
Check datasets have same column names: passed
Check listed columns are valid: passed
Columns to convert:
	name
	description
Cleaning text. Fixed 402 unique tokens
Creating map file. 17263 unique tokens detected.
Building embedding matrix. 2309 unknown tokens assigned random Gaussian.
Converting text data to index vectors.


### Amazon-Google: normal

In [2]:
! source activate entity-resolution && python create-embedding-data-by-set.py glove ../data/embeddings/glove-300.gensim ../data/raw/amazon-google ../data/converted/amazon-google -c name description

Loading Gensim model...
Check id column names are valid: passed
Check datasets have same column names: passed
Check listed columns are valid: passed
Columns to convert:
	name
	description
Cleaning text. Fixed 402 unique tokens
Creating map file. 22339 unique tokens detected.
Building embedding matrix. 5110 unknown tokens assigned random Gaussian.
Converting text data to index vectors.


### Amazon-walmart: reduced

In [23]:
! source activate entity-resolution && python create-embedding-data-by-set.py glove ../data/embeddings/glove-300.gensim ../data/raw/amazon-walmart ../data/converted/amazon-walmart-reduced -c brand groupname title shelfdescr shortdescr longdescr -d prod_id1 prod_id2 imageurl modelno -s --max_df=1.0

Loading Gensim model...
Check id column names are valid: passed
Check datasets have same column names: passed
Check listed columns are valid: passed
Columns to convert:
	brand
	groupname
	title
	shelfdescr
	shortdescr
	longdescr
Columns to drop:
	prod_id1
	prod_id2
	imageurl
	modelno
Cleaning text. Fixed 3719 unique tokens
Creating map file. 69244 unique tokens detected.
Building embedding matrix. 33265 unknown tokens assigned random Gaussian.
Converting text data to index vectors.


### Amazon-Walmart: normal

In [17]:
! source activate entity-resolution && python create-embedding-data-by-set.py glove ../data/embeddings/glove-300.gensim ../data/raw/amazon-walmart ../data/converted/amazon-walmart -c brand groupname title shelfdescr shortdescr longdescr -d prod_id1 prod_id2 imageurl modelno

Loading Gensim model...
Check id column names are valid: passed
Check datasets have same column names: passed
Check listed columns are valid: passed
Columns to convert:
	brand
	groupname
	title
	shelfdescr
	shortdescr
	longdescr
Columns to drop:
	prod_id1
	prod_id2
	imageurl
	modelno
Cleaning text. Fixed 3719 unique tokens
Creating map file. 124127 unique tokens detected.
Building embedding matrix. 53741 unknown tokens assigned random Gaussian.
Converting text data to index vectors.


### DBLP-Scholar: reduced

In [2]:
! source activate entity-resolution && python create-embedding-data-by-set.py glove ../data/embeddings/glove-300.gensim ../data/raw/dblp-scholar ../data/converted/dblp-scholar-reduced -c title authors venue -s --max_df=1.0


Loading Gensim model...
Check id column names are valid: passed
Check datasets have same column names: passed
Check listed columns are valid: passed
Columns to convert:
	title
	authors
	venue
Cleaning text. Fixed 4128 unique tokens
Creating map file. 92475 unique tokens detected.
Building embedding matrix. 45412 unknown tokens assigned random Gaussian.
Converting text data to index vectors.


### DBLP-Scholar: normal

In [3]:
! source activate entity-resolution && python create-embedding-data-by-set.py glove ../data/embeddings/glove-300.gensim ../data/raw/dblp-scholar ../data/converted/dblp-scholar -c title authors venue


Loading Gensim model...
Check id column names are valid: passed
Check datasets have same column names: passed
Check listed columns are valid: passed
Columns to convert:
	title
	authors
	venue
Cleaning text. Fixed 4128 unique tokens
Creating map file. 123171 unique tokens detected.
Building embedding matrix. 37557 unknown tokens assigned random Gaussian.
Converting text data to index vectors.


# Script 

In [22]:
%%writefile create-embedding-data-by-set.py

import os
import shutil
import re
import string
import html

import numpy as np
import pandas as pd
import pickle as pkl
import argparse as ap

from gensim.models import KeyedVectors
from collections import defaultdict
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict

# parse command-line arguments
parser = ap.ArgumentParser()
parser.add_argument('embedding_type',
                    help = 'choose embedding type (word2vec or glove)')
parser.add_argument('embedding_file',
                    help = 'file path of downloaded embedding data')
parser.add_argument('data_dir',
                    help = 'directory containing dataset and match files')
parser.add_argument('dest_dir',
                    help = 'directory path to generate new files')
parser.add_argument('--columns', '-c', nargs='+', required=True,
                    help = 'names of columns to be converted')
parser.add_argument('--drop', '-d', nargs='+', required=False,
                    help = 'names of columns to be dropped')
parser.add_argument('--set1', '-s1', default='set1.csv',
                    help='filename of first dataset csv')
parser.add_argument('--set2', '-s2', default='set2.csv',
                    help='filename of second dataset csv')
parser.add_argument('--matches', '-m', default='matches.csv',
                    help='filename of positives matches csv')
parser.add_argument('--sklearn', '-s', action='store_true',
                    help='whether to use sklearn\'s CountVectorizer')
parser.add_argument('--max_df', default=0.1, type=float,
                    help='proportion of documents above which token will be excluded')

args = parser.parse_args()
embedding_type = args.embedding_type
embedding_file = args.embedding_file
dest_dir = args.dest_dir
data_dir = args.data_dir
set1 = args.set1
set2 = args.set2
matches = args.matches
columns = args.columns
drop = args.drop
use_sklearn = args.sklearn
max_df = args.max_df

print('Loading Gensim model...')
if embedding_type in ['glove', 'word2vec']:
    model = KeyedVectors.load(embedding_file)      
elif embedding_type == 'debug':
    print(embedding_type)
    print(embedding_file)
    print(dest_dir)
    quit()
else:
    raise 'Not a valid embedding type'
gensim_vocab = defaultdict(int, model.vocab)

df1 = pd.read_csv(os.path.join(data_dir, set1), encoding = "latin1")
df2 = pd.read_csv(os.path.join(data_dir, set2), encoding = "latin1")

# check input data meets requirements
df1_column_check = list(df1.columns)
df2_column_check = list(df2.columns)

print('Check id column names are valid: ', end='')
assert('id1' in df1_column_check)
assert('id2' in df2_column_check)
print('passed')

df1_column_check.remove('id1')
df2_column_check.remove('id2')

print('Check datasets have same column names: ', end='')
assert(df1_column_check == df2_column_check)
print('passed')

print('Check listed columns are valid: ', end='')
for column in columns:
    assert(column in df1_column_check)
if drop:
    for column in drop:
        assert(column in df1_column_check)
print('passed')

print('Columns to convert:')
for column in columns:
    print('\t' + column)

if drop:
    print('Columns to drop:')
    for column in drop:
        print('\t' + column)
        df1 = df1.drop(column, axis='columns')
        df2 = df2.drop(column, axis='columns')

if not os.path.isdir(dest_dir):
    os.mkdir(dest_dir)

# no need to do anything to matches.csv, so just copy it to destination
matches_source_path = os.path.join(data_dir, matches)
matches_dest_path = os.path.join(dest_dir, matches)
shutil.copyfile(matches_source_path, matches_dest_path)

# pre-process all text data into tokens compatible with Glove/Word2Vec
def clean_text(x):
    "formats a single string"
    if not isinstance(x, str):
        return 'NaN'
    
    # separate possessives with spaces
    x = x.replace('\'s', ' \'s')
    
    # convert html escape characters to regular characters
    x = html.unescape(x)
    
    # separate punctuations with spaces
    def pad(x):
        match = re.findall(r'.', x[0])[0]
        match_clean = ' ' + match + ' '
        return match_clean
    rx = r'\(|\)|/|!|#|\$|%|&|\\|\*|\+|,|:|;|<|=|>|\?|@|\[|\]|\^|_|{|}|\||'
    rx += r'`|~'
    x = re.sub(rx, pad, x)
    
    # remove decimal parts of version numbers
    def v_int(x):
        return re.sub('\.\d+','',x[0])
    x = re.sub(r'v\d+\.\d+', v_int, x)
    
    return x

print('Cleaning text. ', end = '')
df1.loc[:, columns] = df1.loc[:, columns].applymap(clean_text)
df2.loc[:, columns] = df2.loc[:, columns].applymap(clean_text)

# for any tokens not in model vocabulary, try a few capitalization variants
fixed = list()
def check_tokens(x):
    global fixed
    x = x.split()
    new_string = ''
    for token_orig in x:
        token = token_orig
        if not bool(gensim_vocab[token]):
            token = token.lower()
            if bool(gensim_vocab[token]):
                fixed.append(token_orig)
        if not bool(gensim_vocab[token]):
            token = string.capwords(token)
            if bool(gensim_vocab[token]):
                fixed.append(token_orig)
        if not bool(gensim_vocab[token]):
            token = token.upper()
            if bool(gensim_vocab[token]):
                fixed.append(token_orig)
        new_string = new_string + ' ' + token
    return new_string
df1.loc[:, columns] = df1.loc[:, columns].applymap(check_tokens)
df2.loc[:, columns] = df2.loc[:, columns].applymap(check_tokens)
print('Fixed {} unique tokens'.format(pd.Series(fixed).nunique()))
        
# map each token to an index and convert text fields accordingly

print('Creating map file. ', end='')
# collapse all text columns in both datasets to a single list of strings
corpus = list()
for df in [df1, df2]:
    for column in columns:
        corpus.extend(list(df[column]))

# map each token to a unique non-zero index
word2idx = defaultdict(int)
idx2word = defaultdict(str)
if not use_sklearn:
    i = 1
    # missing = ['nan', 'NAN', 'Nan', 'NaN']
    for instance in corpus:
        for token in instance.split():
            if not word2idx[token]:
                word2idx[token] = i
                i += 1

    # create a reverse mapping from index to token
    for key, value in word2idx.items():
        idx2word[value] = key
else:
    cv = CountVectorizer(max_df=max_df)
    cv.fit(corpus)
    sk_dict = cv.vocabulary_
    # offset indices assigned by sklearn so 0 index is free
    for word, index in sk_dict.items():
        word2idx[word] = index + 1
        idx2word[index+1] = word
    
print('{} unique tokens detected.'.format(len(word2idx)))
    
print('Building embedding matrix. ', end='')
# an extra row of zeros at top of matrix is needed for Keras zero padding
embedding_matrix = np.zeros([len(word2idx) + 1, 300])
n_unknowns = 0
for word, index in word2idx.items():
    # if word has no vector embedding, leave corresponding row to be an
    # ...attenuated random Gaussian
    if bool(gensim_vocab[word]):
        embedding_vector = model.get_vector(word)      
    else:
        embedding_vector = np.random.randn(300) / 300
        n_unknowns += 1
    embedding_matrix[index, :] = embedding_vector
print('{} unknown tokens assigned random Gaussian.'.format(n_unknowns))
    
print('Converting text data to index vectors.')
def record2idx(x):
    if not use_sklearn:
        x = x.split()
    else:
        x = x.lower().split()
    return [word2idx[word] for word in x]

df1.loc[:, columns] = df1.loc[:, columns].applymap(record2idx)
df2.loc[:, columns] = df2.loc[:, columns].applymap(record2idx)

# save files
df1.to_csv(os.path.join(dest_dir, set1), index=False)
df2.to_csv(os.path.join(dest_dir, set2), index=False)
np.save(arr=embedding_matrix,
        file=os.path.join(dest_dir, '{}-300.matrix'.format(embedding_type)))

# save both word2idx and idx2word mappings into a double dictionary
map = dict(word2idx = word2idx, idx2word = idx2word)
with open(os.path.join(dest_dir, embedding_type + '-300.map'), 'wb') as f:
    pkl.dump(map, f)

Overwriting create-embedding-data-by-set.py


# Scratch pad 

In [1]:
import os
import shutil
import re
import string
import html

import pandas as pd
import numpy as np
import pickle as pkl

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import KeyedVectors
from collections import defaultdict

In [5]:
def clean_text(x):
    "formats a single string"
    if not isinstance(x, str):
        return 'NaN'
    
    # separate possessives with spaces
    x = x.replace('\'s', ' \'s')
    
    # convert html escape characters to regular characters
    x = html.unescape(x)
    
    # separate punctuations with spaces
    def pad(x):
        match = re.findall(r'.', x[0])[0]
        match_clean = ' ' + match + ' '
        return match_clean
    rx = r'\(|\)|/|!|#|\$|%|&|\\|\*|\+|,|:|;|<|=|>|\?|@|\[|\]|\^|_|{|}|\||'
    rx += r'`|~'
    x = re.sub(rx, pad, x)
    
    # remove decimal parts of version numbers
    def v_int(x):
        return re.sub('\.\d+','',x[0])
    x = re.sub(r'v\d+\.\d+', v_int, x)
    
    return x

In [2]:
model = KeyedVectors.load('../data/embeddings/glove-300.gensim')

In [6]:
df1 = pd.read_csv('../data/raw/amazon-google/set1.csv', encoding='latin1')
df2 = pd.read_csv('../data/raw/amazon-google/set2.csv', encoding='latin1')
corpus = list(df1['name']) + list(df1['description']) + list(df2['name']) + list(df2['description'])  
corpus = [clean_text(x) for x in corpus]

for index, record in enumerate(corpus):
    if not isinstance(corpus[index], str):
        corpus[index] = 'NaN'

In [57]:
cv = CountVectorizer(min_df=0.05)
cv.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0.05,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [58]:
cv.vocabulary_

{'adobe': 0,
 'all': 1,
 'an': 2,
 'and': 3,
 'are': 4,
 'as': 5,
 'by': 6,
 'can': 7,
 'cd': 8,
 'complete': 9,
 'create': 10,
 'dvd': 11,
 'easy': 12,
 'edition': 13,
 'encore': 14,
 'features': 15,
 'for': 16,
 'from': 17,
 'home': 18,
 'in': 19,
 'is': 20,
 'it': 21,
 'mac': 22,
 'microsoft': 23,
 'more': 24,
 'new': 25,
 'of': 26,
 'on': 27,
 'or': 28,
 'pc': 29,
 'pro': 30,
 'professional': 31,
 'software': 32,
 'system': 33,
 'that': 34,
 'the': 35,
 'this': 36,
 'time': 37,
 'to': 38,
 'tools': 39,
 'use': 40,
 'user': 41,
 'will': 42,
 'win': 43,
 'windows': 44,
 'with': 45,
 'xp': 46,
 'you': 47,
 'your': 48}

In [33]:
len(corpus) * 0.0002

1.8356000000000001

In [234]:
word2idx = defaultdict(int)
i = 1
missing = ['NaN','nan', 'NAN', 'Nan']
for sentence in corpus:
    for token in sentence.split():
        if token not in missing and not word2idx[token]:
            word2idx[token] = i
            i += 1

In [339]:
with open('../data/converted/amazon-google/glove-300.map', 'rb') as f:
    map = pkl.load(f)
    
embedding_matrix = np.load('../data/converted/amazon-google/glove-300.matrix.npy')

In [340]:
test_idx = 8
np.all(embedding_matrix[test_idx,:] == model.get_vector(map['idx2word'][test_idx]))

True

In [301]:
def check_tokens(x):
    x = x.split()
    new_string = ''
    for token_orig in x:
        token = token_orig
        if not bool(gensim_vocab[token]):
            token = token.lower()
        if not bool(gensim_vocab[token]):
            token = string.capwords(token)
        if not bool(gensim_vocab[token]):
            token = token.upper()
        new_string = new_string + ' ' + token
    return new_string

In [302]:
df1[columns].applymap(clean_text).applymap(check_tokens)

Unnamed: 0,name,description,manufacturer
0,CLICKART 950 000 - premier image pack ( dvd-r...,,broderbund
1,ca international - arcserve lap / desktop oem...,oem arcserve backup v11 win 30u for laptops a...,computer associates
2,noah 's ark activity center ( jewel case ages...,,victory multimedia
3,peachtree by sage premium accounting for nonp...,peachtree premium accounting for nonprofits 2...,sage software
4,singing coach unlimited,singing coach unlimited - electronic learning...,CARRY-A-TUNE technologies
5,emc retrospect 7.5 disk to disk windows,emc retrospect 7.5 disk to DISKCROMWINDOWS,Dantz
6,adobe after effects professional 6.5 upgrade ...,upgrade only ; installation of after effects ...,adobe
7,acad upgrade dragon naturallyspeaking pro sol...,- marketing information : dragon NATURALLYSPE...,nuance academic
8,mia 's math adventure : just in time,in mia 's math adventure : just in time child...,Kutoka
9,disney 's 1st & 2nd grade bundle ( pixar 1st ...,disney 's 1st & 2nd grade bundle will help yo...,disney


In [91]:
with open('../data/converted/amazon-google-reduced/glove-300.map', 'rb') as f:
    map = pkl.load(f)
    
embedding_matrix = np.load('../data/converted/amazon-google-reduced/glove-300.matrix.npy')

word = 'school'
idx = map['word2idx'][word]
word_vector = embedding_matrix[idx,:]
print(np.all(word_vector == model.get_vector(word)))
print(word == map['idx2word'][idx])

True
True
