### In-domain embedding generator
In this application we will use a Amazon review dataset (2018), especifically the Home and Kitchen subset (this was the most similar dataset compared to our goal with significant amount of reviews - ~6.9M), since we aim to extract aspects from refrigerator reviews. The dataset was aquired in Jianmo Ni [personal webpage](https://nijianmo.github.io/amazon/index.html).

---

**This code sample doesn't need to be executed for the main application to work, it was necessary to create the input data structures that our model will use.*

In [1]:
import re
import json
import nltk
import ijson
import string
import codecs
import gensim
from nltk.stem.wordnet import WordNetLemmatizer



First, let's read the JSON file and extract the text reviews. We'll use ijson lib for this task because it creates a iterator rather than reading the whole file. Ijson parser gives a parsing object containing a triple: prefix, event and value. They can assume different caracteristics in the JSON file, but for a field: prefix contain its name; event, its type; and value, its content.

In [2]:
reviews = []
with open('input/home_and_kitchen.json', 'r') as f:
    parser = ijson.parse(f, multiple_values=True)           # using ijson since we are working with a large JSON file
    
    for prefix, event, value in parser:            
        if prefix == 'reviewText':                          # we are only interested in the reviewText field
            v = re.sub('[\r\t\n]', '', value).lower()       # remove newline, tab and other unwanted chars
            v = re.sub(r'[^\x00-\x7F]','', v)               # remove non-ASCII characters
            reviews.append(v)

In [3]:
# Retrieve only the reviews that mention some domain words
domain = ['fridge', 'fridges', 'refrigerator', 'freezer', 'freezers', 'cooler', 'frig', 'icebox', 'icemaker', 'ice machine', 'minibar', 'refrigeration', 'refrigerate', 'cupboard', 'cupboards', 'defrost', 'microwave', 'stove', 'oven']
reviews = [s for s in reviews if any(w in s for w in domain)]
joined_reviews = ' '.join(reviews)

The next step is to tokenize the sentences for the Gensim word2vec function.

In [4]:
out = codecs.open('data/tech_domain.txt', 'w', 'utf-8')

lmtzr = WordNetLemmatizer()
stop_words = set(nltk.corpus.stopwords.words('english'))
stop_words = [w.replace("'",'') for w in stop_words]           # remove ' from stop words

punct = '[' + string.punctuation.replace('-','') + ']'         # regex expression to be used in re.sub function

tokenized_reviews = []
sentences = nltk.tokenize.sent_tokenize(joined_reviews)                     # tokenize reviews into sentences
for sent in sentences:
    sent = re.sub(punct, '', sent)                                          # remove punctuation
    tokens = nltk.tokenize.word_tokenize(sent)                              # then, tokenize the sentences into words
    tokens = [lmtzr.lemmatize(w) for w in tokens if not w in stop_words]    # remove stop words and apply lemmatization
    if len(tokens) > 0:
        tokenized_reviews.append(tokens)
        out.write(' '.join(tokens)+'\n')                                

We ended up with **more than 1.7 million** tokenized sentences, a small number compared to larger implementations, but more than enough for our purposes, and they are all somehow related to our domain. Now, we are able to create our embedding, it'll be generated based on the *CBoW approach* with *negative sampling (5)*, *window-length* of 10 context words and *word frequency threshold* equal to 5. Although most of papers working with attention-based models and in-domain embeddings use 200 words, the embedding size that we selected was **300**, because a larger embedding can fit our test set better, since we have a very small dataset.

In [5]:
emb = gensim.models.Word2Vec(tokenized_reviews, window=5, size=200, min_count=10, workers=4)
    
emb.save('embeddings/refrigerator_emb')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


### Citation
[Justifying recommendations using distantly-labeled reviews and fined-grained aspects](http://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19a.pdf) <br>
Jianmo Ni, Jiacheng Li, Julian McAuley <br>
Empirical Methods in Natural Language Processing (EMNLP), 2019