# NLP-based Recommenders

Compare models for NLP-based recommender of Engine

(Refactored from `Computing Similarity using Latent Semantic Analysis on a personal corpus` NB by Liad)

Imports

In [1]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt

import os, re, glob, ujson, urllib.request
from pathlib import Path

import nltk
nltk.download("punkt", download_dir="nltk/")
nltk.data.path.append("nltk/")
from nltk.corpus import stopwords

from gensim.corpora import Dictionary, MmCorpus
from gensim import models, utils, similarities

%matplotlib inline

[nltk_data] Downloading package punkt to nltk/...
[nltk_data]   Package punkt is already up-to-date!


# Preprocessing
Using papers from `Reasoning Corpus/` directory

In [2]:
class Corpus(object):
    """
    Memory efficient represantation of corpus
    """
    def __iter__(self):
        for file in glob.glob("data/Reasoning_Corpus/*.txt"):
            print(file)
            paper = Path(file).read_text(encoding="utf8")
            yield paper
    
    @property
    def titles(self): # TBD
        titles = []
        for file in glob.glob("data/Reasoning_Corpus/*.txt"):
            f_name = os.path.split(file)[-1]
            title = os.path.splitext(f_name)[0]
            titles.append(title)
        return titles
            
corpus_memory_friendly = Corpus()
papers = list(corpus_memory_friendly)
papers_titles = corpus_memory_friendly.titles

data/Reasoning_Corpus/Aspect-augmented Adversarial Networks for Domain Adaptation.txt
data/Reasoning_Corpus/Rationalizing Neural Predictions.txt
data/Reasoning_Corpus/Explaining the Predictions of Any Classifier.txt
data/Reasoning_Corpus/Representation Learning for Grounded Spatial Reasoning.txt


## Tokenization

In [3]:
SPECIAL_CHARS = "[^A-Za-z0-9 ]+" # TBD
stopword_list = stopwords.words("english")
stopword_list.extend(["et", "al"])

def tokenize(text):
    """
    Tokenize input text
    :param text: Text sequence
    :return: [tokens]
    """
    tokens = [re.sub(SPECIAL_CHARS, "", word.lower()) for word in nltk.word_tokenize(text)] # remove special chars
    tokens = [re.sub(r"^arxiv.*", "", token) for token in tokens] # remove arxiv refs
    tokens = [re.sub(r"\b[0-9][0-9.,-]*\b", "UNIFIED-NUMBER-TOKEN", token) for token in tokens] # replace numbers with special token; TBD: Add yk, yth, yx etc
    tokens = [word for word in tokens if word not in stopword_list]
    tokens = list(filter(None, tokens))
    
    return tokens

In [4]:
doc_iterator = [list(tokenize(paper)) for paper in papers]

dictionary = Dictionary(doc_iterator) # gensim.corpora.Dictionary
len(dictionary)

4059

In [5]:
# See Dict
#dictionary.token2id

## Vectorization

BoW

In [6]:
corpus = [dictionary.doc2bow(doc) for doc in doc_iterator]
MmCorpus.serialize("data/training/reasoning_corpus.mm", corpus)

TF-IDF

In [7]:
tfidf = models.TfidfModel(corpus)
tfidf_corpus = tfidf[corpus]

# Models

[Latent Semantic Indexing](https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-indexing-1.html)

In [8]:
lsi_bow = models.LsiModel(corpus, id2word=dictionary, num_topics=20)
lsi_bow.show_topics()

[(0,
  '0.915*"UNIFIED-NUMBER-TOKEN" + 0.147*"model" + 0.091*"x" + 0.078*"learning" + 0.060*"explanations" + 0.058*"classifier" + 0.057*"dataset" + 0.054*"training" + 0.052*"set" + 0.052*"figure"'),
 (1,
  '0.332*"explanations" + 0.177*"predictions" + -0.175*"domain" + 0.165*"features" + 0.162*"lime" + -0.159*"aspect" + 0.151*"trust" + 0.150*"explanation" + 0.141*"interpretable" + 0.136*"users"'),
 (2,
  '0.254*"instructions" + 0.209*"goal" + -0.206*"domain" + -0.177*"classifier" + 0.165*"value" + 0.163*"language" + 0.147*"map" + 0.144*"global" + 0.139*"spatial" + 0.136*"instruction"'),
 (3,
  '0.263*"rationales" + 0.217*"rationale" + 0.189*"generator" + -0.170*"instructions" + 0.167*"z" + 0.167*"encoder" + 0.165*"neural" + -0.137*"model" + -0.137*"goal" + 0.127*"recurrent"')]

In [9]:
lsi_tfidf = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=10)
lsi_tfidf.show_topics()

[(0,
  '0.246*"rationale" + 0.215*"generator" + 0.208*"explanations" + 0.200*"lime" + 0.156*"rationales" + 0.155*"instructions" + 0.154*"z" + 0.154*"adversarial" + 0.150*"relevance" + 0.131*"pathology"'),
 (1,
  '0.512*"instructions" + 0.274*"instruction" + 0.256*"policy" + 0.247*"environment" + 0.155*"global" + 0.137*"uvfa" + 0.110*"locations" + 0.110*"environments" + 0.110*"agent" + -0.105*"adversarial"'),
 (2,
  '0.256*"adversarial" + 0.250*"relevance" + -0.249*"explanations" + -0.243*"lime" + 0.218*"pathology" + 0.205*"transfer" + 0.131*"source" + -0.117*"interpretable" + 0.116*"document" + -0.114*"explanation"'),
 (3,
  '0.330*"rationale" + 0.288*"generator" + -0.223*"explanations" + -0.222*"lime" + 0.200*"rationales" + 0.144*"gen" + 0.127*"zx" + 0.126*"encoder" + -0.122*"adversarial" + -0.119*"relevance"')]

[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

In [10]:
lda_bow = models.LdaMulticore(corpus, id2word=dictionary, num_topics=10) # sounds faster than models.LdaModel ;)
lda_bow.show_topics()

[(0,
  '0.065*"UNIFIED-NUMBER-TOKEN" + 0.011*"model" + 0.007*"x" + 0.005*"learning" + 0.005*"explanations" + 0.004*"classifier" + 0.004*"set" + 0.004*"dataset" + 0.004*"figure" + 0.004*"use"'),
 (1,
  '0.045*"UNIFIED-NUMBER-TOKEN" + 0.011*"model" + 0.006*"x" + 0.005*"learning" + 0.004*"explanations" + 0.004*"classifier" + 0.004*"figure" + 0.003*"use" + 0.003*"dataset" + 0.003*"set"'),
 (2,
  '0.058*"UNIFIED-NUMBER-TOKEN" + 0.010*"model" + 0.006*"learning" + 0.006*"domain" + 0.005*"x" + 0.005*"classifier" + 0.005*"dataset" + 0.004*"set" + 0.004*"explanations" + 0.004*"training"'),
 (3,
  '0.039*"UNIFIED-NUMBER-TOKEN" + 0.007*"model" + 0.005*"x" + 0.004*"learning" + 0.003*"instructions" + 0.003*"figure" + 0.003*"set" + 0.003*"use" + 0.003*"using" + 0.003*"goal"'),
 (4,
  '0.059*"UNIFIED-NUMBER-TOKEN" + 0.010*"model" + 0.007*"learning" + 0.005*"x" + 0.004*"models" + 0.004*"training" + 0.004*"figure" + 0.004*"classifier" + 0.004*"dataset" + 0.004*"explanations"'),
 (5,
  '0.063*"UNIFIED-NU

## Evaluation

In [11]:
#TODO

# Recommenders

Fetch documents from Crawler microservice. Acutally limited to 100 most recent papers. #TODO change to fetch DB when MongoDB is set up

In [12]:
with urllib.request.urlopen('https://keepcurrent-crawler.herokuapp.com/arxiv') as url:
            response = url.read()
        
crawled_docs = ujson.loads(response)
crawled_docs[:1]

[{'publish_date': '2018-06-21T17:59:09Z',
  'authors': ['Deepak Pathak',
   'Yide Shentu',
   'Dian Chen',
   'Pulkit Agrawal',
   'Trevor Darrell',
   'Sergey Levine',
   'Jitendra Malik'],
  'title': 'Learning Instance Segmentation by Interaction',
  'abstract': "We present an approach for building an active agent that learns to segment\nits visual observations into individual objects by interacting with its\nenvironment in a completely self-supervised manner. The agent uses its current\nsegmentation model to infer pixels that constitute objects and refines the\nsegmentation model by interacting with these pixels. The model learned from\nover 50K interactions generalizes to novel objects and backgrounds. To deal\nwith noisy training signal for segmenting objects obtained by self-supervised\ninteractions, we propose robust set loss. A dataset of robot's interactions\nalong-with a few human labeled examples is provided as a benchmark for future\nresearch. We test the utility of the lea

TBD: Uses abstracts only. #TODO change to parsed document, when pdf extraction is ready

In [13]:
abstracts = [doc["abstract"] for doc in crawled_docs]
titles = [doc["title"] for doc in crawled_docs]
authors = [doc["authors"] for doc in crawled_docs]
links = [doc["link"] for doc in crawled_docs]

Preprocess abstracts

In [14]:
abstract_tokens = [tokenize(abstract) for abstract in abstracts]
abstract_tokens_bow = [dictionary.doc2bow(abstract) for abstract in abstract_tokens]
abstract_tokens_tfidf = tfidf[abstract_tokens_bow]

vec_lsi_bow = [lsi_bow[bow_vector] for bow_vector in abstract_tokens_bow]
vec_lsi_tfidf = [lsi_tfidf[tfidf_vector] for tfidf_vector in abstract_tokens_tfidf]
vec_lda_bow = [lda_bow[bow_vector] for bow_vector in abstract_tokens_bow]

## Similarity matrices

Cosine similarity

In [15]:
lsi_bow_index = similarities.MatrixSimilarity(lsi_bow[corpus])
lsi_bow_index.save("models/indices/rationality_lsi_bow_mat_sim.index")
sims_lsi_bow = lsi_bow_index[vec_lsi_bow]

lsi_tfidf_index = similarities.MatrixSimilarity(lsi_tfidf[corpus])
lsi_tfidf_index.save("models/indices/rationality_lsi_tfidf_mat_sim.index")
sims_lsi_tfidf = lsi_tfidf_index[vec_lsi_tfidf]

lda_bow_index = similarities.MatrixSimilarity(lda_bow[corpus])
lda_bow_index.save("models/indices/rationality_lda_bow_mat_sim.index")
sims_lda_bow = lda_bow_index[vec_lda_bow]

# Load index: index = similarities.MatrixSimilarity.load("/models/indices/rationality_lsi_bow_mat_sim.index')

## Recommendations

In [16]:
TOP_N = 5

sims_df = pd.DataFrame(sims_lsi_bow)
sims_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,100.0,0.763705,0.187163,0.117203,0.630858,0.812657,0.917121,0.981435
1,100.0,0.679451,0.205645,0.169724,0.545178,0.69505,0.866757,0.991578
2,100.0,0.817545,0.119416,0.347124,0.750919,0.826029,0.911356,0.984627
3,100.0,0.727028,0.18002,0.111811,0.640128,0.753224,0.876731,0.971581


### n-Docs most similar to corpus

E.g. corpus represents your favourite docs and you search for n most similar docs from input docs.

Simple heuristic, based on similarity average:

In [17]:
avg_sims_df = sims_df
avg_sims_df["a_mean"] = avg_sims_df.mean(axis=1)
avg_sims_df["g_mean"] = scipy.stats.mstats.gmean(avg_sims_df.iloc[:,:-1], axis=1) # more elaborated, cos skewed

avg_sims_df_sorted = avg_sims_df.sort_values(by="g_mean", ascending=False)
avg_sims_df_sorted.head()

Unnamed: 0,0,1,2,3,a_mean,g_mean
7,0.939831,0.966941,0.91643,0.926384,0.937397,0.937206
35,0.922805,0.976067,0.922328,0.918191,0.934848,0.934549
8,0.967506,0.931213,0.925871,0.912378,0.934242,0.934022
1,0.962615,0.960259,0.915,0.899151,0.934256,0.933843
23,0.913568,0.919333,0.957195,0.939108,0.932301,0.932143


In [18]:
print("Most similiar to corpus:\n\n")

for abstract_ix in avg_sims_df_sorted.index.values.tolist()[:TOP_N]:
    print("Index of document:", abstract_ix)
    print("Document:", titles[abstract_ix], "by", authors[abstract_ix])
    print("\nAbstract:", abstracts[abstract_ix])
    print("\n\n")

Most similiar to corpus:


Index of document: 7
Document: Fashion-Gen: The Generative Fashion Dataset and Challenge by ['Negar Rostamzadeh', 'Seyedarian Hosseini', 'Thomas Boquet', 'Wojciech Stokowiec', 'Ying Zhang', 'Christian Jauvin', 'Chris Pal']

Abstract: We introduce a new dataset of 293,008 high definition (1360 x 1360 pixels)
fashion images paired with item descriptions provided by professional stylists.
Each item is photographed from a variety of angles. We provide baseline results
on 1) high-resolution image generation, and 2) image generation conditioned on
the given text descriptions. We invite the community to improve upon these
baselines. In this paper, we also outline the details of a challenge that we
are launching based upon this dataset.



Index of document: 35
Document: DPP-Net: Device-aware Progressive Search for Pareto-optimal Neural
  Architectures by ['Jin-Dong Dong', 'An-Chieh Cheng', 'Da-Cheng Juan', 'Wei Wei', 'Min Sun']

Abstract: Recent breakthroughs in Neu

### n-Docs most similar pairs

E.g. corpus consist of all processed papers and you search for n most similar docs in corpus to input docs. 

In [19]:
top_n_pairs = sims_df.stack().nlargest(TOP_N)
top_n_pairs

32  1    0.991578
53  2    0.984627
86  0    0.981435
25  1    0.979233
43  0    0.978500
dtype: float32

In [20]:
print("Most similiar pairs in descending order:\n\n")

for abstract_ix, corpus_ix in zip(top_n_pairs.keys().labels[0], top_n_pairs.keys().labels[1]):
    print("Pair:", abstract_ix, "/", corpus_ix)
    print("Input document:", titles[abstract_ix], "by", authors[abstract_ix])
    print("Corpus document:", papers_titles[corpus_ix])
    print("")

Most similiar pairs in descending order:


Pair: 32 / 1
Input document: Synaptic partner prediction from point annotations in insect brains by ['Julia Buhmann', 'Renate Krause', 'Rodrigo Ceballos Lentini', 'Nils Eckstein', 'Matthew Cook', 'Srinivas Turaga', 'Jan Funke']
Corpus document: Rationalizing Neural Predictions

Pair: 53 / 2
Input document: Reservoir Computing Hardware with Cellular Automata by ['Alejandro Morán', 'Christiam F. Frasser', 'Josep L. Rosselló']
Corpus document: Explaining the Predictions of Any Classifier

Pair: 86 / 0
Input document: Lifted Neural Networks by ['Armin Askari', 'Geoffrey Negiar', 'Rajiv Sambharya', 'Laurent El Ghaoui']
Corpus document: Aspect-augmented Adversarial Networks for Domain Adaptation

Pair: 25 / 1
Input document: Emotional Conversation Generation Orientated Syntactically Constrained
  Bidirectional-Asynchronous Framework by ['Xiao Sun', 'Jingyuan Li', 'Jianhua Tao']
Corpus document: Rationalizing Neural Predictions

Pair: 43 / 0
Input do