### Loading the Dataframe

In [31]:
import pandas as pd
df = pd.read_pickle("arxiv_data_cs_all.pickle.bz2")
print(len(df))

180644


In [209]:
!pip show urllib3

Name: urllib3
Version: 1.24.1
Summary: HTTP library with thread-safe connection pooling, file post, and more.
Home-page: https://urllib3.readthedocs.io/
Author: Andrey Petrov
Author-email: andrey.petrov@shazow.net
License: MIT
Location: /home/rclaret/anaconda3/envs/py36/lib/python3.6/site-packages
Requires: 
Required-by: requests, botocore


### Sanity Checks for the Dataframe

In [32]:
print(df.iloc[11,4])
print(df.iloc[11,5][0])

Transformer architectures show significant promise for natural language processing. Given that a single pretrained model can be fine-tuned to perform well on many different tasks, these networks appear to extract generally useful linguistic features. A natural question is how such networks represent this information internally. This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations.
Andy Coenen


### Loading the CountVectorizer

In [33]:
df_min = 2 #100
df_max = 0.1 #0.2

# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer
from gensim.matutils import Sparse2Corpus

cvec = CountVectorizer(min_df=df_min, max_df=df_max, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')

trans = cvec.fit_transform(df.summary.tolist())

corpus = Sparse2Corpus(trans, documents_columns=False)

id_map = dict((v, k) for k, v in cvec.vocabulary_.items())

### Sanity Checks for the LDA Count Vectorizer

In [34]:
print(len(list(id_map)))
print(list(cvec.vocabulary_.keys())[11])
print("lda" in cvec.vocabulary_.keys())

66987
ignoring
True


### Loading the LDA Model

In [35]:
from gensim.models.ldamodel import LdaModel
n_topics = 10
n_passes = 50

model_path = "lda/arxiv_data_cs_all_ntopics_"+str(n_topics)+"_npasses_"+str(n_passes)+"_nvoc_"+str(len(list(id_map)))+"_dfmin_"+str(df_min)+"_dfmax_"+str(int(df_max*100))+".model"
ldamodel = LdaModel.load(model_path)

### Sanity Checks for the LDA Model

In [36]:
ldamodel.print_topics(num_topics=n_topics, num_words=5)

[(0,
  '0.010*"distribution" + 0.009*"probability" + 0.009*"random" + 0.009*"function" + 0.007*"error"'),
 (1,
  '0.014*"channel" + 0.011*"power" + 0.011*"energy" + 0.009*"rate" + 0.008*"scheme"'),
 (2,
  '0.015*"matrix" + 0.010*"sparse" + 0.008*"signal" + 0.008*"dimensional" + 0.007*"space"'),
 (3,
  '0.011*"logic" + 0.010*"quantum" + 0.009*"theory" + 0.006*"properties" + 0.006*"complexity"'),
 (4,
  '0.015*"classification" + 0.014*"training" + 0.012*"features" + 0.010*"neural" + 0.009*"feature"'),
 (5,
  '0.022*"graph" + 0.014*"graphs" + 0.013*"bound" + 0.013*"codes" + 0.010*"bounds"'),
 (6,
  '0.011*"social" + 0.009*"research" + 0.008*"users" + 0.007*"user" + 0.005*"web"'),
 (7,
  '0.012*"software" + 0.008*"search" + 0.008*"computing" + 0.007*"implementation" + 0.007*"design"'),
 (8,
  '0.020*"control" + 0.012*"game" + 0.010*"agents" + 0.009*"games" + 0.008*"agent"'),
 (9,
  '0.030*"image" + 0.018*"images" + 0.013*"detection" + 0.011*"video" + 0.010*"object"')]

### Abtract to get similarities from

In [65]:
text_abstract = "We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models."

#text_abstract = "Argus exploits a Multi-Agent Reinforcement Learning (MARL) framework to create a 3D mapping of the disaster scene using agents present around the incident zone to facilitate the rescue operations. The agents can be both human bystanders at the disaster scene as well as drones or robots that can assist the humans. The agents are involved in capturing the images of the scene using their smartphones (or on-board cameras in case of drones) as directed by the MARL algorithm. These images are used to build real time a 3D map of the disaster scene. Via both simulations and real experiments, an evaluation of the framework in terms of effectiveness in tracking random dynamicity of the environment is presented."

### Get related topics

In [66]:
from gensim.matutils import Sparse2Corpus
import numpy as np

def get_similar_topics_distribution(abst_to_match):
    topics_array = np.zeros(n_topics)
    
    trans = cvec.transform(list([abst_to_match])) 
    corpus = Sparse2Corpus(trans, documents_columns=False)
    results = list(ldamodel.get_document_topics(bow=corpus))[0]
    
    for items in results:
        topics_array[items[0]] = items[1]
    return topics_array

In [67]:
results = []
lda_topics_with_5_words = ldamodel.print_topics(num_topics=n_topics, num_words=5)
#lda_topics_with_5_words = ldamodel.get_topic_terms(6, topn=10)

for i, item in enumerate(get_similar_topics_distribution(text_abstract)):
    if item > 0:
        results.append([i,item])

results.sort(key=lambda x: x[1], reverse=True)

#print(results)

for r in results:
    print(round(r[1],2), lda_topics_with_5_words[r[0]],"\n")

0.27 (4, '0.015*"classification" + 0.014*"training" + 0.012*"features" + 0.010*"neural" + 0.009*"feature"') 

0.26 (0, '0.010*"distribution" + 0.009*"probability" + 0.009*"random" + 0.009*"function" + 0.007*"error"') 

0.18 (2, '0.015*"matrix" + 0.010*"sparse" + 0.008*"signal" + 0.008*"dimensional" + 0.007*"space"') 

0.14 (7, '0.012*"software" + 0.008*"search" + 0.008*"computing" + 0.007*"implementation" + 0.007*"design"') 

0.1 (5, '0.022*"graph" + 0.014*"graphs" + 0.013*"bound" + 0.013*"codes" + 0.010*"bounds"') 

0.04 (1, '0.014*"channel" + 0.011*"power" + 0.011*"energy" + 0.009*"rate" + 0.008*"scheme"') 



### Load the Papers Matrix Dataframe

In [98]:
df_papers_path = "pickles/arxiv_data_cs_all_ntopics_"+str(n_topics)+"_npasses_"+str(n_passes)+"_nvoc_"+str(len(list(id_map)))+"_dfmin_"+str(df_min)+"_dfmax_"+str(int(df_max*100))+"_paperdf.pickle.bz2"
df_papers = pd.read_pickle(df_papers_path)

### Sanity Checks Papers Matrix

In [99]:
df_papers.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
count,180644.0,180644.0,180644.0,180644.0,180644.0,180644.0,180644.0,180644.0,180644.0,180644.0
mean,0.106808,0.117729,0.086715,0.096656,0.102931,0.116628,0.119399,0.105897,0.068867,0.069535
std,0.153645,0.209588,0.145671,0.176655,0.181113,0.197873,0.185187,0.153533,0.121295,0.145008
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.035985,0.0,0.015821,0.0,0.0,0.018142,0.026214,0.033596,0.0,0.0
75%,0.160989,0.129646,0.112872,0.105934,0.127626,0.141316,0.165402,0.158722,0.084873,0.061561
max,0.989652,0.992241,0.983632,0.986153,0.990321,0.990623,0.98269,0.969994,0.963993,0.988155


In [100]:
df_papers.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,0.080399,0.0,0.24942,0.063365,0.0,0.0,0.0,0.599737
1,0.0,0.0,0.0,0.0,0.612595,0.023915,0.011652,0.341914,0.0,0.0


## Get similarity

### Important features

In [80]:
results_topics = []
for r in results:
    results_topics.append(r[0])
print(results_topics)

[4, 0, 2, 7, 5, 1]


### Filter none related rows

In [145]:
df_filtered_papers = df_papers.copy()
print("Orginal dataset size",len(df_filtered_papers))

for r in results:
    df_filtered_papers = df_filtered_papers.where(df_filtered_papers[r[0]]>0).dropna()
print("Filtered dataset size",len(df_filtered_papers))

Orginal dataset size 180644
Filtered dataset size 2708


### Weighting the related rows

In [151]:
df_weighted_papers_filtered = df_filtered_papers.copy()
print("Filtered dataset size",len(df_weighted_papers_filtered))
for i in range(len(results)-1):
    df_weighted_papers_filtered = df_weighted_papers_filtered.where(df_weighted_papers_filtered[results[i][0]]>df_weighted_papers_filtered[results[i+1][0]]).dropna()

print("Weighted dataset size",len(df_weighted_papers_filtered))

Filtered dataset size 2708
Weighted dataset size 15


### Retrieve related articles

In [149]:
#df_test_papers_filtered.head(5)

In [152]:
similar_papers_index = df_test_papers_filtered.index.values.astype(int)
similar_papers_index

array([  5245,   8723,  10651,  12922,  26187,  36022,  38129,  47580,
        73065,  73689,  73767,  90315, 102916, 110867, 141588])

In [210]:
similar_papers_by_index = []
for i in similar_papers_index:
    similar_papers_by_index.append([i,df.iloc[i,0]])
similar_papers_by_index

[[5245, 'http://arxiv.org/abs/1901.03706v5'],
 [8723, 'http://arxiv.org/abs/1904.12933v1'],
 [10651, 'http://arxiv.org/abs/1904.08688v1'],
 [12922, 'http://arxiv.org/abs/1811.09341v2'],
 [26187, 'http://arxiv.org/abs/1806.07366v4'],
 [36022, 'http://arxiv.org/abs/1811.00894v1'],
 [38129, 'http://arxiv.org/abs/1810.07491v1'],
 [47580, 'http://arxiv.org/abs/0911.4983v1'],
 [73065, 'http://arxiv.org/abs/1708.05464v1'],
 [73689, 'http://arxiv.org/abs/1708.03052v1'],
 [73767, 'http://arxiv.org/abs/1704.02124v2'],
 [90315, 'http://arxiv.org/abs/1612.08882v2'],
 [102916, 'http://arxiv.org/abs/1606.07369v1'],
 [110867, 'http://arxiv.org/abs/1511.05078v2'],
 [141588, 'http://arxiv.org/abs/1507.01826v2']]

In [211]:
similar_papers_by_index[9]

[73689, 'http://arxiv.org/abs/1708.03052v1']

### Rank retrived articles

In [154]:
def get_score(abstract_1, abstract_2):
    return np.dot(get_similar_topics_distribution(abstract_1),
                  get_similar_topics_distribution(abstract_2))

In [212]:
similar_papers_summaries = []
for i in similar_papers_index:
    similar_papers_summaries.append(df.iloc[i,4])

In [213]:
similar_papers_ranked_by_id = []
for i, abstract in enumerate(similar_papers_summaries):
    if text_abstract == abstract:
            print("me at:",i)
            continue
    similar_papers_ranked_by_id.append([i, get_score(text_abstract, abstract)])

similar_papers_ranked_by_id = sorted(similar_papers_ranked_by_id, key=lambda x: x[1], reverse=True)
print(similar_papers_ranked_by_id)

me at: 4
[[0, 0.2545948518225215], [9, 0.2156067581129495], [5, 0.20817100388195325], [3, 0.20246628537173106], [10, 0.19847078589184258], [6, 0.19455812937375738], [1, 0.17897905241003836], [11, 0.1786961457787899], [2, 0.17338374635551135], [12, 0.16507263437871061], [14, 0.1537525284028094], [8, 0.10652350212058317], [7, 0.06271363104310562], [13, 0.04201339097344092]]


### Display the ranked similar articles

In [214]:
ranked_papers = []
for r in similar_papers_ranked_by_id:
    ranked_papers.append(similar_papers_by_index[r[0]])
ranked_papers

[[5245, 'http://arxiv.org/abs/1901.03706v5'],
 [73689, 'http://arxiv.org/abs/1708.03052v1'],
 [36022, 'http://arxiv.org/abs/1811.00894v1'],
 [12922, 'http://arxiv.org/abs/1811.09341v2'],
 [73767, 'http://arxiv.org/abs/1704.02124v2'],
 [38129, 'http://arxiv.org/abs/1810.07491v1'],
 [8723, 'http://arxiv.org/abs/1904.12933v1'],
 [90315, 'http://arxiv.org/abs/1612.08882v2'],
 [10651, 'http://arxiv.org/abs/1904.08688v1'],
 [102916, 'http://arxiv.org/abs/1606.07369v1'],
 [141588, 'http://arxiv.org/abs/1507.01826v2'],
 [73065, 'http://arxiv.org/abs/1708.05464v1'],
 [47580, 'http://arxiv.org/abs/0911.4983v1'],
 [110867, 'http://arxiv.org/abs/1511.05078v2']]

In [215]:
df.summary[similar_papers_ranked_by_id[0][0]]

"Rapid advances in 2D perception have led to systems that accurately detect objects in real-world images. However, these systems make predictions in 2D, ignoring the 3D structure of the world. Concurrently, advances in 3D shape prediction have mostly focused on synthetic benchmarks and isolated objects. We unify advances in these two areas. We propose a system that detects objects in real-world images and produces a triangle mesh giving the full 3D shape of each detected object. Our system, called Mesh R-CNN, augments Mask R-CNN with a mesh prediction branch that outputs meshes with varying topological structure by first predicting coarse voxel representations which are converted to meshes and refined with a graph convolution network operating over the mesh's vertices and edges. We validate our mesh prediction branch on ShapeNet, where we outperform prior work on single-image shape prediction. We then deploy our full Mesh R-CNN system on Pix3D, where we jointly detect objects and predi