<div align='center' ><font size='70'>Data Challenge</font></div>


<center>
</center>

<center>
YANG Yining & JIN Zhongwei

10 Avril, 2022
</center>



In [1]:
import csv
import networkx as nx
import numpy as np
import pandas as pd
import pickle

In [2]:
# packages for NLP
import nltk
for path in nltk.data.path:
    print(path)

/Users/yangyining/nltk_data
/Users/yangyining/opt/anaconda3/envs/pytorch/nltk_data
/Users/yangyining/opt/anaconda3/envs/pytorch/share/nltk_data
/Users/yangyining/opt/anaconda3/envs/pytorch/lib/nltk_data
/usr/share/nltk_data
/usr/local/share/nltk_data
/usr/lib/nltk_data
/usr/local/lib/nltk_data


In [3]:
from collections import OrderedDict
import spacy
from spacy import displacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from spacy.matcher import Matcher

In [None]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

## Part 1 -- Task Description

Link prediction is the problem of predicting a potential link between 2 entities within the network. In the problem, we are given a citation network consisting of 138499 papers, along with their abstracts and arthors. Moreover, one list file of 1091955 existing edges is given. However, they are incomplete. We are asked to predict whether test edges existed and the probability of their existence.

Here, we divide our workflow into several <u>tasks</u>:
- Feature engineering: Explorating and constructing the features of edges
- Model construction & training
- Evaluations & test results



In [12]:
# print the file tree
import os
for dirname, _, filenames in os.walk('../'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

../.DS_Store
../submission.csv
../Docs/.DS_Store
../Docs/data_challenge_handout.pdf
../data/edgelist.txt
../data/abstracts.txt
../data/test.txt
../data/authors.txt
../data/generated_data/abs_w2v_model_withmincount1.pkl
../data/generated_data/athrs_w2v_model.pkl
../data/generated_data/authors.pkl
../data/generated_data/abstract_w.pkl
../data/generated_data/tr4w.pkl
../data/generated_data/abs_w2v_model.pkl
../src/text_baseline.py
../src/Text_authors_preprocessing.py
../src/graph_baseline.py
../src/Text_abstracts_preprocessing.py
../src/Datachallenge_Y&J.ipynb
../src/__pycache__/Text_abstracts_preprocessing.cpython-37.pyc
../src/__pycache__/Text_authors_preprocessing.cpython-37.pyc


## Part 2 -- Feature Engineering

In these part, our aim is to extract information from different dimentions of data. 

There are 2 main source of edge features:
- From ``graph`` structure: degrees, ranks,...
- Converting the paper (<u>node</u> in the graph) attributes including ``authors`` and ``abstracts`` into citation (<u>edge</u> in the graph) features: similarities, ...

For the first part, we could get them directly when the graph is established. For the next part, we should do some natural language preprocessing work.

For the first part, we could read the ``graph`` information and use the intrinsic properties in the given graph.

In [5]:
G = nx.read_edgelist('../data/edgelist.txt', delimiter=',',
                     create_using=nx.Graph(), nodetype=int)  ## import the graph from the edgelist file.
nodes = list(G.nodes())
n = G.number_of_nodes()
m = G.number_of_edges()
print('Number of nodes:', n)
print('Number of edges:', m)

Number of nodes: 138499
Number of edges: 1091955


In [12]:
max(G.nodes())

138498

For the language processing, we are going to perform different tricks on ``abstracts`` and ``authors`` texts.

> ### abstracts.txt
>1. Using the ``TextRank`` algorithm to extract keywords from natural sentences of an abstract. (https://towardsdatascience.com/textrank-for-keyword-extraction-by-python-c0bae21bcec0)
>2. Establish the ``Word2Vec`` representation of abstract words for articles. (dimension of 100)
>3. Create article features using the ``keyword vectors`` (10 keywords).

> ### authors.txt
>1. Establish the ``Word2Vec`` representation of authors for papers. (dimension of 100)
>2. Create the article features using the ``author vectors``.

In [6]:
# This is the definition of the textrank algorithm which is defined in the Text_abstracts_preprocessing.py
#import Text_abstracts_preprocessing 
from Text_abstracts_preprocessing import TextRankForKeyword


In [7]:
from Text_abstracts_preprocessing import abstracts

In [33]:
# open the testranked
with open('../data/generated_data/tr4w.pkl','rb') as abstracts_read:
  tr4w_input_abstracts = pickle.load(abstracts_read)

In [34]:
tr4w_abstracts = dict()
for node in tr4w_input_abstracts:
  tr4w_abstracts[node] = pickle.loads(tr4w_input_abstracts[node])

In [18]:
nlp = spacy.load('en_core_web_sm')
# for node in abstracts:
#   print(tr4w_abstracts[node].sentence_segment(nlp(abstracts[node]),candidate_pos=['NOUN', 'PROPN', 'ADJ'],lower =True))
#   if node==3:
#     break

### it is the word extraction function for word2vec embedding
### It is too long! It is recommended to load the result from pickle file

abstract_data=[tr4w_abstracts[node].sentence_segment(nlp(abstracts[node]),candidate_pos=['NOUN', 'PROPN', 'ADJ'],lower =True) for node in G.nodes()]

# # store the abstract_data into .pkl file
# with open('../data/generated_data/abstract_w.pkl','wb') as abstract_data_file:
#   pickle.dump(abstract_data,abstract_data_file)
#   abstract_data_file.close()




In [114]:
with open('../data/generated_data/abstract_w.pkl','rb') as abstract_data_file:
  abstract_data = pickle.load(abstract_data_file)
  abstract_data_file.close()


In [29]:
abs_w2v_model_data = list()
for node in abstracts:
  for sent in abstract_data[node]:
    abs_w2v_model_data.append(sent)


# with open('../data/generated_data/abstract_w2v_model_input.pkl','wb') as abstract_inputdata_file:
#   pickle.dump(abstract_data,abstract_inputdata_file)
#   abstract_inputdata_file.close()

In [35]:
abs_w2v_model = Word2Vec(size=100, window=5, min_count=1, sg=0, workers=8)
abs_w2v_model.build_vocab(abs_w2v_model_data)
abs_w2v_model.train(abs_w2v_model_data, total_examples=abs_w2v_model.corpus_count, epochs=5) 


(41580969, 45747750)

In [113]:
# store the model into .pkl file
with open('../data/generated_data/abs_w2v_model_withmincount1.pkl','wb') as abs_w2v_model_file:
  model_bytes = pickle.dumps(abs_w2v_model)
  pickle.dump(model_bytes,abs_w2v_model_file)
  abs_w2v_model_file.close()

# with open('../data/generated_data/abs_w2v_model_withmincount1.pkl','rb') as abs_w2v_model_file:
#   abs_w2v_model = pickle.load(abs_w2v_model_file)
#   abs_w2v_model = pickle.loads(abs_w2v_model)
#   abs_w2v_model_file.close()

In [105]:
abstracts[0]

'The development of an automated system for the quality assessment of aerodrome ground lighting (AGL), in accordance with associated standards and recommendations, is presented. The system is composed of an image sensor, placed inside the cockpit of an aircraft to record images of the AGL during a normal descent to an aerodrome. A model-based methodology is used to ascertain the optimum match between a template of the AGL and the actual image data in order to calculate the position and orientation of the camera at the instant the image was acquired. The camera position and orientation data are used along with the pixel grey level for each imaged luminaire, to estimate a value for the luminous intensity of a given luminaire. This can then be compared with the expected brightness for that luminaire to ensure it is operating to the required standards. As such, a metric for the quality of the AGL pattern is determined. Experiments on real image data is presented to demonstrate the applicat

In [25]:
abs_w2v_model.wv.get_vector('agl')==abs_w2v_model.wv.word_vec('agl')

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

In [40]:
ws1 = list([i.lower() for i in OrderedDict(sorted(tr4w_abstracts[0].node_weight.items(), key=lambda t: t[1], reverse=True))])
ws2 = list([i.lower() for i in OrderedDict(sorted(tr4w_abstracts[100].node_weight.items(), key=lambda t: t[1], reverse=True))])

abs_w2v_model.wv.n_similarity(ws1[:10],ws2[:9])

0.47216004

In [41]:
np.shape(np.array([abs_w2v_model.wv.get_vector(i).reshape((10,10)) for i in ws1[:8]]))

(8, 10, 10)

In [212]:
np.shape(np.array(np.broadcast_to(np.mean([abs_w2v_model.wv.get_vector(i).reshape((10,10)) for i in ws1[:8]],axis =0),(3,10,10))))

(3, 10, 10)

In [43]:
def node_abstract_feature(G,Word2Vec=abs_w2v_model,key_generator =tr4w_abstracts, f_reshape = (10,10),num_keyword=10):
  feature = np.zeros((len(G.nodes),f_reshape[0],f_reshape[1],num_keyword))
  for n in G.nodes():
    #print(n)
    ## generate list of all keyword vectors (the order is depend on the keyword's significance)
    keyws = list([i.lower() for i in OrderedDict(sorted(key_generator[n].node_weight.items(), key=lambda t: t[1], reverse=True))])
    
    ## pick num_keyword vectors to construct the ndarray of shape (f_reshape,num_keyword): typically ((10,10),10)
    if len(keyws)<num_keyword:
      if len(keyws)>0:
        origi_feature = np.zeros((num_keyword,f_reshape[0],f_reshape[1]))
        origi_feature[:len(keyws)]=np.array([Word2Vec.wv.get_vector(i).reshape(f_reshape) for i in keyws[:]])
        origi_feature[len(keyws):]=np.array(np.broadcast_to(np.mean([abs_w2v_model.wv.get_vector(i).reshape(f_reshape) for i in keyws[:len(keyws)]],axis =0),(num_keyword-len(keyws),f_reshape[0],f_reshape[1])))
        feature[n]=origi_feature.transpose((1, -1, 0))
      else:
        feature[n]=np.zeros((f_reshape[0],f_reshape[1],num_keyword))
    else:
      feature[n] = np.array([Word2Vec.wv.get_vector(i).reshape(f_reshape) for i in keyws[:num_keyword]]).transpose((1, -1, 0))

  return feature
    

In [45]:
G_abs_feature = node_abstract_feature(G,num_keyword=8)

In [46]:
G_abs_feature.shape

(138499, 10, 10, 8)

In [144]:
abs_w2v_model.wv.n_similarity?

[0;31mSignature:[0m [0mabs_w2v_model[0m[0;34m.[0m[0mwv[0m[0;34m.[0m[0mn_similarity[0m[0;34m([0m[0mws1[0m[0;34m,[0m [0mws2[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Compute cosine similarity between two sets of words.

Parameters
----------
ws1 : list of str
    Sequence of words.
ws2: list of str
    Sequence of words.

Returns
-------
numpy.ndarray
    Similarities between `ws1` and `ws2`.
[0;31mFile:[0m      ~/opt/anaconda3/envs/pytorch/lib/python3.7/site-packages/gensim/models/keyedvectors.py
[0;31mType:[0m      method


In [115]:
tr4w_abstracts[0].node_weight

{'development': 0.6090520682324654,
 'automated': 0.7086640311498161,
 'system': 1.5679570113392165,
 'quality': 1.3171029580107905,
 'assessment': 1.0329315988838061,
 'aerodrome': 1.428328959770789,
 'ground': 1.0717865363074939,
 'lighting': 1.0710650930750285,
 'AGL': 2.84224636207708,
 'accordance': 0.8263990018051872,
 'standard': 0.9720223325738512,
 'recommendation': 0.6346649158719633,
 'image': 2.883652050757183,
 'sensor': 0.7287399050369255,
 'cockpit': 0.8164182291826758,
 'aircraft': 0.89771077463189,
 'normal': 0.7176374172831043,
 'descent': 0.639638563678444,
 'model': 0.6301721618431856,
 'methodology': 0.7577006521962233,
 'optimum': 0.8455715132389162,
 'match': 0.9517206957597248,
 'template': 1.0520358132679348,
 'actual': 0.9577995187961232,
 'datum': 2.0465024726228793,
 'order': 0.8585879852028008,
 'position': 1.1875646813677716,
 'orientation': 1.199074135114456,
 'camera': 1.0010472345501968,
 'instant': 0.5636257638358058,
 'pixel': 0.908579658383817,
 'gre

In [66]:
abs_w2v_model.wv.most_similar('graph')

[('graphs', 0.8177399635314941),
 ('hypergraph', 0.7595082521438599),
 ('subgraph', 0.694742739200592),
 ('vertex', 0.677985668182373),
 ('bipartite', 0.6227419376373291),
 ('undirected', 0.6220051050186157),
 ('clique', 0.6205345988273621),
 ('dag', 0.6048083901405334),
 ('hyperedge', 0.5968008637428284),
 ('node', 0.5712962746620178)]

Next we are going to deal with the author data.

In [47]:
from Text_authors_preprocessing import authors

In [48]:
sum([len(authors[node]) for node in range(len(authors))])

456810

In [18]:
distin_authors = set()
for node in range(len(authors)):
  distin_authors = distin_authors|set(authors[node])

# with open("../data/generated_data/authors.pkl",'wb') as authorfile:
#   pickle.dump(distin_authors,authorfile)
#   authorfile.close()

In [54]:
# introduce the distinct authors set
with open("../data/generated_data/authors.pkl",'rb') as authorfile:
  distin_authors=pickle.load(authorfile)
  authorfile.close()

In [None]:
print('There are %d distinct authors occured in these articles'%len(distin_authors))

There are 149683 distinct authors occured in these articles


In [67]:
authors_word_data = [list(authors[node]) for node in authors]

In [85]:
athrs_w2v_model = Word2Vec(size=100, window=5, min_count=1, sg=0, workers=8)  # Here we set the min_count frequency to 1 because the authors are sparse to occur
athrs_w2v_model.build_vocab(authors_word_data)
athrs_w2v_model.train(authors_word_data, total_examples=athrs_w2v_model.corpus_count, epochs=5) 

(2284050, 2284050)

In [116]:
# store the model into .pkl file
with open('../data/generated_data/athrs_w2v_model.pkl','wb') as athrs_w2v_model_file:
  atrs_model_bytes = pickle.dumps(athrs_w2v_model)
  pickle.dump(atrs_model_bytes,athrs_w2v_model_file)
  athrs_w2v_model_file.close()

In [70]:
# introduce the model in .pkl file
with open('../data/generated_data/athrs_w2v_model.pkl', 'rb') as athrs_w2v_model_file:
    athrs_w2v_model = pickle.load(athrs_w2v_model_file)
    athrs_w2v_model = pickle.loads(athrs_w2v_model)
    athrs_w2v_model_file.close()

In [118]:
authors[0]

{'George W. Irwin', 'James H. Niblock', 'Jian-Xun Peng', 'Karen R. McMenemy'}

In [122]:
athrs_w2v_model.wv.similarity('Jian-Xun Peng','James H. Niblock')

0.069632575

In [103]:
def node_author_feature(G,Word2Vec=athrs_w2v_model,author_data =authors_word_data, f_reshape = (10,10),num_author=2):
  feature = np.zeros((len(G.nodes),f_reshape[0],f_reshape[1],num_author))
  for n in G.nodes():
    #print(n)

    ## pick num_keyword vectors to construct the ndarray of shape (f_reshape,num_keyword): typically ((10,10),10)
    if len(author_data[n])<num_author:
      if len(author_data[n])>0:
        origi_feature = np.zeros((num_author,f_reshape[0],f_reshape[1]))
        origi_feature[:len(author_data[n])]=np.array([Word2Vec.wv.get_vector(i).reshape(f_reshape) for i in author_data[n][:]])
        if len(author_data[n])==1:
          origi_feature[len(author_data[n]):]=np.array(np.broadcast_to(Word2Vec.wv.get_vector(author_data[n][0]).reshape(f_reshape),(num_author-len(author_data[n]),f_reshape[0],f_reshape[1])))
        else:
          origi_feature[len(author_data[n]):]=np.array(np.broadcast_to(np.mean([Word2Vec.wv.get_vector(i).reshape(f_reshape) for i in author_data[n]],axis =0),(num_author-len(author_data[n]),f_reshape[0],f_reshape[1])))
        feature[n]=origi_feature.transpose((1, -1, 0))
      else:
        feature[n]=np.zeros((f_reshape[0],f_reshape[1],num_author))
    else:
      feature[n] = np.array([Word2Vec.wv.get_vector(i).reshape(f_reshape) for i in author_data[n][:num_author]]).transpose((1, -1, 0))

  return feature

In [104]:
G_athrs_feature = node_author_feature(G,Word2Vec=athrs_w2v_model,num_author=2)

In [105]:
print('The combined feature dimensions:',np.shape(np.concatenate((G_abs_feature,G_athrs_feature),axis=3)))

The combined feature dimensions: (138499, 10, 10, 10)


In [106]:
def node_feature(features = (G_abs_feature,G_athrs_feature),ax=3):
  return np.concatenate(features,axis=ax)

In [107]:
X = node_feature()

In [108]:
np.save('../data/generated_data/X.npy',X)

In [111]:
np.shape(X)

(138499, 10, 10, 10)