Group Menbers:
- Didier Bakoue Ngatcha 
- Paul Laudrup Fotso Kaptue
- Wilfried Djomeni Djiela

# Data Challenge: Prediction of authors' h-index

## Description

### Description of the subject

The goal of this challenge is to study and apply machine learning / artificial intelligence methods to a real-world regression problem. In this problem, each data corresponds to an author and we are asked to predict the h-index of this author. The h-index of an author measures its productivity and its impact in the research field. It is defined as the maximum value h such that the author has published h paper(s) that have each been cited at least h times. To build the model, we have:
- a graph that shapes the intensity of collaboration between researchers
- extracts from authors' papers


### Description of the data 
we have the following files:
- coauthorship.edges : it is a graph where nodes correspond to authors and edges specify whether or not two authors have collaborated together for the production of a research paper. This graph contains 217801 vertices (authors) and 1718164 edges.
- author_papers.txt: contains a list of authors and IDs of their most cited papers
- abstract.txt: for each paper, this file contains the Id of the paper and the "inverted Index" of the extracts of this paper
- train.csv: contains 174242 authors and their h-index. Each line contains the author ID and its h-index.
- test.csv : contains 43561 author IDs whose h-indexes we want to predict

### Simple Regression Lasso
this first submission produced an MSE of 129.0, so there is a question of improving it as much as possible

In [1]:
import os
import pandas as pd
import numpy as np
import networkx as nx
from sklearn.linear_model import Lasso
from sklearn.feature_extraction.text import TfidfVectorizer


# read training data
df_train = pd.read_csv('train.csv', dtype={'author': np.int64, 'hindex': np.float32})
n_train = df_train.shape[0]

# read test data
df_test = pd.read_csv('test.csv', dtype={'author': np.int64})
n_test = df_test.shape[0]

# load the graph    
G = nx.read_edgelist('coauthorship.edgelist', delimiter=' ', nodetype=int)
n_nodes = G.number_of_nodes()
n_edges = G.number_of_edges()
print('Number of nodes:', n_nodes)
print('Number of edges:', n_edges)


# computes structural features for each node
core_number = nx.core_number(G)

# create the training matrix. each node is represented as a vector of 3 features:
# (1) its degree, (2) its core number 
X_train = np.zeros((n_train, 2))
y_train = np.zeros(n_train)
nodes_train = np.zeros((n_train, 1))
for i,row in df_train.iterrows():
    node = row['author']
    X_train[i,0] = G.degree(node)
    X_train[i,1] = core_number[node]
    y_train[i] = row['hindex']
    nodes_train[i, 0] = node
    

# create the test matrix. each node is represented as a vector of 3 features:
# (1) its degree, (2) its core number
X_test = np.zeros((n_test, 2))
for i,row in df_test.iterrows():
    node = row['author']
    X_test[i,0] = G.degree(node)
    X_test[i,1] = core_number[node]
    
# train a regression model and make predictions
reg = Lasso(alpha=0.1)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

# write the predictions to file
df_test['hindex'] = pd.Series(np.round_(y_pred, decimals=3))


df_test.loc[:,["author","hindex"]].to_csv('submission.csv', index=False)

Number of nodes: 217801
Number of edges: 1718164


## Features extraction
Let's recall that we have the extracts of the authors' papers and the collaboration graph.
The goal of this first part is to extract from these two sets of vectors that will characterize the authors.
### abstract features extraction
#### data import

In [2]:
#loading abstract and author papers in dataframe
abstract_dict = pd.read_csv('abstracts.txt', sep="----", 
                            error_bad_lines=False, 
                            header=None)
author_paper = pd.read_csv('author_papers.txt', sep=':' )

  return func(*args, **kwargs)


  exec(code_obj, self.user_global_ns, self.user_ns)
Skipping line 63243: Expected 2 fields in line 63243, saw 19. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 82978: Expected 2 fields in line 82978, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 123264: Expected 2 fields in line 123264, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 150525: Expected 2 fields in line 150525, saw 4. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 198059: Expected 2 fields in line 198059, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skipping line 297016: Expected 2 fields in line 297016, saw 4. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
Skippi

In [3]:
import json
import string 

# put the abstrack in a form that can be use for training 
# the result text_abstract is use for the training of tfidf
# the result train_abstract is use to train our word to vec model 
def build_text_abstract(dict_abstract:dict):
    words_matrix = ["" for i  in range(dict_abstract['IndexLength'])]
    for key in list(dict_abstract['InvertedIndex'].keys()):
        for position in dict_abstract['InvertedIndex'][key]:
            words_matrix[position] = key
    text_abstract = ''.join(word+' ' for word in words_matrix)
    for punc in string.punctuation:
        text_abstract = text_abstract.replace(punc, ' ')
    for num in range (10):
        text_abstract = text_abstract.replace(str(num), '')
    text_abstract = text_abstract.lower()
    train_abstract = text_abstract.split()
    return str(text_abstract) , train_abstract

# put the result in two generator , one for tfidf and the oder for word2vec training 
abstracts = (build_text_abstract(json.loads(abstract_dict[1][i]))[0] for i in range (len(abstract_dict)))
trainings_abstracts = (build_text_abstract(json.loads(abstract_dict[1][i]))[1] for i in range (len(abstract_dict)))

In [5]:
# next(trainings_abstracts)
# len(list(abstracts))

624168

#### strategy: TFIDF weighted word2vec 
To build information representing authors according to their research paper excerpt, we chose to use word2vec which we will train on the data. We will then build vectors representing the authors' abstracts by modulating the vectors produced by word2vec with the TFIDF of the different words in the abstracts.

In [4]:
#build the tfidf for all the abstracts 
vectorizer = TfidfVectorizer()
tfidf_abs = vectorizer.fit_transform(abs for abs in abstracts)
feature_name = vectorizer.get_feature_names()
vectors_gen = (tfidf_abs[i] for i in range (tfidf_abs.get_shape()[0]))



In [5]:
len(feature_name)

427456

**The goal of the following cell is to train our own word2vec neural network. this training took us a lot of time(about 7hours). There is no need to recompute it. our model has already been save and the cell after this one upload it** 

In [None]:
#training of a word2vec there is no need to train the model again. it has already be trained and the has only to be loaded 
from gensim.models.callbacks import CallbackAny2Vec

class callback(CallbackAny2Vec):
#     callback to print the loss after each epoch
    def __init__ (self):
        self.epoch = 0
    
    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        model.save('abstractsw3vecIt.model')
        if self.epoch == 0:
            print('Loss after epoch {}: {}'.format(self.epoch, loss))
        elif self.epoch % 1 == 0:
            print('Loss after epoch {}: {}'.format(self.epoch, loss- self.loss_previous_step))
        self.epoch += 1
        self.loss_previous_step = loss


models_abstracts = Word2Vec(list(trainings_abstracts), 
                            vector_size=300, 
                            min_count=1, 
                            workers=4, 
                            window=5)

models_abstracts.build_vocab(sentences)

# train the word2vec neurals network models on my dataset abstracts 

import time 
start = time.time()
models_abstracts.train(sentences, 
                      total_examples=models_abstracts.corpus_count,
                      epochs=50,
                      report_delay=1,
                      compute_loss=True,
                      callbacks=[callback()])
end = time.time()

models_abstracts.save('abstractsw2vec.model')

In [6]:
# loading the pretrained word2vec model 
import gensim
from gensim import models
from gensim.models import Word2Vec, KeyedVectors

reload_model = Word2Vec.load('abstractsw2vecIt.model')
# models_abstracts.wv.vocab
words = list(reload_model.wv.index_to_key)
print('vocabulary size :', len(words))

reload_model.wv.most_similar(positive='company', topn=10)

vocabulary size : 489681


[('companies', 0.6483214497566223),
 ('customers', 0.6270943880081177),
 ('customer', 0.601658046245575),
 ('taikang', 0.5920459628105164),
 ('company’s', 0.5890681743621826),
 ('utopics', 0.5839164853096008),
 ('organizations', 0.5789316892623901),
 ('employees', 0.5730001926422119),
 ('financial', 0.5720323324203491),
 ('bangchak', 0.5690685510635376)]

In [8]:
#compute now the list of abstract vect
reload_model.wv.add_vectors(feature_name, np.zeros((len(feature_name) ,300)), replace=False)
vocab_vects = reload_model.wv[feature_name]

# weighted word2vec with tfidf
abs_vects_list = tfidf_abs@vocab_vects

In [None]:
# compute vectors that represent each authors
vects_authors = []
nbre_total = 0
for index , papers in enumerate (author_paper['paperID']):
    list_papers = papers.split('-')
    vec_author = np.zeros(300)
    nb_paper = 0
    nbre_total += len(list_papers)   
    for paper in list_papers:
        if len(np.array(df['abs_vects'][df.Id == int(paper)])) != 0:
            vec_author += np.array(df['abs_vects'][df.Id == int(paper)])[0]
            nb_paper += 1
#     print(vec_author) 
#     print(nb_paper)
    if nb_paper != 0:
        vec_author /= nb_paper
    vects_authors.append(vec_author)
print(nbre_total)

### Graph features

we use to strategy to compute graph feature :
- the graph metric : there are feature compute by *hand* and that represent for each node a property it has 
- Node2Vec : node to is an embedding method for node's graph

In [None]:
# compute graph metrics : there is no need to compute graph features again. they have already been computed and stored in the csv with abstract vect

def compute_features(g, node):
    X = np.zeros((1, 3))

    neighb = [n for n in g.neighbors(node)]
    nb_neighb = len(neighb)
    neighb.append(node)
    g1 = g.subgraph(neighb)
    neighb = neighb[:-1]
    #groups
    t = set(neighb)
    res = []
    while len(t)!=0:
        prev_len=0
        clus = set({neighb[0]})
        while prev_len!=len(clus):
            prev_len = len(clus)
            temp = {c for c in clus}
            for c in clus:
                temp.update(g1.neighbors(c))
                temp.remove(node)
            clus = {m for m in temp}
        res.append(clus)
        for k in clus:
            neighb.remove(k)
        t = set(neighb)
    #corresponding h-index
    #features

    nb_comp =0
    nb_isolates=0
    nb_auth_in_comp = []
#     h_idx_per_comp = []
#     h_idx_per_isolates = []
    for k in range(len(res)):
        if len(res[k])>=2:
            nb_auth_in_comp.append(len(res[k]))
            nb_comp = nb_comp+1
            
        elif len(res[k])==1:
            nb_isolates = nb_isolates+1
            
    X[0, 0] = nb_neighb
    X[0, 1] = nb_comp
    X[0, 2] = nb_isolates 
    return X

def neighbor_av_degree(g, node):
    
    d = 0
    for neighb in g.neighbors(node):
        d+=g.degree[neighb]
    if g.degree[node] == 0:
        return 0
    else:
        return d/g.degree[node]

dictionary_d_centrality =nx.algorithms.centrality.degree_centrality(G)
def degree_centrality(node):
    return dictionary_d_centrality[node]

dictionary_of_page_rank = nx.pagerank(G)
def page_rank(node):
    return dictionary_of_page_rank[node]

dictionary_of_core = nx.algorithms.core.core_number(G)
def core_number(node):
    return dictionary_of_core[node]

def neighbor_av_and_max_h(g, node, author_labelled):
    h_indices = []
    for neighb in g.neighbors(node):
#         k = np.argwhere(nodes_train[:,0]==neighb).flatten()
        if not np.isnan(author_labelled['hindex'][neighb]):
            h_indices.append(author_labelled['hindex'][neighb])
    if len(h_indices)!=0:
        return np.mean(h_indices), np.max(h_indices)
    else:
        return 0, 0
    
dictionary_of_b_centrality = nx.betweenness_centrality(G)
def b_centrality(node):
    return dictionary_of_b_centrality[node]

dictionary_of_EVC = nx.eigenvector_centrality(G)
def h_of_mostEVC(g, node, author_labelled):
    evc = []
    evc_node = []
    for neighb in g.neighbors(node):
        if neighb in dictionary_of_EVC.keys():
            evc.append(dictionary_of_EVC[neighb])
            evc_node.append(neighb)
    if len(evc)!=0:
        max_node = evc_node[np.argmax(evc)]
#         k = np.argwhere(nodes_train[:,0]==max_node).flatten()
        if not np.isnan(author_labelled['hindex'][max_node]) : 
            return author_labelled['hindex'][max_node]
        else:
            return 0
    else:
        return 0

In [20]:
#node2vec embedding of nodes
import os.path as osp

import torch
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from torch_geometric.datasets import Planetoid
from torch_geometric.nn import Node2Vec


modelvect = Node2Vec(edges.coalesce().indices(), embedding_dim=100, walk_length=3000,
                     context_size=10, walks_per_node=1000,
                     num_negative_samples=1, p=0.6, q=0.8, sparse=True).cuda()

loader = modelvect.loader(shuffle=True, num_workers=6)
optimizer = torch.optim.SparseAdam(list(modelvect.parameters()), lr=0.01)
modelvect.train()

Node2Vec(1036333, 100)

In [21]:
import torch
from torch.nn import Linear
import torch.nn.functional as F
from torch_geometric.nn import GCNConv 

# read the file that containt auhor labelled and graph features
author_labelled = pd.read_csv('author_lbl_graphfeature2.csv')

#modify nodes names and build edges indices for GCN computation
mapping = {author_labelled['index'][i] : i for i in range(217800)}
G = nx.relabel_nodes(G, mapping)
adjacency = nx.to_scipy_sparse_matrix(G)
adjacency_coo = adjacency.tocoo()
edges_0 = []
edges_1 = []
for edge in G.edges():
    edges_0.append(edge[0])
    edges_0.append(edge[1])
    edges_1.append(edge[1])
    edges_1.append(edge[0])

adjacency_tensor = torch.sparse.LongTensor(torch.LongTensor([np.array(edges_0).tolist(),
                                                             np.array(edges_1).tolist()]),
                                                           torch.LongTensor(adjacency_coo.data.astype(np.int32)))

edges = adjacency_tensor.cuda()



X_train = author_labelled.drop(['index', 'hindex', 'paperID', 'mask', 'Unnamed: 0', 'out', 'Unnamed: 0.1'], axis=1)
y_train = author_labelled['hindex']
mask = author_labelled['mask'].astype('bool')

X_train_np = pd.DataFrame(X_train).to_numpy()
y_train_np = pd.DataFrame(y_train).to_numpy()
mask_np = pd.DataFrame(mask).to_numpy()

X_trp = X_train_np[mask_np.reshape(217800)]
y_trp = y_train_np[mask_np.reshape(217800)]

X_train_tensor = torch.tensor(X_train_np)
y_train_tensor = torch.tensor(y_train_np)
mask_tensor = torch.tensor(mask.values)

**after all, this is the data frame witch contains all our data features**

the hindex=nan for author of the test_data 

In [16]:
author_labelled

Unnamed: 0.2,Unnamed: 0,index,Unnamed: 0.1,hindex,paperID,0,1,2,3,4,...,page_ranks,degree_centralities,neighbor_av_degrees,nbr_connexions,nbr_isolates,nbr_comp,out,h_of_mostEVC,neighbor_av,neighbor_max
0,0,1101850,0,,133459021-179719743-2111787673-2126488676-3183...,-0.516787,1.636447,-0.062156,-0.188258,-0.998227,...,0.006489,0.001349,0.040722,0.001349,0.000000,0.000000,3.635898,0.096774,0.073171,0.096257
1,1,1336878,1,,2122092249-2132109814-2100271871-2065672539-20...,-0.412267,1.622659,0.097246,-0.548176,-0.497817,...,0.340766,0.068780,0.020598,0.068780,0.277778,0.083333,43.550365,0.252688,0.051974,0.390374
2,2,1515524,2,7.0,2141827797-2127085795-2013547785-2138529788-19...,-0.324699,1.140772,0.147425,0.294866,-0.846317,...,0.015702,0.002023,0.010653,0.002023,0.000000,0.083333,11.592803,0.166667,0.113821,0.165775
3,3,1606427,3,1.0,1907724546,-0.228558,1.052226,0.286160,0.778895,-0.780682,...,0.026008,0.001349,0.002577,0.001349,0.000000,0.000000,0.000000,0.000000,0.018293,0.016043
4,4,2728936,4,27.0,2114261446-2042751882-1912205781-2059913822-19...,-0.246059,1.145682,0.077173,0.257870,-0.600490,...,0.071081,0.004046,0.004639,0.004046,0.111111,0.000000,24.399586,0.000000,0.064024,0.144385
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
217795,217795,2908277686,217795,28.0,1964777539-2051142510-2092148526-2036760475-20...,-0.279820,1.364193,0.986596,-0.521121,-1.115616,...,0.025557,0.111261,0.331109,0.111261,0.000000,0.083333,24.768595,0.419355,0.353481,0.529412
217796,217796,2908387141,217796,,2540479521,-0.165670,1.739194,0.239138,-0.301512,-0.595904,...,0.021564,0.001349,0.003093,0.001349,0.000000,0.000000,2.162876,0.021505,0.018293,0.021390
217797,217797,2908425732,217797,1.0,2553344037,-0.608998,1.042386,-0.358419,-0.518114,-0.351695,...,0.019159,0.002697,0.021649,0.002697,0.000000,0.083333,0.000000,0.037634,0.079268,0.090909
217798,217798,2908436250,217798,1.0,2907086791,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.025378,0.006743,0.034227,0.006743,0.000000,0.083333,1.397518,0.182796,0.105691,0.219251


## Model Tuning

In [18]:
#build GCN structure
import torch
from torch.nn import Linear
import torch.nn.functional as F
from torch_geometric.nn import GCNConv #GATConv

class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GCN, self).__init__()
        torch.manual_seed(42)
        
        #Initialize the layers
        self.conv1 = GCNConv(130, hidden_channels)
        self.out = Linear(hidden_channels, 1)
    
    def forward(self, x, edge_index):
        #First Message passing Layer (Transformation)
        x = self.conv1(x, edge_index)
        x = x.relu()
        x= F.dropout(x, p=0.1, training=self.training)
        
        #output layer
        x= self.out(x)
        return x
model1 = GCN(hidden_channels=70)
print(model1)

GCN(
  (conv1): GCNConv(130, 70)
  (out): Linear(in_features=70, out_features=1, bias=True)
)


In [None]:
#initialize model  GCN MODEL and send it to GPU for fast computation
model1 = GCN(hidden_channels=70)
# pd_to
# Use GPU
# device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model1 = model1.cuda()
y_train_tensor = y_train_tensor.cuda()
X_train_tensor = X_train_tensor.cuda()
edges = adjacency_tensor.cuda()
model1.to(torch.float64)
#Initialize Optimizer 
learning_rate = 0.01
decay = 5e-4
optimizer = torch.optim.Adam(model1.parameters(),
                            lr=learning_rate,
                            weight_decay=decay)

In [None]:
#define loss function (Mean scared error for regression  Problem )

criterion = torch.nn.MSELoss()
def train():
    model1.train()
    optimizer.zero_grad()
    #Use all data as input, because all nodes have node features
    out = model1(X_train_tensor, edges.coalesce().indices())
    #only use nodes with labels available for loss calculation --> mask
#     print(out[mask_tensor].shape)
    loss = criterion(out[mask_tensor], y_train_tensor.reshape((217800, 1))[mask_tensor])
    loss.backward()
    optimizer.step()
    return loss

def test():
    model.eval
    out = model(data.x, data.edge_index)
    #use the class with highest probability
    pred = out.argmax(dim=1)
    #check against ground-truth labels
    test_correct = pred[data.test_mask] == data.y[data.test_mask]
    #derive ratio of correct predictions
    test_acc = int(test_correct.sum()) / int(data.test_mask.sum())
    return test_acc

losses = []
for epoch in range(0, 4000):
    loss = train()
    losses.append(loss)
    if epoch % 10 == 0:
        print(f'epoch: {epoch:03d}, Loss: {loss:4f}')

In [None]:
#compute the prediction 
model.eval
out = model(X_train_tensor, edges.coalesce().indices())

In [23]:
# MODEL MLPRegressor

from sklearn.model_selection import train_test_split
import tensorflow as tf 
from tensorflow import keras

X_trp_val, X_test_val, y_trp_val, y_test_val = train_test_split(X_trp, y_trp, test_size=0.2, random_state=5)

model1 = keras.Sequential([
    keras.layers.Dense(100, input_shape=(310,), activation='relu', kernel_initializer = 'glorot_normal'),
     keras.layers.Dropout(0.1),
    keras.layers.Dense(1, activation='relu'),  
])

opt = tf.keras.optimizers.Adam(
    learning_rate=0.001
)

model1.compile(optimizer=opt,
             loss='MeanSquaredError')
model1.fit(X_trp_val, y_trp_val, epochs=15, validation_data=(X_test_val, y_test_val))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x21699b85b20>

In [None]:
#XGBOOST Regressor

from xgboost import XGBRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from tqdm import tqdm

mse_train = []
mse_test = []
mse_max = -1
optimal_n = -1
for n_estimator in tqdm(range(10, 1000, 50)):
    xgbr = XGBRegressor(n_estimators = n_estimator, verbosity=0)
    xgbr.fit(X_trp_val, y_trp_val)

    ypred_train = xgbr.predict(X_trp_val)
    ypred_test = xgbr.predict(X_test_val)
    mse_test_curr = mean_squared_error(y_test_val, ypred_test)
    mse_train_curr = mean_squared_error(y_trp_val, ypred_train)
    if mse_test_curr < mse_max:
        optimal_n = n_estimator
        mse_max = mse_test
    mse_train.append(mse_train_curr)
    mse_test.append(mse_test_curr)

n_list = np.arange(10, 1000, 50)
plt.plot(n_list, mse_train, label='Boost train')    
plt.plot(n_list, mse_test, label='Boost test')    
plt.show()

In [None]:
xgbr.fit(X_trp_val, y_trp_val)
score = xgbr.score(X_trp_val, y_trp_val)
print('training score:', score)
cv_score = cross_val_score(model1, X_trp_val, y_trp_val, cv=10)
print("CV mean score :", cv_score.mean)
ypred = xgbr.predict(X_test_val)
mse = mean_squared_error(y_test_val, ypred)
print("MSE:", mse)
print("RMSE:", mse**(1/2.0))