Skip to content

FrappaN/graph-eqev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data, scripts and notebooks from "Value is in the Eye of the Beholder: A Framework for an Equitable Graph Data Evaluation"

Data and notebooks from the work on the datasets Cora, PubMed and Pokec-z.

Running the code

The two scripts, script.py and sample_script.py, share the same set of arguments. First, there are 'no_cuda' and 'device' arguments to select wether to use a GPU (and which one if multiple are available). Then there is the 'source' arguments, which allows for selecting one between the three "graph sources" we used in the article: Cora, PubMed or Pokec (Cora by default). Then, you can choose to disable the training of GNNs and using only a label propagation algorithm with 'no_GNN'. You can also choose between the testing procedure we described in the article with the 'test' argument, choosing between 'shared', 'ind' (for 'individual') or 'double' to use them both (default is 'shared'). Finally, with 'players' you can select the number of datasets for each artificial coalition (default is 4 as in the paper) and with 'num_sims' the number of artificial coalitions to extract. The results were produced by running script.py with Cora and PubMed with 1000 artifcial coalitions and Pokec with 200 artificial coalitions, in both cases with double testing. The sample_script.py was instead run for all sources only with 30 artificial coalitions and using the shared testing. In all cases, both GNNs and label propagation were used.

Organization of files in "results"

There are three types of .csv: "acc_..." for the accuracies of all possible coalitions in each run; "par_..." similarly for statistical parity; and "stats_..." for data regarding each subset in each run. The "acc_..." and "par_..." files are further divided for the label propagation model (LP) and the GNN model, and for the modality of testing ('individual' or 'shared' test set).

Columns in the "acc_..."/"par_..." .csv's

Each column represent a coalition of subsets. The subsets are numbered from 0 to 3 and the coalition is given by an order list of the present players. The number of each subset in each run correspond to the order in which it appears in the corresponding "stats_..." file. In the "sample_exp" folder, also the "sample_acc_..." files are present, with some extra columns. Firstly, there are columns with the values corresponding to taking one of the subset entirely, and only a sample of the others (e.g.: "2F+[0,3]" corresponds to the values given when training a model over the union of the full subset 2 and the samples of subsets 0 and 3). Secondly, for each row there is an indication of the modality of sampling ("u": uniform; "r": unbiased random walk; "high_res", "low_res", "mix_res": sampling the top/lower edges for effective resistance, or a mix of the two; "t": random spanning tree) and the fraction of the subset which is sampled (except in the case of random spanning trees).

Columns in the "stats_..." .csv's

-sampling: the kind of sampling used for the subsets; "q=x" for a biased random walk with in-out parameter x, "rw" for un unbiased random walk (equivalent to q=1), and "unif" for a uniform sampling over the edges;

-num_edges: number of the edges in the subset, including edges connecting nodes in the test set

-size: fraction of the complete graph edges covered

-size_gt: fraction of sampled, labelled nodes which keep their labels

-diameter: diameter of the subset

-avg_cc: average clustering coefficient of the subset

-avg_degree: average degree of the subset

-homophily: edge homophily ratio of the subgraph

-assortativity: assortativity mixing of the subgraph

-lab_dist: "L1 norm" between the vector representing the distribution of target labels in the full dataset and the vector of their distribution in the non-zero degree nodes in the subset

-tot_edge: total number of available edges when combining all subsets in the coalition

-edge_contribution: fraction of tot_edge given by the subgraph

-overlapping_edges: fraction of num_edges shared with other subsets in the coalition

-gain_acc: accuracy of the model after training over the subset; LP for label propagation model, GNN for the graph neural network model

-shap_acc: shapley value assigned on the basis of the accuracy; same names as above for label propagation and GNN

-Run: each run corresponds to a different independent sampling of three subsets

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors