 ## Sustainability optimisation task: classifying graphs

The [onion decomposition](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.core.onion_layers.html#networkx.algorithms.core.onion_layers) of a graph G is based on the k-core algorithm explained earlier, but records the vertices removed in each step.

In [26]:
import networkx as nx

# load a graph
G = nx.read_adjlist("demo_graph.txt", nodetype=int)

print(f"original graph has {len(G.nodes):d} vertices and {len(G.edges):d} edges")

od = nx.onion_layers(G)

for i in range(1, 21):
    layer_i = [v for v in od.keys() if od[v] == i]
    print(f"layer {i:d} contains {len(layer_i):d} vertices")


original graph has 50000 vertices and 125005 edges
layer 1 contains 326 vertices
layer 2 contains 1693 vertices
layer 3 contains 68 vertices
layer 4 contains 4284 vertices
layer 5 contains 754 vertices
layer 6 contains 153 vertices
layer 7 contains 37 vertices
layer 8 contains 9 vertices
layer 9 contains 2 vertices
layer 10 contains 7732 vertices
layer 11 contains 4155 vertices
layer 12 contains 2604 vertices
layer 13 contains 1854 vertices
layer 14 contains 1480 vertices
layer 15 contains 1299 vertices
layer 16 contains 1183 vertices
layer 17 contains 1101 vertices
layer 18 contains 1065 vertices
layer 19 contains 1085 vertices
layer 20 contains 1185 vertices


The sequence of layer sizes contains a lot of information about the local structure of the graph. 

It might make a useful set of features for distinguishing between graphs of different types.

## The task

You are given the directory `data`, which contains subdirectories containing examples of graphs built with two different generator functions.

All the graphs have different numbers of vertices and are built using different parameter values.

Your job is to build a workflow that will 
1. calculate the sizes of the first 50 onion layers of the graphs provided
2. use these as features to build a machine learning classifier to distinguish between these two types of graph.

We are interested in the *total sustainability* of the solution (i.e. the development, training and usage of the resulting model).

### development cost
You will make an estimate of the development cost, including any use of GenAI for coding.

### training cost
Your workflow will be applied to an UNSEEN training dataset (containing different types of graph to the ones you have been given!). 

We will record the mean training cost over 100 repetitions.

### usage cost
The resulting model will be used to classify a balanced testing set of 100 graphs, and the total cost of testing will be recorded.

### performance
The accuracy on the test dataset will be recorded.

Solutions will be ranked by their accuracy and their total carbon cost.



Good luck!
