 # Hackathon task: classifying graphs

There are a number of simple recipes for creating random graphs, known as [graph generators](https://networkx.org/documentation/stable/reference/generators.html).

The properties of graphs created using different recipes are somewhat distinctive, so in principle we can use machine learning to predict the source generator for a given graph.

You are given a sample training dataset (in `data/train`) containing graphs created using five different generators. The number of vertices and the other generator parameters may be different between individual graphs.

These graphs are provided in the [adjacency list](https://networkx.org/documentation/stable/reference/readwrite/adjlist.html) format.

Your task is to produce a classifier that has
- high accuracy in distinguishing between these five generators, and
- low total CO<sub>2</sub> emissions (considering the three stages of model development, training and testing).

You are free to use any approach and any third-party packages/resources you choose, in any programming language. 

Probably if using python you will want to use the [scikit-learn](https://scikit-learn.org/stable/) package to create your classifier.

One important decision will be how to extract informative features from the graphs.

 ## One possible approach

Given a <a href="https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)">graph</a> G, we can apply the following procedure:

1) remove all vertices with degree 0 (i.e. unconnected vertices)
2) remove all vertices with degree 1 (i.e. those connected to only one other vertex)
3) keep repeating (2) until there are no more vertices with degree 1
4) output the resulting graph, known as the <a href="https://en.wikipedia.org/wiki/Degeneracy_(graph_theory)">2-core</a> of the original graph G.

We can continue the procedure with progressively higher degrees to generate the 3-core, 4-core etc...

The [onion decomposition](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.core.onion_layers.html#networkx.algorithms.core.onion_layers) of a graph G is based on the k-core algorithm explained earlier, but records the vertices removed in each step.


### Example: onion decomposition of the Facebook dataset

In [None]:
import networkx as nx

# load a graph
G = nx.read_adjlist("data/facebook.txt", nodetype=int)

print(f"original graph has {len(G.nodes):d} vertices and {len(G.edges):d} edges")

od = nx.onion_layers(G)

for i in range(1, 21):
    layer_i = [v for v in od.keys() if od[v] == i]
    print(f"layer {i:d} contains {len(layer_i):d} vertices")


The sequence of layer sizes contains a lot of information about the local structure of the graph. 

It might make a useful set of features for distinguishing between graphs of different types...

## Tracking emissions

Please use the CodeCarbon tracker to estimate the total CO~2~ emissions produced during the development of your model.

If working in a notebook, an easy way to do this is to run 

In [None]:
from codecarbon import EmissionsTracker

tracker = EmissionsTracker()
tracker.start()

once at the beginning of your coding, and 

In [None]:
emissions = tracker.stop()
print(emissions)

when you are ready to do the final model training.

## A note on the use of GenAI

You may want to make use of Generative AI to help you with your coding.
This is perfectly acceptable, but there is an associated environmental cost that we will need to take into account.

It is currently difficult to accurately estimate the emissions associated with tools such as GitHub Copilot (although there are some projects that are aiming towards similar functionality as CodeCarbon for LLM APIs, e.g. https://ecologits.ai/).

However, we do have [some general estimates](https://piktochart.com/blog/carbon-footprint-of-chatgpt/) for the CO2 emissions that can be assigned to a single LLM query.

If you want to use GenAI for this hackathon task, please keep a record of the approximate number of queries you make to GenAI tools during your coding. You will be asked to enter this number when you submit your results.


## Results submission

Towards the end of the hackathon time, we will provide a link to the final evaluation dataset, which will be a zipped directory `test.zip`.

Train your final model using the full `train` dataset and record the associated CO<sub>2</sub> emissions.

Use your model to predict the generator of each graph in the `test` dataset and record the associated CO<sub>2</sub> emissions.

Calculate the overall accuracy of your model predictions on this dataset, as a decimal.

(Each of the five graph generators has the same number of test graphs, so the expected accuracy of a random predictor is `0.2`.)

Submit your results using [this form](https://forms.office.com/e/2FgQFkth4L).


## Competition Scoring

This competition will have a winner. We will rank all entries according to their accuracy and CO<sub>2</sub> emissions. For each entry, we will sum your rank according to each of these two criteria, and the entry with the lowest total rank will be declared the winner.

For determining the CO<sub>2</sub> emissions, we will use the following formula:

$$
\text{total emissions (kg CO₂)} = 0.004(\text{kg CO₂}) * \text{number of GenAI queries} + \\
\text{emissions from model development (kg CO₂)} +  \\
10,000 (\text{emissions from model training (kg CO₂)} + \text{emissions from model testing (kg CO₂)}) 
$$

The factor of 10,000 is to account for the fact that the problem is short and small, and only repeated once. For real research projects, the code would likely be more complicated, and be run multiple times, so the emissions would be much higher.

## Good luck!

---