# Graph Neural Networks (GNN)

Graph based deep learning is currently one of the hottest topics in Machine Learning Research. In the NeurIPS 2020 conference GNNs constituted the most prominent topic, as can be seen in this [list of conference papers](https://github.com/naganandy/graph-based-deep-learning-literature/blob/master/conference-publications/folders/publications_neurips20/README.md). 

However, GNNs are not only subject of research. They have already found their way into a wide range of [applications](https://medium.com/criteo-engineering/top-applications-of-graph-neural-networks-2021-c06ec82bfc18).

Graph Neural Networks are suitable for Machine Learning tasks on data with structural 
relation between the individual data-points. Examples are e.g. social and communication networks analysis, traffic prediction, fraud detection and the classification tasks in this exercise. [Graph Representation Learning](https://www.cs.mcgill.ca/~wlh/grl_book/)
aims to build and train models for graph datasets to be used for a variety of ML tasks.

The goals of this lecture are:
* Understand the basic concepts of Graph Neural Networks and their application categories
* Understand how to create a custom dataset for Graph Neural Networks
* Learn to implement a Graph Convolutional Network (GCN) for Node Classification by using the [Spektral](https://graphneural.network/)-framework
* Apply a Graph Convolutional Network to predict the genre of a song. For this a comprehensive set of playlists and tracks is accessed from **spotify** via the Python API [spotipy](https://spotipy.readthedocs.io/en/2.18.0/).

## Course of Action

* Please write all executable python code in ```Code```-Cells (```Cell```->```Cell Type```->```Code```) and all Text as [Markdown](http://commonmark.org/help/) in ```Markdown```-Cells
* Describe your thinking and your decisions (where appropriate) in an extra Markdown Cell or via Python comments
* In general: discuss all your results and comment on them (are they good/bad/unexpected, could they be improved, how?, etc.). Furthermore, visualise your data (input and output).
* Write a short general conclusion at the end of the notebook
* Further experiments are encouraged. However, don't forget to comment on your reasoning.
* Use a scientific approach for all experiments (i.e. develop a hypothesis or concrete question, make observations, evaluate results)

## Submission

Upload your complete Notebook to [Ilias](https://learn.mi.hdm-stuttgart.de/ilias/ilias.php?ref_id=21049&cmdClass=ilrepositorygui&cmdNode=3k&baseClass=ilrepositorygui) until the defined deadline. One Notebook per group is sufficient. Edit the teammember table below.

**Important**: Also attach a HTML version of your notebook (```File```->```Download as```->```HTML```) in addition to the ```.ipynb```-File.

| Teammember |                    |
|------------|--------------------|
| 1.         | Geoffrey Hinton    |
| 2.         | Yoshua Bengio      |
| 3.         | Yann LeCun         |
| 4.         | Jürgen Schmidhuber |

# Background 
Refresh the theory about graph neural networks from the lecture: https://maucher.pages.mi.hdm-stuttgart.de/mlbook/neuralnetworks/GraphNeuralNetworks.html. In this notebook, we will also use GNNs for Node Classification, but create an custom graph from spotify data. 

Questions:
1. What are different prediciton tasks that can be solved with GNNs? 
2. Explain in a few sentences how the Graph Convolution layer works.
3. Why do you need shortcut connections? 
4. Try to explain in which applications in general a GNN may be beneficial, compared to other neural network architecture-types. 

# Predicting the genre of music tracks
In this notebook we implement an own application of a Graph Neural Network for node classification. The GNN shall now predict the genre of a music-track (instead of the subject of the paper). Now, 

* our nodes are music-tracks, which are described initially only by their audio-features (instead of papers, which are described initially by their BoW). 
* we define two nodes to be connected, if the music-tracks are close together within a spotify-playlist (instead of nodes that are connected, if the paper of the one node has a reference to the paper of the other node). 

You do not have to implement your own GCN-layer as we will use the [Spektral Library](https://graphneural.network/).  However, the challenge now is to
* access and preprocess the spotify-data
* create a custom Spektral Dataset which can be used to train different GNNs with Spektral

## Spotify Data
Since the Spotify API has a rate limit and we also had issues with API changes last semester, we are making the JSON responses available offline for further use. To access the spotify data, the `SpotifyMock` class is provided, which returns the same JSON responses as the API. The data was collected on June 22.

In [1]:
import json
import os 

class SpotifyMock:

    def __init__(self, root_path="./spotify_data"):
        self.root_path = root_path

    def load_json_for_path(self, path):
        full_path = os.path.join(self.root_path, path)
        with open(full_path, "r") as f:
            data = json.load(f)
        return data
        
    def categories(self):
        return self.load_json_for_path('categories.json')

    def category_playlists(self, category_id):
        return self.load_json_for_path(f'categories/{category_id}.json')

    def playlist_tracks(self, playlist_id):
        return self.load_json_for_path(f'playlists/{playlist_id}.json')

    def audio_features(self, track_id):
        return self.load_json_for_path(f'tracks/{track_id}.json')

spotify = SpotifyMock()


## Access data from spotify
If you want to use the offical spotifiy API and want to do further experiments with other data, you need to apply for an [spotify developer account](https://developer.spotify.com/documentation/general/guides/app-settings/). Successful registration provides you a `client-ID` and a `client-secret`. In order to access data from spotify, we use the [spotipy Python package](https://spotipy.readthedocs.io/en/2.18.0/#). You can use the provided `client-ID` and a `client-secret` to connect to spotify. Once this connection has been established, all methods of [spotipy](https://spotipy.readthedocs.io/en/2.18.0/#) can be applied via this connector, e.g. `spotify.categories()`. The next two cells are of the type `raw`, so that they are not executed unintentionally.

## Collect Data

1. Apply the method `categories()` in order to receive the categories (genres) distinguished in spotify. **(Only for the official API: Note that in this and several other methods the value of the `limits`-argument must be increased in order to not only get the first elements.)**
2. The categories relevant in this exercise are

```CATEGORIES=["edm_dance","romance","metal","classical","jazz","latin","hiphop"] ```

3. Apply the method `category_playlists(category_id)` in order to get for each of the relevant categories a set of playlists (at least 60 playlists in sum).
4. Apply the method `playlist_tracks(playlist_id)` in order to get for each of the playlists the music-tracks of the playlist. In this way you get for each track a list of features, such as artist, title, popularity, etc. 
5. Not contained in the set of features, returned by `playlist_tracks(playlist_id)` are the audio-features. However, these audio-features can be obtained by applying method `audio_features(track_id)`. 
6. Combine all the methods described above in order to create a pandas Dataframe which contains at least 4500 music tracks of the relevant categories. In the dataframe each track constitutes a row, and each row is described by the following features (columns):
    
    * unique-id of the track
    * name of the playlist, that contains the track
    * genre (category) of this playlist
    * name of the track
    * artist of the track
    * position of the track within the playlist
    * all the track's audio-features, which are returned by the method `audio_features(track_id)`.
      

## Create Node Features and Edges

In order to create the `graph_info` as in [the introductory example](#graphinfo), from the dataframe, created in the previous task
1. Take the *audio-features* as `node_features` of the music tracks
2. Create the `edges` array by connecting each music track to all tracks, which appear either in the N previous tracks or in the N successive tracks of the same playlist. (Set e.g. N=3).
3. Initially you can set the weights of all edges to be 1. But maybe there is a better option?

The `genre`-column of the dataframe, created in the previous task, constitutes the node-labels.

## Preprocessing Node Features
Inspect the `node_features` (audio-features of tracks). Is there need for some kind of preprocessing? Implement this preprocessing. 

## Save data to csv files

After preprocessing the data, store the node features in a csv file and the edges in another csv file. These files can then be used for the custom dataset for the GNN, so we don't always have to create the graphs from the raw data.

For reference, these are the final dataframes how your processed data should look: 

<img src="pics/spotify-data.png" width=1800 />

And the edges dataframe should look like this:

<img src="pics/spotify-edges.png" width=150 />

## Train and evaluate the Baseline Model
As in the [GNN-Lecutre](https://maucher.pages.mi.hdm-stuttgart.de/mlbook/neuralnetworks/GraphNeuralNetworks.html#build-a-baseline-neural-network-model) train and evaluate the baseline model. Adapt the hyperparameters of the baseline model to the current task. The necessary methods are already copied from the tutorial in the below cell.

Create the training data only from the spotify data without the information about the edges, since we can not use this information for a standard neural network. 

`SparseCategoricalCrossentropy` can be used if you have the labels in integer format, if you have one-hot encoded labels, you need to use the `CategoricalCrossentropy`-function. The same is true for the metric `SparseCategoricalAccuracy`.

In [None]:
import os
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers


def run_experiment(model, x_train, y_train):
    # Compile the model.
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
    )
    # Create an early stopping callback.
    early_stopping = keras.callbacks.EarlyStopping(
        monitor="val_acc", patience=50, restore_best_weights=True
    )
    # Fit the model.
    history = model.fit(
        x=x_train,
        y=y_train,
        epochs=num_epochs,
        batch_size=batch_size,
        validation_split=0.3,
        callbacks=[early_stopping],
        verbose=True,
    )

    return history


def display_learning_curves(history):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

    ax1.plot(history.history["loss"])
    ax1.plot(history.history["val_loss"])
    ax1.legend(["train", "test"], loc="upper right")
    ax1.set_xlabel("Epochs")
    ax1.set_ylabel("Loss")

    ax2.plot(history.history["acc"])
    ax2.plot(history.history["val_acc"])
    ax2.legend(["train", "test"], loc="upper right")
    ax2.set_xlabel("Epochs")
    ax2.set_ylabel("Accuracy")
    plt.show()


def create_ffn(hidden_units, dropout_rate, name=None):
    fnn_layers = []

    for units in hidden_units:
        fnn_layers.append(layers.BatchNormalization())
        fnn_layers.append(layers.Dropout(dropout_rate))
        fnn_layers.append(layers.Dense(units, activation=tf.nn.gelu))

    return keras.Sequential(fnn_layers, name=name)


def create_baseline_model(hidden_units, num_classes, dropout_rate=0.2):
    inputs = layers.Input(shape=(num_features,), name="input_features")
    x = create_ffn(hidden_units, dropout_rate, name=f"ffn_block1")(inputs)
    for block_idx in range(4):
        # Create an FFN block.
        x1 = create_ffn(hidden_units, dropout_rate, name=f"ffn_block{block_idx + 2}")(x)
        # Add skip connection.
        x = layers.Add(name=f"skip_connection{block_idx + 2}")([x, x1])
    # Compute logits.
    logits = layers.Dense(num_classes, name="logits")(x)
    # Create the model.
    return keras.Model(inputs=inputs, outputs=logits, name="baseline")

In [None]:
baseline_model = create_baseline_model(hidden_units, num_classes, dropout_rate)
baseline_model.summary()

In [None]:
history = run_experiment(baseline_model, x_train, y_train)
_, test_accuracy = baseline_model.evaluate(x=x_test, y=y_test, verbose=0)
print(f"Test accuracy: {round(test_accuracy * 100, 2)}%")

## Train and evaluate a Graph Convolutional Network
Now we train and evaluate a Graph Convolutional Network (GCN) with the help of the [Spektral Library](https://graphneural.network/). Let's get started by first reading the tutorial to understand the basic concepts of the framework: [Tutorial](https://graphneural.network/getting-started/) 

First we need to create a custom dataset and then we can try out different architectures. In the cell below, a skeleton of the class is provided, into which some methods must be implemented. As we represent songs as nodes and want to predict the genre of the song, we have one single big graph where we classify the nodes of the graph, therefore we need to create a dataset which can be used in the single mode. 

- Creating a Custom Dataset: [Tutorial](https://graphneural.network/creating-dataset/)
    - Fill out the skeleton of `create_masks_for_splits`, where the indices of each data split are stored in the class attributes `mask_tr`, `mask_va` and `mask_te`, which are used later in the data loader. The data loader can then sample batches of the data only from the given indices (A reference implementation of the Cora Citation Dataset for Spektral is given [here](https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/citation.py)).
    - Fill out the skeleton of `get_y`, where the labels are transformed into a one-hot encoded representation.
    - Are more preprocessing steps necesseary so that the GNN converges? 
- Different Data modes: [Tutorial](https://graphneural.network/data-modes/)
    - Explain in a few sentences, why we have different data modes for GNNs.
- Train a GCN for predicting the music genre for the nodes in the graph
    - Since our task is node prediction, similar to classifying the nodes of the Cora dataset, we can review the spectral example for node prediction: https://github.com/danielegrattarola/spektral/blob/master/examples/node_prediction/citation_gcn.py
    - If you want to try different architectures, you can take a look at the examples of node-level predictions, which include some references for them: https://graphneural.network/examples/

In [None]:
!pip install spektral

In [None]:
import os
from spektral.data import Dataset
from spektral.data.graph import Graph
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.utils import to_categorical
import scipy.sparse as sparse
from sklearn.preprocessing import MinMaxScaler


def _idx_to_mask(idx, l):
    mask = np.zeros(l)
    mask[idx] = 1
    return np.array(mask, dtype=bool)

class SpotifyDataset(Dataset):
    """
    A dataset of five random graphs.
    """
    def __init__(self, data_path, edges_path, dtype=np.float32, **kwargs):
        self.data_path = data_path
        self.edges_path = edges_path
        self.dtype = dtype

        self.relevant_cols = ['intId', 'genre','danceability', 'tempo', 'acousticness',
                             'duration_ms', 'energy', 'instrumentalness', 'liveness', 
                             'loudness','speechiness', 'time_signature', 'valence']
        self.feature_names = list(set(relCols) - {"intId", "genre"})
        
        super().__init__(**kwargs)
        
    def download(self):
        # Create the directory
        os.makedirs(self.path, exist_ok=True)

    def create_masks_for_splits(self, audio_data, train_split=0.9):
        
        # index_tr = a list of indices of the training data in the whole dataset
        # index_va = a list of indices of the validation data in the whole dataset
        # index_te = a list of indices of the test data in the whole dataset
        
        # Train/valid/test masks
        size_dataset = len(audio_data)
        self.mask_tr = _idx_to_mask(index_tr, size_dataset)
        self.mask_va = _idx_to_mask(index_va, size_dataset)
        self.mask_te = _idx_to_mask(index_te, size_dataset)

    def get_y(self, audio_data):
        # return one hot encoded lables of the dataset
        return
        
    def read(self):
        df = pd.read_csv(self.data_path)

        # the node features 
        x = df[self.feature_names].to_numpy()

        # read the edges, we set the default edge weight to 1 for all edges
        edges = pd.read_csv(self.edges_path)[["source", "target"]]
        edges['weight'] = 1
        a = edges.to_numpy()

        # create one-hot encoded labels
        y = self.get_y(df)
        
        self.create_masks_for_splits(df)

        # create adjacency matrix list fro edgelist     
        G = nx.from_pandas_edgelist(edges)
        a = nx.adjacency_matrix(G)
        a.setdiag(0)
        a.eliminate_zeros()

        # Are more preprocessing steps necessary? 


        
        return [
            Graph(
                x=x.astype(self.dtype),
                a=a.astype(self.dtype),
                y=y.astype(self.dtype),
            )
        ]


## Comparison and Discussion

* Compare and discuss the results of the baseline- and the GCN-model. 
* Propose at least 3 other applications for which Graph Neural Networks may be suitable. For each of your proposals describe what the `node_features` and the `edges` may be.