# Graph Convolutional Networks for Clickbait Detection

Authors: Stephen Gelinas, Kazuma Yamamoto, Ethan Zhou\
Credits: Parker Erickson https://colab.research.google.com/drive/11tcL4KXXwY__TmUUTjOf6InFQMC-VsG6#scrollTo=_BNqh7fz0486 \
Pytorch Implementation of GCN: https://github.com/iworldtong/text_gcn.pytorch, https://github.com/codeKgu/Text-GCN

## 1.1 Install Queries on TigerGraph Server - UPDATE

This notebook walks through a basic example of using a graph convolutional neural network (GCN) for text classification. The data is collected from a TigerGraph database using a Python package [pyTigerGraph](https://github.com/tigergraph/pyTigerGraph)
. Data collected is then pushed through a GCN to output predictions about a headline.

## 1.2 Installing Packages

The core packages that need to be installed are PyTorch, dgl, and pyTigerGraph. PyTorch and dgl are used for creating and training the GCN, while pyTigerGraph is used for connecting to the TigerGraph database. We also import networkx for converting the list of edges from TigerGraph into a graph dgl can work with.

In [1]:
!pip install pyTigerGraph
!pip install torch torchvision
!pip install dgl
!pip install networkx

Collecting pyTigerGraph
  Downloading pyTigerGraph-1.2.5-py3-none-any.whl (170 kB)
[K     |████████████████████████████████| 170 kB 4.0 MB/s eta 0:00:01
[?25hCollecting validators
  Downloading validators-0.20.0.tar.gz (30 kB)
Collecting pyTigerDriver
  Downloading pyTigerDriver-1.0.15-py3-none-any.whl (12 kB)
Building wheels for collected packages: validators
  Building wheel for validators (setup.py) ... [?25ldone
[?25h  Created wheel for validators: filename=validators-0.20.0-py3-none-any.whl size=19567 sha256=869049b08848f8f9de21e8f9420ae9ae76d0391b56db930717bb929224cb7467
  Stored in directory: /Users/stephengelinas/Library/Caches/pip/wheels/5f/55/ab/36a76989f7f88d9ca7b1f68da6d94252bb6a8d6ad4f18e04e9
Successfully built validators
Installing collected packages: validators, pyTigerDriver, pyTigerGraph
Successfully installed pyTigerDriver-1.0.15 pyTigerGraph-1.2.5 validators-0.20.0
Collecting torchvision
  Downloading torchvision-0.14.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.4 MB)


## 1.3 Installing Packages

We now import the packages we just installed

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import pyTigerGraph as tg
import dgl
import networkx as nx
from heapq import nlargest, nsmallest

DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


## 1.4 Configuration

Here we define some variables, such as the number of epochs of training (usually only need 30 or less for a 2-layer GCN), the learning rate (0.01 seems to work well). (optimize this)

In [5]:
numEpochs = 25
learningRate = 0.01

## 1.5 Creating the Graph Convolutional Network

The block below defines some functions and classes for the GCN. The main ones to look at are the GCNLayer, which are the individual building blocks that the GCN class is made out of. The GCN class defines the structure of our neural network.

In [7]:
class MLP(nn.Module):
    def __init__(self, input_dim, dropout_rate=0., num_classes=10):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_dim, 200)
        self.fc2 = nn.Linear(200, num_classes)
        self.relu = nn.ReLU(inplace=True)
        self.dropout = nn.Dropout(dropout_rate)
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.fc2(out)
        return out

In [8]:
class GraphConvolution(nn.Module):
    def __init__(self, input_dim, \
                       output_dim, \
                       support, \
                       act_func = None, \
                       featureless = False, \
                       dropout_rate = 0., \
                       bias=False):
        super(GraphConvolution, self).__init__()
        self.support = support
        self.featureless = featureless
        for i in range(len(self.support)):
            setattr(self, 'W{}'.format(i), nn.Parameter(torch.randn(input_dim, output_dim)))
        if bias:
            self.b = nn.Parameter(torch.zeros(1, output_dim))
        self.act_func = act_func
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, x):
        x = self.dropout(x)
        for i in range(len(self.support)):
            if self.featureless:
                pre_sup = getattr(self, 'W{}'.format(i))
            else:
                pre_sup = x.mm(getattr(self, 'W{}'.format(i)))
            if i == 0:
                out = self.support[i].mm(pre_sup)
            else:
                out += self.support[i].mm(pre_sup)
        if self.act_func is not None:
            out = self.act_func(out)
        self.embedding = out
        return out


class GCN(nn.Module):
    def __init__(self, input_dim, \
                       support,\
                       dropout_rate=0., \
                       num_classes=10):
        super(GCN, self).__init__()
        # GraphConvolution
        self.layer1 = GraphConvolution(input_dim, 200, support, act_func=nn.ReLU(), featureless=True, dropout_rate=dropout_rate)
        self.layer2 = GraphConvolution(200, num_classes, support, dropout_rate=dropout_rate)
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        return out

## 2.1 Creating Database Connection and Creating Edge List - UPDATE

This section instantiates a connection to the TigerGraph database and creates a list of tuples which consist of directed edges in the form of (from, to). This is done through two dictionaries that corresponds an article name to a unique numerical id that is needed to process the graph in the GCN.


#### **Assumption Alert:** We oversimplify the graph here. The query returns pairs of movies that share the same term (genre). In the real world, most people like a variety of genres and therefore their views are a little more nuanced than creating a graph where the edges are created if the movies share the same genre. This hurts accuracy (a lot). Better link creation factors might be actors, directors, etc. but we don't have that in this dataset. Where TigerGraph comes in is the ease of data extraction, as there are no JOIN operations to create these links between movies.
* Note: It is possible to create a GCN that has multiple types of verticies, (known as a Relational Graph Convolutional Notebook) but it is more complex. A good way to get started is to simplify until you only have relations between the same type of thing.


In [None]:
graph = tg.TigerGraphConnection(
    ipAddress="https://graphml.i.tgcloud.io", 
    graphname="Recommender", 
    apiToken="bekr9ls24mlh4kbkd7g28stq8vpj67vi") # Really not the best idea to have your API key out in the open, but for the sake of the demo, here it is

movieToNum = {} # translation dictionary for movie name to number (for dgl)
numToMovie = {} # translation dictionary for number to movie name
i = 0
def createEdgeList(result): # returns tuple of number version of edge
    global i
    if result["src"] in movieToNum:
        fromKey = movieToNum[result["src"]]
    else:
        movieToNum[result["src"]] = i
        numToMovie[i] = result["src"]
        fromKey = i
        i+=1
    if result["dest"] in movieToNum:
        toKey = movieToNum[result["dest"]]
    else:
        movieToNum[result["dest"]] = i
        numToMovie[i] = result["dest"]
        toKey = i
        i+=1
    return (fromKey, toKey)
    
edges = [createEdgeList(thing) for thing in graph.runInstalledQuery("movieLinks", {}, sizeLimit=128000000)["results"][0]["@@tupleRecords"]] # creates list of edges
print(len(edges))
print(edges[:5])

## 2.2. Initializing Graph

This section converts the list of edges into a graph that DGL can process in the GCN.

In [None]:
g = nx.Graph()
g.add_edges_from(edges)


G = dgl.DGLGraph(g)

## 2.3 Adding Features to Graph - UPDATE

We one-hot encode the features of the verticies in the graph. Feature assignment can be done a multitude of different ways, this is just the fastest and easiest, especially given the lack of attributal information in the dataset.

If you had a graph of documents for example, you could run doc2vec on those documents to create a feature vector and create the feature matrix by concatenating those together.

Another possiblity is that you have a graph of songs, artists, albums, etc. and you could use tempo, max volume, minimum volume, length, and other numerical descriptions of the song to create the feature matrix.

In [None]:
G.ndata["feat"] = torch.eye(G.number_of_nodes())

print(G.nodes[2].data['feat'])

## 3.1 Get User Data - UPDATE OR MAYBE DELETE

In this section, we get a specific user's movie preferences. There is a lot of list comprehension going on, but just know that we are getting the user's 3 highest and lowest reviewed movies for a total of 6 labelled datapoints to feed the GCN. The remainder of the user's data is then processed and saved to test the accuracy of the GCN.

In [None]:
ratings = graph.runInstalledQuery("userRatings", {"user":"217"})["results"][0]["S1"]
print("Total Number of Reviews by User: "+str(len(ratings)))
top3Movies = [thing["attributes"]["movieTitle"] for thing in nlargest(3, ratings, key=lambda item: item["attributes"]["userRating"])] # getting the 3 highest rated movies by the user
bottom3Movies = [thing["attributes"]["movieTitle"] for thing in nsmallest(3, ratings, key=lambda item: item["attributes"]["userRating"])] # getting the 3 lowest rated movies by the user
unclassifiedMovies = [thing for thing in ratings if not((thing["attributes"]["movieTitle"] in top3Movies) or (thing["attributes"]["movieTitle"] in bottom3Movies))]

def filterNegative(thing):
    if thing["attributes"]["userRating"] < 0:
        return thing

negativeRating = [filterNegative(thing)["attributes"]["movieTitle"] for thing in unclassifiedMovies if filterNegative(thing) != None]
positiveRating = [thing["attributes"]["movieTitle"] for thing in ratings if thing["attributes"]["movieTitle"] not in negativeRating]
print("Number of movies whose rating is unknown to the GCN: "+str(len(unclassifiedMovies)))
print("Number of unknown movies with a negative rating: "+str(len(negativeRating)))
print("Number of unknown movies with a positive rating: "+str(len(positiveRating)))
print(top3Movies)
print(bottom3Movies)

## 3.2 Creating Neural Network and Labelling Relevant Verticies - UPDATE

Here, we create the GCN. A two-layered GCN appears to work better than deeper networks, and this is further corroborated by the fact [this](https://arxiv.org/abs/1609.02907) paper only used a two-layered one. We also label the wanted and unwanted verticies and setup the optimizer. Since the GCN is a semi-supervised algorithm, we do not label all of the nodes to their correct classes before training - only two are needed!

In [None]:
net = GCN(G.number_of_nodes(), 15, 2) #Two layer GCN
inputs = G.ndata["feat"]
labeled_nodes = torch.tensor([movieToNum[top3Movies[0]], movieToNum[top3Movies[1]], movieToNum[top3Movies[2]], 
                              movieToNum[bottom3Movies[0]], movieToNum[bottom3Movies[1]], movieToNum[bottom3Movies[2]]])  # only the liked movies and the disliked movies are labelled
labels = torch.tensor([0, 0, 0, 1, 1, 1])  # their labels are different
optimizer = torch.optim.Adam(net.parameters(), lr=learningRate)

## 3.3 Training Loop

Below is the training loop that trains the GCN. Unlike many traditional deep learning architectures, GCNs don't always need that much training or as large of data sets due to their exploitation of the *structure* of the data, as opposed to only the features of the data.
* Note: due to the randomized initial values of the weights in the neural network, sometimes models don't work very well, or their loss gets stuck at a relatively large number (Try and be below a loss of about .7 at minimum). If that happens, just stop and restart the training process (also rerun the cell above to reset the weights) and hope for better luck! Alternatively, you can run more epochs in hopes of eventually getting out of the rut.

In [None]:
all_logits = []
for epoch in range(numEpochs):
    logits = net(G, inputs)
    # we save the logits for visualization later
    all_logits.append(logits.detach())
    logp = F.log_softmax(logits, 1)
    # we only compute loss for labeled nodes
    loss = F.nll_loss(logp[labeled_nodes], labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print('Epoch %d | Loss: %6.3e' % (epoch, loss.item()))

## 3.4 Testing Accuracy - UPDATE

Here is the code that processes the GCN's results and calculates the accuracy based off the verticies that the user has reviewed, but were not labelled in the graph for the GCN to use. While this accuracy is pretty mediocre, the GCN does make predictions based off of movies sharing the same genre, and therefore with better data, there could be (and almost certainly would be) an improvement in accuracy.

In [None]:
predictions = list(all_logits[numEpochs-1])

positivePrediction = []
negativePrediction = []
a = 0
for movie in predictions:
    if movie[0] >= movie[1]:
        positivePrediction.append(numToMovie[a])
    else:
        negativePrediction.append(numToMovie[a])
    a+=1

totalPredictions = len(unclassifiedMovies)
totalRight = 0

for movie in unclassifiedMovies:
    if (movie["attributes"]["movieTitle"] in negativePrediction) and (movie["attributes"]["movieTitle"] in negativeRating):
        totalRight += 1
    if (movie["attributes"]["movieTitle"] in positivePrediction) and (movie["attributes"]["movieTitle"] in positiveRating):
        totalRight += 1
    
print("Number of movies whose rating is unknown to the GCN: "+str(len(unclassifiedMovies)))
print("Total number of correct classifications: "+str(totalRight))
print("Accuracy: "+str(totalRight/totalPredictions))
print("Some movies that the user might like (In no particular order): "+str(positiveRating[:10]))