Questions:
- Read data ✓
- Check nulls ✓
- How many users? ✓
- How many categories? ✓
    - How much records does each one have? ✓
- How many posts? ✓
- How many interactions? ✓
- What is the average interactions per post? ✓
- Remove "insignificant" connections. ✓
- How representative was the reduction? ✓
- Create a reasonable visualization from the graph (e.g. Gephi)
- Create a mapping from all names to indices (e.g. LabelEncoder). How to get all names? ✓
- Create a mapping from all labels to an indices. ✓
- Create a mapping from all nodes to a label index. ✓
- Create a toy model (e.g. the GCN example provided in the documentation).
- Check if the data object was created correctly.
- Define the embedding dimension.
- Create and save a Node2Vec model.
- Create and save a GCN model.
- Create ans save a GAT model.
- Create ans save a GraphSAGE model.
- Create ans save a GIN model.
- Use a t-SNE and plot the classes with different colors.
- How consistent are the embeddings? 
- Do they group well together?
- From which models does the greatest embeddings come from?
- Which metric will be optimized by the learning models?
- Which model should be used to classify the nodes?
    - If a neural model:
        - Which learning rate? Is it adaptive? 
        - How many epochs? 
        - Which architecture?
        - Present a training erro vs test error analysis chart.
- Which categories reach the greatest performance? 
    - Why?
    - Is there any pausible reason or maybe characteristic from a method/family of methods that helps to perform better in our case? If so, what is?
- 
   
Resources:
- https://graphreason.github.io/papers/39.pdf (Must Read)
- https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8
- https://pytorch-geometric.readthedocs.io/

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import missingno as msno
import ast
import torch
import numpy as np
import plotly.graph_objects as go


from sklearn.preprocessing import LabelEncoder
from torch_geometric.data import Data
from collections import defaultdict

In [None]:
torch.manual_seed(0)
np.random.seed(0)

In [None]:
profiles = pd.read_csv("../data/profiles.csv", usecols=["profile_username", "category_1"]).drop_duplicates()
profiles.tail(1)

In [None]:
profiles.isna().sum()

In [None]:
comments = pd.read_csv("../data/comments.csv", usecols=["media_short_code", "media_author", "commenter"])
comments = comments.drop_duplicates()
comments.tail(1)

In [None]:
comments.isna().sum()

In [None]:
results = profiles.category_1.value_counts()

figure = go.Figure(
    data=[go.Pie(labels=results.index.values, values=results.values)],
    layout_title_text="Percentage of Each Category"
)

figure.show()

In [None]:
category_to_index = {category: index for index, category in enumerate(results.index)}
profiles.category_1 = profiles.category_1.map(lambda key: category_to_index[key])
profiles.head(3)

In [None]:
known_users = profiles.profile_username.unique().tolist()
followers = comments.commenter.unique().tolist()
prev_number_of_users = len(set(known_users + followers))

print("There are originally {} users".format(prev_number_of_users))

In [None]:
all_posts = len(comments.media_short_code.unique())
all_interactions = len(comments)
print("There are {} distinct posts and {} interactions. An average of {} interactions per post"\
          .format(all_posts, all_interactions, np.round(all_interactions/all_posts, 2)))

In [None]:
RELEVANCE_THRESHOLD = 5
subgraph = []
for commenter, frequency in comments.commenter.value_counts().items():
    if frequency > RELEVANCE_THRESHOLD:
        subgraph.append(commenter)
        
comments = comments[comments.commenter.isin(subgraph)]

In [None]:
known_users = profiles.profile_username.unique().tolist()
followers = comments.commenter.unique().tolist()
all_users = set(known_users + followers)
cur_number_of_users = len(all_users)

print("The new graph drawn from relevance threshold {} has {} users and {} interactions"\
          .format(RELEVANCE_THRESHOLD, cur_number_of_users, len(comments)))

In [None]:
print("The number of users was reduced by ~ {}%"\
          .format(np.round(1-cur_number_of_users/prev_number_of_users, 2) * 100))

In [None]:
username_to_index = {name: index for index, name in enumerate(all_users)}

In [None]:
user_to_label = {user: category for user, category in profiles[["profile_username", "category_1"]].values}

In [None]:
interactions = [(username_to_index[author], username_to_index[commenter]) 
                    for author, commenter in comments[['media_author', 'commenter']].drop_duplicates().values]

print("The final graph has {} interactions".format(len(interactions)))

In [None]:
graph = nx.Graph()
graph.add_edges_from(interactions)

In [None]:
x = torch.tensor(graph.nodes, dtype=torch.float).view(-1, 1)
# x = torch.LongTensor(graph.nodes).view(-1, 1)
y = torch.tensor([user_to_label.get(user, 4) for user in all_users], dtype=torch.float)
edge_index = torch.tensor(nx.to_pandas_edgelist(graph).values.T, dtype=torch.float)

In [None]:
data = Data(x=x, edge_index=edge_index, y=y)

In [None]:
import torch
from torch.nn import Sequential as Seq, Linear, ReLU
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import remove_self_loops, add_self_loops
class SAGEConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super(SAGEConv, self).__init__(aggr='max') #  "Max" aggregation.
        self.lin = torch.nn.Linear(in_channels, out_channels)
        self.act = torch.nn.ReLU()
        self.update_lin = torch.nn.Linear(in_channels + out_channels, in_channels, bias=False)
        self.update_act = torch.nn.ReLU()
        
    def forward(self, x, edge_index):
        # x has shape [N, in_channels]
        # edge_index has shape [2, E]
        
        
        edge_index, _ = remove_self_loops(edge_index)
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
        
        
        return self.propagate(edge_index, size=(x.size(0), x.size(0)), x=x)

    def message(self, x_j):
        # x_j has shape [E, in_channels]

        x_j = self.lin(x_j)
        x_j = self.act(x_j)
        
        return x_j

    def update(self, aggr_out, x):
        # aggr_out has shape [N, out_channels]


        new_embedding = torch.cat([aggr_out, x], dim=1)
        
        new_embedding = self.update_lin(new_embedding)
        new_embedding = self.update_act(new_embedding)
        
        return new_embedding

In [None]:
model = SAGEConv(data.x.shape[1], 1)

In [None]:
model.forward(data.x, edge_index)