Questions:
- Set a seed ✓
- Read data ✓
- Check nulls ✓
- How many users? ✓
- How many categories? ✓
    - How much records does each one have? ✓
- How many posts? ✓
- How many interactions? ✓
- What is the average interactions per post? ✓
- Remove "insignificant" connections. ✓
- How representative was the reduction? ✓
- Create a reasonable visualization from the graph (e.g. Gephi)
- Create a mapping from all names to indices (e.g. LabelEncoder). How to get all names? ✓
- Create a mapping from all labels to an indices. ✓
- Create a mapping from all nodes to a label index. ✓
- Create a toy model (e.g. the GCN example provided in the documentation). ✓
- Check if the data object was created correctly. 
- Define the embedding dimension.
- Create and save a [Node2Vec](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.models.Node2Vec) model. ✓
- Create and save a [GCN](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv) model. ✓
- Create ans save a [GAT](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GATConv) model. ✓
- Create ans save a [SAGE](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.SAGEConv) model.
- Create ans save a [GIN](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GINConv) model.
- Use a t-SNE and plot the classes with different colors.
- How consistent are the embeddings? 
- Do they group well together?
- From which models does the greatest embeddings come from?
- Which metric will be optimized by the learning models?
- Which model should be used to classify the nodes?
    - If a neural model:
        - Which learning rate? Is it adaptive? 
        - How many epochs? 
        - Which architecture?
        - Present a training erro vs test error analysis chart.
- Which categories reach the greatest performance? 
    - Why?
    - Is there any pausible reason or maybe characteristic from a method/family of methods that helps to perform better in our case? If so, what is?
- 
   
Resources:
- https://graphreason.github.io/papers/39.pdf (Must Read)
- https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8
- https://pytorch-geometric.readthedocs.io/

In [None]:
# How many comments did a user make? - OK 
# In how many categories does the commenter posted on? - OK
# How many user does the commenter mentions, on average? - OK
# How many tags is used by the commenter, on average? - OK
# What hour of day do he/she made more comments?
# What is day of week in which the user made more comments?
# What is the average message length? - OK

In [None]:
import torch
import torch.nn.functional as F
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
import plotly.graph_objects as go
import os

from sklearn.model_selection import StratifiedKFold
from torch_geometric.data import Data
from torch.utils.data import DataLoader
from torch_geometric.nn.models import Node2Vec
from torch_geometric.nn import GCNConv, GATConv, SAGEConv, GINConv
from torch.optim import Adam
from torch.nn import NLLLoss
from collections import defaultdict
from dateutil import parser

import ast
import preprocessing
import utils

torch.manual_seed(0)
np.random.seed(0)

profiles = pd.read_csv("../data/profiles.csv", usecols=["profile_username", "profile_followed_by", "profile_follow", 
                                                        "medias_nb", "comments_nb", "comments_commenters_nb", 
                                                        "comments_self_nb", "category_1"])
comments = pd.read_csv("../data/comments.csv")

In [None]:
preprocessing.filter_by_relevance(comments, profiles, minimum_freq=75)

In [None]:
final_df = comments[["media_author", "commenter"]]\
                .groupby("commenter", as_index=False)\
                .agg("count")\
                .sort_values("commenter")
final_df.columns = ["commenter", "comments_nb"]
final_df = final_df[final_df.comments_nb >= 80]

In [None]:
print("The number of commenters: {}".format(len(final_df)))
comments = comments[comments.commenter.isin(final_df.commenter.values)]
print("The current number of interactions: {}".format(len(comments)))

In [None]:
final_df

In [None]:
tmp = comments[["category_1", "commenter"]]\
            .groupby("commenter", as_index=False)\
            .agg({"category_1": lambda col: col.nunique()})\
            .sort_values("commenter")

to_categories_nb = {commenter: categories_nb for commenter, categories_nb in tmp.values}
final_df["categories_nb"] = final_df.commenter.apply(lambda commenter: to_categories_nb[commenter])

del tmp

In [None]:
tmp = comments[["commenter", "comment_tags"]]
tmp["comment_tags"] = tmp.comment_tags.apply(lambda x: len(ast.literal_eval(x)))
tmp = tmp.groupby("commenter", as_index=False)\
            .agg("mean")\
            .sort_values("commenter")

to_avg_tags_nb = {commenter: avg_tags_nb for commenter, avg_tags_nb in tmp.values}
final_df["avg_tags_nb"] = final_df.commenter.apply(lambda commenter: to_avg_tags_nb[commenter])

del tmp

In [None]:
tmp = comments[["commenter", "comment_mentioned_usernames"]]
tmp["comment_mentioned_usernames"] = tmp.comment_mentioned_usernames.apply(lambda x: len(ast.literal_eval(x)))
tmp = tmp.groupby("commenter", as_index=False)\
            .agg("mean")\
            .sort_values("commenter")

to_avg_cited_users = {commenter: avg_cited_users for commenter, avg_cited_users in tmp.values}
final_df["avg_cited_users"] = final_df.commenter.apply(lambda commenter: to_avg_cited_users[commenter])

del tmp

In [None]:
tmp = comments[["commenter", "comment_text"]]
tmp["comment_text"] = tmp.comment_text.apply(lambda x: len(x))
tmp = tmp.groupby("commenter", as_index=False)\
            .agg("mean")\
            .sort_values("commenter")

to_avg_msg_len = {commenter: avg_msg_len for commenter, avg_msg_len in tmp.values}
final_df["avg_msg_len"] = final_df.commenter.apply(lambda commenter: to_avg_msg_len[commenter])

del tmp

In [None]:
def update_dict(to_weekday, commenter, day, freq):
    if to_weekday[commenter][1] < freq:
        to_weekday[commenter] = (day, freq)

In [None]:
tmp = comments[["commenter", "comment_created_time_str"]]
tmp["day"] = tmp.comment_created_time_str.apply(lambda item: parser.parse(item).weekday())
tmp = tmp.groupby(["commenter", "day"], as_index=False).agg("count")

to_weekday = {}
for commenter, day, freq in tmp.values:
    if not commenter in to_weekday:
        to_weekday[commenter] = (0, 0)
    
    update_dict(to_weekday, commenter, day, freq)
    
final_df["weekday"] = final_df.commenter.apply(lambda commenter: to_weekday[commenter][0])

del tmp

In [None]:
tmp = comments[["commenter", "comment_created_time_str"]]
tmp["hour"] = tmp.comment_created_time_str.apply(lambda item: parser.parse(item).hour)
tmp = tmp.groupby(["commenter", "hour"], as_index=False).agg("count")

to_hour = {}
for commenter, hour, freq in tmp.values:
    if not commenter in to_hour:
        to_hour[commenter] = (0, 0)
    
    update_dict(to_hour, commenter, hour, freq)
    
final_df["hour"] = final_df.commenter.apply(lambda commenter: to_hour[commenter][0])

del tmp

In [None]:
final_df["not_tracked_user"] = 1

In [None]:
final_df.head(5)

In [None]:
results = profiles.category_1.value_counts()
print(results)

figure = go.Figure(
    data=[go.Pie(labels=results.index.values, values=results.values)],
    layout_title_text="Percentage of Each Category"
)

figure.show()

In [None]:
known_users = profiles.profile_username.unique().tolist()
followers = comments.commenter.unique().tolist()
all_users = set(known_users + followers)
prev_number_of_users = len(all_users)
print("There are originally {} users".format(prev_number_of_users))

In [None]:
all_posts = len(comments.media_short_code.unique())
all_interactions = len(comments)
print("There are {} distinct posts and {} interactions. An average of {} interactions per post"\
          .format(all_posts, all_interactions, np.round(all_interactions/all_posts, 2)))

In [None]:
profiles = preprocessing.categorical_to_numerical(profiles, col="category_1")
comments = comments.drop_duplicates()
comments = preprocessing.filter_by_relevance(comments, profiles, minimum_freq=50)

In [None]:
known_users = profiles.profile_username.unique().tolist()
followers = comments.commenter.unique().tolist()
all_users = set(known_users + followers)
cur_number_of_users = len(all_users)
print("The new graph drawn from relevance threshold {} has {} users and {} interactions"\
          .format(35, cur_number_of_users, len(comments)))

In [None]:
print("The number of users was reduced by ~ {}%"\
          .format(np.round((1-cur_number_of_users/prev_number_of_users)*100, 2)))

In [None]:
names = profiles.profile_username.values
data = profiles[["profile_followed_by", "profile_follow", "medias_nb", 
                "comments_nb", "comments_commenters_nb", "comments_self_nb"]].values
name_to_record = {name: record for name, record in zip(names, data)}

input_dim, output_dim = data.shape[1], len(profiles.category_1.unique()) + 1
user_to_label = {user: category for user, category in profiles[["profile_username", "category_1"]].values}

In [None]:
def update_histories(models_histories, new_histories):
    for model, history in new_histories.items():
        if not models_histories[model]:
            models_histories[model] = np.array(new_histories[model])
        else:
            models_histories[model] += np.array(new_histories[model])
    
    return models_histories
    

def calculate_statistics(models_metrics):
    return {model: {metric: (np.mean(values), np.std(values)) for metric, values in metrics.items()} 
                for model, metrics in models_metrics.items()}

In [None]:
K = 5
skf = StratifiedKFold(n_splits=K)

n_hidden_units = 64
print("Fez of preprocessamento")
models_metrics = defaultdict(dict)
models_histories = defaultdict(list)
for train_idx, test_idx in skf.split(profiles.profile_username.values, profiles.category_1.values):
    train_authors, test_authors = utils.get_authors(profiles, all_users, train_idx, test_idx)

    print("Pegou autores")

    username_to_index = utils.get_users_indices(train_authors)
    print("Pegou indices")
    train_interactions = utils.get_interactions(comments[(comments.media_author.isin(train_authors)) 
                                                    & (comments.commenter.isin(train_authors))], username_to_index)
    print("Pegou interações")
    x_train, y_train = utils.get_x(train_authors, name_to_record, input_dim=input_dim), utils.get_y(user_to_label, train_authors)
    print("Pegou x e y")
    assert len(x_train)==len(y_train), "Train Input and Output tensor do not have the same dimensions"


    edge_index = utils.get_edge_index(train_interactions)
    print("Pegou os indices de arestas")
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    data = Data(x=x_train, y=y_train, edge_index=edge_index).to(device)
    print("Criou Data")

    models = utils.get_models(data.num_nodes, input_dim, output_dim, n_hidden_units, device=device, lr=0.005)
    print("Criou modelos")

    histories = utils.train(data, models, epochs=1)
    models_histories = update_histories(models_histories, histories)

    username_to_index = utils.get_users_indices(test_authors)
    test_interactions = utils.get_interactions(comments[(comments.media_author.isin(test_authors)) 
                                                    & (comments.commenter.isin(test_authors))], username_to_index)
    x_test, y_test = utils.get_x(test_authors, name_to_record, input_dim=input_dim), utils.get_y(user_to_label, test_authors)
    assert len(x_test)==len(y_test), "Test Input and Output tensor do not have the same dimensions"

    edge_index = utils.get_edge_index(test_interactions)
    data = Data(x=x_test, y=y_test, edge_index=edge_index).to(device)
    current_metrics = utils.test(data, models)
    utils.update_metrics_dict(models_metrics, current_metrics)
    
models_histories = {model: list(history/K) for model, history in models_histories.items()} # Get mean traces

In [None]:
models_metrics = calculate_statistics(models_metrics)

In [None]:
import matplotlib.pyplot as plt
import json

models_histories = json.load(open("../pipeline/results/one_layer_models/models_histories_1_256.json"))

In [None]:
from pylab import rcParams
rcParams['figure.figsize'] = (15, 5)

fig, ax = plt.subplots(1, 3, sharey=True)
models = ["GCN", "GAT", "GraphSAGE"]
for i, (model, history) in enumerate(models_histories.items()):
    ax[i].plot(range(len(history)), history)
    ax[i].set_title(models[i])

plt.suptitle("History of Training Error (200 Epochs)")
plt.show()

In [None]:
import matplotlib.pyplot as plt
import json

models_histories = json.load(open("../pipeline/results/one_layer_models/models_metrics_1_64.json"))
models_histories