# Clasificando temas de papers 

En este notebook vamos a tomar una base de datos de papers, y los vamos a clasificar. Vamos a comparar dos enfoques, uno que toma solo las palabras de los papers, y otro que incorpora GNNs para usar la información de que papers citan a cuales otros en la predicción. 

## Datos: Cora cites. 

In [1]:
import pandas as pd
import numpy as np

Este dataset contiene dos archivos. El primero es una base de datos de papers que citan a otros papers

In [4]:
citas = pd.read_csv('cora/cora.cites',sep="\t",
    header=None,
    names=["target", "source"])

citas

Unnamed: 0,target,source
0,35,1033
1,35,103482
2,35,103515
3,35,1050679
4,35,1103960
...,...,...
5424,853116,19621
5425,853116,853155
5426,853118,1140289
5427,853155,853118


El segundo archivo contiene informacién de muchas palabras (términos) asociadas a cada paper, y además una categoría. 

In [5]:
###informacion de los papers de acuerdo a loas palabras que mencionan, junto a su paper id y el tema general

column_names = ["paper_id"] + [f"word_{idx}" for idx in range(1433)] + ["subject"]
papers = pd.read_csv(
    'cora/cora.content', sep="\t", names=column_names,
)
papers

Unnamed: 0,paper_id,word_0,word_1,word_2,word_3,word_4,word_5,word_6,word_7,word_8,...,word_1424,word_1425,word_1426,word_1427,word_1428,word_1429,word_1430,word_1431,word_1432,subject
0,31336,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,Neural_Networks
1,1061127,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,Rule_Learning
2,1106406,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Reinforcement_Learning
3,13195,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Reinforcement_Learning
4,37879,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Probabilistic_Methods
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2703,1128975,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Genetic_Algorithms
2704,1128977,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Genetic_Algorithms
2705,1128978,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Genetic_Algorithms
2706,117328,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Case_Based


In [6]:
print(papers.subject.value_counts())

subject
Neural_Networks           818
Probabilistic_Methods     426
Genetic_Algorithms        418
Theory                    351
Case_Based                298
Reinforcement_Learning    217
Rule_Learning             180
Name: count, dtype: int64


### 1. Preprocesamiento

Vamos a pasar estos subjects a números, y armar los paper_id para que sean consecutivos

In [8]:
class_values = sorted(papers["subject"].unique())
class_idx = {name: id for id, name in enumerate(class_values)}
paper_idx = {name: idx for idx, name in enumerate(sorted(papers["paper_id"].unique()))}

papers["paper_id"] = papers["paper_id"].apply(lambda name: paper_idx[name])
citas["source"] = citas["source"].apply(lambda name: paper_idx[name])
citas["target"] = citas["target"].apply(lambda name: paper_idx[name])
papers["subject"] = papers["subject"].apply(lambda value: class_idx[value])

In [9]:
citas

Unnamed: 0,target,source
0,0,21
1,0,905
2,0,906
3,0,1909
4,0,1940
...,...,...
5424,1873,328
5425,1873,1876
5426,1874,2586
5427,1876,1874


In [10]:
print(papers.subject.value_counts())

subject
2    818
3    426
1    418
6    351
0    298
4    217
5    180
Name: count, dtype: int64


### 2. Sets de training y test
Lo último antes de empezar: dividir en dos este dataset (y tratar que nos queden bien balanceados los subjects)

In [12]:
#Primero sacamos un 50% de nodos de cada subject
train_data, test_data = [], []

for nombres, datos_agrupados in papers.groupby("subject"):
    random_selection = np.random.rand(len(datos_agrupados.index)) <= 0.5
    train_data.append(datos_agrupados[random_selection])
    test_data.append(datos_agrupados[~random_selection])
train_data = pd.concat(train_data)
test_data = pd.concat(test_data)

In [13]:
train_data

Unnamed: 0,paper_id,word_0,word_1,word_2,word_3,word_4,word_5,word_6,word_7,word_8,...,word_1424,word_1425,word_1426,word_1427,word_1428,word_1429,word_1430,word_1431,word_1432,subject
13,1233,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
30,771,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32,888,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47,1237,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
50,2660,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2653,406,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6
2675,593,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6
2685,900,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6
2694,1857,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,6


Nuestra tarea consiste en predecir la línea "subject" (el tema del paper), basándonos por ahora solo en el vector de ocurrencias de palabras. Con esto, dividimos el set de training y test en una matriz x y un vector y con los subjects. 

In [51]:
# Y luego algunas cosas que se necesitan    
feature_names = list(set(papers.columns) - {"paper_id", "subject"})
num_features = len(feature_names)
num_classes = len(class_idx)

# A torch le gustan los arreglos numpy
x_train = train_data[feature_names].to_numpy()
x_test = test_data[feature_names].to_numpy()

y_train = train_data["subject"].to_numpy()
y_test = test_data["subject"].to_numpy()

In [52]:
x_train, y_train

(array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]),
 array([0, 0, 0, ..., 6, 6, 6]))

## Primer clasificador: una red neuronal simple.

Vamos a conectar un par de capas que corresponden a multi-layered perceptrons (MLP), y ver como aprenden los subjects de los papers. 

In [34]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader


### Datos para entrenar en batches

Lo primero es crear dataset y dataloader para entrenar en batches. 

In [57]:
class CustomDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        sample = torch.tensor(self.X[idx], dtype=torch.float32)
        label = torch.tensor(self.y[idx], dtype=torch.long)
        return sample, label

In [65]:
train_dataset = CustomDataset(x_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, drop_last=True)
test_dataset = CustomDataset(x_test, y_test)
test_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, drop_last=True)


Definir la red neuronal

In [29]:
def create_MLP(hidden_layers, dropout_rate, name=None):
    model = nn.Sequential()
    num_layer = 0

    for layer in hidden_layers:
        num_layer +=1
        neurons_in = layer[0]
        neurons_out = layer[1]
        
        model.add_module("batchnorm"+str(num_layer), nn.BatchNorm1d(neurons_in))
        model.add_module("dropout"+str(num_layer), nn.Dropout(dropout_rate))
        model.add_module("dense"+str(num_layer), nn.Linear(neurons_in, neurons_out))
        model.add_module("activacion"+str(num_layer), nn.GELU())
    
    return(model)
        


In [23]:
class BaseClassifier(nn.Module):
    def __init__(self, num_classes, input_features, hidden_layer_neurons=32, dropout_rate=0.5, normalize=True):
        super(BaseClassifier, self).__init__()

        self.preprocessor = create_MLP([[input_features,hidden_layer_neurons],[hidden_layer_neurons,hidden_layer_neurons]], dropout_rate)
        self.layer1 = create_MLP([[hidden_layer_neurons,hidden_layer_neurons],[hidden_layer_neurons,hidden_layer_neurons]], dropout_rate)
        self.layer2 = create_MLP([[hidden_layer_neurons,hidden_layer_neurons],[hidden_layer_neurons,hidden_layer_neurons]], dropout_rate)
        self.layer3 = create_MLP([[hidden_layer_neurons,hidden_layer_neurons],[hidden_layer_neurons,hidden_layer_neurons]], dropout_rate)
        self.classifier = nn.Linear(hidden_layer_neurons, num_classes)

    def forward(self, input_features):
        i = self.preprocessor(input_features)
        
        c1 = self.layer1(i)
        s1 = c1 + i  # Skip connection

        c2 = self.layer2(s1)
        s2 = c2 + s1  # Skip connection

        c3 = self.layer3(s2)
        s3 = c3 + s2  # Skip connection

        return self.classifier(s3)



### Entrenamiento

In [73]:
BASE = BaseClassifier(
    num_classes=num_classes,
    input_features = num_features
)

print(BASE)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(BASE.parameters(), lr=0.01)
num_epochs = 300

def calculate_accuracy(output, target):
    _, predicted = torch.max(output, 1)
    correct = torch.sum(predicted == target).item() 
    return correct / len(target)


BaseClassifier(
  (preprocessor): Sequential(
    (batchnorm1): BatchNorm1d(1433, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (dropout1): Dropout(p=0.5, inplace=False)
    (dense1): Linear(in_features=1433, out_features=32, bias=True)
    (activacion1): GELU(approximate='none')
    (batchnorm2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (dropout2): Dropout(p=0.5, inplace=False)
    (dense2): Linear(in_features=32, out_features=32, bias=True)
    (activacion2): GELU(approximate='none')
  )
  (layer1): Sequential(
    (batchnorm1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (dropout1): Dropout(p=0.5, inplace=False)
    (dense1): Linear(in_features=32, out_features=32, bias=True)
    (activacion1): GELU(approximate='none')
    (batchnorm2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (dropout2): Dropout(p=0.5, inplace=False)
    (dense2): Linea

In [80]:


def train(model, criterion, optimizer, dataloader, num_epochs):
    
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        running_accuracy = 0.0
        
        for inputs, labels in dataloader:
            optimizer.zero_grad()  # Zero the gradients

            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, labels)

            # Backward pass and optimization
            loss.backward()
            optimizer.step()

            # Compute loss and accuracy
            running_loss += loss.item()
            running_accuracy += calculate_accuracy(outputs, labels)

        epoch_loss = running_loss / len(dataloader)
        epoch_accuracy = running_accuracy / len(dataloader)

        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss:.4f}, Accuracy: {epoch_accuracy:.4f}')


In [81]:
train(BASE, criterion, optimizer, train_loader, num_epochs)


Epoch [1/300], Loss: 1.9116, Accuracy: 0.2703
Epoch [2/300], Loss: 1.7060, Accuracy: 0.3779
Epoch [3/300], Loss: 1.4533, Accuracy: 0.4767
Epoch [4/300], Loss: 1.1819, Accuracy: 0.5908
Epoch [5/300], Loss: 1.1055, Accuracy: 0.6076
Epoch [6/300], Loss: 0.9578, Accuracy: 0.6512
Epoch [7/300], Loss: 0.8918, Accuracy: 0.7006
Epoch [8/300], Loss: 0.8592, Accuracy: 0.6984
Epoch [9/300], Loss: 0.8026, Accuracy: 0.7376
Epoch [10/300], Loss: 0.7829, Accuracy: 0.7340
Epoch [11/300], Loss: 0.7865, Accuracy: 0.7406
Epoch [12/300], Loss: 0.7527, Accuracy: 0.7384
Epoch [13/300], Loss: 0.7056, Accuracy: 0.7689
Epoch [14/300], Loss: 0.6875, Accuracy: 0.7573
Epoch [15/300], Loss: 0.6875, Accuracy: 0.7703
Epoch [16/300], Loss: 0.6542, Accuracy: 0.7674
Epoch [17/300], Loss: 0.6820, Accuracy: 0.7776
Epoch [18/300], Loss: 0.6798, Accuracy: 0.7485
Epoch [19/300], Loss: 0.6421, Accuracy: 0.7849
Epoch [20/300], Loss: 0.6460, Accuracy: 0.7696
Epoch [21/300], Loss: 0.6422, Accuracy: 0.7849
Epoch [22/300], Loss: 

### Validacion

In [82]:
def test(model, criterion, X,y):
    model.eval()  # Set the model to evaluation mode
    running_loss = 0.0
    running_accuracy = 0.0

    with torch.no_grad():
        outputs = model(X)
        t_loss = criterion(outputs, y)
        t_accuracy = calculate_accuracy(outputs, y)
        

    print(f'Test Loss: {t_loss:.4f}, Test Accuracy: {t_accuracy:.4f}')
    return t_loss, t_accuracy

In [83]:
test(BASE, criterion, torch.tensor(x_test, dtype=torch.float32), torch.tensor(y_test, dtype=torch.long))

Test Loss: 2.8371, Test Accuracy: 0.6003


(tensor(2.8371), 0.6003005259203607)