#GENERAL INFORMATION

## Dataset description

**Project datasets**
To carry out this classification you have the following information about the projects:

| Field Name          | Description                                                     |
|---------------------|-----------------------------------------------------------------|
| projectID           | Unique ID associated with each project                          |
| startDate           | Date of the beginning of the project                             |
| endDate             | Date of the ending of the project                                |
| totalCost           | Total cost of the project declared by the applicants             |
| ecMaxContribution   | Maximum expenses that will be covered by the grant               |
| frameworkProgramme  | FP7, H2020, or HE - Multiannual EU program associated to the project. To work with categorical variables, we advise using one-hot-encoding codification |
| Number of papers    | Number of published papers that acknowledge funding from the project |
| Number of patents   | Number of patents resulting from the project                     |
| TFIDF               | TF-IDF vectorization of the title and objective of the project   |
| title               | Title of the project                                             |
| Objective           | Summary of the project                                           |


**Available files**
* **categories.csv** - a file describing the different categories. The categories are structured into a tree or hierarchical structure, so this file also includes a description of the hierarchy of each category
* **train_set.csv** - the training set. Note that the columns category and label have the same target information but in different formats.
* **test_set.csv** - the test set, where only the input features characterizing each project are included.
* **data_text_train.pickle** - Title and objective associated to the projects in the training set.
* **data_text_test.pickle** - Title and objective associated to the projects in the test set.
* **publis_title_train.pickle** - Titles of publications associated to the projects in the training set.
* **publis_title_test.pickle** - Titles of publications associated to the projects in the test set.

#Requirements

In [None]:
!pip install datasets



In [None]:
!pip install transformers[torch]

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.29.3-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu

# Data_preprocessing

## Dataset Loading


In [None]:
from google.colab import drive
import pandas as pd
import pickle
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt



file_path = "Add your file path"
data_test_path= file_path+"/data_text_test.pickle"
data_train_path=file_path+"/data_text_train.pickle"

publis_test_path= file_path+"/publis_title_test.pickle"
publis_train_path=file_path+"/publis_title_train.pickle"

data_v2_test_path= file_path+"/archive" + "/data_v2_test.pickle"
data_v2_train_path=file_path+"/archive" + "/data_v2_train.pickle"

categories_path=file_path+"/archive" + "/categories.csv"

with open(data_test_path, 'rb') as file1:
    data_test = pickle.load(file1)
    df_test= pd.DataFrame(data=data_test)
    array_test=pd.DataFrame.to_numpy(df_test)
with open(data_train_path, 'rb') as file2:
    data_train = pickle.load(file2)
    df_train=pd.DataFrame(data=data_train)
    array_train=pd.DataFrame.to_numpy(df_train)
with open(publis_test_path, 'rb') as file3:
    data_publis_test = pickle.load(file3)
    df_publis_test= pd.DataFrame(data=data_publis_test)
    array_publis_test=pd.DataFrame.to_numpy(df_publis_test)
with open(publis_test_path, 'rb') as file4:
    data_publis_train = pickle.load(file4)
    df_publis_train= pd.DataFrame(data=data_publis_train)
    array_publis_train=pd.DataFrame.to_numpy(df_publis_train)
with open(data_v2_test_path, 'rb') as file5:
    data_v2_test = pickle.load(file5)
    df_v2_test= pd.DataFrame(data=data_v2_test)
    array_v2_test=pd.DataFrame.to_numpy(df_v2_test)
with open(data_v2_train_path, 'rb') as file6:
    data_v2_train = pickle.load(file6)
    df_v2_train= pd.DataFrame(data=data_v2_train)
    array_v2_train=pd.DataFrame.to_numpy(df_v2_train)

df_categories = pd.read_csv(categories_path)


df_train.drop(10595, inplace=True)
df_publis_train.drop(10595, inplace=True)
df_v2_train.drop(10595, inplace=True)

df_train.reset_index(drop=True, inplace=True)
array_train=pd.DataFrame.to_numpy(df_train)
df_publis_train.reset_index(drop=True, inplace=True)
array_publis_train=pd.DataFrame.to_numpy(df_publis_train)
df_v2_train.reset_index(drop=True, inplace=True)
array_v2_train=pd.DataFrame.to_numpy(df_v2_train)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Data visualization

In [None]:
df_test.head(3)

Unnamed: 0,projectID,title,objective
0,101095619,Efficient and rapidly SCAlable EU-wide evidenc...,Pandemics have the potential to disrupt our da...
1,836869,Unique approach to improving neurological func...,The aim of this project is to develop a busine...
2,301764,The origin and function of CD20 positive T-cel...,"""The expression of CD20 is generally assumed t..."


In [None]:
df_train.head()

Unnamed: 0,projectID,title,objective
0,305282,A Multi-Stage Malaria Vaccine,A highly effective malaria vaccine is a major ...
1,318997,NEUREN - Neuroscience Research Exchange Networ...,"""The NEUREN project is based on an interdiscip..."
2,101075873,Pulsed Laser Light and Nano-encapsulated Ocula...,Ocular diseases affect the quality of life of ...
3,957468,LUCERO - Smart Optofluidic Micromanipulation o...,The goal of Lucero is to create autonomous mic...
4,948561,Depression in diverse populations: Unravelling...,Depression affects 300 million people and repr...


In [None]:
df_publis_test.head(3)

Unnamed: 0,projectID,title,id,SSID
3,281359,MyLabStocks: a web-application to manage molec...,5C9898F483DFBBF2E013B667F1006BF999E5F4D6,12494575.0
5,281359,Genetic Modifiers of Chromatin Acetylation Ant...,C482F767409EB909B14080C17322C8A686DDC45E,6409412.0
6,281359,CPF-Associated Phosphatase Activity Opposes Co...,A2DB88BC1E1D83A907A15E703C3B1F114B052830,15882734.0


In [None]:
df_publis_train.head(3)

Unnamed: 0,projectID,title,id,SSID
0,281359,MyLabStocks: a web-application to manage molec...,5C9898F483DFBBF2E013B667F1006BF999E5F4D6,12494575.0
1,281359,Genetic Modifiers of Chromatin Acetylation Ant...,C482F767409EB909B14080C17322C8A686DDC45E,6409412.0
2,281359,CPF-Associated Phosphatase Activity Opposes Co...,A2DB88BC1E1D83A907A15E703C3B1F114B052830,15882734.0


In [None]:
df_v2_test.head(3)

Unnamed: 0,projectID,startDate,endDate,totalCost,ecMaxContribution,frameworkProgramme,num_papers,num_patents,TFIDF
2,101095619,2022-12-31 23:00:00,2026-12-30 23:00:00,2420930.0,2420929.75,HORIZON,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6,836869,2018-12-31 23:00:00,2019-06-29 22:00:00,71429.0,50000.0,H2020,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
9,301764,NaT,NaT,104516.7,104516.7,FP7,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [None]:
df_v2_train.head(3)

Unnamed: 0,projectID,startDate,endDate,totalCost,ecMaxContribution,frameworkProgramme,num_papers,num_patents,category,label,TFIDF
0,305282,2012-09-30 22:00:00,2017-03-30 22:00:00,8055788.47,6000000.0,FP7,1,0,"[1, 6]","[0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.04742882392942083, 0.0, 0.0, 0.0, 0.0,..."
1,318997,2013-08-31 22:00:00,2017-08-30 22:00:00,304200.0,304200.0,FP7,19,0,"[4, 7, 21]","[0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,101075873,2023-08-31 22:00:00,2028-08-30 22:00:00,1499351.0,1499351.0,HORIZON,0,0,"[6, 12, 20]","[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [None]:
df_categories.head(df_categories.shape[0])

Unnamed: 0,Category,Name,Tree
0,0,nutrition,medical and health sciences/health sciences/nu...
1,1,infectious diseases,medical and health sciences/health sciences/in...
2,2,public health,medical and health sciences/health sciences/pu...
3,3,pathology,medical and health sciences/basic medicine/pat...
4,4,neurology,medical and health sciences/basic medicine/neu...
5,5,immunology,medical and health sciences/basic medicine/imm...
6,6,pharmacology and pharmacy,medical and health sciences/basic medicine/pha...
7,7,physiology,medical and health sciences/basic medicine/phy...
8,8,cells technologies,medical and health sciences/medical biotechnol...
9,9,genetic engineering,medical and health sciences/medical biotechnol...


In [None]:
"To analyze blank documents wiht Nan in its features"

import math

for idx, text in enumerate(array_train[:,2]):
    if isinstance(text, float) and math.isnan(text):
        print(f"Found NaN: {text}\n Id: {idx}")


## One-hot encoder creation for train-test labels for each of the categories

In [None]:
#Encode labels
labels = df_categories['Name'].tolist()
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

In [None]:
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings("ignore")
# Splitting df_train into train and validation sets
train_df, validation_df = train_test_split(df_train, test_size=0.1, random_state=42)

# Concatenating "title" and "objective" columns as "title + objective"
train_df['text'] = train_df['title'] + " " + train_df['objective']
validation_df['text'] = validation_df['title'] + " " + validation_df['objective']
df_test['text'] = df_test['title'] + " " + df_test['objective']

# Drop the "title" and "objective" columns if necessary
train_df.drop(columns=['title', 'objective'], inplace=True)
validation_df.drop(columns=['title', 'objective'], inplace=True)
df_test.drop(columns=['title', 'objective'], inplace=True)

# Merging labels with train and validation sets based on projectID
train_df = pd.merge(train_df, df_v2_train[['projectID', 'label']], on='projectID')
validation_df = pd.merge(validation_df, df_v2_train[['projectID', 'label']], on='projectID')

# Encode labels as one-hot columns
labels = df_categories['Name'].tolist()
for i,label in enumerate(labels):
    train_df[label] = train_df['label'].apply(lambda x: x[i])
    validation_df[label] = validation_df['label'].apply(lambda x: x[i])

# Drop the original 'label' column
train_df.drop(columns=['label'], inplace=True)
validation_df.drop(columns=['label'], inplace=True)

# Create a DatasetDict
dataset_dict = DatasetDict({
    'train': Dataset.from_pandas(train_df),
#    'test': Dataset.from_pandas(df_test),
    'val': Dataset.from_pandas(validation_df)
})

# Example usage
print(dataset_dict)


DatasetDict({
    train: Dataset({
        features: ['projectID', 'text', 'nutrition', 'infectious diseases', 'public health', 'pathology', 'neurology', 'immunology', 'pharmacology and pharmacy', 'physiology', 'cells technologies', 'genetic engineering', 'endocrinology', 'cardiology', 'surgery', 'oncology', 'psychiatry', 'optics', 'artificial intelligence', 'data science', 'software', 'cell biology', 'biochemistry', 'neurobiology', 'genetics', 'microbiology', 'zoology', 'inorganic chemistry', 'nano-materials', 'electronic engineering', 'economics', 'business and management', 'demography', 'implants', 'agriculture', 'personalized medicine'],
        num_rows: 10103
    })
    val: Dataset({
        features: ['projectID', 'text', 'nutrition', 'infectious diseases', 'public health', 'pathology', 'neurology', 'immunology', 'pharmacology and pharmacy', 'physiology', 'cells technologies', 'genetic engineering', 'endocrinology', 'cardiology', 'surgery', 'oncology', 'psychiatry', 'optics', '

## Batch Text Encoding and Label Preparation

In [None]:
from transformers import DebertaV2Tokenizer
import numpy as np

tokenizer = DebertaV2Tokenizer.from_pretrained("microsoft/deberta-v2-xlarge")

def preprocess_data(examples):
  # take a batch of texts
  text = examples["text"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=512)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()

  return encoding

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.45M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/633 [00:00<?, ?B/s]

In [None]:
encoded_dataset = dataset_dict.map(preprocess_data, batched=True, remove_columns=dataset_dict['train'].column_names)

Map:   0%|          | 0/10103 [00:00<?, ? examples/s]

Map:   0%|          | 0/1123 [00:00<?, ? examples/s]

In [None]:
encoded_dataset.set_format("torch")

## Pretrained LLM Bert Model Initialization

In this case, the best performing one among the available were DeBERTaV2

In [None]:
from transformers import DebertaV2ForSequenceClassification
import torch
#pretrained_path="C:/Users/Usuario/Desktop/pretrained_model"
pretrained_path="microsoft/deberta-v2-xlarge"
model = DebertaV2ForSequenceClassification.from_pretrained(pretrained_path,
                                                           problem_type="multi_label_classification",
                                                           num_labels=34,
                                                           id2label=id2label,
                                                           label2id=label2id)

# Move model to appropriate device (GPU if available)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

pytorch_model.bin:   0%|          | 0.00/1.78G [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v2-xlarge and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 1536, padding_idx=0)
      (LayerNorm): LayerNorm((1536,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-23): 24 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=1536, out_features=1536, bias=True)
              (key_proj): Linear(in_features=1536, out_features=1536, bias=True)
              (value_proj): Linear(in_features=1536, out_features=1536, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=1536, out_features=1536, bias=True)
              (LayerNorm): LayerNorm((1536,), eps=1e-07, element

# Model_training

First we define the arguments for the trainer (Training Hyperparameters)

In [None]:
batch_size = 4
metric_name = "roc_auc"

In [None]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    pretrained_path,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

We define the metric functions for the classification model perfomance

In [None]:
from sklearn.metrics import roc_auc_score
from transformers import EvalPrediction
import torch

def multi_label_metrics(predictions, labels, threshold=0.5):
    # we apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # we use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # we compute metrics
    y_true = labels
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    # we return them as dictionary
    metrics = {'roc_auc': roc_auc}

    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions,
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result

In [None]:
encoded_dataset['train']['input_ids'][0]

tensor([     1,   1154,    324,   2867,     18,   1891,    360,     91,     34,
          7608,   2236,      8,   3290,   2608,   2867,   6448,   4628,   2246,
             7,    364,     18,  63819,    540,  53673,      5,   3200,     34,
            41,   2247,   2520,    119,     11,    764,   2867,    119,      8,
            10,   7444,   2520,    365,   2229,      4,   1274,   2867,  70661,
            30,   7315,   2120,    510,      7,   2655,    960,      6,      5,
           366,     14,    116,    360,     13,    188,    117,    271,      4,
            69,     13,    588,    521,      8,      5,    437,     15,   1304,
           229,   2867,  70661,     91,     11,    764,     51,  12396,   6235,
            91,     71,     40,    683,   2655,   6719, 108530,     10,    682,
         64489,  11359,      7,   1650,    772,    308,    162,    360,      4,
           231,   2426,     10,   1514,    366,    490,      6,      5,    446,
             9,   2867,  70661,     19, 

In [None]:
#forward pass
# Move input data to the appropriate device
input_ids = encoded_dataset['train']['input_ids'][0].unsqueeze(0).to(device)
labels = encoded_dataset['train'][0]['labels'].unsqueeze(0).to(device)

# Forward pass
outputs = model(input_ids=input_ids, labels=labels)


We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


We define the trainer with model, the arguments and the datasets

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["val"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

After trainer definition, we train the model and analyse its performance over the validation dataset

In [None]:
trainer.train()

In [None]:
#model_saver_path="C:/Users/Usuario/Desktop"+"/pretrained_model_DeBERTA"
pretrained_path="/pretrained_model_DeBERTA"
trainer.save_model(model_saver_path)

In [None]:
trainer.evaluate()

In [None]:
df_test.head()

In [None]:
encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

outputs = trainer.model(**encoding)

# Model_inference

In this code cell, we peform the model inference on the text dataset, and we save the soft output predictions of the model with the correspoding ID for each text in a csv file

In [None]:
import csv
import numpy as np

# Define the projectIDs and texts
projectIDs = df_test['projectID'].tolist()
texts = df_test['text'].tolist()

# Define batch size
batch_size = 8

# Get the total number of samples
total_samples = len(projectIDs)

# Initialize lists to store results
all_probs = []

# Process data in batches
for i in range(0, total_samples, batch_size):
    # Get batch inputs
    batch_texts = texts[i:i+batch_size]
    batch_encodings = tokenizer(batch_texts, return_tensors="pt", padding=True, truncation=True)
    batch_encodings = {k: v.to(trainer.model.device) for k, v in batch_encodings.items()}

    # Get model predictions for the batch
    with torch.no_grad():
        outputs = trainer.model(**batch_encodings)

    logits = outputs.logits
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(logits.cpu())

    # Append batch probabilities to the list
    all_probs.append(probs.numpy())

# Concatenate batch probabilities into a single array
all_probs = np.concatenate(all_probs, axis=0)

# Write the probabilities to a CSV file
category_names = [f'cat_{i}' for i in range(all_probs.shape[1])]

with open('soft_predictions_DeBERTA.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)

    # Write the header row with projectID and category names
    header = ['projectID'] + category_names
    writer.writerow(header)

    # Write the soft predictions for each observation
    for i in range(len(projectIDs)):
        row = [projectIDs[i]] + [all_probs[i, j] for j in range(all_probs.shape[1])]
        writer.writerow(row)

print("CSV file 'soft_predictions.csv' has been created.")
