# 6. Submission

En este notebook se va a realizar la entrega al reto de Kaggle.

Inicialmente se va a realizar un preprocesado de los datos, eliminando las palabras sin significado útil, los url y los signos de puntuación.

In [7]:
import pandas as pd
import numpy as np

df_train = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test_nolabel.csv')

In [8]:
import spacy

nlp = spacy.load('en_core_web_lg')

In [9]:
en_stopwords = nlp.Defaults.stop_words

def remove_stop_words(text):
    cleanText = ''
    phrase = nlp(text)
    for token in phrase:
        if not token.is_stop and not token.is_punct and not token.like_url:
            cleanText += ' ' + token.text

    return cleanText

df_train['text_cleaned'] = df_train['text'].apply(remove_stop_words)
df_test['text_cleaned'] = df_test['text'].apply(remove_stop_words)


In [10]:
x_train = df_train['text_cleaned']
y_train = df_train['label']

x_test = df_test['text_cleaned']

Se va a utilizar BERT para realizar la clasificación de lenguaje ofensivo.

In [11]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

2024-05-16 23:26:00.400897: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])

In [13]:
train_encodings = tokenizer(x_train.tolist(), truncation=True, padding='max_length', max_length=128)
train_encodings['labels'] = y_train.tolist()
train_DS = Dataset(train_encodings)

val_encodings = tokenizer(x_test.tolist(), truncation=True, padding='max_length', max_length=128)
test_DS = Dataset(val_encodings)

In [14]:
args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_DS,
    eval_dataset=train_DS
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,0.4992,0.392376


TrainOutput(global_step=1019, training_loss=0.5317784469424801, metrics={'train_runtime': 2492.7221, 'train_samples_per_second': 3.269, 'train_steps_per_second': 0.409, 'total_flos': 535957219768320.0, 'train_loss': 0.5317784469424801, 'epoch': 1.0})

Se obtienen las predicciones y se crea el fichero results.csv con las mismas.

In [15]:
predictions = trainer.predict(test_DS)
predicted_labels = predictions.predictions.argmax(-1)

In [16]:
result_df = pd.DataFrame({'id': df_test['id'], 'label': predicted_labels})

result_df.to_csv('results.csv', index=False)