# Resumen Ejecutivo
Durante el proceso de revisión de los informes de práctica del DISC (Departamento de Ingeniería de Sistemas y Computación), se requiere una inversión considerable de tiempo que, hasta la fecha, no ha sido automatizada. Esto conlleva largas jornadas de trabajo y carga adicional para los académicos, quienes podrían emplear ese tiempo en otras labores. Por lo tanto, como equipo de trabajo, hemos llegado a un consenso en la necesidad de llevar a cabo el análisis y desarrollo de un modelo que permita clasificar los informes en las categorías definidas en la rúbrica actual (insatisfactorio, regular, bueno y excelente).
Es importante destacar que, con la llegada de la pandemia, la entrega de informes ha sido en formato digital, lo que ha generado un conjunto de aproximadamente 100 informes disponibles. Esta digitalización ofrece ventajas significativas para el entrenamiento del modelo, ya que se dispone de datos de entrada y resultados concretos (informe, rúbrica y nota).


In [102]:
import fitz
import pandas as pd
import numpy as np

# Lectura de datos
Se cargan la información del dataset, y se eliminan los datos nulos de las calificaciones.

In [103]:
dataset = pd.read_excel("calificaciones.xlsx", decimal=',')
grades_columns = dataset.columns.difference(["id", "periodo", "Unnamed: 9"]) #["estructura", "escritura", "contenido", "conclusiones", "conocimiento", "relevancia", "total"]
rubric_columns = grades_columns.difference(["total"]) #, "escritura", "estructura"
dataset = dataset.dropna(subset=grades_columns)

# Extracción y limpieza de documentos
En esta sección, se cargan los documentos en formato PDF, para la extracción y limpieza de estos, seguido de su integración al dataset.

In [4]:
documents = []

for id in dataset['id']:
    pdf_file = fitz.open(f"dataset/{id}.pdf")
    document_text = chr(12).join([page.get_text() for page in pdf_file])
    documents.append(document_text)

dataset.insert(loc=2, column="documents", value=documents)
dataset

Unnamed: 0,id,periodo,documents,estructura,escritura,contenido,conclusiones,conocimiento,relevancia,total,Unnamed: 9
0,20908397-1,2023-1,\n \nUNIVERSIDAD CATÓLICA DEL NORTE \nFACUL...,6.2,5.1,6.0,5.5,4.4,6.0,5.3,
1,18971994-1,2023-1,\nAntofagasta \n \n Abril de 2023 \...,6.9,6.8,6.8,6.8,7.0,6.9,6.9,
2,19445943-1,2023-1,\n1 \n \n \nUNIVERSIDAD CATÓLICA DEL NORTE \n...,6.7,6.9,6.5,6.4,6.8,7.0,6.7,
5,19463712-1,2023-1,\n \n \n \n ...,4.4,4.9,4.9,5.8,4.0,6.0,4.9,
6,20218430-1,2023-1,\nUNIVERSIDAD CATÓLICA DEL NORTE \nFACULTAD D...,6.1,5.8,5.5,5.0,4.5,5.8,5.2,
...,...,...,...,...,...,...,...,...,...,...,...
177,19928371-1,2021-1,\nUNIVERSIDAD CATÓLICA DEL NORTE \nFACULTAD D...,6.8,6.2,6.0,5.7,6.3,6.0,6.1,
178,19952605-1,2021-1,\n \n \n \nUNIVERSIDAD CATÓLICA DEL NORTE \nF...,7.0,6.8,6.8,7.0,7.0,7.0,6.9,
179,19957163-1,2021-1,\n \nUNIVERSIDAD CATÓLICA DEL NORTE \nFACULTA...,6.5,4.5,5.8,5.5,6.4,6.3,5.8,
180,20180533-1,2021-1,\nUNIVERSIDAD CATÓLICA DEL NORTE \nFACULTAD D...,7.0,6.0,6.7,6.8,6.8,7.0,6.7,


# Preprocesamiento de etiquetas
Con base en las notas obtenidas por cada entrada, se clasifican los documentos en las categorías definidas en la rúbrica.
Esto es, se reemplaza la nota de cada componente de cada informe por un elemento que represente la categoría correspondiente, el cual puede ser un texto que sea directamente el nombre de la categoría, o un número del 0 para "insuficiente" hasta el 3 para "excelente".
El primero se utiliza para específicamente el modelo Classy Classification, mientras que el segundo se utiliza para el resto de modelos. 

In [101]:
def get_classification(grade, number=False):
  grade = round(grade, 1)
  if(grade < 4):
    return "insatisfactorio" if not number else 0
  elif (4 <= grade < 5.5):
    return "regular" if not number else 1
  elif (5.5 <= grade < 6.5):
    return "bueno" if not number else 2
  elif (6.5 <= grade <= 7):
    return "excelente" if not number else 3

In [5]:
text_labeled_dataset = dataset.copy()
text_labeled_dataset.loc[:, grades_columns] = text_labeled_dataset.loc[:, grades_columns].apply(lambda s: s.apply(get_classification))
dataset.loc[:, grades_columns] = dataset.loc[:, grades_columns].apply(lambda s: s.apply(lambda x: get_classification(grade=x, number=True)))

# Clasificación de documentos con métodos tradicionales
Se usará SciKit-Learn para el análisis de los documentos mediante métodos tradicionales de NLP, específicamente TF-IDF. Se utilizarán distintos modelos de Machine Learning para la clasificación de los documentos, como regresión lineal y logística, SVM, árboles de decisión y Naïve Bayes.
En primera instancia, solo se probará la clasificiación final.

In [94]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier, LogisticRegression, LinearRegression
from sklearn.naive_bayes import MultinomialNB, CategoricalNB
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, f1_score
from xgboost import XGBClassifier
from numpy import floor

## Preprocesamiento TF-IDF
Para el trabajo de los modelos tradicionales, se usa representación TF-IDF. Se aplica la división en conjunto de prueba y validación, y se aplica la vectorización.

In [7]:
Xn = dataset['documents']
yn = dataset[grades_columns]
X_train, X_test, y_train, y_test = train_test_split(Xn, yn, random_state=3, test_size=0.3)

In [8]:
vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.9, min_df=0.2)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

## Regresores de nota final
A modo de experimentación, para obtener un modelo que pueda predecir una nota final considerando una predicción de todos los elementos, se intentan hacer modelos que puedan definir la categoría final en base a los elementos de la rúbrica. Además, se prueba aplicando la fórmula de nota final considerando solamente las categorías de la rúbrica.

In [95]:
def get_final_grade(rubric_grades):
    result = rubric_grades['estructura']*0.5
    result += rubric_grades['escritura']*0.15
    result += rubric_grades['contenido']*0.25
    result += rubric_grades['conclusiones']*0.15
    result += rubric_grades['conocimiento']*0.30
    result += rubric_grades['relevancia']*0.10
    result 
    return floor(result)

In [96]:
y_calc = get_final_grade(yn[rubric_columns])
print(accuracy_score(y_calc, yn['total']))

0.3390804597701149


In [87]:
final_grade_rf = RandomForestClassifier()
final_grade_rf.fit(y_train[rubric_columns], y_train['total'])
grade_pred_rf = final_grade_rf.predict(y_test[rubric_columns])
acc = accuracy_score(y_test['total'], grade_pred_rf)
f1 = f1_score(y_test['total'], grade_pred_rf, average='weighted')
print(f'Accuraccy: {acc}, F1: {f1}')
{'Característica': rubric_columns, 'Importancia': final_grade_rf.feature_importances_}

Accuraccy: 0.8113207547169812, F1: 0.8131925726265348


{'Característica': Index(['conclusiones', 'conocimiento', 'contenido', 'escritura', 'estructura',
        'relevancia'],
       dtype='object'),
 'Importancia': array([0.17987443, 0.2466199 , 0.24701639, 0.20052356, 0.06050739,
        0.06545833])}

In [98]:
final_grade_svc_n = SVC(C=10)
final_grade_svc_n.fit(y_train[rubric_columns], y_train['total'])
grade_pred = final_grade_svc_n.predict(y_test[rubric_columns])
acc = accuracy_score(y_test['total'], grade_pred)
f1 = f1_score(y_test['total'], grade_pred, average='weighted')
print(f'Accuraccy: {acc}, F1: {f1}')

Accuraccy: 0.8301886792452831, F1: 0.8401467505241089


In [88]:
final_grade_log = LogisticRegression(max_iter=1000)
final_grade_log.fit(y_train[rubric_columns], y_train['total'])
grade_pred_log = final_grade_log.predict(y_test[rubric_columns])
acc = accuracy_score(y_test['total'], grade_pred_log)
f1 = f1_score(y_test['total'], grade_pred_log, average='weighted')
print(f'Accuraccy: {acc}, F1: {f1}')

Accuraccy: 0.8679245283018868, F1: 0.8687345149609299


In [99]:
final_grade_nb = CategoricalNB()
final_grade_nb.fit(y_train[rubric_columns], y_train['total'])
grade_pred_nb = final_grade_nb.predict(y_test[rubric_columns])
acc = accuracy_score(y_test['total'], grade_pred_nb)
f1 = f1_score(y_test['total'], grade_pred_nb, average='weighted')
print(f'Accuraccy: {acc}, F1: {f1}')

Accuraccy: 0.8490566037735849, F1: 0.8503294399520814


In [100]:
final_grade_xgb = XGBClassifier()
final_grade_xgb.fit(y_train[rubric_columns], y_train['total'])
grade_pred_xgb = final_grade_xgb.predict(y_test[rubric_columns])
acc = accuracy_score(y_test['total'], grade_pred_xgb)
f1 = f1_score(y_test['total'], grade_pred_xgb, average='weighted')
print(f'Accuraccy: {acc}, F1: {f1}')

Accuraccy: 0.8113207547169812, F1: 0.811572327044025


## Clasificación de documentos con SVM

In [70]:
svc_n = SVC(C=10)
svc_n.fit(X_train_bow, y_train['total'])
y_pred = svc_n.predict(X_test_bow)
acc = accuracy_score(y_test['total'], y_pred)
f1 = f1_score(y_test['total'], y_pred, average='weighted')
print(f'Accuraccy: {acc}, F1: {f1}')

Accuraccy: 0.5849056603773585, F1: 0.5516265369466873


In [72]:
svc_n_mo = MultiOutputClassifier(svc_n)
svc_n_mo.fit(X_train_bow, y_train[rubric_columns])
y_pred_mo = svc_n_mo.predict(X_test_bow)
acc_mo = svc_n_mo.score(X_test_bow, y_test[rubric_columns])
print(f'Accuracy: {acc_mo}')

Accuracy: 0.07547169811320754


## Clasificación de documentos con Regresión Ridge

In [73]:
ridge = RidgeClassifier()
ridge.fit(X_train_bow, y_train['total'])
y_pred_r = ridge.predict(X_test_bow)
acc_r = accuracy_score(y_test['total'], y_pred_r)
f1_r = f1_score(y_test['total'], y_pred_r, average='weighted')
print(f'Accuraccy: {acc_r}, F1: {f1_r}')

Accuraccy: 0.5660377358490566, F1: 0.5275254892769249


In [75]:
ridge_mo = MultiOutputClassifier(ridge)
ridge_mo.fit(X_train_bow, y_train[rubric_columns])
y_pred_r_mo = ridge_mo.predict(X_test_bow)
acc_r_mo = ridge_mo.score(X_test_bow, y_test[rubric_columns])
print(f'Accuracy: {acc_r_mo}')

Accuracy: 0.03773584905660377


## Clasificación de documentos con Regresión Logística

In [76]:
log = LogisticRegression()
log.fit(X_train_bow, y_train['total'])
y_pred_l = log.predict(X_test_bow)
acc_l = accuracy_score(y_test['total'], y_pred_l)
f1_l = f1_score(y_test['total'], y_pred_l, average='weighted')
print(f'Accuraccy: {acc_l}, F1: {f1_l}')

Accuraccy: 0.6037735849056604, F1: 0.5322023148882195


## Clasificación de documentos con Random Forest

In [77]:
rf = RandomForestClassifier()
rf.fit(X_train_bow, y_train['total'])
y_pred_rf = rf.predict(X_test_bow)
acc_rf = accuracy_score(y_test['total'], y_pred_rf)
f1_rf = f1_score(y_test['total'], y_pred_rf, average='weighted')
print(f'Accuraccy: {acc_rf}, F1: {f1_rf}')

Accuraccy: 0.5471698113207547, F1: 0.4873391291143085


In [78]:
rf_mo = MultiOutputClassifier(rf)
rf_mo.fit(X_train_bow, y_train[rubric_columns])
y_pred_rf_mo = rf_mo.predict(X_test_bow)
acc_rf_mo = rf_mo.score(X_test_bow, y_test[rubric_columns])
print(f'Accuracy: {acc_rf_mo}')

y_pred_rf_mo

Accuracy: 0.0


array([[2, 2, 2, 2, 2, 3],
       [3, 3, 2, 1, 3, 3],
       [3, 2, 2, 2, 3, 3],
       [3, 1, 2, 1, 3, 2],
       [3, 3, 2, 1, 3, 3],
       [3, 3, 2, 2, 3, 2],
       [3, 3, 2, 1, 3, 3],
       [2, 1, 2, 2, 3, 2],
       [3, 3, 2, 2, 3, 3],
       [3, 2, 2, 2, 3, 3],
       [3, 2, 2, 1, 3, 3],
       [3, 3, 2, 2, 3, 3],
       [3, 3, 2, 2, 3, 3],
       [3, 2, 2, 2, 3, 3],
       [2, 2, 2, 2, 2, 3],
       [3, 1, 3, 1, 3, 3],
       [2, 3, 2, 2, 3, 3],
       [3, 3, 2, 1, 3, 3],
       [3, 3, 2, 2, 3, 3],
       [3, 1, 2, 1, 3, 3],
       [2, 1, 2, 1, 2, 2],
       [3, 1, 2, 1, 3, 3],
       [3, 2, 2, 2, 3, 3],
       [3, 3, 2, 2, 3, 2],
       [2, 1, 2, 1, 2, 2],
       [3, 3, 3, 1, 3, 3],
       [2, 1, 2, 1, 2, 3],
       [3, 3, 2, 2, 3, 2],
       [3, 3, 2, 1, 3, 3],
       [3, 3, 2, 2, 3, 3],
       [2, 1, 2, 1, 3, 3],
       [3, 3, 2, 2, 3, 3],
       [3, 3, 2, 2, 3, 3],
       [3, 3, 2, 1, 3, 3],
       [2, 2, 2, 1, 3, 2],
       [3, 1, 2, 2, 3, 3],
       [3, 3, 2, 2, 3, 3],
 

## Clasificación de documentos con Naïve Bayes

In [79]:
nb = MultinomialNB()
nb.fit(X_train_bow, y_train['total'])
y_pred_nb = nb.predict(X_test_bow)
acc_nb = accuracy_score(y_test['total'], y_pred_nb)
f1_nb = f1_score(y_test['total'], y_pred_nb, average='weighted')
print(f'Accuraccy: {acc_nb}, F1: {f1_nb}')

Accuraccy: 0.5660377358490566, F1: 0.41417395306028526


## Clasificación de documentos con XGBoost

In [80]:
xgb = XGBClassifier()
xgb.fit(X_train_bow, y_train['total'])
y_pred_xgb = xgb.predict(X_test_bow)
acc_xgb = accuracy_score(y_test['total'], y_pred_xgb)
f1_xgb = f1_score(y_test['total'], y_pred_xgb, average='weighted')
print(f'Accuraccy: {acc_xgb}, F1: {f1_xgb}')

Accuraccy: 0.4339622641509434, F1: 0.41675727534378


# Clasificación de documentos con Deep Learning

## Clasificación de documentos con Spacy

In [None]:
Xn_text_labeled = text_labeled_dataset['documents']
yn_text_labeled = text_labeled_dataset['total']
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(Xn_text_labeled, yn_text_labeled, random_state=3, test_size=0.3)

In [None]:
training_data_total = {
  "insatisfactorio": [],
  "regular": [],
  "bueno": [],
  "excelente": []
}
for index, document in X_text_train.items():
  training_data_total[y_text_train[index]].append(document)
training_data_total

https://spacy.io/universe/project/classyclassification

In [None]:
from classy_classification import ClassyClassifier
classifier = ClassyClassifier(data=training_data_total)
classifier.set_embedding_model(model="paraphrase-multilingual-mpnet-base-v2")
y_text_pred = classifier.pipe(X_text_test.tolist())

In [None]:
y_text_pred

In [None]:
y_text_test

## Clasificación de documentos con TensorFlow
Se prueba un modelo LSTM para hacer la clasificación en base a una representación vectorial del cuerpo del documento.
Por limitaciones técnicas y para evitar pérdida de información, cada documento se divide en un conjunto de bloques; por el momento se considera solamente la división propia del formato PDF.
Así, la entrada del modelo LSTM sería un conjunto de lotes o muestras, que contemplan un conjunto de bloques, que a su vez son un vector de cierta dimensionalidad (determinada por el modelo de representación utilizado).
Dado que el modelo LSTM requiere de entradas de tamaño fijo, se asignó un tamaño de entrada tal que todos los documentos pudieran ser transformados y aceptados por el modelo; para suplir el espacio restante de cada documento, se rellena la entrada con una representación de una cadena vacía.

La búsqueda de modelos de representación se hizo priorizando el largo de entrada, y la posibilidad de trabajar con textos en español. Para acotar el trabajo a realizar, se escogieron 3 modelos a probar:
- Universal Sentence Encoder - Multilingual Large: Tiene soporte para español y admite una entrada de tamaño arbitrario (a coste de posible pérdida de información).
- Longformer Spanish: Modelo basado en BERT, mejorado para soportar entradas de hasta 4096 tokens,y entrenado específicamente en español.
- Tulio BERT: Modelo basado en BERT, entrenado con un conjunto de datos chileno.

In [13]:
documents = []
max_length = -1
for id in dataset['id']:
    pdf_file = fitz.open(f"dataset/{id}.pdf")
    document_text = [] 
    for page in pdf_file:
        blocks = page.get_text("blocks")
        document_text += [block[4] for block in blocks]
    documents.append(document_text)
    for block in document_text:
        if len(block) > max_length:
            max_length = len(block)
print(max_length)            
dataset.insert(loc=2, column="block_documents", value=documents)

4438


El número de párafos corresponde al mayor número de bloques encontrado en un documento, aproximado a la siguiente mayor potencia de dos.
El tamaño de lote corresponde a la cantidad de documentos que se procesaran.
Además, para facilitar el trabajo de clasificación por parte del modelo, se transforman las categorías a una lista de represntación binaria.

In [25]:
import time
from keras.layers import Embedding, LSTM, Bidirectional, Dense
from keras.models import Sequential, clone_model
import tensorflow as tf
import tensorflow_text
import tensorflow_hub as hub
from keras import utils
import torch
PARAGRAPH_QTY = 2048
BATCH_SIZE= 171

In [32]:
#["estructura", "escritura", "contenido", "conclusiones", "conocimiento", "relevancia", "total"]
labels_total = utils.to_categorical(dataset['total'], num_classes=4)
labels_estructura = utils.to_categorical(dataset['estructura'], num_classes=4)
labels_escritura = utils.to_categorical(dataset['escritura'], num_classes=4)
labels_contenido = utils.to_categorical(dataset['contenido'], num_classes=4)
labels_conclusiones = utils.to_categorical(dataset['conclusiones'], num_classes=4)
labels_conocimiento = utils.to_categorical(dataset['conocimiento'], num_classes=4)
labels_relevancia = utils.to_categorical(dataset['relevancia'], num_classes=4)

### Clasificación de documentos con Universal Sentence Encoder - Multilingual Large

In [34]:
use_embedder = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")

In [35]:
use_pad = use_embedder("")

In [36]:
Xn_text_embed = [use_embedder(document) for document in dataset['block_documents']]
print(time_end-time_start)

2023-11-26 11:50:49.636775: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 1776424960 exceeds 10% of free system memory.
2023-11-26 11:50:49.645256: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 1776424960 exceeds 10% of free system memory.
2023-11-26 11:50:49.705936: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 1776424960 exceeds 10% of free system memory.
2023-11-26 11:50:50.099502: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 1776424960 exceeds 10% of free system memory.
2023-11-26 11:50:50.181956: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 1776424960 exceeds 10% of free system memory.


1619.3680374929681


In [41]:
padded_documents = []
for document in Xn_text_embed:
    pad = np.concatenate(np.repeat([use_pad], PARAGRAPH_QTY-document.shape[0], axis=0), axis=0)
    padded_documents.append(tf.concat([document, pad], 0))

In [42]:
Xn_tensor = tf.convert_to_tensor(padded_documents)

In [43]:
serialized_Xn_use = tf.io.serialize_tensor(Xn_tensor)
with open('use_padded.tensor', 'wb') as file:
    file.write(serialized_Xn_use.numpy())

In [48]:
model_use = Sequential()
model_use.add(Bidirectional(LSTM(512, return_sequences=False, input_shape=(PARAGRAPH_QTY, 512))))
model_use.add(Dense(256, activation='relu'))
model_use.add(Dense(4, activation='softmax'))
model_use.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", "AUC"])

In [49]:
model_use_total = clone_model(model_use)
model_use_total.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", "AUC"])
model_use_total.fit(Xn_tensor, labels_total, validation_split=0.3, epochs=3, batch_size=BATCH_SIZE)
model_use_total.summary()

Epoch 1/3
Epoch 2/3
Epoch 3/3
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_6 (Bidirectio  (None, 1024)             4198400   
 nal)                                                            
                                                                 
 dense_8 (Dense)             (None, 256)               262400    
                                                                 
 dense_9 (Dense)             (None, 4)                 1028      
                                                                 
Total params: 4,461,828
Trainable params: 4,461,828
Non-trainable params: 0
_________________________________________________________________


In [106]:
model_use_contenido = clone_model(model_use)
model_use_contenido.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", "AUC"])
model_use_contenido.fit(Xn_tensor, labels_contenido, validation_split=0.3, epochs=3, batch_size=BATCH_SIZE)
model_use_contenido.summary()

Epoch 1/3
Epoch 2/3
Epoch 3/3
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_6 (Bidirectio  (None, 1024)             4198400   
 nal)                                                            
                                                                 
 dense_8 (Dense)             (None, 256)               262400    
                                                                 
 dense_9 (Dense)             (None, 4)                 1028      
                                                                 
Total params: 4,461,828
Trainable params: 4,461,828
Non-trainable params: 0
_________________________________________________________________


### Clasificación de documentos con Longformer - Spanish

In [15]:
from transformers import RobertaTokenizer, RobertaModel
long_tokenizer = RobertaTokenizer.from_pretrained("mrm8488/longformer-base-4096-spanish",)
long_model = RobertaModel.from_pretrained("mrm8488/longformer-base-4096-spanish")

Some weights of the model checkpoint at mrm8488/longformer-base-4096-spanish were not used when initializing RobertaModel: ['roberta.encoder.layer.11.attention.self.value_global.weight', 'roberta.encoder.layer.11.attention.self.query_global.weight', 'lm_head.bias', 'roberta.encoder.layer.11.attention.self.value_global.bias', 'lm_head.layer_norm.weight', 'roberta.encoder.layer.2.attention.self.query_global.weight', 'roberta.encoder.layer.0.attention.self.query_global.bias', 'roberta.encoder.layer.8.attention.self.query_global.weight', 'roberta.encoder.layer.5.attention.self.value_global.bias', 'roberta.encoder.layer.2.attention.self.value_global.weight', 'roberta.encoder.layer.1.attention.self.query_global.bias', 'roberta.encoder.layer.1.attention.self.key_global.bias', 'roberta.encoder.layer.4.attention.self.query_global.bias', 'roberta.encoder.layer.5.attention.self.key_global.weight', 'roberta.encoder.layer.4.attention.self.key_global.weight', 'roberta.encoder.layer.3.attention.self.

In [16]:
inputs = long_tokenizer("", return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = long_model(**inputs)
cls_embeddings = outputs.last_hidden_state[:, 0, :]
long_pad = tf.convert_to_tensor(cls_embeddings)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
2023-11-26 10:13:42.192975: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-11-26 10:13:42.193040: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-11-26 10:13:42.193071: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (cn04): /proc/driver/nvidia/version does not exist
2023-11-26 10:13:42.193553: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in othe

In [17]:
def long_embedder(document):
    embeddings = []
    for page in document:
        inputs = long_tokenizer(page, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = long_model(**inputs)
        cls_embeddings = outputs.last_hidden_state[:, 0, :]
        embeddings.append(cls_embeddings)
    print("done document")
    return tf.convert_to_tensor(torch.cat(embeddings))

In [18]:
Xn_long_embed = [long_embedder(document) for document in dataset['block_documents']]

done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done d

In [19]:
padded_long_documents = []
for document in Xn_long_embed:
    pad = np.concatenate(np.repeat([long_pad], PARAGRAPH_QTY-document.shape[0], axis=0), axis=0)
    padded_long_documents.append(tf.concat([document, pad], 0))

In [20]:
Xn_long_tensor = tf.convert_to_tensor(padded_long_documents)

In [105]:
serialized_Xn_long = tf.io.serialize_tensor(Xn_long_tensor)
with open('long_padded.tensor', 'wb') as file:
    file.write(serialized_Xn_long.numpy())

In [29]:
model_long = Sequential()
model_long.add(Bidirectional(LSTM(512, return_sequences=False, input_shape=(PARAGRAPH_QTY, 768))))
model_long.add(Dense(256, activation='relu'))
model_long.add(Dense(4, activation='softmax'))

In [30]:
model_long_total = clone_model(model_long)
model_long_total.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", "AUC"])
model_long_total.fit(Xn_long_tensor, labels_total, validation_split=0.3, epochs=3, batch_size=BATCH_SIZE)
model_long_total.summary()

Epoch 1/3
Epoch 2/3
Epoch 3/3
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_1 (Bidirectio  (None, 1024)             5246976   
 nal)                                                            
                                                                 
 dense_2 (Dense)             (None, 256)               262400    
                                                                 
 dense_3 (Dense)             (None, 4)                 1028      
                                                                 
Total params: 5,510,404
Trainable params: 5,510,404
Non-trainable params: 0
_________________________________________________________________


In [107]:
model_long_contenido = clone_model(model_long)
model_long_contenido.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", "AUC"])
model_long_contenido.fit(Xn_long_tensor, labels_contenido, validation_split=0.3, epochs=3, batch_size=BATCH_SIZE)
model_long_contenido.summary()

Epoch 1/3
Epoch 2/3
Epoch 3/3
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_1 (Bidirectio  (None, 1024)             5246976   
 nal)                                                            
                                                                 
 dense_2 (Dense)             (None, 256)               262400    
                                                                 
 dense_3 (Dense)             (None, 4)                 1028      
                                                                 
Total params: 5,510,404
Trainable params: 5,510,404
Non-trainable params: 0
_________________________________________________________________


### Clasificación de documentos con Tulio BERT

In [50]:
from transformers import AutoTokenizer, AutoModel
tulio_tokenizer = AutoTokenizer.from_pretrained("dccuchile/tulio-chilean-spanish-bert")
tulio_model = AutoModel.from_pretrained("dccuchile/tulio-chilean-spanish-bert")

Some weights of the model checkpoint at dccuchile/tulio-chilean-spanish-bert were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/tulio-chilean-spanish-bert and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this mode

In [51]:
inputs = tulio_tokenizer("", return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = tulio_model(**inputs)
cls_embeddings = outputs.last_hidden_state[:, 0, :]
tulio_pad = tf.convert_to_tensor(cls_embeddings)

In [52]:
def tulio_embedder(document):
    embeddings = []
    for page in document:
        inputs = tulio_tokenizer(page, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = tulio_model(**inputs)
        cls_embeddings = outputs.last_hidden_state[:, 0, :]
        embeddings.append(cls_embeddings)
    print("done document")
    return tf.convert_to_tensor(torch.cat(embeddings))

In [53]:
Xn_tulio_embed = [tulio_embedder(document) for document in dataset['block_documents']]

done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done d

In [62]:
padded_tulio_documents = []
for document in Xn_tulio_embed:
    pad = np.concatenate(np.repeat([tulio_pad], PARAGRAPH_QTY-document.shape[0], axis=0), axis=0)
    padded_tulio_documents.append(tf.concat([document, pad], 0))

In [63]:
Xn_tulio_tensor = tf.convert_to_tensor(padded_tulio_documents)

In [104]:
serialized_Xn_tulio = tf.io.serialize_tensor(Xn_tulio_tensor)
with open('tulio_padded.tensor', 'wb') as file:
    file.write(serialized_Xn_tulio.numpy())

In [57]:
model_tulio = Sequential()
model_tulio.add(Bidirectional(LSTM(512, return_sequences=True, input_shape=(PARAGRAPH_QTY, 768))))
model_tulio.add(Bidirectional(LSTM(units=512, return_sequences=False)))
model_tulio.add(Dense(256, activation='relu'))
model_tulio.add(Dense(4, activation='softmax'))

In [65]:
model_tulio_total = clone_model(model_tulio)
model_tulio_total.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", "AUC"])
model_tulio_total.fit(Xn_tulio_tensor, labels_total, validation_split=0.3, epochs=3, batch_size=BATCH_SIZE)
model_tulio_total.summary()

Epoch 1/3








Epoch 2/3
Epoch 3/3
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_7 (Bidirectio  (None, 2048, 1024)       5246976   
 nal)                                                            
                                                                 
 bidirectional_8 (Bidirectio  (None, 1024)             6295552   
 nal)                                                            
                                                                 
 dense_10 (Dense)            (None, 256)               262400    
                                                                 
 dense_11 (Dense)            (None, 4)                 1028      
                                                                 
Total params: 11,805,956
Trainable params: 11,805,956
Non-trainable params: 0
_________________________________________________________________


In [66]:
model_tulio_contenido = clone_model(model_tulio)
model_tulio_contenido.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", "AUC"])
model_tulio_contenido.fit(Xn_tulio_tensor, labels_contenido, validation_split=0.3, epochs=3, batch_size=BATCH_SIZE)
model_tulio_contenido.summary()

Epoch 1/3








Epoch 2/3
Epoch 3/3
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_7 (Bidirectio  (None, 2048, 1024)       5246976   
 nal)                                                            
                                                                 
 bidirectional_8 (Bidirectio  (None, 1024)             6295552   
 nal)                                                            
                                                                 
 dense_10 (Dense)            (None, 256)               262400    
                                                                 
 dense_11 (Dense)            (None, 4)                 1028      
                                                                 
Total params: 11,805,956
Trainable params: 11,805,956
Non-trainable params: 0
_________________________________________________________________
