# Resumen Ejecutivo
Durante el proceso de revisión de los informes de práctica del DISC (Departamento de Ingeniería de Sistemas y Computación), se requiere una inversión considerable de tiempo que, hasta la fecha, no ha sido automatizada. Esto conlleva largas jornadas de trabajo y carga adicional para los académicos, quienes podrían emplear ese tiempo en otras labores. Por lo tanto, como equipo de trabajo, hemos llegado a un consenso en la necesidad de llevar a cabo el análisis y desarrollo de un modelo que permita clasificar los informes en las categorías definidas en la rúbrica actual (insatisfactorio, regular, bueno y excelente).
Es importante destacar que, con la llegada de la pandemia, la entrega de informes ha sido en formato digital, lo que ha generado un conjunto de aproximadamente 100 informes disponibles. Esta digitalización ofrece ventajas significativas para el entrenamiento del modelo, ya que se dispone de datos de entrada y resultados concretos (informe, rúbrica y nota).


In [1]:
import fitz
import pandas as pd
import numpy as np
#pd.set_option("mode.copy_on_write", True) #not on Python 3.9

# Lectura de datos


In [2]:
def get_classification(grade, number=False):
  classification = [0,0,0] # Regular, Bueno, Excelente (Todo 0 = Insatisfactorio)
  grade = round(grade, 1)
  if(grade < 4):
    return "insatisfactorio" if not number else 0
  elif (4 <= grade < 5.5):
    classification[0] = 1
    return "regular" if not number else 1
  elif (5.5 <= grade < 6.5):
    classification[1] = 1
    return "bueno" if not number else 2
  elif (6.5 <= grade <= 7):
    classification[2] = 1
    return "excelente" if not number else 3

In [3]:
dataset = pd.read_excel("calificaciones.xlsx", decimal=',')
grades_columns = dataset.columns.difference(["id", "periodo", "Unnamed: 9"]) #["estructura", "escritura", "contenido", "conclusiones", "conocimiento", "relevancia", "total"]
rubric_columns = grades_columns.difference(["total"]) #, "escritura", "estructura"
dataset = dataset.dropna(subset=grades_columns)

# Extracción y limpieza de documentos
En esta sección, se cargan los documentos en formato PDF, para la extracción y limpieza de estos, seguido de su integración al dataset.

In [4]:
documents = []

for id in dataset['id']:
    pdf_file = fitz.open(f"dataset/{id}.pdf")
    document_text = chr(12).join([page.get_text() for page in pdf_file])
    documents.append(document_text)

dataset.insert(loc=2, column="documents", value=documents)
dataset

Unnamed: 0,id,periodo,documents,estructura,escritura,contenido,conclusiones,conocimiento,relevancia,total,Unnamed: 9
0,20908397-1,2023-1,\n \nUNIVERSIDAD CATÓLICA DEL NORTE \nFACUL...,6.2,5.1,6.0,5.5,4.4,6.0,5.3,
1,18971994-1,2023-1,\nAntofagasta \n \n Abril de 2023 \...,6.9,6.8,6.8,6.8,7.0,6.9,6.9,
2,19445943-1,2023-1,\n1 \n \n \nUNIVERSIDAD CATÓLICA DEL NORTE \n...,6.7,6.9,6.5,6.4,6.8,7.0,6.7,
5,19463712-1,2023-1,\n \n \n \n ...,4.4,4.9,4.9,5.8,4.0,6.0,4.9,
6,20218430-1,2023-1,\nUNIVERSIDAD CATÓLICA DEL NORTE \nFACULTAD D...,6.1,5.8,5.5,5.0,4.5,5.8,5.2,
...,...,...,...,...,...,...,...,...,...,...,...
177,19928371-1,2021-1,\nUNIVERSIDAD CATÓLICA DEL NORTE \nFACULTAD D...,6.8,6.2,6.0,5.7,6.3,6.0,6.1,
178,19952605-1,2021-1,\n \n \n \nUNIVERSIDAD CATÓLICA DEL NORTE \nF...,7.0,6.8,6.8,7.0,7.0,7.0,6.9,
179,19957163-1,2021-1,\n \nUNIVERSIDAD CATÓLICA DEL NORTE \nFACULTA...,6.5,4.5,5.8,5.5,6.4,6.3,5.8,
180,20180533-1,2021-1,\nUNIVERSIDAD CATÓLICA DEL NORTE \nFACULTAD D...,7.0,6.0,6.7,6.8,6.8,7.0,6.7,


# Preprocesamiento de etiquetas
Con base en las notas obtenidas por cada entrada, se clasifican los documentos en las categorías definidas en la rúbrica.
Se consideran dos tipos de etiquetas: texto y número. El primero se utiliza para el entrenamiento de modelos de Deep Learning, mientras que el segundo se utiliza para el entrenamiento de modelos tradicionales.

In [5]:
text_labeled_dataset = dataset.copy()
text_labeled_dataset.loc[:, grades_columns] = text_labeled_dataset.loc[:, grades_columns].apply(lambda s: s.apply(get_classification))
dataset.loc[:, grades_columns] = dataset.loc[:, grades_columns].apply(lambda s: s.apply(lambda x: get_classification(grade=x, number=True)))

# Clasificación de documentos con métodos tradicionales
Se usará SciKit-Learn para el análisis de los documentos mediante métodos tradicionales de NLP, como TF-IDF ~~y Word2Vec?~~. Se utilizará un modelo de clasificación de regresión logística y SVM para la clasificación de los documentos.
En primera instancia, solo se probará la clasificiación final.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.linear_model import RidgeClassifier, LogisticRegression, LinearRegression
from sklearn.naive_bayes import MultinomialNB, CategoricalNB
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, f1_score

## Preprocesamiento TF-IDF

In [7]:
Xn = dataset['documents']
yn = dataset[grades_columns]
X_train, X_test, y_train, y_test = train_test_split(Xn, yn, random_state=3, test_size=0.3)

In [8]:
vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.9, min_df=0.2)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

## Regresores de nota final

In [66]:
def get_final_grade(rubric_grades):
    result = rubric_grades['estructura']*0.5
    result += rubric_grades['escritura']*0.15
    result += rubric_grades['contenido']*0.25
    result += rubric_grades['conclusiones']*0.15
    result += rubric_grades['conocimiento']*0.30
    result += rubric_grades['relevancia']*0.10
    return round(result)

In [73]:
y_calc = get_final_grade(yn[rubric_columns])
print(accuracy_score(y_calc, yn['total']))
y_calc

0.10344827586206896


0      2.0
1      4.0
2      4.0
5      2.0
6      2.0
      ... 
177    3.0
178    4.0
179    3.0
180    4.0
181    4.0
Name: estructura, Length: 174, dtype: float64

In [59]:
final_grade_rf = RandomForestClassifier()
final_grade_rf.fit(y_train[rubric_columns], y_train['total'])
grade_pred_rf = final_grade_rf.predict(y_test[rubric_columns])
print(accuracy_score(y_test['total'], grade_pred_rf))
{'Característica': rubric_columns, 'Importancia': final_grade_rf.feature_importances_}

0.8113207547169812


{'Característica': Index(['conclusiones', 'conocimiento', 'contenido', 'escritura', 'estructura',
        'relevancia'],
       dtype='object'),
 'Importancia': array([0.16857424, 0.2603969 , 0.26508912, 0.19325956, 0.05499339,
        0.05768678])}

In [60]:
final_grade_log = LogisticRegression(max_iter=1000)
final_grade_log.fit(y_train[rubric_columns], y_train['total'])
grade_pred_log = final_grade_log.predict(y_test[rubric_columns])
print(accuracy_score(y_test['total'], grade_pred_log))

0.8679245283018868


{'Característica': Index(['conclusiones', 'conocimiento', 'contenido', 'escritura', 'estructura',
        'relevancia'],
       dtype='object'),
 'Importancia': array([[-1.34695751, -1.35070132, -0.92570698, -1.02865765, -0.25962928,
         -1.29684428],
        [-0.59156168, -1.05596707, -0.89826997, -1.19481917, -0.23909476,
         -0.17719915],
        [ 0.46610424,  0.36809972,  0.03083842,  0.40622851, -0.19707045,
          0.37408921],
        [ 1.47241495,  2.03856867,  1.79313853,  1.8172483 ,  0.69579448,
          1.09995422]])}

In [1]:
final_grade_nb = CategoricalNB()
final_grade_nb.fit(y_train[rubric_columns], y_train['total'])
grade_pred_nb = final_grade_log.predict(y_test[rubric_columns])
print(accuracy_score(y_test['total'], grade_pred_nb))
print(f1_score(y_test['total'], grade_pred_nb, average='weighted'))

NameError: name 'CategoricalNB' is not defined

## Clasificación de documentos con SVM

In [85]:
svc_n = SVC(C=10)
svc_n.fit(X_train_bow, y_train['total'])
y_pred = svc_n.predict(X_test_bow)
print(accuracy_score(y_test['total'], y_pred))
print(f1_score(y_test['total'], y_pred, average='weighted'))

0.5849056603773585
0.5516265369466873


In [80]:
svc_n_mo = MultiOutputClassifier(svc_n)
svc_n_mo.fit(X_train_bow, y_train[rubric_columns])
y_pred_cont = svc_n_mo.predict(X_test_bow)
print(svc_n_mo.score(X_test_bow, y_test[rubric_columns]))
print(f1_score(y_test[rubric_columns], y_pred_cont, average='weighted'))
y_pred_cont

0.07547169811320754


ValueError: multiclass-multioutput is not supported

## Clasificación de documentos con Regresión Ridge

In [84]:
ridge = RidgeClassifier()
ridge.fit(X_train_bow, y_train['total'])
y_pred_r = ridge.predict(X_test_bow)
print(accuracy_score(y_test['total'], y_pred_r))
print(f1_score(y_test['total'], y_pred_r, average='weighted'))

0.5660377358490566
0.5275254892769249


In [40]:
ridge_mo = MultiOutputClassifier(ridge)
ridge_mo.fit(X_train_bow, y_train[rubric_columns])
y_pred_cont = ridge_mo.predict(X_test_bow)
print(ridge_mo.score(X_test_bow, y_test[rubric_columns]))
y_pred_cont

0.1320754716981132


array([[2, 1, 3, 2],
       [3, 2, 2, 3],
       [2, 2, 2, 2],
       [3, 1, 2, 3],
       [3, 3, 2, 3],
       [3, 2, 2, 2],
       [3, 3, 2, 3],
       [2, 2, 2, 3],
       [3, 3, 3, 3],
       [3, 2, 2, 3],
       [2, 1, 1, 3],
       [2, 1, 2, 2],
       [2, 3, 2, 3],
       [2, 2, 2, 3],
       [2, 2, 2, 3],
       [3, 3, 1, 3],
       [3, 1, 2, 2],
       [2, 3, 2, 3],
       [3, 2, 2, 3],
       [3, 1, 1, 3],
       [1, 2, 2, 2],
       [1, 1, 1, 3],
       [3, 2, 2, 3],
       [3, 2, 2, 3],
       [2, 1, 1, 2],
       [3, 3, 3, 3],
       [3, 3, 2, 3],
       [3, 2, 2, 3],
       [3, 3, 3, 3],
       [2, 3, 2, 3],
       [1, 2, 2, 2],
       [3, 3, 2, 3],
       [3, 3, 2, 2],
       [3, 2, 2, 3],
       [2, 2, 2, 2],
       [3, 3, 2, 3],
       [2, 3, 2, 3],
       [3, 1, 2, 2],
       [2, 2, 2, 3],
       [3, 1, 2, 2],
       [3, 3, 2, 3],
       [3, 2, 2, 3],
       [1, 2, 2, 2],
       [3, 3, 2, 3],
       [2, 3, 2, 3],
       [2, 2, 2, 2],
       [3, 2, 2, 3],
       [2, 3,

## Clasificación de documentos con Regresión Logística

In [83]:
log = LogisticRegression()
log.fit(X_train_bow, y_train['total'])
y_pred_l = log.predict(X_test_bow)
print(accuracy_score(y_test['total'], y_pred_l))
print(f1_score(y_test['total'], y_pred_l, average='weighted'))

0.6037735849056604
0.5322023148882195


## Clasificación de documentos con Random Forest

In [13]:
rf = RandomForestClassifier()
rf.fit(X_train_bow, y_train['total'])
y_pred_rf = log.predict(X_test_bow)
print(accuracy_score(y_test['total'], y_pred_l))

0.6037735849056604


In [75]:
rf_mo = MultiOutputClassifier(rf)
rf_mo.fit(X_train_bow, y_train[rubric_columns])
y_pred_cont = rf_mo.predict(X_test_bow)
print(rf_mo.score(X_test_bow, y_test[rubric_columns]))
y_pred_cont

0.018867924528301886


array([[2, 1, 2, 2, 2, 3],
       [3, 1, 2, 1, 3, 3],
       [3, 2, 2, 2, 3, 3],
       [3, 1, 2, 2, 2, 3],
       [3, 3, 2, 1, 3, 3],
       [3, 3, 3, 2, 3, 3],
       [3, 3, 2, 2, 3, 3],
       [2, 2, 2, 2, 2, 3],
       [2, 3, 2, 1, 3, 3],
       [3, 1, 2, 1, 3, 3],
       [2, 1, 2, 2, 3, 3],
       [2, 3, 2, 2, 3, 3],
       [3, 3, 2, 1, 3, 3],
       [3, 2, 2, 2, 3, 3],
       [2, 2, 2, 2, 3, 3],
       [2, 1, 2, 1, 3, 3],
       [3, 3, 2, 2, 3, 3],
       [3, 3, 2, 1, 3, 3],
       [3, 2, 2, 1, 3, 3],
       [2, 1, 2, 1, 3, 3],
       [1, 1, 2, 1, 2, 2],
       [3, 3, 2, 2, 3, 3],
       [3, 3, 2, 2, 3, 3],
       [2, 3, 2, 2, 3, 2],
       [2, 1, 2, 2, 3, 2],
       [2, 2, 2, 1, 3, 3],
       [3, 1, 2, 1, 3, 3],
       [2, 3, 2, 2, 3, 3],
       [2, 2, 3, 1, 3, 3],
       [3, 3, 2, 2, 3, 3],
       [3, 2, 2, 1, 3, 2],
       [3, 2, 2, 1, 3, 3],
       [3, 3, 2, 2, 3, 2],
       [3, 2, 2, 1, 3, 3],
       [2, 1, 1, 1, 3, 2],
       [3, 3, 2, 2, 3, 3],
       [3, 3, 2, 1, 3, 3],
 

## Clasificación de documentos con Naïve Bayes

In [16]:
nb = MultinomialNB()
nb.fit(X_train_bow, y_train['total'])
y_pred_nb = nb.predict(X_test_bow)
print(accuracy_score(y_test['total'], y_pred_nb))

0.5660377358490566


## Clasificación de documentos con XGBoost

# Clasificación de documentos con Deep Learning

## Clasificación de documentos con Spacy

In [None]:
Xn_text_labeled = text_labeled_dataset['documents']
yn_text_labeled = text_labeled_dataset['total']
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(Xn_text_labeled, yn_text_labeled, random_state=3, test_size=0.3)

In [None]:
training_data_total = {
  "insatisfactorio": [],
  "regular": [],
  "bueno": [],
  "excelente": []
}
for index, document in X_text_train.items():
  training_data_total[y_text_train[index]].append(document)
training_data_total

https://spacy.io/universe/project/classyclassification

In [None]:
from classy_classification import ClassyClassifier
classifier = ClassyClassifier(data=training_data_total)
classifier.set_embedding_model(model="paraphrase-multilingual-mpnet-base-v2")
y_text_pred = classifier.pipe(X_text_test.tolist())

In [None]:
y_text_pred

In [None]:
y_text_test

## Clasificación de documentos con TensorFlow

In [8]:
documents = []
max_length = -1
for id in dataset['id']:
    pdf_file = fitz.open(f"dataset/{id}.pdf")
    document_text = [] 
    for page in pdf_file:
        blocks = page.get_text("blocks")
        document_text += [block[4] for block in blocks]
    documents.append(document_text)
    for block in document_text:
        if len(block) > max_length:
            max_length = len(block)
print(max_length)            
dataset.insert(loc=2, column="block_documents", value=documents)

4438


In [9]:
import time
from keras.layers import Embedding, LSTM, Bidirectional, Dense
from keras.models import Sequential
import tensorflow as tf
import tensorflow_text
import tensorflow_hub as hub
from keras import utils
import torch
PARAGRAPH_QTY = 2048
BATCH_SIZE= 171
labels = utils.to_categorical(dataset['total'], num_classes=4)


2023-11-25 23:18:22.976503: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-25 23:18:24.900399: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-11-25 23:18:24.900512: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-11-25 23:18:41.298247: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directo

### Clasificación de documentos con Universal Sentence Encoder - Multilingual Large
La estructura de entrada corresponde a la siguiente:
Cada documento se encuentra almacenado como una lista de párrafos.
Así, la entrada del modelo LSTM sería un conjunto de lotes, correspondiendo a lo

In [8]:
use_embedder = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")

2023-11-25 14:54:24.417375: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-11-25 14:54:24.417467: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-11-25 14:54:24.417524: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (cn05): /proc/driver/nvidia/version does not exist
2023-11-25 14:54:24.418268: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [36]:
use_pad = use_embedder("")

<tf.Tensor: shape=(1, 512), dtype=float32, numpy=
array([[ 9.67169367e-03, -2.54486930e-02, -3.17078345e-02,
        -2.97019519e-02, -1.05562985e-01,  1.94779057e-02,
         5.57217561e-03, -6.65287301e-02, -9.96011775e-03,
        -2.87357699e-02, -4.46397811e-02, -3.81327048e-03,
        -2.46620625e-02,  1.41643267e-02,  1.62112806e-02,
         3.12007181e-02,  2.16443427e-02, -1.47441644e-02,
        -3.72272567e-03, -1.33331269e-02,  9.22790088e-04,
         6.21030889e-02,  4.95002009e-02, -7.31829880e-03,
         9.38552152e-03, -6.95153605e-03,  1.25779901e-02,
         1.24983238e-02, -7.82670360e-03, -4.84213866e-02,
        -1.85677093e-02, -1.64676979e-02, -7.40418630e-03,
        -5.79094924e-02, -1.73449591e-02, -6.26746789e-02,
        -4.70428318e-02, -2.61945203e-02, -9.74262424e-04,
        -8.06684420e-02,  3.92258726e-02, -1.62350815e-02,
        -2.77670939e-02, -6.16709851e-02,  1.24697713e-02,
        -3.87554914e-02, -1.99186374e-02,  5.62777333e-02,
      

In [15]:
time_start = time.perf_counter()
Xn_text_embed = [use_embedder(document) for document in dataset['block_documents']]
time_end = time.perf_counter()
print(time_end-time_start)

2214.054824677296


In [40]:
padded_documents = []
for document in Xn_text_embed:
    pad = np.concatenate(np.repeat([use_pad], PARAGRAPH_QTY-document.shape[0], axis=0), axis=0)
    padded_documents.append(tf.concat([document, pad], 0))

In [42]:
Xn_tensor = tf.convert_to_tensor(padded_documents)

In [76]:
model_use = Sequential()
model_use.add(Bidirectional(LSTM(512, return_sequences=True, input_shape=(PARAGRAPH_QTY, 512))))
model_use.add(Bidirectional(LSTM(units=512, return_sequences=False)))
model_use.add(Dense(256, activation='relu'))
model_use.add(Dense(4, activation='softmax'))
model_use.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", "AUC"])

In [86]:
model_use.fit(Xn_tensor, labels, validation_split=0.3, epochs=5, batch_size=BATCH_SIZE)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fc4fd7f0940>

In [87]:
model_use.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_9 (Bidirectio  (None, 2048, 1024)       4198400   
 nal)                                                            
                                                                 
 bidirectional_10 (Bidirecti  (None, 1024)             6295552   
 onal)                                                           
                                                                 
 dense_14 (Dense)            (None, 256)               262400    
                                                                 
 dense_15 (Dense)            (None, 4)                 1028      
                                                                 
Total params: 10,757,380
Trainable params: 10,757,380
Non-trainable params: 0
_________________________________________________________________


### Clasificación de documentos con Longformer - Spanish

In [10]:
from transformers import RobertaTokenizer, RobertaModel
long_tokenizer = RobertaTokenizer.from_pretrained("mrm8488/longformer-base-4096-spanish",)
long_model = RobertaModel.from_pretrained("mrm8488/longformer-base-4096-spanish")

Some weights of the model checkpoint at mrm8488/longformer-base-4096-spanish were not used when initializing RobertaModel: ['roberta.encoder.layer.6.attention.self.value_global.bias', 'roberta.encoder.layer.4.attention.self.value_global.weight', 'roberta.encoder.layer.1.attention.self.value_global.weight', 'roberta.encoder.layer.11.attention.self.value_global.bias', 'lm_head.layer_norm.weight', 'roberta.encoder.layer.6.attention.self.value_global.weight', 'roberta.encoder.layer.0.attention.self.value_global.weight', 'roberta.encoder.layer.6.attention.self.query_global.bias', 'roberta.encoder.layer.10.attention.self.value_global.weight', 'roberta.encoder.layer.6.attention.self.key_global.weight', 'roberta.encoder.layer.7.attention.self.query_global.bias', 'roberta.encoder.layer.5.attention.self.value_global.weight', 'roberta.encoder.layer.8.attention.self.query_global.weight', 'roberta.encoder.layer.4.attention.self.query_global.weight', 'roberta.encoder.layer.7.attention.self.key_globa

In [11]:
inputs = long_tokenizer("", return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = long_model(**inputs)
cls_embeddings = outputs.last_hidden_state[:, 0, :]
long_pad = tf.convert_to_tensor(cls_embeddings)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
2023-11-25 23:19:42.304263: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-11-25 23:19:42.304362: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-11-25 23:19:42.304441: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (cn03): /proc/driver/nvidia/version does not exist
2023-11-25 23:19:42.305121: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in othe

In [24]:
def long_embedder(document):
    embeddings = []
    for page in document:
        inputs = long_tokenizer(page, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = long_model(**inputs)
        cls_embeddings = outputs.last_hidden_state[:, 0, :]
        embeddings.append(cls_embeddings)
    print("done document")
    return tf.convert_to_tensor(torch.cat(embeddings))

In [None]:
Xn_long_embed = [long_embedder(document) for document in dataset['block_documents']]

done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done document
done d

In [None]:
padded_long_documents = []
for document in Xn_long_embed:
    pad = np.concatenate(np.repeat([long_pad], PARAGRAPH_QTY-document.shape[0], axis=0), axis=0)
    padded_long_documents.append(tf.concat([document, pad], 0))

In [None]:
Xn_long_tensor = tf.convert_to_tensor(padded_long_documents)

In [None]:
model_long = Sequential()
model_long.add(Bidirectional(LSTM(512, return_sequences=True, input_shape=(PARAGRAPH_QTY, 768))))
model_long.add(Bidirectional(LSTM(units=512, return_sequences=False)))
model_long.add(Dense(256, activation='relu'))
model_long.add(Dense(4, activation='softmax'))
model_long.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", "AUC"])

In [None]:
model_long.fit(Xn_tensor, labels, validation_split=0.3, epochs=3, batch_size=BATCH_SIZE)

In [None]:
model_long.summary()

### Clasificación de documentos con Tulio BERT

In [32]:
from transformers import AutoTokenizer, AutoModel
tulio_tokenizer = AutoTokenizer.from_pretrained("dccuchile/tulio-chilean-spanish-bert")
tulio_model = AutoModel.from_pretrained("dccuchile/tulio-chilean-spanish-bert")

Some weights of the model checkpoint at dccuchile/tulio-chilean-spanish-bert were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at dccuchile/tulio-chilean-spanish-bert and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this mode

In [None]:
inputs = tulio_tokenizer("", return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = tulio_model(**inputs)
cls_embeddings = outputs.last_hidden_state[:, 0, :]
tulio_pad = tf.convert_to_tensor(cls_embeddings)

In [None]:
def tulio_embedder(document):
    embeddings = []
    for page in document:
        inputs = tulio_tokenizer(page, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = tulio_model(**inputs)
        cls_embeddings = outputs.last_hidden_state[:, 0, :]
        embeddings.append(cls_embeddings)
    print("done document")
    return tf.convert_to_tensor(torch.cat(embeddings))

In [33]:
Xn_tulio_embed = [tulio_embedder(document) for document in dataset['block_documents']]

KeyboardInterrupt: 

In [None]:
padded_tulio_documents = []
for document in Xn_tulio_embed:
    pad = np.concatenate(np.repeat([tulio_pad], PARAGRAPH_QTY-document.shape[0], axis=0), axis=0)
    padded_long_documents.append(tf.concat([document, pad], 0))

In [None]:
Xn_tulio_tensor = tf.convert_to_tensor(padded_tulio_documents)

In [None]:
model_tulio = Sequential()
model_tulio.add(Bidirectional(LSTM(512, return_sequences=True, input_shape=(PARAGRAPH_QTY, 768))))
model_tulio.add(Bidirectional(LSTM(units=512, return_sequences=False)))
model_tulio.add(Dense(256, activation='relu'))
model_tulio.add(Dense(4, activation='softmax'))
model_tulio.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy", "AUC"])

In [None]:
model_tulio.fit(Xn_tulio_tensor, labels, validation_split=0.3, epochs=3, batch_size=BATCH_SIZE)

In [31]:
model_tulio.summary()

NameError: name 'model_tulio' is not defined

In [12]:
embeddings = []
time_start = time.perf_counter()
for page in dataset['block_documents'][0]:
    inputs = tulio_tokenizer(page, return_tensors="pt", padding=True, truncation=True)
    print('done tokenizing')
    with torch.no_grad():
        outputs = tulio_model(**inputs)
    print('done encoding')
    cls_embeddings = outputs.last_hidden_state[:, 0, :]
    embeddings.append(cls_embeddings)
#Xn_long_text_embed = long_pipe(dataset['block_documents'][0])
#Xn_long_text_embed = [long_pipe(document) for document in dataset['block_documents']]
time_end = time.perf_counter()
print(time_end-time_start)

done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done tokenizing
done encoding
done token

In [None]:
embeddings[0].shape

In [None]:
Xn_tulio_embed = [tulio_embedder(document) for document in dataset['block_documents']]

done document


In [16]:
Xn_tensor = tf.convert_to_tensor(torch.cat(embeddings))
Xn_tensor.shape

TensorShape([1415, 768])

In [None]:
padded_documents = []
for document in embeddings:
    pad = np.concatenate(np.repeat([long_pad], PARAGRAPH_QTY-document.shape[0], axis=0), axis=0)
    padded_documents.append(tf.concat([document, pad], 0)) 

In [26]:
time_start = time.perf_counter()
inputs = tulio_tokenizer(dataset['documents'][50], return_tensors="pt", padding=True, truncation=True)
print('done tokenizing')
outputs = tulio_model(**inputs)
print('done encoding')
cls_embeddings = outputs.last_hidden_state[:, 0, :]
#Xn_long_text_embed = long_pipe(dataset['block_documents'][0])
#Xn_long_text_embed = [long_pipe(document) for document in dataset['block_documents']]
time_end = time.perf_counter()
print(time_end-time_start)
cls_embeddings.shape

done tokenizing


AttributeError: EagerTensor object has no attribute 'size'. 
        If you are looking for numpy-related methods, please run the following:
        from tensorflow.python.ops.numpy_ops import np_config
        np_config.enable_numpy_behavior()
      

In [30]:
cls_embeddings.shape

torch.Size([1, 768])