<div style="background:#FFFFE0;padding:20px;color:#000000;margin-top:10px;">
Imports necesarios para la ejecución de los módulos instalados con pip:

• pandas → import pandas as pd  
• numpy → import numpy as np  
• matplotlib → import matplotlib.pyplot as plt  
• seaborn → import seaborn as sns  
• scikit-learn → from sklearn.model_selection import train_test_split  
                    from sklearn.metrics import classification_report, confusion_matrix  
• torch → import torch  
• transformers → from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding  
• datasets → from datasets import Dataset  
</div>


In [12]:
#pip install transformers datasets scikit-learn pandas matplotlib seaborn torch
#!pip install "transformers[torch]" --upgrade


In [None]:
#!pip install "transformers[torch]" --upgrade
#!pip install --upgrade transformers


Collecting transformers
  Downloading transformers-4.54.1-py3-none-any.whl.metadata (41 kB)
Downloading transformers-4.54.1-py3-none-any.whl (11.2 MB)
   ---------------------------------------- 0.0/11.2 MB ? eta -:--:--
   ------------------- -------------------- 5.5/11.2 MB 37.2 MB/s eta 0:00:01
   ---------------------------------------- 11.2/11.2 MB 41.0 MB/s eta 0:00:00
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.54.0
    Uninstalling transformers-4.54.0:
      Successfully uninstalled transformers-4.54.0
Successfully installed transformers-4.54.1


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy-transformers 1.3.9 requires transformers<4.50.0,>=3.4.0, but you have transformers 4.54.1 which is incompatible.


In [14]:
#!pip install --upgrade transformers accelerate


<div style="background:#FFFFE0;padding:20px;color:#000000;margin-top:10px;">
Este bloque de código verifica si PyTorch puede usar la GPU (usualmente con CUDA) y cuál GPU está disponible. Es útil para asegurarse de que el entrenamiento del modelo se pueda hacer con aceleración por hardware, lo que reduce significativamente el tiempo.</div>


In [15]:
import torch

print("CUDA disponible:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

CUDA disponible: True
GPU: NVIDIA GeForce RTX 3050 Ti Laptop GPU


# Clasificacion automatica de poemas segun su forma poetica(usando la carpeta forms)

In [9]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding
from datasets import Dataset
import torch

In [13]:
import os
import pandas as pd

ruta_base = "archive/forms"

# Inicializamos listas vacías
textos = []
etiquetas = []

# Recorrer cada carpeta (que es una clase)
for nombre_carpeta in os.listdir(ruta_base):
    ruta_carpeta = os.path.join(ruta_base, nombre_carpeta)
    if os.path.isdir(ruta_carpeta):
        for archivo in os.listdir(ruta_carpeta):
            ruta_archivo = os.path.join(ruta_carpeta, archivo)
            try:
                with open(ruta_archivo, 'r', encoding='utf-8') as f:
                    contenido = f.read().strip()
                    textos.append(contenido)
                    etiquetas.append(nombre_carpeta)
            except:
                continue

# Crear el DataFrame
df = pd.DataFrame({'text': textos, 'label': etiquetas})

# Ver los primeros datos
print(df.head())
print(df['label'].value_counts().to_string())


                                                text label
0  2 ABC of H.k. and China revised vision.\nBarre...   abc
1  Apparently life without love, is no life at al...   abc
2  A abc angles on angels flaws (poem)\nMix with ...   abc
3  A abc Brazil dance (poem)\nJack of crack in po...   abc
4  ABC... I can't go on\n123... what's the next o...   abc
label
acrostic                       100
allegory                       100
free-verse                     100
cinquain                       100
cavatina                       100
ballad                         100
ballade                        100
tetractys                      100
triolet                        100
villanelle                     100
stanza                         100
syllabic-verse                 100
epigram                        100
dirge                          100
clerihew                       100
epitaph                        100
elegy                          100
epistle                        100
verse     

In [14]:
from sklearn.preprocessing import LabelEncoder

# Etiquetas que sí quieres mantener con su nombre
clases_deseadas = ['haiku', 'sonnet']

# Reasignamos todo lo que no es haiku ni sonnet como "otros"
df['label'] = df['label'].apply(lambda x: x if x in clases_deseadas else 'otros')

# Ahora sí codificamos las tres clases
le = LabelEncoder()
df['label_id'] = le.fit_transform(df['label'])

# Imprimir para verificar
print(df['label'].value_counts())
print(df['label_id'].value_counts())
print(le.classes_)  # Te dirá cuál clase es cuál número


label
otros     6140
haiku       99
sonnet      79
Name: count, dtype: int64
label_id
1    6140
0      99
2      79
Name: count, dtype: int64
['haiku' 'otros' 'sonnet']


In [16]:
from transformers import BertTokenizer

# Cargar el tokenizer de BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

dataset = Dataset.from_pandas(df[['text', 'label_id']])
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.rename_column("label_id", "labels")



Map: 100%|██████████| 6318/6318 [00:22<00:00, 276.68 examples/s]


In [21]:
# Dividir en entrenamiento y prueba (80% - 20%)
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)

# Asignar a variables por claridad
train_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

# Confirmar tamaños
print(f"Entrenamiento: {len(train_dataset)} ejemplos")
print(f"Evaluación: {len(eval_dataset)} ejemplos")



Entrenamiento: 5054 ejemplos
Evaluación: 1264 ejemplos


In [22]:
from transformers import BertForSequenceClassification

# Indicar la cantidad de clases (por ejemplo: 3 si tienes haiku, sonnet y otros)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


con 3 epecas hay confuncion en varios poemas

In [44]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding

# Preparar colador de datos
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Argumentos básicos de entrenamiento compatibles
training_args = TrainingArguments(
    output_dir="./resultados",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    logging_dir="./logs"
)

# Definir Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Entrenar
trainer.train()


  trainer = Trainer(


Step,Training Loss
500,0.0618
1000,0.0223
1500,0.0144
2000,0.0088
2500,0.0047
3000,0.0031


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)


TrainOutput(global_step=3160, training_loss=0.018526437275017363, metrics={'train_runtime': 712.8334, 'train_samples_per_second': 35.45, 'train_steps_per_second': 4.433, 'total_flos': 1662219016496640.0, 'train_loss': 0.018526437275017363, 'epoch': 5.0})

In [47]:
model.save_pretrained("./modelo_poemasv2")
tokenizer.save_pretrained("./modelo_poemasv2")


('./modelo_poemasv2\\tokenizer_config.json',
 './modelo_poemasv2\\special_tokens_map.json',
 './modelo_poemasv2\\vocab.txt',
 './modelo_poemasv2\\added_tokens.json')

In [48]:
from transformers import BertTokenizer, BertForSequenceClassification

# Cargar modelo entrenado
modelo_entrenado = BertForSequenceClassification.from_pretrained("./modelo_poemasv2")
tokenizer_entrenado = BertTokenizer.from_pretrained("./modelo_poemasv2")


In [49]:
import torch

poema = """One of the four great masters of Japanese haiku, Matsuo Bashō is known for his simplistic yet thought-provoking haikus. “The Old Pond”, arguably his most famous piece, stays true to his style of couching observations of human nature within natural imagery. One interpretation is that by metaphorically using the ‘pond’ to symbolize the mind, Bashō brings to light the impact of external stimuli (embodied by the frog, a traditional subject of Japanese poetry) on the human mind. 
"""

# Preparar input
inputs = tokenizer_entrenado(poema, return_tensors="pt", padding='max_length', truncation=True, max_length=128)
inputs = {k: v.to(modelo_entrenado.device) for k, v in inputs.items()}

# Predecir
modelo_entrenado.eval()
with torch.no_grad():
    outputs = modelo_entrenado(**inputs)
    logits = outputs.logits
    predicted_class_id = logits.argmax().item()

print("Predicción:", le.classes_[predicted_class_id])  


Predicción: otros


  return forward_call(*args, **kwargs)


In [50]:
import os
import torch
import torch.nn.functional as F
from transformers import BertTokenizer, BertForSequenceClassification
from pathlib import Path

# Cargar modelo y tokenizer
model_path = "./modelo_poemasv2"
model = BertForSequenceClassification.from_pretrained(model_path)
model.eval()
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Clases en el mismo orden del entrenamiento
labels = ["haiku", "sonnet", "otros"]

# Dispositivo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Carpeta con los archivos .txt
carpeta_poemas = "./Poemasvar"

# Umbral de confianza para decir "desconocido"
umbral_confianza = 0.65  # puedes ajustar entre 0.5 y 0.7 según tu preferencia

# Recorrer todos los archivos .txt
for archivo in Path(carpeta_poemas).glob("*.txt"):
    with open(archivo, "r", encoding="utf-8") as f:
        texto = f.read().strip()

    # Tokenizar
    inputs = tokenizer(texto, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Predicción
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = F.softmax(logits, dim=-1)
        confidence, pred_id = torch.max(probs, dim=1)

    # Evaluar confianza
    if confidence.item() < umbral_confianza:
        clase_predicha = "desconocido"
    else:
        clase_predicha = labels[pred_id.item()]

    print(f"{archivo.name}: {clase_predicha} (confianza: {confidence.item():.2f})")


haiku10_masaoka_shiki_cold.txt: sonnet (confianza: 1.00)
haiku11_modern_sea.txt: sonnet (confianza: 1.00)
haiku12_modern_lanterns.txt: sonnet (confianza: 1.00)
haiku13_modern_train.txt: sonnet (confianza: 1.00)
haiku14_modern_rain.txt: sonnet (confianza: 1.00)
haiku15_modern_moon.txt: haiku (confianza: 1.00)
haiku16_modern_street.txt: haiku (confianza: 1.00)
haiku17_modern_coffee.txt: sonnet (confianza: 1.00)
haiku18_modern_tree.txt: sonnet (confianza: 1.00)
haiku19_modern_beach.txt: sonnet (confianza: 1.00)
haiku1_matsuo_basho_frog.txt: haiku (confianza: 1.00)
haiku20_modern_fireflies.txt: sonnet (confianza: 1.00)
haiku2_matsuo_basho_autumn.txt: haiku (confianza: 1.00)
haiku3_matsuo_basho_summer.txt: sonnet (confianza: 1.00)
haiku4_yosa_buson_butterfly.txt: sonnet (confianza: 1.00)
haiku5_yosa_buson_moon.txt: sonnet (confianza: 1.00)
haiku6_kobayashi_issa_snail.txt: sonnet (confianza: 1.00)
haiku7_kobayashi_issa_dewdrop.txt: sonnet (confianza: 1.00)
haiku8_kobayashi_issa_child.txt: so