<a href="https://colab.research.google.com/github/NeskaCleo/MonicaGlez/blob/main/PLN/Trasformer_libros.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Clasificación de Género/Categoría
## Objetivo:
Utilizar los campos de título, descripción y autor para clasificar libros en sus respectivas categorías o géneros.
## Cómo:
Entrenar un modelo de transformer para identificar patrones en el texto que sean indicativos de un género específico. Por ejemplo, podría aprender que ciertas palabras o estilos son comunes en la ciencia ficción frente a la literatura histórica.
## Beneficio:
Esto puede ayudar a las librerías en línea a categorizar automáticamente los nuevos libros y mejorar las recomendaciones a los usuarios.

In [None]:
!pip install transformers torch

In [2]:
import pandas as pd
import numpy as np
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

In [3]:
# Cargar datos desde un archivo CSV
data = pd.read_csv("/content/drive/MyDrive/BooksDataset.csv")
print(data.head())

                                               Title  \
0                                      Goat Brothers   
1                                 The Missing Person   
2                  Don't Eat Your Heart Out Cookbook   
3  When Your Corporate Umbrella Begins to Leak: A...   
4    Amy Spangler's Breastfeeding : A Parent's Guide   

                    Authors Description              Category  \
0          By Colton, Larry         NaN     History , General   
1        By Grumbach, Doris         NaN     Fiction , General   
2  By Piscatella, Joseph C.         NaN   Cooking , Reference   
3         By Davis, Paul D.         NaN                   NaN   
4          By Spangler, Amy         NaN                   NaN   

          Publisher                 Publish Date                    Price  
0         Doubleday      Friday, January 1, 1993  Price Starting at $8.79  
1  Putnam Pub Group        Sunday, March 1, 1981  Price Starting at $4.99  
2    Workman Pub Co  Thursday, September 1, 

In [4]:
# Ejemplos de análisis exploratorio
print(data.describe())
print(data['Category'].value_counts())

                 Title Authors  \
count           103082  103082   
unique           97818   63580   
top     The Nutcracker      By   
freq                12    1043   

                                              Description            Category  \
count                                               70213               76912   
unique                                              68831                3106   
top     For Ingest Only - Data needs to be cleaned up ...   Fiction , General   
freq                                                   30                2549   

               Publisher               Publish Date                    Price  
count             103074                     103082                   103082  
unique             13029                        956                     1387  
top     Simon & Schuster  Thursday, January 1, 2004  Price Starting at $5.29  
freq                1521                        868                    41876  
 Fiction , General          

In [5]:
# Eliminar filas donde 'Description' o 'Category' es nulo
data = data.dropna(subset=['Description', 'Category'])

# Luego, eliminar filas donde 'Description' contiene "For Ingest Only"
data = data[~data['Description'].str.contains("For Ingest Only", na=False)]

# Continúa con la normalización de texto
data['Description'] = data['Description'].str.lower().str.replace('[^\w\s]', '', regex=True)
data['Title'] = data['Title'].str.lower().str.replace('[^\w\s]', '', regex=True)

In [6]:
data.describe()

Unnamed: 0,Title,Authors,Description,Category,Publisher,Publish Date,Price
count,65280,65280,65280,65280,65280,65280,65280
unique,61672,39268,64010,2983,4747,798,951
top,the night before christmas,"By Roberts, Nora",complemented by easyto use reliable maps helpf...,"Fiction , General",Simon & Schuster,"Monday, September 1, 2003",Price Starting at $5.29
freq,7,182,16,2274,1311,448,28652


In [7]:
# Obtener categorías únicas
unique_categories = data['Category'].unique()
print("Categorías únicas:", unique_categories)

# Contar cuántos libros hay en cada categoría
category_counts = data['Category'].value_counts()
print("Conteo de libros por categoría:", category_counts)


Categorías únicas: [' Poetry , General' ' Biography & Autobiography , General'
 ' Health & Fitness , Diet & Nutrition , Diets' ...
 ' Young Adult Fiction , Performing Arts , Music'
 ' Computers , Mathematical & Statistical Software'
 ' Young Adult Nonfiction , Biography & Autobiography , Science & Technology']
Conteo de libros por categoría:  Fiction , General                                                            2274
 Fiction , Literary                                                           1647
 Fiction , Mystery & Detective , General                                      1555
 Fiction , Thrillers , General                                                1105
 Fiction , Thrillers , Suspense                                               1042
                                                                              ... 
 Travel , Cruises                                                                1
 Mathematics , Mathematical Analysis                                       

In [9]:
# Cargar el tokenizador preentrenado de BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [10]:
# Tokenizar un ejemplo de texto
example_text = data.iloc[0]['Description']  # Cambia esto por la columna y fila correspondiente
tokens = tokenizer.tokenize(example_text)

print("Texto Original:", example_text)
print("Tokens:", tokens)


Texto Original: collects poems written by the elevenyearold muscular dystrophy patient sharing his feelings and thoughts about his life the deaths of his siblings nature faith and hope
Tokens: ['collects', 'poems', 'written', 'by', 'the', 'eleven', '##year', '##old', 'muscular', 'd', '##yst', '##rop', '##hy', 'patient', 'sharing', 'his', 'feelings', 'and', 'thoughts', 'about', 'his', 'life', 'the', 'deaths', 'of', 'his', 'siblings', 'nature', 'faith', 'and', 'hope']


In [11]:
# Convertir tokens en IDs de token
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("IDs de Token:", token_ids)


IDs de Token: [17427, 5878, 2517, 2011, 1996, 5408, 29100, 11614, 13472, 1040, 27268, 18981, 10536, 5776, 6631, 2010, 5346, 1998, 4301, 2055, 2010, 2166, 1996, 6677, 1997, 2010, 9504, 3267, 4752, 1998, 3246]


In [12]:
def preparar_datos_para_bert(texto, tokenizer, max_len=128): # max_len =512 pero no puedo porque la RAM no es suficiente
    # Tokenizar el texto y añadir [CLS] y [SEP]
    tokens = ['[CLS]'] + tokenizer.tokenize(texto) + ['[SEP]']

    # Convertir tokens a IDs y realizar padding/truncamiento
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    padded_ids = token_ids + [0] * (max_len - len(token_ids))
    input_ids = padded_ids[:max_len]

    # Crear máscara de atención
    attention_mask = [1 if id != 0 else 0 for id in input_ids]

    return input_ids, attention_mask


In [14]:
# Aplicar la función a la columna 'Description'
data['bert_input_ids'], data['bert_attention_mask'] = zip(*data['Description'].apply(
    lambda x: preparar_datos_para_bert(x, tokenizer, max_len=128)))
X_input_ids = list(data['bert_input_ids'])
X_attention_masks = list(data['bert_attention_mask'])
y = list(data['Category'])  # 'Category' es la columna de etiquetas


In [15]:
# Codificación de las etiquetas
label_encoder = LabelEncoder()
data['encoded_category'] = label_encoder.fit_transform(data['Category'])

Convertir los datos en un formato que PyTorch pueda usar, lo que generalmente implica crear DataLoaders.

Crear Tensores y DataLoaders:

Utilizar PyTorch para convertir tus datos en tensores y luego en DataLoaders. Los DataLoaders permiten cargar los datos en lotes durante el entrenamiento.

Primero, conviertir etiquetas categóricas en un formato numérico utilizando, por ejemplo, LabelEncoder de scikit-learn:

In [16]:
# División en conjuntos de entrenamiento y validación
train_data, validation_data = train_test_split(data, test_size=0.1)

In [17]:
# Creación de DataLoaders para entrenamiento y validación
def create_dataloader(df, batch_size=32):
    input_ids = torch.tensor(list(df['bert_input_ids']))
    attention_masks = torch.tensor(list(df['bert_attention_mask']))
    labels = torch.tensor(df['encoded_category'].values)
    dataset = TensorDataset(input_ids, attention_masks, labels)
    return DataLoader(dataset, sampler=RandomSampler(dataset), batch_size=batch_size)

batch_size = 16  # Tamaño de lote más pequeño, menos uso de RAM
train_dataloader = create_dataloader(train_data, batch_size=batch_size)
validation_dataloader = create_dataloader(validation_data, batch_size=batch_size)



In [18]:
# Paso 9: Configuración del modelo BERT para clasificación de secuencia
num_labels = len(label_encoder.classes_)
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=num_labels,
    output_attentions=False,
    output_hidden_states=False,
)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
# Paso 10: Configuración del optimizador
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)



In [20]:
# Paso 11: Entrenamiento del modelo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
epochs = 2   # Debería poner más epoch, pewro la memoria RAM no es suficiente

for epoch in range(epochs):
    total_train_loss = 0
    model.train()

    for batch in train_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_attention_mask, b_labels = batch
        model.zero_grad()

        outputs = model(b_input_ids, attention_mask=b_attention_mask, labels=b_labels)
        loss = outputs.loss
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}/{epochs}, Training Loss: {total_train_loss/len(train_dataloader)}")


Epoch 1/2, Training Loss: 6.614867546070116
Epoch 2/2, Training Loss: 6.514895183718023


In [21]:
# Evaluación del modelo
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

model.eval()
total_eval_accuracy = 0

with torch.no_grad():
    for batch in validation_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_attention_mask, b_labels = batch

        outputs = model(b_input_ids, attention_mask=b_attention_mask, labels=b_labels)
        logits = outputs.logits
        total_eval_accuracy += flat_accuracy(logits.detach().cpu().numpy(), b_labels.to('cpu').numpy())

print("Validation Accuracy:", total_eval_accuracy/len(validation_dataloader))

Validation Accuracy: 0.02022058823529412


In [23]:
unique_labels = np.unique(y_true)  # Obtener las etiquetas únicas en tus datos
label_names = label_encoder.classes_  # Obtener los nombres de clases del codificador
print("Etiquetas Únicas:", unique_labels)
print("Nombres de Clases del Codificador:", label_names)


Etiquetas Únicas: [   3    4   11 ... 2974 2975 2976]
Nombres de Clases del Codificador: [' Antiques & Collectibles , Americana' ' Antiques & Collectibles , Art'
 ' Antiques & Collectibles , Autographs' ...
 ' Young Adult Nonfiction , Sports & Recreation , Winter Sports'
 ' Young Adult Nonfiction , Study Aids , General'
 ' Young Adult Nonfiction , Technology , General']


In [26]:
# Generar informe de clasificación (opcional)
y_true = validation_data['encoded_category']
y_pred = []

model.eval()
with torch.no_grad():
    for batch in validation_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_attention_mask, _ = batch
        outputs = model(b_input_ids, attention_mask=b_attention_mask)
        logits = outputs.logits
        y_pred.extend(np.argmax(logits.cpu().numpy(), axis=1))

# Obtener las etiquetas únicas en tus datos
unique_labels = np.unique(y_true)

# Crear un nuevo codificador de etiquetas y ajustarlo a las etiquetas únicas
new_label_encoder = LabelEncoder()
new_label_encoder.classes_ = unique_labels
class_names = [str(label) for label in new_label_encoder.classes_]
print(classification_report(y_true, y_pred, target_names=class_names))

              precision    recall  f1-score   support

           3       0.00      0.00      0.00         3
           4       0.00      0.00      0.00         1
          11       0.00      0.00      0.00         1
          12       0.00      0.00      0.00         2
          13       0.00      0.00      0.00         4
          25       0.00      0.00      0.00         1
          29       0.00      0.00      0.00         1
          30       0.00      0.00      0.00         1
          35       0.00      0.00      0.00         4
          39       0.00      0.00      0.00         3
          40       0.00      0.00      0.00         1
          41       0.00      0.00      0.00         2
          42       0.00      0.00      0.00         2
          47       0.00      0.00      0.00         1
          49       0.00      0.00      0.00         1
          50       0.00      0.00      0.00         1
          52       0.00      0.00      0.00         1
          53       0.00    

In [27]:
# Probando el modelo con entrada nueva
nueva_descripcion = "The Little Prince is a timeless tale by Antoine de Saint-Exupéry. It narrates the adventures of a young prince who travels between planets, exploring themes of loneliness, friendship, and love. This classic story is cherished for its deep philosophical insights."

# Preparar los datos para BERT (al igual que se hizo con los datos de entrenamiento)
input_ids, attention_mask = preparar_datos_para_bert(nueva_descripcion, tokenizer, max_len=128)

# Convertir a tensores y crear un dataloader
input_ids = torch.tensor([input_ids])
attention_mask = torch.tensor([attention_mask])
dataset = TensorDataset(input_ids, attention_mask)
dataloader = DataLoader(dataset, batch_size=1)

# Predecir con el modelo
model.eval()
with torch.no_grad():
    for batch in dataloader:
        b_input_ids, b_attention_mask = batch
        b_input_ids = b_input_ids.to(device)
        b_attention_mask = b_attention_mask.to(device)
        outputs = model(b_input_ids, attention_mask=b_attention_mask)
        logits = outputs.logits
        predicted_label = torch.argmax(logits, dim=1).cpu().numpy()[0]

# Convertir la etiqueta codificada de vuelta a la categoría original
predicted_category = label_encoder.inverse_transform([predicted_label])[0]
print("Categoría predicha:", predicted_category)


Categoría predicha:  Fiction , Mystery & Detective , General
