# Proyecto: Sistema Generador de Resúmenes

## Descripción
Este proyecto se centra en el desarrollo de un sistema automatizado para la generación de resúmenes abstractivos de artículos científicos.  
El objetivo principal es reducir la extensión de textos académicos largos, organizando la información en secciones clave. De esta manera, los usuarios podrán acceder rápidamente a los puntos más relevantes, optimizando el tiempo dedicado a la lectura y análisis de textos técnicos y científicos.

## Autores
- Oscar Alberto Sánchez Martinez
- Octavio Ortega Hernández
- De La Fuente Cuamatzi Jesus
- Becerra Tapia Alberto

##Extraccion de texto 1 usando OCR


In [None]:
## librerias necesarias para teseract
!sudo apt-get install tesseract-ocr
!pip install pytesseract pdf2image opencv-python-headless pillow
!sudo apt-get install poppler-utils


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 1s (9,458 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debc

In [None]:
import os
import numpy as np
from pdf2image import convert_from_path
import cv2
import pytesseract
from PIL import Image
from google.colab import drive

# Montar Google Drive
drive.mount('/content/drive')

# Función para preprocesar imágenes
def preprocesar_imagen(imagen):
    img = cv2.cvtColor(np.array(imagen), cv2.COLOR_RGB2GRAY)  # Convertir a escala de grises
    img = cv2.GaussianBlur(img, (5, 5), 0)  # Desenfoque gaussiano
    _, img_binaria = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)  # Binarización
    img_denoised = cv2.fastNlMeansDenoising(img_binaria, h=30)  # Reducir ruido
    return img_denoised

# Función para extraer texto de una imagen
def extraer_texto_de_imagen(imagen):
    texto = pytesseract.image_to_string(imagen, lang='eng')  # Usar idioma inglés
    return texto

# Pipeline principal
def procesar_pdf_a_texto(ruta_pdf):
    print("Convirtiendo PDF a imágenes en memoria...")
    imagenes = convert_from_path(ruta_pdf, dpi=300, fmt='png')
    texto_extraido = []

    for i, imagen in enumerate(imagenes):
        print(f"Preprocesando imagen de página {i+1}...")
        imagen_procesada = preprocesar_imagen(imagen)
        texto = extraer_texto_de_imagen(imagen_procesada)
        texto_extraido.append(texto)

    return texto_extraido


ruta_pdf = '/content/articulo_3.pdf'

# Procesar PDF y obtener texto
todo_el_texto = procesar_pdf_a_texto(ruta_pdf)

# Ruta de salida para guardar el archivo de texto en Google Drive
dir_salida = '/content/drive/My Drive/imagenes'
os.makedirs(dir_salida, exist_ok=True)
output_path = os.path.join(dir_salida, "output_text_prueba4.txt")

# Guardar el texto extraído en un archivo
with open(output_path, "w", encoding="utf-8") as f:
    for num_pagina, texto_pagina in enumerate(todo_el_texto, start=1):
        f.write(f"--- Página {num_pagina} ---\n")
        f.write(texto_pagina + "\n")
print("Texto extraído y guardado en Drive")


Mounted at /content/drive
Convirtiendo PDF a imágenes en memoria...
Preprocesando imagen de página 1...
Preprocesando imagen de página 2...
Preprocesando imagen de página 3...
Preprocesando imagen de página 4...
Preprocesando imagen de página 5...
Preprocesando imagen de página 6...
Preprocesando imagen de página 7...
Preprocesando imagen de página 8...
Texto extraído y guardado en Drive


#Extracción de texto 2 usando NLP

In [None]:
!pip install PyPDF2

# Cargar bibliotecas
import PyPDF2
import re
import os

# Paso 1: Función para extraer texto de un PDF
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

# Paso 2: Función para limpiar el texto
def clean_text(text):
    text = re.sub(r'\$.*?\$', '', text)  # Eliminar fórmulas matemáticas
    text = re.sub(r'\\\[.*?\\\]', '', text)  # Eliminar expresiones entre corchetes
    text = re.sub(r'[^\w\s.,;:()\-]', '', text)  # Eliminar caracteres especiales
    text = re.sub(r'\s+', ' ', text)  # Reemplazar espacios múltiples por uno solo
    return text.strip()

# Paso 3: Función principal para procesar el PDF
def process_pdf(pdf_path):
    text = extract_text_from_pdf(pdf_path)
    return clean_text(text)

# Paso 4: Ejecución principal simplificada

if __name__ == "__main__":
    pdf = "/content/articulo_3.pdf"
    print("Texto limpio:\n")
    print(process_pdf(pdf))


Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Texto limpio:

Using Chinese Glyphs for Named Entity Recognition Arijit Sehanobishy Yale University arijit.sehanobishyale.eduChan Hee Songy University of Notre Dame csong1nd.edu Abstract Most Named Entity Recognition (NER) systems use addi- tional features like part-of-speech (POS) tags, shallow pars- ing, gazetteers, etc. Adding these external features to NER systems have been shown to have a positive impact. How- ever, creating gazetteers or taggers can take a lot of time and may require extensive data cleaning. In this work instead of using these traditional features we use lexicographic features of Chinese characters. Chinese characters are composed of grap

In [None]:
!pip install datasets

# Importación de bibliotecas y configuración inicial del modelo

In [None]:
from datasets import (load_dataset, DownloadConfig)
from transformers import (
    LEDTokenizer,
    LEDForConditionalGeneration,
    Trainer,
    TrainingArguments
)
import os
import torch

# Desactivar WandB (opcional)
os.environ["WANDB_DISABLED"] = "true"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Parámetros
MODEL_NAME = "allenai/led-base-16384"
MAX_INPUT_LENGTH = 4096  # Para textos largos
MAX_TARGET_LENGTH = 300  # Resúmenes más largos (150-300 palabras)
BATCH_SIZE = 2
EPOCHS = 2
LEARNING_RATE = 2e-5

# Cargar el tokenizador
tokenizer = LEDTokenizer.from_pretrained(MODEL_NAME)


# Carga y selección del dataset de entrenamiento


In [None]:
# Cargar en modo de transmisión
dataset = load_dataset("scientific_papers", "pubmed", split="train", streaming=True, trust_remote_code=True)
dataset_test = load_dataset("scientific_papers", "pubmed", split="test", streaming=True, trust_remote_code=True)

# Seleccionar los primeros 100 ejemplos (mantener el formato de dataset)
dataset = dataset.take(100)
dataset_test = dataset_test.take(100)

# Preprocesamiento del dataset

In [None]:
# Preprocesar datos
def preprocess_function(examples):
    inputs = examples["article"]
    targets = examples["abstract"]

    # Convertir a texto si no lo son
    inputs = [str(i) for i in inputs]
    targets = [str(t) for t in targets]

    # Tokenizar las entradas y las etiquetas
    model_inputs = tokenizer(
        inputs, max_length=MAX_INPUT_LENGTH, truncation=True, padding="max_length"
    )
    labels = tokenizer(
        targets, max_length=MAX_TARGET_LENGTH, truncation=True, padding="max_length"
    )

    # Añadir las etiquetas como "labels"
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
tokenized_datasets = dataset.map(preprocess_function, batched=True)
tokenized_test_datasets = dataset_test.map(preprocess_function, batched=True)
"""
# Convertir a listas
tokenized_datasets  = list(tokenized_datasets.take(100))
ttokenized_test_datasets  = list(tokenized_test_datasets.take(100))"""

'\n# Convertir a listas\ntokenized_datasets  = list(tokenized_datasets.take(100))\nttokenized_test_datasets  = list(tokenized_test_datasets.take(100))'

# Configuración del modelo y entrenamiento


In [None]:
# Cargar el modelo
model = LEDForConditionalGeneration.from_pretrained(MODEL_NAME)
model.to(device)
# Configuración del entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=1,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    fp16=False,
    max_steps=1,  # Establecer un número fijo de pasos para el entrenamiento
)

# Configurar el Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_test_datasets,
    tokenizer=tokenizer,
)

from tqdm import tqdm
# Entrenar forzar que se detenga
MAX_STEPS = 1
# Entrenar con control manual de los pasos
for step, batch in enumerate(tqdm(trainer.get_train_dataloader())):
    if step >= MAX_STEPS:
        break
    trainer.training_step(model, batch)
# Guardar el modelo
model.save_pretrained("./trained_model")
tokenizer.save_pretrained("./trained_model")



Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(
1it [00:07,  7.32s/it]


('./trained_model/tokenizer_config.json',
 './trained_model/special_tokens_map.json',
 './trained_model/vocab.json',
 './trained_model/merges.txt',
 './trained_model/added_tokens.json')

# Función para resumir textos

In [None]:
def summarize_text(input_text, model, tokenizer):
    # Dividir el texto en chunks si es muy largo
    def split_text(text, max_length=MAX_INPUT_LENGTH):
        tokens = tokenizer.encode(text, truncation=False)
        chunks = [tokens[i:i+max_length] for i in range(0, len(tokens), max_length)]
        return [tokenizer.decode(chunk, skip_special_tokens=True) for chunk in chunks]

    # Tokenizar y procesar el texto
    chunks = split_text(input_text)
    summaries = []
    for chunk in chunks:
        inputs = tokenizer(chunk, return_tensors="pt", max_length=MAX_INPUT_LENGTH, truncation=True, padding=True)
        summary_ids = model.generate(
            inputs["input_ids"],
            num_beams=6,
            max_length=MAX_TARGET_LENGTH,
            min_length=150,
            length_penalty=2.0,
            repetition_penalty=2.5,
            early_stopping=True
        )
        summaries.append(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

    # Combinar los resúmenes parciales
    return " ".join(summaries)


#Modelo de resumen

In [None]:
#Aqui la entrada debe ser el texto extraido del pdf
#
#
#
# comentar la siguiente linea si se quiere probar con el texto extraido
input_text = """
"Due to the success of deep learning to solving a variety of challenging machine learning tasks, there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect.", "Particularly, the properties of critical points and the landscape around them are of importance to determine the convergence performance of optimization algorithms.", "In this paper, we provide a necessary and sufficient characterization of the analytical forms for the critical points (as well as global minimizers) of the square loss functions for linear neural networks.", "We show that the analytical forms of the critical points characterize the values of the corresponding loss functions as well as the necessary and sufficient conditions to achieve global minimum.", "Furthermore, we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of linear neural networks and shallow ReLU networks.", "One particular conclusion is that: While the loss function of linear networks has no spurious local minimum, the loss function of one-hidden-layer nonlinear networks with ReLU activation function does have local minimum that is not global minimum.", "In the past decade, deep neural networks BID8 have become a popular tool that has successfully solved many challenging tasks in a variety of areas such as machine learning, artificial intelligence, computer vision, and natural language processing, etc.", "As the understandings of deep neural networks from different aspects are mostly based on empirical studies, there is a rising need and interest to develop understandings of neural networks from theoretical aspects such as generalization error, representation power, and landscape (also referred to as geometry) properties, etc.", "In particular, the landscape properties of loss functions (that are typically nonconex for neural networks) play a central role to determine the iteration path and convergence performance of optimization algorithms.One major landscape property is the nature of critical points, which can possibly be global minima, local minima, saddle points.", "There have been intensive efforts in the past into understanding such an issue for various neural networks.", "For example, it has been shown that every local minimum of the loss function is also a global minimum for shallow linear networks under the autoencoder setting and invertibility assumptions BID1 and for deep linear networks BID11 ; BID14 ; Yun et al. (2017) respectively under different assumptions.", "The conditions on the equivalence between local minimum or critical point and global minimum has also been established for various nonlinear neural networks Yu & Chen (1995) ; BID9 ; BID15 ; BID17 ; BID6 under respective assumptions.However, most previous studies did not provide characterization of analytical forms for critical points of loss functions for neural networks with only very few exceptions.", "In BID1 , the authors provided an analytical form for the critical points of the square loss function of shallow linear networks under certain conditions.", "Such an analytical form further helps to establish the landscape properties around the critical points.", "Further in BID13 , the authors characterized certain sufficient form of critical points for the square loss function of matrix factorization problems and deep linear networks.The focus of this paper is on characterizing the sufficient and necessary forms of critical points for broader scenarios, i.e., shallow and deep linear networks with no assumptions on data matrices and network dimensions, and shallow ReLU networks over certain parameter space.", "In particular, such analytical forms of critical points capture the corresponding loss function values and the necessary and sufficient conditions to achieve global minimum.", "This further enables us to establish new landscape properties around these critical points for the loss function of these networks under general settings, and provides alternative (yet simpler and more intuitive) proofs for existing understanding of the landscape properties.OUR CONTRIBUTION", "1) For the square loss function of linear networks with one hidden layer, we provide a full (necessary and sufficient) characterization of the analytical forms for its critical points and global minimizers.", "These results generalize the characterization in BID1 to arbitrary network parameter dimensions and any data matrices.", "Such a generalization further enables us to establish the landscape property, i.e., every local minimum is also a global minimum and all other critical points are saddle points, under no assumptions on parameter dimensions and data matrices.", "From a technical standpoint, we exploit the analytical forms of critical points to provide a new proof for characterizing the landscape around the critical points under full relaxation of assumptions, where the corresponding approaches in BID1 are not applicable.", "As a special case of linear networks, the matrix factorization problem satisfies all these landscape properties.2) For the square loss function of deep linear networks, we establish a full (necessary and sufficient) characterization of the analytical forms for its critical points and global minimizers.", "Such characterizations are new and have not been established in the existing art.", "Furthermore, such analytical form divides the set of non-global-minimum critical points into different categories.", "We identify the directions along which the loss function value decreases for two categories of the critical points, for which our result directly implies the equivalence between the local minimum and the global minimum.", "For these cases, our proof generalizes the result in BID11 under no assumptions on the network parameter dimensions and data matrices.3) For the square loss function of one-hidden-layer nonlinear neural networks with ReLU activation function, we provide a full characterization of both the existence and the analytical forms of the critical points in certain types of regions in the parameter space.", "Particularly, in the case where there is one hidden unit, our results fully characterize the existence and the analytical forms of the critical points in the entire parameter space.", "Such characterization were not provided in previous work on nonlinear neural networks.", "Moreover, we apply our results to a concrete example to demonstrate that both local minimum that is not a global minimum and local maximum do exist in such a case.", "In this paper, we provide full characterization of the analytical forms of the critical points for the square loss function of three types of neural networks, namely, shallow linear networks, deep linear networks, and shallow ReLU nonlinear networks.", "We show that such analytical forms of the critical points have direct implications on the values of the corresponding loss functions, achievement of global minimum, and various landscape properties around these critical points.", "As a consequence, the loss function for linear networks has no spurious local minimum, while such point does exist for nonlinear networks with ReLU activation.", "In the future, it is interesting to further explore nonlinear neural networks.", "In particular, we wish to characterize the analytical form of critical points for deep nonlinear networks and over the full parameter space.", "Such results will further facilitate the understanding of the landscape properties around these critical points."
"""
summary = summarize_text(input_text, model, tokenizer)
print("Resumen generado:\n", summary)

# Mejorar gramática.

In [None]:
from transformers import pipeline
import language_tool_python

# Cargar el modelo para corrección gramatical
tool = language_tool_python.LanguageTool('en-US')

# Función para corregir errores gramaticales
def correct_grammar(text):
    matches = tool.check(text)
    corrected_text = language_tool_python.utils.correct(text, matches)
    return corrected_text

# Función para parafrasear el texto (mejorar la coherencia y claridad)
def paraphrase_text(text):
    # Cargar el modelo de parafraseo
    paraphraser = pipeline("text2text-generation", model="t5-small", tokenizer="t5-small")

    # Parafrasear el texto
    paraphrased = paraphraser(f"paraphrase: {text}", max_length=200, num_return_sequences=1)[0]['generated_text']
    return paraphrased

# Función para mejorar el resumen generado
def improve_summary(summary):
    # Paso 1: Corregir errores gramaticales
    corrected_summary = correct_grammar(summary)

    # Paso 2: Parafrasear para mayor coherencia y claridad
    improved_summary = paraphrase_text(corrected_summary)

    return improved_summary

# Ejemplo de resumen generado por el modelo
generated_summary = "Due to the success of deep learning to solving a variety of challenging machine learning tasks, there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect. Particularly, the properties of critical points and the landscape around them are of importance to determine the convergence performance of optimization algorithms."

# Mejorar el resumen
improved_summary = improve_summary(generated_summary)

# Mostrar el resumen mejorado
print("Resumen Mejorado:")
print(improved_summary)


# Modelo número dos utilizando t-5 small



In [None]:
import os
import torch
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, TrainingArguments, Trainer

# Desactivar WandB
os.environ["WANDB_MODE"] = "offline"
os.environ["WANDB_DISABLED"] = "true"

# 1. Cargar dataset
dataset = load_dataset("scientific_papers", "pubmed")

# Usar subconjunto pequeño para prueba rápida
train_data = dataset["train"].shuffle(seed=42).select(range(2000))
val_data = dataset["validation"].shuffle(seed=42).select(range(400))

# 2. Cargar modelo y tokenizer
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# 3. Preprocesar datos con estructura
def preprocess_data_with_structure(batch):
    inputs = ["summarize with structure: introduction, main topics, conclusion: " + article for article in batch["article"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(batch["abstract"], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_data = train_data.map(preprocess_data_with_structure, batched=True)
val_data = val_data.map(preprocess_data_with_structure, batched=True)

# 4. Parámetros de entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=100,
    save_strategy="epoch",
    save_total_limit=2,
    learning_rate=5e-5,
    fp16=True
)

# 5. Configurar entrenador
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
)

# 6. Liberar memoria y entrenar modelo
torch.cuda.empty_cache()
trainer.train()

# 7. Leer archivo .txt y resumir
def summarize_text_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    return text

# Montar Google Drive para acceder al archivo
from google.colab import drive
drive.mount('/content/drive')

# Ruta al archivo .txt en Google Drive
txt_file_path = "/content/drive/My Drive/imagenes/output_text_prueba1.txt"

# Leer contenido del archivo
text_to_summarize = summarize_text_from_file(txt_file_path)

# 8. Generar resumen estructurado
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

inputs = tokenizer("summarize with structure: introduction, main topics, conclusion: " + text_to_summarize,
                   max_length=512, truncation=True, return_tensors="pt").to(device)

summary_ids = model.generate(
    inputs["input_ids"],
    max_length=128,
    min_length=40,
    length_penalty=1.0,
    num_beams=4,
    early_stopping=True
)

# Imprimir el resumen estructurado
print("Resumen Estructurado:", tokenizer.decode(summary_ids[0], skip_special_tokens=True))


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,2.841456


Resumen Estructurado: "there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect," he says. "we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of linear neural networks and shallow ReLU networks," he says. "there has been intensive efforts in the past into understanding such an issue for various neural networks," he says.


#Modelo de traducción de texto


In [None]:
!pip install transformers

from transformers import MarianTokenizer, MarianMTModel

# 1. Cargar el modelo preentrenado y el tokenizador
model_name = "Helsinki-NLP/opus-mt-en-es"  # Modelo para inglés-español
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# 2. Función para traducir texto con mejoras
def translate_text(text, model, tokenizer):

    # Tokenizar el texto de entrada
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    # Generar la traducción con parámetros
    translated = model.generate(
        **inputs,
        num_beams=10,  # Explorar más traducciones
        early_stopping=True
    )

    # Decodificar la traducción
    translation = tokenizer.decode(translated[0], skip_special_tokens=True)


    translation = translation.replace(" ,", ",").replace(" .", ".")
    return translation

# 3. Probar la traducción
sample_text = (
    "there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect, he says. "
    "we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of linear neural networks and shallow ReLU networks, he says. "
    "there has been intensive efforts in the past into understanding such an issue for various neural networks, he says"
)

translated_text = translate_text(sample_text, model, tokenizer)

print("Texto original:", sample_text)
print("Traducción:", translated_text)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Texto original: there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect, he says. we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of linear neural networks and shallow ReLU networks, he says. there has been intensive efforts in the past into understanding such an issue for various neural networks, he says
Traducción: Hay un interés creciente en entender las funciones de pérdida para el entrenamiento de redes neuronales desde un aspecto teórico, dice. explotamos las formas analíticas de los puntos críticos para caracterizar las propiedades del paisaje para las funciones de pérdida de redes neuronales lineales y redes ReLU poco profundas, dice. ha habido esfuerzos intensivos en el pasado para entender tal problema para varias redes neuronales, dice.


#  Modelo 2 de Traducción de Texto con Transformer


In [None]:
import pandas as pd
import numpy as np
from keras_transformer import get_model, decode
from pickle import load

# Funciones
def build_token_dict(token_list):
    token_dict = {'<PAD>': 0, '<START>': 1, '<END>': 2}
    for tokens in token_list:
        for token in tokens:
            if token not in token_dict:
                token_dict[token] = len(token_dict)
    return token_dict

def translate(sentence, model, source_token_dict, target_token_dict, target_token_dict_inv):
    sentence_tokens = [['<START>'] + sentence.split(' ') + ['<END>']]
    tr_input = [list(map(lambda x: source_token_dict.get(x, source_token_dict['<PAD>']), tokens)) for tokens in sentence_tokens][0]

    decoded = decode(
        model,
        tr_input,
        start_token=target_token_dict['<START>'],
        end_token=target_token_dict['<END>'],
        pad_token=target_token_dict['<PAD>']
    )

    print('Frase original:', sentence)
    print('Traducción:', ' '.join(map(lambda x: target_token_dict_inv[x], decoded[1:-1])))

# Configuración inicial
filename = 'B:/UPIIT/I.A. UPIIT/6to_Semestre/LenguajeNatural/Parcial 3/Proyectos/traductor/codigos ocupados/dataset_Completo.csv'
np.random.seed(0)

if __name__ == '__main__':
    # Cargar el dataset
    dataset = pd.read_csv(filename)
    print(dataset.iloc[383015, 0])
    print(dataset.iloc[383015, 1])

    # Tokenización
    source_tokens = [sentence.split(' ') for sentence in dataset['english']]
    target_tokens = [sentence.split(' ') for sentence in dataset['spanish']]
    print(source_tokens[383015])
    print(target_tokens[383015])

    # Construcción de diccionarios
    source_token_dict = build_token_dict(source_tokens)
    target_token_dict = build_token_dict(target_tokens)
    target_token_dict_inv = {v: k for k, v in target_token_dict.items()}

    # Preparar datos para el modelo
    encoder_tokens = [['<START>'] + tokens + ['<END>'] for tokens in source_tokens]
    decoder_tokens = [['<START>'] + tokens + ['<END>'] for tokens in target_tokens]
    output_tokens = [tokens + ['<END>'] for tokens in target_tokens]

    source_max_len = max(map(len, encoder_tokens))
    target_max_len = max(map(len, decoder_tokens))

    encoder_tokens = [tokens + ['<PAD>'] * (source_max_len - len(tokens)) for tokens in encoder_tokens]
    decoder_tokens = [tokens + ['<PAD>'] * (target_max_len - len(tokens)) for tokens in decoder_tokens]
    output_tokens = [tokens + ['<PAD>'] * (target_max_len - len(tokens)) for tokens in output_tokens]

    encoder_input = [list(map(lambda x: source_token_dict[x], tokens)) for tokens in encoder_tokens]
    decoder_input = [list(map(lambda x: target_token_dict[x], tokens)) for tokens in decoder_tokens]
    output_decoded = [list(map(lambda x: [target_token_dict[x]], tokens)) for tokens in output_tokens]

    # Crear y entrenar el modelo
    model = get_model(
        token_num=max(len(source_token_dict), len(target_token_dict)),
        embed_dim=32,
        encoder_num=2,
        decoder_num=2,
        head_num=4,
        hidden_dim=128,
        dropout_rate=0.1,
        use_same_embed=False,
    )
    model.compile('adam', 'sparse_categorical_crossentropy')
    model.summary()

    x = [np.array(encoder_input), np.array(decoder_input)]
    y = np.array(output_decoded)

    #Quitar como comentarios solo si se quiere entrenar el modelo:
    #history = model.fit(x, y, epochs=100, batch_size=32)
    #print(history.history)
    #model.save('translator_trained.h5')
    filename = '/content/drive/My Drive/translator_trained.h5'
    model.load_weights(filename)

    # Traducción
    translate('Yes I am.', model, source_token_dict, target_token_dict, target_token_dict_inv)


In [None]:

    translate('there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect, he says we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of linear neural networks and shallow ReLU networks, he says. there has been intensive efforts in the past into understanding such an issue for various neural networks, he says', model, source_token_dict, target_token_dict, target_token_dict_inv)


# Modelo utilizando lecturas con caracteres matemáticos



In [None]:
!pip install --upgrade fsspec gcsfs
!pip install transformers datasets

import os
import torch
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, TrainingArguments, Trainer
from google.colab import drive

# Montar Google Drive
drive.mount('/content/drive')

os.environ["WANDB_MODE"] = "offline"  # disable
os.environ["WANDB_DISABLED"] = "true"

# Configurar PYTORCH_CUDA_ALLOC_CONF
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# 1. Cargar dataset de arXiv (papers que contienen fórmulas y números)
dataset = load_dataset("scientific_papers", "arxiv")

# Usar subconjunto pequeño para prueba rápida
train_data = dataset["train"].shuffle(seed=42).select(range(500))
val_data = dataset["validation"].shuffle(seed=42).select(range(200))

# 2. Cargar modelo y tokenizer
model_name = "t5-small"  # Usa t5-small para menor uso de memoria
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# 3. Preprocesar datos
def preprocess_data(batch):
    inputs = ["summarize: " + article for article in batch["article"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(batch["abstract"], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_data = train_data.map(preprocess_data, batched=True)
val_data = val_data.map(preprocess_data, batched=True)

# 4. Parámetros de entrenamiento
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,  # Reducir tamaño del lote
    per_device_eval_batch_size=4,  # Reducir tamaño del lote
    num_train_epochs=2,
    save_strategy="epoch",
    save_total_limit=2,
    learning_rate=5e-5,
    fp16=True
)

# 5. Configurar entrenador
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
)

# 6. Liberar memoria y entrenar modelo
torch.cuda.empty_cache()
trainer.train()

# 7. Leer archivo .txt desde Google Drive y resumir
def summarize_text_from_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    return text

# Ruta al archivo .txt en Google Drive
txt_file_path = "/content/drive/My Drive/imagenes/output_text_prueba2.txt"

# Leer contenido del archivo
text_to_summarize = summarize_text_from_file(txt_file_path)

# 8. Generar resumen estructurado
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

inputs = tokenizer("summarize: " + text_to_summarize, max_length=512, truncation=True, return_tensors="pt").to(device)
summary_ids = model.generate(inputs["input_ids"], max_length=128, min_length=30, length_penalty=2.0, num_beams=4)

print("Resumen:", tokenizer.decode(summary_ids[0], skip_special_tokens=True))


Collecting fsspec
  Using cached fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Using cached fsspec-2024.12.0-py3-none-any.whl (183 kB)
Installing collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.9.0
    Uninstalling fsspec-2024.9.0:
      Successfully uninstalled fsspec-2024.9.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.2.0 requires fsspec[http]<=2024.9.0,>=2023.1.0, but you have fsspec 2024.12.0 which is incompatible.[0m[31m
[0mSuccessfully installed fsspec-2024.12.0
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Using cached fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Using cached fsspec-2024.9.0-py3-none-any.whl (179 kB)
Installing collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fssp

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,No log,3.271149
2,No log,3.198559


Resumen: Derivada de una constante por una funcién f(x) =u+tv f(x)=ultv' Derivada de una constante por una funcién f(x)=keu f(x)=keu' Derivada de una raiz cuadrada f(x) = k Kigk-1 Ejemplos de derivadas F(x) =-2 f(x
