env `nlp_mlops`

## Directory

In [1]:
import os
os.chdir("../")
os.getcwd()

'c:\\Users\\Marina\\Desktop\\ML Operations\\0 - KrishNaik Course\\21_end_to_end_nlp_project_with_huggingface_and_transformers\\my_project'

## 1. Config.yaml

Vamos fazer isso, criando a classe `model_trainer` 

## 2. Params.yaml

Criado para nos auxiliar no treinamento

## 3. Config entity

In [4]:
from dataclasses import dataclass
from pathlib import Path

@dataclass
class ModelTrainerConfig():
  root_dir: Path
  data_path: Path 
  model_ckpt: str


@dataclass
class ModelTrainerParams():
  num_train_epochs: int
  warmup_steps: int
  per_device_train_batch_size: int
  weight_decay: float
  logging_steps: int
  evaluation_strategy: str
  eval_steps: int
  save_steps: int
  gradient_accumulation_steps: int

## 4. Configuration Manager

Vamos criar umas constanstes 

In [5]:
from src.textSummarizer.constants import CONFIG_FILE_PATH, PARAMS_FILE_PATH
from src.textSummarizer.utils.common import read_yaml, create_directories
from typing import Tuple

class ConfigurationManager:
    def __init__(self,
                config_path= CONFIG_FILE_PATH,
                params_path= PARAMS_FILE_PATH ):
        
        self.configurations = read_yaml(config_path)
        self.params = read_yaml(params_path)

        create_directories([self.configurations.artifacts_root]) # cria o /artifacts


    def get_model_trainer_config(self)-> Tuple[ModelTrainerConfig, ModelTrainerParams]:
        
        model_trainer_config = self.configurations.model_trainer
        model_trainer_params = self.params["TrainingArguments"]
    
        
        create_directories([model_trainer_config.root_dir]) # cria o /artifacts/model_trainer

        return model_trainer_config, model_trainer_params


In [6]:
config = ConfigurationManager()
model_trainer_config, model_trainer_params = config.get_model_trainer_config()
print(model_trainer_config)
print(model_trainer_params)


[ 2024-11-13 11:05:20,684 ] - 28 summarizerlogger - INFO - yaml file: config\config.yaml loaded successfully
[ 2024-11-13 11:05:20,701 ] - 28 summarizerlogger - INFO - yaml file: params.yaml loaded successfully
[ 2024-11-13 11:05:20,705 ] - 46 summarizerlogger - INFO - created directory at: artifacts
[ 2024-11-13 11:05:20,708 ] - 46 summarizerlogger - INFO - created directory at: artifacts/model_trainer
{'root_dir': 'artifacts/model_trainer', 'data_path': 'artifacts/data_transformation/samsum_dataset', 'model_ckpt': 'google/pegasus-cnn_dailymail'}
{'num_train_epochs': 1, 'warmup_steps': 500, 'per_device_train_batch_size': 1, 'weight_decay': 0.01, 'logging_steps': 10, 'evaluation_strategy': 'steps', 'eval_steps': 500, 'save_steps': 1000000, 'gradient_accumulation_steps': 16}


## 5. Update the components- Data Ingestion,Data Transformation, Model Trainer

In [8]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import torch    
# from datasets import load_from_disk

#keras==2.13.1

[ 2024-11-13 11:13:55,072 ] - 54 datasets - INFO - PyTorch version 2.1.2 available.
[ 2024-11-13 11:13:55,077 ] - 112 datasets - INFO - TensorFlow version 2.13.0 available.


In [9]:
# ! pip install datasets
from datasets import load_from_disk


In [10]:
from src.textSummarizer.logging import logger

class ModelTrainer:
    def __init__(self,
                 model_trainer_config: ModelTrainerConfig, 
                 model_trainer_params: ModelTrainerParams):
        """
        Initializes the ModelTrainer class with configuration details.

        Args:   
            model_trainer_config (ModelTrainerConfig): Configuration object containing the needed paths to perform the model training.
            model_trainer_params (ModelTrainerParams): Params object containing the needed params to perform the model training.
        """
        self.config = model_trainer_config
        self.params = model_trainer_params
        logger.info("ModelTrainer initialized with configuration and parameters.")

    def train(self):
        # Determine the device to use
        device = "cuda" if torch.cuda.is_available() else "cpu"
        logger.info(f"Using device: {device}")

        # Load tokenizer and model
        logger.info("Loading tokenizer and model.")
        tokenizer = AutoTokenizer.from_pretrained(self.config.model_ckpt)
        model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(self.config.model_ckpt).to(device)
        seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)
        logger.info("Tokenizer and model loaded successfully.")

        # Load the dataset
        logger.info(f"Loading dataset from {self.config.data_path}.")
        dataset_samsum_pt = load_from_disk(self.config.data_path)
        logger.info("Dataset loaded successfully.")
        

        # Set up training arguments
        logger.info("Setting up training arguments.")
        trainer_args = TrainingArguments(
            output_dir=self.config.root_dir,
            **self.params
        )
        logger.info("Training arguments set.")

        # Initialize the trainer
        logger.info("Initializing Trainer.")
        trainer = Trainer(
            model=model_pegasus,
            args=trainer_args,
            tokenizer=tokenizer,
            data_collator=seq2seq_data_collator,
            train_dataset=dataset_samsum_pt["test"],
            eval_dataset=dataset_samsum_pt["validation"]
        )

        # Start training
        logger.info("Starting training process.")
        trainer.train()
        logger.info("Training completed.")

        # Save model and tokenizer
        logger.info("Saving the trained model and tokenizer.")
        model_pegasus.save_pretrained(os.path.join(self.config.root_dir, "pegasus-samsum-model"))
        tokenizer.save_pretrained(os.path.join(self.config.root_dir, "tokenizer"))
        logger.info("Model and tokenizer saved successfully.")


In [None]:
configuration_manager_obj = ConfigurationManager()
model_trainer_config, model_trainer_params = configuration_manager_obj.get_model_trainer_config()

model_trainer_obj = ModelTrainer(model_trainer_config, model_trainer_params)
model_trainer_obj.train()


[ 2024-11-13 11:14:05,091 ] - 28 summarizerlogger - INFO - yaml file: config\config.yaml loaded successfully
[ 2024-11-13 11:14:05,142 ] - 28 summarizerlogger - INFO - yaml file: params.yaml loaded successfully
[ 2024-11-13 11:14:05,147 ] - 46 summarizerlogger - INFO - created directory at: artifacts
[ 2024-11-13 11:14:05,190 ] - 46 summarizerlogger - INFO - created directory at: artifacts/model_trainer
[ 2024-11-13 11:14:05,191 ] - 16 summarizerlogger - INFO - ModelTrainer initialized with configuration and parameters.
[ 2024-11-13 11:14:05,193 ] - 21 summarizerlogger - INFO - Using device: cpu
[ 2024-11-13 11:14:05,196 ] - 24 summarizerlogger - INFO - Loading tokenizer and model.


  return self.fget.__get__(instance, owner)()
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[ 2024-11-13 11:15:21,421 ] - 28 summarizerlogger - INFO - Tokenizer and model loaded successfully.
[ 2024-11-13 11:15:21,456 ] - 31 summarizerlogger - INFO - Loading dataset from artifacts/data_transformation/samsum_dataset.
[ 2024-11-13 11:15:23,304 ] - 33 summarizerlogger - INFO - Dataset loaded successfully.
[ 2024-11-13 11:15:23,312 ] - 37 summarizerlogger - INFO - Setting up training arguments.
[ 2024-11-13 11:15:23,760 ] - 42 summarizerlogger - INFO - Training arguments set.
[ 2024-11-13 11:15:23,761 ] - 45 summarizerlogger - INFO - Initializing Trainer.
[ 2024-11-13 11:15:52,063 ] - 56 summarizerlogger - INFO - Starting training process.


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  0%|          | 0/51 [00:00<?, ?it/s]

: 

Emulei o treinamento usando o google colab e baixnado o modelo e o tokenizer

## 6. Modularizar o Código

O `3.` vai para `src\textSummarizer\entity\__init__.py`

O `4.` vai para `src\textSummarizer\config\configuration.py`

O `5.` vai para `src\textSummarizer\components\data_transformation.py`

Modularizamos criando uma pipeline (classe) em `stage_3_model_trainer_pipeline.py`, com o que usamos para rodar o código

Jogar a Pipeline para `main.py`