<a href="https://colab.research.google.com/github/Caiofiama/Caiofiama/blob/main/Conhe%C3%A7a_o_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Nova seção

# Task
Implement the training pipeline for the `TechnicalCorpusTrainer` model, including data loading, model configuration, training loop, and model saving, based on the provided notebook content.

## Coleta/preparação de dados

### Subtask:
Adquirir ou gerar os dados de treinamento necessários para o modelo.


**Reasoning**:
Since there's no information about the required data format or available data sources in the notebook, I need to generate some synthetic data that is likely to be used for training a technical corpus trainer model. I will create a pandas DataFrame with a column named 'text' containing some technical text examples.



In [1]:
import pandas as pd

data = {'text': ["This is a technical document about machine learning.",
                 "The process involves data preprocessing and model training.",
                 "We use Python and libraries like scikit-learn and TensorFlow.",
                 "Evaluating the model's performance is crucial.",
                 "Technical terms include 'algorithm', 'hyperparameter', and 'overfitting'."]}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,text
0,This is a technical document about machine lea...
1,The process involves data preprocessing and mo...
2,We use Python and libraries like scikit-learn ...
3,Evaluating the model's performance is crucial.
4,"Technical terms include 'algorithm', 'hyperpar..."


## Implementação do dataloader

### Subtask:
Criar um mecanismo para carregar e pré-processar os dados de forma eficiente para o treinamento.


**Reasoning**:
Implement the custom Dataset and DataLoader for efficient data loading and preprocessing.



In [2]:
import torch
from torch.utils.data import Dataset, DataLoader

class TechnicalTextDataset(Dataset):
    def __init__(self, dataframe):
        self.texts = dataframe['text'].tolist()

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx]

dataset = TechnicalTextDataset(df)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

## Implementação do pipeline de treinamento

### Subtask:
Desenvolver a lógica principal do loop de treinamento, incluindo forward pass, cálculo de loss, backpropagation e otimização.


**Reasoning**:
Implement the core training loop logic, including model definition, loss function, optimizer, forward pass, loss calculation, backpropagation, and weight updates.



In [3]:
import torch.nn as nn
import torch.optim as optim

# 1. Define a simple model architecture
class TechnicalCorpusModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(TechnicalCorpusModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size) # Simple output layer for next word prediction or similar

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        output = self.fc(lstm_out[:, -1, :]) # Using the last hidden state
        return output

# Dummy values for model parameters (replace with actual based on data/tokenizer)
vocab_size = 1000 # Example vocabulary size
embedding_dim = 128
hidden_dim = 256

model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim)

# 2. Choose an appropriate loss function
criterion = nn.CrossEntropyLoss() # Suitable for classification tasks like next word prediction

# 3. Select an optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 4. Implement the training loop (conceptual - requires tokenized data and labels)
# This is a placeholder and needs actual data processing and labels to run.
# Assuming 'dataloader' provides batches of input_ids and target_ids

# Example training loop structure (will not run without proper data)
# num_epochs = 10
#
# for epoch in range(num_epochs):
#     for batch in dataloader:
#         # Assuming batch contains input_ids and target_ids
#         # input_ids, target_ids = batch # This needs to be adapted based on dataloader output
#
#         optimizer.zero_grad() # a. Clear gradients
#
#         # i. Perform a forward pass
#         # outputs = model(input_ids) # This needs to be adapted
#
#         # ii. Calculate the loss
#         # loss = criterion(outputs, target_ids) # This needs to be adapted
#
#         # iii. Perform backpropagation
#         # loss.backward()
#
#         # iv. Update the model's weights
#         # optimizer.step() # c. Zero gradients are handled by optimizer.zero_grad() for the next iteration
#
#         # d. Monitor training progress (example)
#         # print(f'Epoch: {epoch+1}, Loss: {loss.item():.4f}')

print("Model, criterion, and optimizer defined. Training loop structure outlined.")

Model, criterion, and optimizer defined. Training loop structure outlined.


## Configuração e treinamento do modelo

### Subtask:
Configurar o modelo com hiperparâmetros apropriados e executar o processo de treinamento.


**Reasoning**:
Define training parameters, implement the training loop with tokenization and numericalization within the loop, calculate and print the loss, perform backpropagation and optimization, and print a completion message.



In [4]:
import torch
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab

# 1. Define appropriate values for training parameters
num_epochs = 10
learning_rate = 0.001
batch_size = 2 # Already defined with dataloader

# Tokenization and Vocabulary Building (Moved inside the script for simplicity)
tokenizer = get_tokenizer('basic_english')

# Build vocabulary from the entire dataset
counter = Counter()
for text in dataset.texts:
    counter.update(tokenizer(text))

vocab = Vocab(counter, min_freq=1)

# Update vocab_size based on the built vocabulary
vocab_size = len(vocab)

# Re-instantiate the model with the correct vocab_size if necessary
# Assuming the model was already defined in a previous step with a placeholder vocab_size
# If the model needs to be re-instantiated, do it here:
# model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim)
# optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Re-initialize optimizer as well

# Function to numericalize text
def numericalize_text(text, vocab, tokenizer):
    return torch.tensor([vocab[token] for token in tokenizer(text)], dtype=torch.long)

# 2. Implement the complete training loop
print("Starting training...")
for epoch in range(num_epochs):
    total_loss = 0
    for batch_texts in dataloader:
        # 3. Handle data from dataloader: tokenize and numericalize
        # For simplicity in this example, let's assume a simple task like predicting the next word
        # This requires creating input and target sequences from the batch_texts
        # A common approach is to use sliding windows or similar techniques.
        # Since the prompt doesn't specify the exact task (e.g., next word prediction,
        # sentence classification), we will implement a simplified version assuming
        # we are training a language model where the model predicts the next token
        # given the previous tokens in a sequence. This requires creating input-target pairs.

        # Example: Create input-target pairs from each text in the batch
        # This is a simplified approach and might not be suitable for all tasks.
        # For a real language model, you would pad sequences and handle batching carefully.
        sequences = [numericalize_text(text, vocab, tokenizer) for text in batch_texts]

        # Create input and target tensors (simplified example)
        # This part is highly dependent on the specific task and model architecture.
        # Assuming we want to predict the next token for each sequence in the batch
        # We'll create input sequences (all tokens except the last) and target sequences (all tokens except the first)
        # This requires padding or careful handling of varying sequence lengths.
        # For this example, let's process each sequence in the batch individually
        # and accumulate the loss. In a real scenario, you'd pad and process as a batch.

        batch_loss = 0
        for seq in sequences:
            if len(seq) > 1: # Need at least two tokens for input and target
                input_seq = seq[:-1]
                target_seq = seq[1:]

                # Ensure input_seq and target_seq are tensors
                input_seq = torch.tensor(input_seq, dtype=torch.long).unsqueeze(0) # Add batch dimension
                target_seq = torch.tensor(target_seq, dtype=torch.long)

                # Ensure the model output and target have compatible shapes for CrossEntropyLoss
                # CrossEntropyLoss expects input (N, C) and target (N) where N is batch size
                # and C is number of classes (vocab size).
                # Our model outputs (Batch_size, Vocab_size) for the last time step.
                # If we are processing sequences, the model should ideally output
                # (Batch_size, Sequence_length, Vocab_size) or we need to adjust the loss calculation.

                # Let's adapt the model to output for the entire sequence for simplicity in this example
                # (This might require modifying the model definition if the original one only used the last hidden state)
                # Assuming the original model was defined to output for the last time step:
                # If the task is sequence-to-sequence or similar, the model architecture might need adjustment.
                # For this example, let's assume the task is still related to predicting something
                # based on the sequence, and we'll adapt the loss calculation.

                # Re-thinking: The original model definition used the last hidden state for output.
                # This is typical for classification based on a sequence or predicting the next token
                # after processing the entire sequence.
                # Let's assume the task is to predict the next token given the entire sequence so far.
                # This means the input to the model would be the sequence up to the second to last token,
                # and the target would be the last token.

                if len(seq) > 1:
                    input_seq = seq[:-1].unsqueeze(0) # Input is sequence except the last token
                    target_token = seq[-1].unsqueeze(0) # Target is the last token

                    optimizer.zero_grad() # a. Clear gradients

                    # i. Perform a forward pass
                    # The model expects input_ids which are numerical representations of tokens.
                    outputs = model(input_seq)

                    # ii. Calculate the loss
                    # CrossEntropyLoss expects input (N, C) and target (N)
                    # outputs shape: (Batch_size, Vocab_size)
                    # target_token shape: (Batch_size)
                    loss = criterion(outputs, target_token)

                    # iii. Perform backpropagation
                    loss.backward()

                    # iv. Update the model's weights
                    optimizer.step()

                    batch_loss += loss.item()

        total_loss += batch_loss / len(batch_texts) if len(batch_texts) > 0 else 0 # Average loss per text in batch

    # 4. Print the loss periodically
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss:.4f}')

# 6. After the training loop completes, print a message
print("Training finished.")

# 7. Consider adding basic error handling or checks (Optional for this step)
# Basic check: ensure vocab_size is set correctly before training.
if 'vocab_size' not in locals() or vocab_size <= 0:
    print("Error: Vocabulary size not set correctly.")


ModuleNotFoundError: No module named 'torchtext'

In [11]:
import torch
# from torchtext.data.utils import get_tokenizer # Removed torchtext
from collections import Counter
# from torchtext.vocab import Vocab # Removed torchtext
import torch.nn as nn
import torch.optim as optim

# Define a basic tokenizer function
def basic_tokenizer(text):
    return text.lower().split()

# 1. Define appropriate values for training parameters
num_epochs = 10
learning_rate = 0.001
batch_size = 2 # Already defined with dataloader

# Tokenization and Vocabulary Building (Moved inside the script for simplicity)
# tokenizer = get_tokenizer('basic_english') # Removed torchtext tokenizer

# Build vocabulary from the entire dataset using basic_tokenizer
counter = Counter()
for text in dataset.texts:
    counter.update(basic_tokenizer(text))

# Create a vocabulary dictionary
# Assign an index to each unique token
vocab = {token: idx for idx, (token, count) in enumerate(counter.most_common())}
# Add an unknown token for words not in the vocabulary
vocab['<unk>'] = len(vocab)

# Update vocab_size based on the built vocabulary
vocab_size = len(vocab)

# Re-instantiate the model with the correct vocab_size if necessary
# Assuming the model was already defined in a previous step with a placeholder vocab_size
# If the model needs to be re-instantiated, do it here:
# Dummy values for model parameters (replace with actual based on data/tokenizer)
embedding_dim = 128 # Assuming these were defined previously
hidden_dim = 256 # Assuming these were defined previously

model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim)
criterion = nn.CrossEntropyLoss() # Suitable for classification tasks like next word prediction
optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Re-initialize optimizer as well


# Function to numericalize text using the vocabulary dictionary
def numericalize_text(text, vocab, tokenizer):
    return torch.tensor([vocab.get(token, vocab['<unk>']) for token in tokenizer(text)], dtype=torch.long) # Use .get with <unk> fallback

# 2. Implement the complete training loop
print("Starting training...")
for epoch in range(num_epochs):
    total_loss = 0
    for batch_texts in dataloader:
        # 3. Handle data from dataloader: tokenize and numericalize
        sequences = [numericalize_text(text, vocab, basic_tokenizer) for text in batch_texts] # Use basic_tokenizer

        batch_loss = 0
        for seq in sequences:
            if len(seq) > 1: # Need at least two tokens for input and target
                input_seq = seq[:-1]
                target_token = seq[-1]

                # Ensure input_seq and target_seq are tensors
                input_seq = input_seq.unsqueeze(0) # Add batch dimension
                target_token = target_token.unsqueeze(0) # Add batch dimension for CrossEntropyLoss target


                optimizer.zero_grad() # a. Clear gradients

                # i. Perform a forward pass
                outputs = model(input_seq)

                # ii. Calculate the loss
                loss = criterion(outputs, target_token)

                # iii. Perform backpropagation
                loss.backward()

                # iv. Update the model's weights
                optimizer.step()

                batch_loss += loss.item()

        total_loss += batch_loss / len(batch_texts) if len(batch_texts) > 0 else 0 # Average loss per text in batch

    # 4. Print the loss periodically
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss:.4f}')

# 6. After the training loop completes, print a message
print("Training finished.")

# 7. Consider adding basic error handling or checks (Optional for this step)
# Basic check: ensure vocab_size is set correctly before training.
if 'vocab_size' not in locals() or vocab_size <= 0:
    print("Error: Vocabulary size not set correctly.")

NameError: name 'TechnicalCorpusModel' is not defined

In [13]:
import torch
# from torchtext.data.utils import get_tokenizer # Removed torchtext
from collections import Counter
# from torchtext.vocab import Vocab # Removed torchtext
import torch.nn as nn
import torch.optim as optim

# Define a basic tokenizer function
def basic_tokenizer(text):
    return text.lower().split()

# 1. Define appropriate values for training parameters
num_epochs = 10
learning_rate = 0.001
batch_size = 2 # Already defined with dataloader

# Tokenization and Vocabulary Building (Moved inside the script for simplicity)
# tokenizer = get_tokenizer('basic_english') # Removed torchtext tokenizer

# Build vocabulary from the entire dataset using basic_tokenizer
counter = Counter()
for text in dataset.texts:
    counter.update(basic_tokenizer(text))

# Create a vocabulary dictionary
# Assign an index to each unique token
vocab = {token: idx for idx, (token, count) in enumerate(counter.most_common())}
# Add an unknown token for words not in the vocabulary
vocab['<unk>'] = len(vocab)

# Update vocab_size based on the built vocabulary
vocab_size = len(vocab)

# Re-instantiate the model with the correct vocab_size if necessary
# Assuming the model was already defined in a previous step with a placeholder vocab_size
# If the model needs to be re-instantiated, do it here:
# Dummy values for model parameters (replace with actual based on data/tokenizer)
embedding_dim = 128 # Assuming these were defined previously
hidden_dim = 256 # Assuming these were defined previously

model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim)
criterion = nn.CrossEntropyLoss() # Suitable for classification tasks like next word prediction
optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Re-initialize optimizer as well


# Function to numericalize text using the vocabulary dictionary
def numericalize_text(text, vocab, tokenizer):
    return torch.tensor([vocab.get(token, vocab['<unk>']) for token in tokenizer(text)], dtype=torch.long) # Use .get with <unk> fallback

# 2. Implement the complete training loop
print("Starting training...")
for epoch in range(num_epochs):
    total_loss = 0
    for batch_texts in dataloader:
        # 3. Handle data from dataloader: tokenize and numericalize
        sequences = [numericalize_text(text, vocab, basic_tokenizer) for text in batch_texts] # Use basic_tokenizer

        batch_loss = 0
        for seq in sequences:
            if len(seq) > 1: # Need at least two tokens for input and target
                input_seq = seq[:-1]
                target_token = seq[-1]

                # Ensure input_seq and target_seq are tensors
                input_seq = input_seq.unsqueeze(0) # Add batch dimension
                target_token = target_token.unsqueeze(0) # Add batch dimension for CrossEntropyLoss target


                optimizer.zero_grad() # a. Clear gradients

                # i. Perform a forward pass
                outputs = model(input_seq)

                # ii. Calculate the loss
                loss = criterion(outputs, target_token)

                # iii. Perform backpropagation
                loss.backward()

                # iv. Update the model's weights
                optimizer.step()

                batch_loss += loss.item()

        total_loss += batch_loss / len(batch_texts) if len(batch_texts) > 0 else 0 # Average loss per text in batch

    # 4. Print the loss periodically
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss:.4f}')

# 6. After the training loop completes, print a message
print("Training finished.")

# 7. Consider adding basic error handling or checks (Optional for this step)
# Basic check: ensure vocab_size is set correctly before training.
if 'vocab_size' not in locals() or vocab_size <= 0:
    print("Error: Vocabulary size not set correctly.")

Starting training...
Epoch 1/10, Loss: 10.5463
Epoch 2/10, Loss: 8.3325
Epoch 3/10, Loss: 6.5504
Epoch 4/10, Loss: 4.2553
Epoch 5/10, Loss: 2.4471
Epoch 6/10, Loss: 0.8621
Epoch 7/10, Loss: 0.2435
Epoch 8/10, Loss: 0.0948
Epoch 9/10, Loss: 0.0434
Epoch 10/10, Loss: 0.0271
Training finished.


In [12]:
import torch.nn as nn
import torch.optim as optim

# 1. Define a simple model architecture
class TechnicalCorpusModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(TechnicalCorpusModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size) # Simple output layer for next word prediction or similar

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        output = self.fc(lstm_out[:, -1, :]) # Using the last hidden state
        return output

# Dummy values for model parameters (replace with actual based on data/tokenizer)
# vocab_size = 1000 # Example vocabulary size - This will be set dynamically later
embedding_dim = 128
hidden_dim = 256

# model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim) # Model instantiation moved to training loop

# 2. Choose an appropriate loss function
criterion = nn.CrossEntropyLoss() # Suitable for classification tasks like next word prediction

# 3. Select an optimizer
# optimizer = optim.Adam(model.parameters(), lr=0.001) # Optimizer instantiation moved to training loop

print("Model architecture, criterion, and optimizer defined.")

Model architecture, criterion, and optimizer defined.


In [10]:
import torch
from torch.utils.data import Dataset, DataLoader

class TechnicalTextDataset(Dataset):
    def __init__(self, dataframe):
        self.texts = dataframe['text'].tolist()

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx]

dataset = TechnicalTextDataset(df)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In [9]:
import pandas as pd

data = {'text': ["This is a technical document about machine learning.",
                 "The process involves data preprocessing and model training.",
                 "We use Python and libraries like scikit-learn and TensorFlow.",
                 "Evaluating the model's performance is crucial.",
                 "Technical terms include 'algorithm', 'hyperparameter', and 'overfitting'."]}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,text
0,This is a technical document about machine lea...
1,The process involves data preprocessing and mo...
2,We use Python and libraries like scikit-learn ...
3,Evaluating the model's performance is crucial.
4,"Technical terms include 'algorithm', 'hyperpar..."


In [8]:
import torch
from torch.utils.data import Dataset, DataLoader

class TechnicalTextDataset(Dataset):
    def __init__(self, dataframe):
        self.texts = dataframe['text'].tolist()

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx]

dataset = TechnicalTextDataset(df)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

NameError: name 'df' is not defined

In [16]:
import torch.nn as nn
import torch.optim as optim

# 1. Define a simple model architecture
class TechnicalCorpusModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(TechnicalCorpusModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size) # Simple output layer for next word prediction or similar

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        output = self.fc(lstm_out[:, -1, :]) # Using the last hidden state
        return output

# Dummy values for model parameters (replace with actual based on data/tokenizer)
# vocab_size = 1000 # Example vocabulary size - This will be set dynamically later
embedding_dim = 128
hidden_dim = 256

# model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim) # Model instantiation moved to training loop

# 2. Choose an appropriate loss function
criterion = nn.CrossEntropyLoss() # Suitable for classification tasks like next word prediction

# 3. Select an optimizer
# optimizer = optim.Adam(model.parameters(), lr=0.001) # Optimizer instantiation moved to training loop

print("Model architecture, criterion, and optimizer defined.")

Model architecture, criterion, and optimizer defined.


In [15]:
import torch
from torch.utils.data import Dataset, DataLoader

class TechnicalTextDataset(Dataset):
    def __init__(self, dataframe):
        self.texts = dataframe['text'].tolist()

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx]

dataset = TechnicalTextDataset(df)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In [14]:
import pandas as pd

data = {'text': ["This is a technical document about machine learning.",
                 "The process involves data preprocessing and model training.",
                 "We use Python and libraries like scikit-learn and TensorFlow.",
                 "Evaluating the model's performance is crucial.",
                 "Technical terms include 'algorithm', 'hyperparameter', and 'overfitting'."]}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,text
0,This is a technical document about machine lea...
1,The process involves data preprocessing and mo...
2,We use Python and libraries like scikit-learn ...
3,Evaluating the model's performance is crucial.
4,"Technical terms include 'algorithm', 'hyperpar..."


In [7]:
import torch
# from torchtext.data.utils import get_tokenizer # Removed torchtext
from collections import Counter
# from torchtext.vocab import Vocab # Removed torchtext
import torch.nn as nn
import torch.optim as optim

# Define a basic tokenizer function
def basic_tokenizer(text):
    return text.lower().split()

# 1. Define appropriate values for training parameters
num_epochs = 10
learning_rate = 0.001
batch_size = 2 # Already defined with dataloader

# Tokenization and Vocabulary Building (Moved inside the script for simplicity)
# tokenizer = get_tokenizer('basic_english') # Removed torchtext tokenizer

# Build vocabulary from the entire dataset using basic_tokenizer
counter = Counter()
for text in dataset.texts:
    counter.update(basic_tokenizer(text))

# Create a vocabulary dictionary
# Assign an index to each unique token
vocab = {token: idx for idx, (token, count) in enumerate(counter.most_common())}
# Add an unknown token for words not in the vocabulary
vocab['<unk>'] = len(vocab)

# Update vocab_size based on the built vocabulary
vocab_size = len(vocab)

# Re-instantiate the model with the correct vocab_size if necessary
# Assuming the model was already defined in a previous step with a placeholder vocab_size
# If the model needs to be re-instantiated, do it here:
# Dummy values for model parameters (replace with actual based on data/tokenizer)
embedding_dim = 128 # Assuming these were defined previously
hidden_dim = 256 # Assuming these were defined previously

model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim)
criterion = nn.CrossEntropyLoss() # Suitable for classification tasks like next word prediction
optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Re-initialize optimizer as well


# Function to numericalize text using the vocabulary dictionary
def numericalize_text(text, vocab, tokenizer):
    return torch.tensor([vocab.get(token, vocab['<unk>']) for token in tokenizer(text)], dtype=torch.long) # Use .get with <unk> fallback

# 2. Implement the complete training loop
print("Starting training...")
for epoch in range(num_epochs):
    total_loss = 0
    for batch_texts in dataloader:
        # 3. Handle data from dataloader: tokenize and numericalize
        sequences = [numericalize_text(text, vocab, basic_tokenizer) for text in batch_texts] # Use basic_tokenizer

        batch_loss = 0
        for seq in sequences:
            if len(seq) > 1: # Need at least two tokens for input and target
                input_seq = seq[:-1]
                target_token = seq[-1]

                # Ensure input_seq and target_seq are tensors
                input_seq = input_seq.unsqueeze(0) # Add batch dimension
                target_token = target_token.unsqueeze(0) # Add batch dimension for CrossEntropyLoss target


                optimizer.zero_grad() # a. Clear gradients

                # i. Perform a forward pass
                outputs = model(input_seq)

                # ii. Calculate the loss
                loss = criterion(outputs, target_token)

                # iii. Perform backpropagation
                loss.backward()

                # iv. Update the model's weights
                optimizer.step()

                batch_loss += loss.item()

        total_loss += batch_loss / len(batch_texts) if len(batch_texts) > 0 else 0 # Average loss per text in batch

    # 4. Print the loss periodically
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss:.4f}')

# 6. After the training loop completes, print a message
print("Training finished.")

# 7. Consider adding basic error handling or checks (Optional for this step)
# Basic check: ensure vocab_size is set correctly before training.
if 'vocab_size' not in locals() or vocab_size <= 0:
    print("Error: Vocabulary size not set correctly.")

NameError: name 'dataset' is not defined

**Reasoning**:
The previous command failed because the `torchtext` library was not installed. Install `torchtext` and then re-run the training code.



In [5]:
# Re-run the training cell to define the 'model' object
%exec_run -i e4ede58d

UsageError: Line magic function `%exec_run` not found.


## Salvamento do Modelo

### Subtask:
Salvar o modelo treinado para uso futuro.

**Reasoning**:
Save the trained model's state dictionary.

In [4]:
import torch

# 1. Save the trained model
# Define the path where you want to save the model
model_save_path = "technical_corpus_model.pth"

# Save the model's state dictionary
torch.save(model.state_dict(), model_save_path)

print(f"Model saved successfully to {model_save_path}")

# 2. Optional: Load the model later
# To load the model later, you would first instantiate the model with the same architecture
# loaded_model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim)
# Load the saved state dictionary
# loaded_model.load_state_dict(torch.load(model_save_path))
# Set the model to evaluation mode
# loaded_model.eval()
# print(f"Model loaded successfully from {model_save_path}")

NameError: name 'model' is not defined

In [20]:
import torch
# from torchtext.data.utils import get_tokenizer # Removed torchtext
from collections import Counter
# from torchtext.vocab import Vocab # Removed torchtext
import torch.nn as nn
import torch.optim as optim

# Define a basic tokenizer function
def basic_tokenizer(text):
    return text.lower().split()

# 1. Define appropriate values for training parameters
num_epochs = 10
learning_rate = 0.001
batch_size = 2 # Already defined with dataloader

# Tokenization and Vocabulary Building (Moved inside the script for simplicity)
# tokenizer = get_tokenizer('basic_english') # Removed torchtext tokenizer

# Build vocabulary from the entire dataset using basic_tokenizer
counter = Counter()
for text in dataset.texts:
    counter.update(basic_tokenizer(text))

# Create a vocabulary dictionary
# Assign an index to each unique token
vocab = {token: idx for idx, (token, count) in enumerate(counter.most_common())}
# Add an unknown token for words not in the vocabulary
vocab['<unk>'] = len(vocab)

# Update vocab_size based on the built vocabulary
vocab_size = len(vocab)

# Re-instantiate the model with the correct vocab_size if necessary
# Assuming the model was already defined in a previous step with a placeholder vocab_size
# If the model needs to be re-instantiated, do it here:
# Dummy values for model parameters (replace with actual based on data/tokenizer)
embedding_dim = 128 # Assuming these were defined previously
hidden_dim = 256 # Assuming these were defined previously

model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim)
criterion = nn.CrossEntropyLoss() # Suitable for classification tasks like next word prediction
optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Re-initialize optimizer as well


# Function to numericalize text using the vocabulary dictionary
def numericalize_text(text, vocab, tokenizer):
    return torch.tensor([vocab.get(token, vocab['<unk>']) for token in tokenizer(text)], dtype=torch.long) # Use .get with <unk> fallback

# 2. Implement the complete training loop
print("Starting training...")
for epoch in range(num_epochs):
    total_loss = 0
    for batch_texts in dataloader:
        # 3. Handle data from dataloader: tokenize and numericalize
        sequences = [numericalize_text(text, vocab, basic_tokenizer) for text in batch_texts] # Use basic_tokenizer

        batch_loss = 0
        for seq in sequences:
            if len(seq) > 1: # Need at least two tokens for input and target
                input_seq = seq[:-1]
                target_token = seq[-1]

                # Ensure input_seq and target_seq are tensors
                input_seq = input_seq.unsqueeze(0) # Add batch dimension
                target_token = target_token.unsqueeze(0) # Add batch dimension for CrossEntropyLoss target


                optimizer.zero_grad() # a. Clear gradients

                # i. Perform a forward pass
                outputs = model(input_seq)

                # ii. Calculate the loss
                loss = criterion(outputs, target_token)

                # iii. Perform backpropagation
                loss.backward()

                # iv. Update the model's weights
                optimizer.step()

                batch_loss += loss.item()

        total_loss += batch_loss / len(batch_texts) if len(batch_texts) > 0 else 0 # Average loss per text in batch

    # 4. Print the loss periodically
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss:.4f}')

# 6. After the training loop completes, print a message
print("Training finished.")

# 7. Consider adding basic error handling or checks (Optional for this step)
# Basic check: ensure vocab_size is set correctly before training.
if 'vocab_size' not in locals() or vocab_size <= 0:
    print("Error: Vocabulary size not set correctly.")

Starting training...
Epoch 1/10, Loss: 10.4886
Epoch 2/10, Loss: 8.5108
Epoch 3/10, Loss: 6.5334
Epoch 4/10, Loss: 4.2441
Epoch 5/10, Loss: 1.9660
Epoch 6/10, Loss: 0.6347
Epoch 7/10, Loss: 0.2073
Epoch 8/10, Loss: 0.0866
Epoch 9/10, Loss: 0.0473
Epoch 10/10, Loss: 0.0290
Training finished.


In [18]:
import torch

# 1. Save the trained model
# Define the path where you want to save the model
model_save_path = "technical_corpus_model.pth"

# Save the model's state dictionary
torch.save(model.state_dict(), model_save_path)

print(f"Model saved successfully to {model_save_path}")

# 2. Optional: Load the model later
# To load the model later, you would first instantiate the model with the same architecture
# loaded_model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim)
# Load the saved state dictionary
# loaded_model.load_state_dict(torch.load(model_save_path))
# Set the model to evaluation mode
# loaded_model.eval()
# print(f"Model loaded successfully from {model_save_path}")

Model saved successfully to technical_corpus_model.pth


In [17]:
import torch
# from torchtext.data.utils import get_tokenizer # Removed torchtext
from collections import Counter
# from torchtext.vocab import Vocab # Removed torchtext
import torch.nn as nn
import torch.optim as optim

# Define a basic tokenizer function
def basic_tokenizer(text):
    return text.lower().split()

# 1. Define appropriate values for training parameters
num_epochs = 10
learning_rate = 0.001
batch_size = 2 # Already defined with dataloader

# Tokenization and Vocabulary Building (Moved inside the script for simplicity)
# tokenizer = get_tokenizer('basic_english') # Removed torchtext tokenizer

# Build vocabulary from the entire dataset using basic_tokenizer
counter = Counter()
for text in dataset.texts:
    counter.update(basic_tokenizer(text))

# Create a vocabulary dictionary
# Assign an index to each unique token
vocab = {token: idx for idx, (token, count) in enumerate(counter.most_common())}
# Add an unknown token for words not in the vocabulary
vocab['<unk>'] = len(vocab)

# Update vocab_size based on the built vocabulary
vocab_size = len(vocab)

# Re-instantiate the model with the correct vocab_size if necessary
# Assuming the model was already defined in a previous step with a placeholder vocab_size
# If the model needs to be re-instantiated, do it here:
# Dummy values for model parameters (replace with actual based on data/tokenizer)
embedding_dim = 128 # Assuming these were defined previously
hidden_dim = 256 # Assuming these were defined previously

model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim)
criterion = nn.CrossEntropyLoss() # Suitable for classification tasks like next word prediction
optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Re-initialize optimizer as well


# Function to numericalize text using the vocabulary dictionary
def numericalize_text(text, vocab, tokenizer):
    return torch.tensor([vocab.get(token, vocab['<unk>']) for token in tokenizer(text)], dtype=torch.long) # Use .get with <unk> fallback

# 2. Implement the complete training loop
print("Starting training...")
for epoch in range(num_epochs):
    total_loss = 0
    for batch_texts in dataloader:
        # 3. Handle data from dataloader: tokenize and numericalize
        sequences = [numericalize_text(text, vocab, basic_tokenizer) for text in batch_texts] # Use basic_tokenizer

        batch_loss = 0
        for seq in sequences:
            if len(seq) > 1: # Need at least two tokens for input and target
                input_seq = seq[:-1]
                target_token = seq[-1]

                # Ensure input_seq and target_seq are tensors
                input_seq = input_seq.unsqueeze(0) # Add batch dimension
                target_token = target_token.unsqueeze(0) # Add batch dimension for CrossEntropyLoss target


                optimizer.zero_grad() # a. Clear gradients

                # i. Perform a forward pass
                outputs = model(input_seq)

                # ii. Calculate the loss
                loss = criterion(outputs, target_token)

                # iii. Perform backpropagation
                loss.backward()

                # iv. Update the model's weights
                optimizer.step()

                batch_loss += loss.item()

        total_loss += batch_loss / len(batch_texts) if len(batch_texts) > 0 else 0 # Average loss per text in batch

    # 4. Print the loss periodically
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss:.4f}')

# 6. After the training loop completes, print a message
print("Training finished.")

# 7. Consider adding basic error handling or checks (Optional for this step)
# Basic check: ensure vocab_size is set correctly before training.
if 'vocab_size' not in locals() or vocab_size <= 0:
    print("Error: Vocabulary size not set correctly.")

Starting training...
Epoch 1/10, Loss: 10.3760
Epoch 2/10, Loss: 8.0251
Epoch 3/10, Loss: 6.1678
Epoch 4/10, Loss: 3.8039
Epoch 5/10, Loss: 1.7677
Epoch 6/10, Loss: 0.6374
Epoch 7/10, Loss: 0.1889
Epoch 8/10, Loss: 0.0890
Epoch 9/10, Loss: 0.0507
Epoch 10/10, Loss: 0.0341
Training finished.


## Validação e Avaliação do Modelo

### Subtask:
Avaliar o desempenho do modelo usando métricas de sucesso e realizar validação.

**Reasoning**:
Evaluate the trained model using relevant metrics to assess its performance against the defined targets. Since this is a simplified example, we will demonstrate a basic evaluation concept. For a real-world technical corpus trainer, more sophisticated evaluation metrics and a dedicated validation dataset would be necessary.

In [2]:
# 1. Implement model evaluation using appropriate metrics
# This is a placeholder for evaluation logic.
# For a real task (e.g., next word prediction), you would calculate perplexity, accuracy, etc.
# For this simplified example, we'll just print a message indicating where evaluation would go.

print("Model evaluation would be implemented here.")
print("Metrics to consider: Accuracy, Loss on a validation set, Perplexity (for language models).")

# Example: (Conceptual) Calculate loss on a validation set (requires a validation dataloader)
# model.eval() # Set the model to evaluation mode
# total_validation_loss = 0
# with torch.no_grad():
#     for batch_texts_val in validation_dataloader:
#         # Process validation batch (tokenize, numericalize, create inputs/targets)
#         # ...
#         # outputs_val = model(input_ids_val)
#         # loss_val = criterion(outputs_val, target_ids_val)
#         # total_validation_loss += loss_val.item()
#
# average_validation_loss = total_validation_loss / len(validation_dataloader)
# print(f'Average Validation Loss: {average_validation_loss:.4f}')

# 2. Compare results against the defined success metrics (Conceptual)
# This would involve comparing calculated metrics (e.g., accuracy, loss) with the targets
# specified in the "Métricas de Sucesso" section of the guide.
# For this placeholder, we'll just print a message.
print("Compare evaluation results against success metrics (e.g., Accuracy > 80%, Loss < 0.5).")

# 3. Perform validation (Conceptual)
# Validation would typically involve using a separate validation dataset to tune hyperparameters
# and prevent overfitting. This is a placeholder.
print("Validation process (e.g., using a validation set for hyperparameter tuning) would be done here.")

Model evaluation would be implemented here.
Metrics to consider: Accuracy, Loss on a validation set, Perplexity (for language models).
Compare evaluation results against success metrics (e.g., Accuracy > 80%, Loss < 0.5).
Validation process (e.g., using a validation set for hyperparameter tuning) would be done here.


In [5]:
!pip install torchtext

Collecting torchtext
  Downloading torchtext-0.18.0-cp312-cp312-manylinux1_x86_64.whl.metadata (7.9 kB)
Downloading torchtext-0.18.0-cp312-cp312-manylinux1_x86_64.whl (2.0 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.0/2.0 MB[0m [31m72.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchtext
Successfully installed torchtext-0.18.0


**Reasoning**:
Now that `torchtext` is installed, re-run the code to configure the model and execute the training loop as originally intended.



In [6]:
import torch
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
import torch.nn as nn
import torch.optim as optim

# 1. Define appropriate values for training parameters
num_epochs = 10
learning_rate = 0.001
batch_size = 2 # Already defined with dataloader

# Tokenization and Vocabulary Building (Moved inside the script for simplicity)
tokenizer = get_tokenizer('basic_english')

# Build vocabulary from the entire dataset
counter = Counter()
for text in dataset.texts:
    counter.update(tokenizer(text))

vocab = Vocab(counter, min_freq=1)

# Update vocab_size based on the built vocabulary
vocab_size = len(vocab)

# Re-instantiate the model with the correct vocab_size if necessary
# Assuming the model was already defined in a previous step with a placeholder vocab_size
# If the model needs to be re-instantiated, do it here:
# Dummy values for model parameters (replace with actual based on data/tokenizer)
# embedding_dim = 128 # Assuming these were defined previously
# hidden_dim = 256 # Assuming these were defined previously

model = TechnicalCorpusModel(vocab_size, embedding_dim, hidden_dim)
criterion = nn.CrossEntropyLoss() # Suitable for classification tasks like next word prediction
optimizer = optim.Adam(model.parameters(), lr=learning_rate) # Re-initialize optimizer as well


# Function to numericalize text
def numericalize_text(text, vocab, tokenizer):
    return torch.tensor([vocab[token] for token in tokenizer(text)], dtype=torch.long)

# 2. Implement the complete training loop
print("Starting training...")
for epoch in range(num_epochs):
    total_loss = 0
    for batch_texts in dataloader:
        # 3. Handle data from dataloader: tokenize and numericalize
        # For simplicity in this example, let's assume a simple task like predicting the next word
        # This requires creating input and target sequences from the batch_texts
        # A common approach is to use sliding windows or similar techniques.
        # Since the prompt doesn't specify the exact task (e.g., next word prediction,
        # sentence classification), we will implement a simplified version assuming
        # we are training a language model where the model predicts the next token
        # given the previous tokens in a sequence. This requires creating input-target pairs.

        # Example: Create input-target pairs from each text in the batch
        # This is a simplified approach and might not be suitable for all tasks.
        # For a real language model, you would pad sequences and handle batching carefully.
        sequences = [numericalize_text(text, vocab, tokenizer) for text in batch_texts]

        # Create input and target tensors (simplified example)
        # This part is highly dependent on the specific task and model architecture.
        # Assuming we want to predict the next token for each sequence in the batch
        # We'll create input sequences (all tokens except the last) and target sequences (all tokens except the first)
        # This requires padding or careful handling of varying sequence lengths.
        # For this example, let's process each sequence in the batch individually
        # and accumulate the loss. In a real scenario, you'd pad and process as a batch.

        batch_loss = 0
        for seq in sequences:
            if len(seq) > 1: # Need at least two tokens for input and target
                input_seq = seq[:-1]
                target_token = seq[-1]

                # Ensure input_seq and target_seq are tensors
                input_seq = torch.tensor(input_seq, dtype=torch.long).unsqueeze(0) # Add batch dimension
                target_token = torch.tensor(target_token, dtype=torch.long).unsqueeze(0) # Add batch dimension for CrossEntropyLoss target


                optimizer.zero_grad() # a. Clear gradients

                # i. Perform a forward pass
                # The model expects input_ids which are numerical representations of tokens.
                outputs = model(input_seq)

                # ii. Calculate the loss
                # CrossEntropyLoss expects input (N, C) and target (N)
                # outputs shape: (Batch_size, Vocab_size)
                # target_token shape: (Batch_size)
                loss = criterion(outputs, target_token)

                # iii. Perform backpropagation
                loss.backward()

                # iv. Update the model's weights
                optimizer.step()

                batch_loss += loss.item()

        total_loss += batch_loss / len(batch_texts) if len(batch_texts) > 0 else 0 # Average loss per text in batch

    # 4. Print the loss periodically
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss:.4f}')

# 6. After the training loop completes, print a message
print("Training finished.")

# 7. Consider adding basic error handling or checks (Optional for this step)
# Basic check: ensure vocab_size is set correctly before training.
if 'vocab_size' not in locals() or vocab_size <= 0:
    print("Error: Vocabulary size not set correctly.")

OSError: /usr/local/lib/python3.12/dist-packages/torchtext/lib/libtorchtext.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6_

**Reasoning**:
The `OSError` suggests a compatibility issue between the installed `torchtext` and `torch` versions or the environment. A common solution is to uninstall and reinstall both `torchtext` and `torch` to ensure compatibility.



In [1]:
!pip uninstall torch torchtext -y
!pip install torch==2.8.0 torchtext==0.18.0

Found existing installation: torch 2.9.0
Uninstalling torch-2.9.0:
  Successfully uninstalled torch-2.9.0
Found existing installation: torchtext 0.18.0
Uninstalling torchtext-0.18.0:
  Successfully uninstalled torchtext-0.18.0
Collecting torch==2.8.0
  Downloading torch-2.8.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting torchtext==0.18.0
  Using cached torchtext-0.18.0-cp312-cp312-manylinux1_x86_64.whl.metadata (7.9 kB)
Collecting nvidia-nccl-cu12==2.27.3 (from torch==2.8.0)
  Downloading nvidia_nccl_cu12-2.27.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
Collecting triton==3.4.0 (from torch==2.8.0)
  Downloading triton-3.4.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.7 kB)
Downloading torch-2.8.0-cp312-cp312-manylinux_2_28_x86_64.whl (887.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.9/887.9 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached torchtext-0.18.0-cp312-cp

**Reasoning**:
Load the dataset into a pandas DataFrame and display the first few rows and the columns and their data types to understand the data structure.

