# Import necessary libraries and modules

- `import torch`: Importing the PyTorch library, which provides support for deep learning and neural networks.

- `import torch.nn as nn`: Importing the neural network module from PyTorch to define and work with neural network layers.

- `import torch.optim as optim`: Importing the optimization module from PyTorch to use various optimization algorithms during training.

- `import torch.nn.functional as F`: Importing the functional module from PyTorch, which contains various functions used in neural network layers, loss functions, etc.

- `import numpy as np`: Importing NumPy, a library for numerical computing in Python. It's commonly used for handling arrays and matrices.

- `import re`: Importing the regular expression module for working with text data. Useful for text preprocessing.

- `from nltk.corpus import stopwords`: Importing NLTK's stopwords module, which contains common words that can be removed from text data during text preprocessing.

- `from collections import Counter`: Importing Python's Counter class, which is used for counting elements in a collection, such as counting word frequencies.

- `from torch.utils.data import DataLoader, Dataset`: Importing PyTorch's data handling modules for creating custom datasets and data loaders for efficient training.

- `from torch.nn.utils.rnn import pad_sequence`: Importing a function from PyTorch for padding sequences in batches, often used in natural language processing tasks.

- `import pandas as pd`: Importing the pandas library for data manipulation and analysis, especially useful for working with tabular data.

- `from functools import partial`: Importing Python's functools module for working with functions, including creating partial functions with fixed arguments.

- `import nltk`: Importing the Natural Language Toolkit (NLTK) library for natural language processing tasks, such as tokenization and stemming.

- `from collections import Counter`: Importing Python's Counter class again, which is likely used to count elements in text data.

- `from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score`: Importing several metrics from scikit-learn, a machine learning library. These metrics are commonly used for evaluating classification models.


In [1]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import re
from nltk.corpus import stopwords
from collections import Counter
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence
import pandas as pd
from functools import partial
import nltk
from collections import Counter
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


### MultiHeadAttention Class ###

The `MultiHeadAttention` class is a PyTorch module designed for implementing multi-head self-attention, a pivotal element found in various transformer-based neural network architectures. These architectures are widely used in natural language processing and other sequence-to-sequence tasks.

#### Constructor Method (`__init__`) ####

- `__init__(self, d_model, num_heads)`: This method serves as the constructor for the `MultiHeadAttention` class and is responsible for initializing the multi-head attention layer.

  - `d_model`: This parameter represents the input dimensionality, often referred to as the model dimension. It signifies the dimension of the input embeddings.
  
  - `num_heads`: The number of attention heads determines how many distinct attention-weighted combinations of the input data are simultaneously computed in parallel.
  
  - `self.num_heads`: A class attribute that retains the number of attention heads.
  
  - `self.head_dim`: Another class attribute that stores the dimension of each individual attention head.
  
  - `self.d_model`: A class attribute that holds the input dimensionality.

  - `self.wq`, `self.wk`, `self.wv`: These are linear layers utilized to project the input data into query, key, and value spaces for each attention head.
  
  - `self.fc_out`: This linear layer is employed to combine the outputs originating from all attention heads.

#### Forward Method (`forward`) ####

- `forward(self, query, key, value, mask)`: The `forward` method carries out the forward pass within the multi-head attention layer.

  - `query`, `key`, `value`: These parameters are input tensors that represent sequences of queries, keys, and values. Typically, these tensors correspond to embeddings of the input data.

  - `mask`: An optional mask tensor is used for masking specific elements during the attention computation. This is often applied to ignore padding tokens, for instance.

  - In this method, the input tensors are split into `self.num_heads` different heads, and their shapes are adjusted for parallel processing.

  - Scaled dot-product attention scores are computed for each attention head.

  - If a mask is provided, it is applied to the attention scores to prevent specific elements from contributing to the final output.

  - The attention scores are normalized via softmax to yield attention weights.

  - These attention weights are then employed to weigh the values, leading to the creation of attended values.

  - The attended values from all attention heads are concatenated and subsequently passed through the linear layer `self.fc_out` to produce the final output.

  - The method returns the output tensor along with the attention weights.



In [39]:


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()

        # Initialize the MultiHeadAttention module with the specified dimensions
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        self.d_model = d_model

        # Linear transformations for Query (Q), Key (K), and Value (V)
        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)

        # Linear transformation for the output
        self.fc_out = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask):
        # Step 1: Split the input into multiple heads
        query = query.view(query.shape[0], -1, self.num_heads, self.head_dim)
        key = key.view(key.shape[0], -1, self.num_heads, self.head_dim)
        value = value.view(value.shape[0], -1, self.num_heads, self.head_dim)

        # Step 2: Transpose dimensions to prepare for matrix multiplication
        query = query.permute(0, 2, 1, 3)  # Batch x num_heads x seq_len x head_dim
        key = key.permute(0, 2, 1, 3)
        value = value.permute(0, 2, 1, 3)

        # Step 3: Calculate the scaled dot-product attention scores
        scaled_attention_logits = torch.matmul(query, key.permute(0, 1, 3, 2)) / torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32))
        
        # Step 4: Apply the mask to the attention scores
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)

        # Step 5: Compute attention weights using softmax
        attention_weights = F.softmax(scaled_attention_logits, dim=-1)

        # Step 6: Apply the attention weights to the value
        output = torch.matmul(attention_weights, value)

        # Step 7: Rearrange and concatenate the heads
        output = output.permute(0, 2, 1, 3).contiguous().view(query.shape[0], -1, self.d_model)

        # Step 8: Apply a linear transformation to the concatenated outputs
        output = self.fc_out(output)

        # Return the output and attention weights
        return output, attention_weights


The information will have gone through multi-head self-attention, bringing about a result tensor that mirrors the connections and conditions between components in the information not set in stone by the consideration system. This result can then be utilized in ensuing layers of a transformer-based model for different regular language handling or grouping to-succession undertakings.

### FeedForward Class ###

The `FeedForward` class is a PyTorch module designed to implement a feedforward neural network layer. This layer is commonly utilized within transformer-based models for various tasks, including natural language processing.

#### Constructor Method (`__init__`) ####

- `__init__(self, d_model, d_ff)`: The constructor method for the `FeedForward` class initializes the feedforward layer with the following parameters:

  - `d_model`: Represents the dimension of the input data or embeddings.

  - `d_ff`: Denotes the dimension of the intermediate hidden representation within the feedforward layer.

  - `self.fc1`: This is a linear layer that transforms the input data from dimension `d_model` to `d_ff`.

  - `self.fc2`: Another linear layer that further transforms the intermediate representation from dimension `d_ff` back to `d_model`.

#### Forward Method (`forward`) ####

- `forward(self, x)`: The `forward` method is responsible for executing the forward pass within the feedforward layer:

  - `x`: This is the input tensor that undergoes processing within the feedforward layer.

  - The input tensor `x` is first passed through the initial linear layer (`self.fc1`). Subsequently, the Rectified Linear Unit (ReLU) activation function is applied to introduce non-linearity.

  - Following the activation, the output is further transformed through the second linear layer (`self.fc2`) to yield the final output tensor.

  - The method ultimately returns this final output tensor.



In [40]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()

        # Initialize the FeedForward module with the specified dimensions
        self.fc1 = nn.Linear(d_model, d_ff)  # Fully connected layer 1
        self.fc2 = nn.Linear(d_ff, d_model)  # Fully connected layer 2

    def forward(self, x):
        # Step 1: Apply the first linear transformation followed by ReLU activation
        x = F.relu(self.fc1(x))
        
        # Step 2: Apply the second linear transformation
        x = self.fc2(x)

        # Return the output
        return x


After going through the FeedForward class, the result is a tensor that has undergone transformation and may have undergone non-linear activation. The input data and the weights of the linear layers will determine the precise values in the output tensor.

### PositionalEncoding Class ###

The `PositionalEncoding` class is a PyTorch module designed to integrate positional information into input data embeddings. This is particularly essential for tasks involving sequences, such as NLP, and is commonly employed in transformer-based models.

#### Constructor Method (`__init__`) ####

- `__init__(self, d_model, max_len=5000)`: This constructor method initializes the positional encoding layer with the following parameters:

  - `d_model`: It represents the dimensionality of the model's input embeddings.

  - `max_len`: This parameter sets the maximum sequence length for which positional encodings will be generated. By default, it is set to 5000, but it can be customized as needed.

  - `self.dropout`: To enhance model robustness and mitigate overfitting, a dropout layer with a dropout rate of 0.1 is included.

  - Subsequently, the method calculates the positional encodings for the specified maximum sequence length.

  - `position`: A tensor containing positions ranging from 0 to `max_len - 1`.

  - `div_term`: Another tensor containing exponential terms necessary for the computation of sine and cosine components of the positional encodings.

  - `pe`: A tensor initialized with zeros is used to store the positional encodings.

  - The positional encodings are determined and stored in `pe`, with sine and cosine components interleaved to capture the positional relationships.

  - The computed positional encodings are registered as a buffer using `self.register_buffer`. This ensures that they are not treated as learnable parameters during model training.

#### Forward Method (`forward`) ####

- `forward(self, x)`: The `forward` method applies the positional encodings to the input data:

  - `x`: This is the input tensor representing the data.

  - The method augments `x` by adding the positional encodings stored in `self.pe`. These positional encodings are added to their respective positions along the sequence dimension of `x`.

  - A dropout operation with a 0.1 dropout rate is employed for regularization.

  - Finally, the augmented tensor is returned as the output.


In [41]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(0.1)
        
        # Step 1: Create positional encodings
        
        # Generate a range of positions from 0 to max_len
        position = torch.arange(0, max_len).unsqueeze(1)
        
        # Calculate the div_term used for positional encoding
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
        
        # Initialize a tensor for positional encodings
        pe = torch.zeros(1, max_len, d_model)
        
        # Apply sine and cosine functions to create positional encodings
        pe[0, :, 0::2] = torch.sin(position * div_term)
        pe[0, :, 1::2] = torch.cos(position * div_term)
        
        # Register the positional encodings as a buffer
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        # Step 2: Add positional encodings to the input
        
        # Add positional encodings to the input tensor
        x = x + self.pe[:, :x.size(1)]
        
        # Apply dropout for regularization
        x = self.dropout(x)
        
        return x


The input data with additional positional information is the PositionalEncoding class's output. This is especially helpful for models that use transformers since it enables the model to take into account the sequence's order of elements, which is essential for comprehending and interpreting sequential data like natural language.



### EncoderLayer Class ###

The `EncoderLayer` class is a component within a transformer-based neural network's encoder. Its role encompasses two key elements: multi-head self-attention and a feedforward neural network. Together, these elements process input data to enable the model to capture intricate patterns.

#### Constructor Method (`__init__`) ####

- `__init__(self, d_model, num_heads, d_ff)`: The constructor method initializes the `EncoderLayer` class with these parameters:

  - `d_model`: Denoting the input dimensionality, it represents the dimension of the input data or embeddings.

  - `num_heads`: Signifying the number of attention heads utilized in the multi-head self-attention mechanism.

  - `d_ff`: Representing the dimension of the intermediate hidden representation within the feedforward layer.

  - `self.multihead_attn`: An instance of the `MultiHeadAttention` class, responsible for executing the multi-head self-attention.

  - `self.feed_forward`: An instance of the `FeedForward` class, responsible for handling the feedforward neural network component.

  - `self.norm1` and `self.norm2`: These are layer normalization modules applied after the multi-head self-attention and feedforward components, respectively. They serve to stabilize training.

#### Forward Method (`forward`) ####

- `forward(self, x, mask)`: The `forward` method conducts the forward pass within the `EncoderLayer`.

  - `x`: A tensor representing the input data undergoing processing within the encoder layer.

  - `mask`: An optional mask tensor employed to mask specific elements during attention computation, often used to handle padding tokens.

  - The process begins with the input tensor `x` passing through the multi-head self-attention mechanism (`self.multihead_attn`). This mechanism computes an attention-based output, which is added back to the original input tensor (`x`). Subsequently, layer normalization (`self.norm1`) is applied to stabilize the output.

  - Following the attention operation, the output proceeds through the feedforward neural network (`self.feed_forward`). The result is once again added to the original input tensor (`x`), followed by another layer normalization step (`self.norm2`).

  - The method then delivers the final output tensor, which represents the processed data with attention and feedforward transformations applied.

The `EncoderLayer` class is a critical component within the encoder of a transformer-based model. It plays a central role in processing input data, capturing contextual information through self-attention, and introducing non-linearity through feedforward layers. Layer normalization is utilized to enhance training stability and contributes to the model's ability to discern intricate patterns within the data.


In [43]:

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super(EncoderLayer, self).__init__()
        
        # Step 1: Initialize sub-layers
        
        # Multi-Head Self-Attention Layer
        self.multihead_attn = MultiHeadAttention(d_model, num_heads)
        
        # Feed-Forward Neural Network Layer
        self.feed_forward = FeedForward(d_model, d_ff)
        
        # Layer Normalization for the first sub-layer
        self.norm1 = nn.LayerNorm(d_model)
        
        # Layer Normalization for the second sub-layer
        self.norm2 = nn.LayerNorm(d_model)
    
    def forward(self, x, mask):
        # Step 2: Multi-Head Self-Attention Sub-Layer
        
        # Calculate attention output and attention weights
        attn_output, _ = self.multihead_attn(x, x, x, mask)
        
        # Add the attention output to the input (residual connection)
        x = x + attn_output
        
        # Apply layer normalization to the result
        x = self.norm1(x)
        
        # Step 3: Feed-Forward Neural Network Sub-Layer
        
        # Apply the feed-forward neural network to the output
        ff_output = self.feed_forward(x)
        
        # Add the feed-forward output to the previous result (residual connection)
        x = x + ff_output
        
        # Apply layer normalization to the final result
        x = self.norm2(x)
        
        # Return the output of the encoder layer
        return x


The output of the encoder layer is a tensor that incorporates non-linear transformations and represents the contextual comprehension of the input data. The input data and the specified weights of the attention and feedforward components determine the precise values in the output tensor.



### DecoderLayer Class ###

The `DecoderLayer` class is a PyTorch module that represents an essential building block within the decoder component of a transformer-based neural network. This class is fundamental in tasks that involve processing sequences, such as machine translation.

#### Constructor Method (`__init__`) ####

- `__init__(self, d_model, num_heads, d_ff)`: The constructor method for the `DecoderLayer` class initializes a single decoder layer. It accepts the following parameters:

  - `d_model`: This denotes the dimensionality of the data processed by the layer, often referred to as the model dimension. It defines the dimension of the input and output data.
  
  - `num_heads`: The number of attention heads employed in the multi-head attention mechanisms within this layer.
  
  - `d_ff`: The dimensionality of the intermediate hidden layer within the feedforward neural network.

  - `self.masked_multihead_attn`: An instance of the `MultiHeadAttention` class, responsible for performing masked multi-head self-attention. It allows the layer to attend to previous positions in the decoder sequence while avoiding information leakage.
  
  - `self.multihead_attn`: Another instance of the `MultiHeadAttention` class, used for multi-head attention between the decoder input and the encoder output. This enables the decoder to consider pertinent information from the encoder.
  
  - `self.feed_forward`: An instance of the `FeedForward` class, which represents a feedforward neural network used for further data transformations.
  
  - `self.norm1`, `self.norm2`, `self.norm3`: Layer normalization layers applied after each sub-layer in the decoder.

#### Forward Method (`forward`) ####

- `forward(self, x, enc_output, tgt_mask, src_mask)`: The `forward` method carries out the forward pass within the decoder layer. It takes the following input parameters:

  - `x`: This input tensor represents the current position of the decoder in the sequence.
  
  - `enc_output`: The encoder output, containing relevant source sequence information. It is utilized for attending to pertinent source details during the decoding process.
  
  - `tgt_mask`: A mask applied to the target sequence, typically used to prevent attending to future positions.
  
  - `src_mask`: A mask applied to the source sequence, often used to mask padding tokens in the encoder output.

  - The method performs the following operations:

    1. Masked Multi-Head Self-Attention: `x` attends to itself with masking to prevent information from future positions. The result is added to `x`, followed by layer normalization.
    
    2. Multi-Head Attention: `x` attends to the encoder output `enc_output`, allowing the decoder to consider relevant source information. Once again, the result is added to `x`, followed by layer normalization.
    
    3. Feedforward Transformation: `x` undergoes transformation through the feedforward neural network, and the result is added to `x`, followed by layer normalization.

  - The method returns the final output tensor, which signifies the processed decoder input.




In [44]:


class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super(DecoderLayer, self).__init__()
        
        # Step 1: Initialize sub-layers
        
        # Masked Multi-Head Self-Attention Layer for the target sequence
        self.masked_multihead_attn = MultiHeadAttention(d_model, num_heads)
        
        # Multi-Head Self-Attention Layer for attending to the encoder output
        self.multihead_attn = MultiHeadAttention(d_model, num_heads)
        
        # Feed-Forward Neural Network Layer
        self.feed_forward = FeedForward(d_model, d_ff)
        
        # Layer Normalization for the first sub-layer
        self.norm1 = nn.LayerNorm(d_model)
        
        # Layer Normalization for the second sub-layer
        self.norm2 = nn.LayerNorm(d_model)
        
        # Layer Normalization for the third sub-layer
        self.norm3 = nn.LayerNorm(d_model)
    
    def forward(self, x, enc_output, tgt_mask, src_mask):
        # Step 2: Masked Multi-Head Self-Attention Sub-Layer
        
        # Calculate masked attention output and attention weights for the target sequence
        attn_output, _ = self.masked_multihead_attn(x, x, x, tgt_mask)
        
        # Add the masked attention output to the input (residual connection)
        x = x + attn_output
        
        # Apply layer normalization to the result
        x = self.norm1(x)
        
        # Step 3: Multi-Head Self-Attention Sub-Layer (Encoder-Decoder Attention)
        
        # Calculate attention output and attention weights between decoder and encoder outputs
        attn_output, _ = self.multihead_attn(x, enc_output, enc_output, src_mask)
        
        # Add the attention output to the previous result (residual connection)
        x = x + attn_output
        
        # Apply layer normalization to the result
        x = self.norm2(x)
        
        # Step 4: Feed-Forward Neural Network Sub-Layer
        
        # Apply the feed-forward neural network to the output
        ff_output = self.feed_forward(x)
        
        # Add the feed-forward output to the previous result (residual connection)
        x = x + ff_output
        
        # Apply layer normalization to the final result
        x = self.norm3(x)
        
        # Return the output of the decoder layer
        return x


The DecoderLayer class incorporates feedforward layers and attention methods to alter the input data within a single decoder layer. The decoder's comprehension of the current sequence position, taking into account both self-attention and source information, is represented by the output tensor. Normally, this procedure is performed for each point in the output sequence of the decoder.


### Transformer Class ###

The `Transformer` class is a PyTorch module designed to implement the Transformer neural network architecture, a versatile model frequently used in natural language processing tasks like machine translation and text generation.

#### Constructor Method (`__init__`) ####

- `__init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff)`: The constructor method initializes the Transformer model, configuring various critical components:

  - `src_vocab_size`: The size of the source vocabulary, representing the number of unique tokens in the source language.

  - `tgt_vocab_size`: The size of the target vocabulary, representing the number of unique tokens in the target language.

  - `d_model`: The model dimension, which defines the dimensionality of embeddings and internal representations.

  - `num_heads`: The number of attention heads employed in multi-head self-attention mechanisms.

  - `num_layers`: Specifies the number of encoder and decoder layers within the Transformer model.

  - `d_ff`: Denotes the dimension of the feedforward network within each encoder and decoder layer.

  - The constructor initializes embedding layers for both source and target languages to convert token indices into dense vectors.

  - It also sets up a positional encoding layer responsible for infusing positional information into the input embeddings.

  - Lists of encoder and decoder layers are prepared to handle the source and target language information.

  - Lastly, a linear layer (`self.fc_out`) is configured for generating the final output in the target language vocabulary size.

#### Forward Method (`forward`) ####

- `forward(self, src, tgt, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask)`: The `forward` method orchestrates the forward pass through the Transformer model.

  - `src`: Represents the source input tensor, typically embodying the source language sequence.

  - `tgt`: Corresponds to the target input tensor, typically serving as the target language sequence during training (utilizing teacher forcing) or for generating output during inference.

  - Masks such as `src_mask`, `tgt_mask`, `src_padding_mask`, and `tgt_padding_mask` are employed to guide the model's attention and ignore padding tokens.

  - The method begins by embedding the source and target inputs and enhancing them with positional information.

  - Subsequently, the source input undergoes encoding through a sequence of encoder layers, followed by the target input passing through decoder layers.

  - The final output, representing the predicted target language sequence, is generated by processing the decoder's output through a linear layer.

  - The method ultimately returns the model's output.




In [45]:

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff):
        super(Transformer, self).__init__()
        
        # Step 1: Initialize components
        
        # Embedding layer for the source sequence
        self.embedding = nn.Embedding(src_vocab_size, d_model)
        
        # Positional encoding layer
        self.pos_encoder = PositionalEncoding(d_model)
        
        # Create a stack of encoder layers
        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)])
        
        # Create a stack of decoder layers
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)])
        
        # Output layer to generate target sequence
        self.fc_out = nn.Linear(d_model, tgt_vocab_size)
    
    def forward(self, src, tgt, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask):
        # Step 2: Encode the source sequence
        
        # Embed the source sequence
        src = self.embedding(src)
        
        # Apply positional encoding to the source sequence
        src = self.pos_encoder(src)
        
        # Pass the source sequence through the stack of encoder layers
        for layer in self.encoder_layers:
            src = layer(src, src_padding_mask)
        
        # Step 3: Decode the target sequence
        
        # Embed the target sequence
        tgt = self.embedding(tgt)
        
        # Apply positional encoding to the target sequence
        tgt = self.pos_encoder(tgt)
        
        # Pass the target sequence through the stack of decoder layers
        for layer in self.decoder_layers:
            tgt = layer(tgt, src, tgt_mask, src_padding_mask)
        
        # Step 4: Generate the output sequence
        
        # Apply a linear transformation to generate the output
        output = self.fc_out(tgt)
        
        # Return the output sequence
        return output


The prediction for the target language sequence produced by the Transformer model, which is based on input from the source language as well as observed patterns and relationships in the data, is the model's output. This output can be applied to projects like text generation, text summarization, and machine translation.


### Download Stopwords ###

The code snippet `nltk.download('stopwords')` is a Python command used to download a set of stopwords from the Natural Language Toolkit (NLTK) library. NLTK is a widely used library in natural language processing (NLP) and provides various tools and resources for working with human language data.

#### `nltk.download()` Method ####

- `nltk.download('stopwords')`: This command invokes the `download()` method from the NLTK library, specifically requesting the download of a stopwords dataset.


In [8]:
# Download stopwords 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\i\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Dataset Loading and Vocabulary Generation ###

This code snippet serves two primary purposes: loading a dataset from a CSV file and creating a vocabulary by tokenizing text.

#### Dataset Loading ####

- `data = pd.read_csv('train.csv')`: This line of code employs the Pandas library to read data from the 'train.csv' file, storing it as a DataFrame. DataFrames are versatile tabular structures used for data handling and analysis in Python.

#### Function for Tokenization and Vocabulary Building ####

- `def create_vocab(data)`: This function is designed to tokenize text and construct a vocabulary based on a provided dataset.

  - `data`: The function expects a dataset as input, which should include a 'comment_text' column containing textual data.

  - Inside the function, an empty list named 'tokens' is initialized to hold individual word tokens extracted from the text.

  - The function then iterates through each row in the 'comment_text' column, breaking down the text into words by splitting it at spaces, and appends these word tokens to the 'tokens' list.

  - After processing all the text data, the function employs the `Counter` class from the collections module to create a vocabulary. This class counts the frequency of each unique word token in the 'tokens' list.

  - The outcome is a vocabulary that pairs words with their respective frequencies.

  - Finally, the function returns the constructed vocabulary as its output.

#### Purpose ####

- The primary objective of dataset loading is to acquire a collection of data records for subsequent analysis or application in machine learning tasks. In this instance, the data is stored in the 'train.csv' file and is read into a Pandas DataFrame for further processing.

- The `create_vocab()` function is intended for text data preprocessing. It tokenizes text by breaking it into individual words and proceeds to generate a vocabulary by counting how frequently each word appears in the text. This vocabulary can prove invaluable in numerous applications, such as text analysis, natural language processing, and feature engineering.

- The creation of a vocabulary is an essential step when working with text data, as it provides insights into word distribution within the dataset and can facilitate various text-related tasks and machine learning endeavors.


In [9]:
# Load the dataset
data = pd.read_csv('train.csv')

# Define a function to tokenize text and create a vocabulary
def create_vocab(data):
    # Initialize an empty list to store tokens
    tokens = []
    
    # Iterate through each comment_text in the dataset
    for text in data['comment_text']:
        # Split the text into tokens using whitespace as the separator
        tokens.extend(text.split())
    
    # Create a vocabulary by counting the frequency of each token
    vocab = Counter(tokens)
    
    # Return the vocabulary
    return vocab


### Vocabulary Creation from Dataset ###
the code snippet `vocab = create_vocab(data)` facilitates the creation of a vocabulary that encapsulates the words and their corresponding frequencies present in the provided dataset. This resource proves invaluable for the analysis and modeling of text-based data.


In [10]:
# Create a vocabulary from  dataset
vocab = create_vocab(data)

 will be a Python dictionary (vocab) where the keys represent unique words present in the dataset, and the values represent the frequency of each word in the dataset. In other words, it will be a vocabulary that summarizes which words appear in the dataset and how often they occur.

### Function for Converting Text to Numeric Tokens ###

The provided code introduces a Python function designed to transform text data into a sequence of numerical tokens. This operation is fundamental in natural language processing (NLP) and machine learning when dealing with text-based data.

#### Function Definition ####

- `def text_to_tensor(text, vocab)`: This code defines a function named `text_to_tensor` with two input parameters:

  - `text`: Represents the input text data that needs to be converted into numerical tokens.

  - `vocab`: Refers to a vocabulary, which essentially functions as a dictionary containing words as keys and their corresponding numerical representations as values. This vocabulary serves as a reference for translating words into numerical tokens.

#### Tokenization and Mapping to Numerical Tokens ####

- Within the function:

  - `tokens = [vocab[token] for token in text.split() if token in vocab]`: This line of code carries out the tokenization of the input `text` by splitting it into individual words using spaces as separators (`text.split()`). Subsequently, it iterates through these tokens, verifying if each token exists within the provided vocabulary (`vocab`). If a token is found in the vocabulary, it is associated with its corresponding numerical representation, and these numerical tokens are assembled in the `tokens` list.

  - `torch.LongTensor(tokens)`: Lastly, the `tokens` list, containing the numerical representations of the words within the input text, is converted into a PyTorch LongTensor. A LongTensor is frequently employed to represent sequences of integers.

#### Purpose ####

- The primary objective of the `text_to_tensor` function is to simplify the conversion of unprocessed textual data into a format compatible with machine learning models. By translating text into numerical tokens, it enables the model to process and analyze textual information effectively, making it suitable for various NLP tasks, such as text classification, sentiment analysis, or language modeling.

- The function relies on a predetermined vocabulary (`vocab`) to establish a mapping from words to their corresponding numerical tokens. This mapping ensures that the model can work seamlessly with numerical data, simplifying training and data processing.




In [46]:
# Define a function to convert text to numerical tokens
def text_to_tensor(text, vocab):
    # Initialize an empty list called 'tokens' to store numerical representations of tokens
    tokens = []
    
    # Split the input 'text' into tokens using whitespace as the separator
    text_tokens = text.split()
    
    # Iterate through each token in the 'text_tokens'
    for token in text_tokens:
        # Check if the token exists in the provided 'vocab' dictionary
        if token in vocab:
            # If the token is in the vocabulary, add its numerical representation to the 'tokens' list
            tokens.append(vocab[token])
    
    # Convert the list of numerical tokens to a PyTorch LongTensor
    tensor_representation = torch.LongTensor(tokens)
    
    # Return the tensor representation of the text
    return tensor_representation


 the output of the text_to_tensor function is a numerical representation of the input text, enabling subsequent machine learning operations to work with textual data efficiently.

### Custom Data Processing and Text Classification Model ###

The given code snippet encompasses two critical components: a custom collate function (`custom_collate_fn`) and the definition of a straightforward text classification model (`TextClassifier`).

#### Custom Collate Function (`custom_collate_fn`) ####

- `def custom_collate_fn(batch, vocab)`: This code introduces a customized collate function responsible for processing a batch of data samples.

  - `batch`: Represents a collection of data samples, with each sample usually comprising a tuple containing a comment (text) and its corresponding label.

  - `vocab`: A vocabulary used to map words in the text to numerical tokens.

  - Within the function, comments and labels are extracted from the batch using `zip(*batch)`. The comments are then tokenized and translated into numerical tensors through the `text_to_tensor` function (defined previously). These tensors are subsequently adjusted to have uniform lengths by utilizing `pad_sequence`.

  - The labels are converted into PyTorch FloatTensors.

  - The function yields the processed comments and labels as its output.

#### Text Classification Model (`TextClassifier`) ####

- `class TextClassifier(nn.Module)`: This segment of code outlines a simplistic text classification model built with PyTorch.

  - `__init__(self, vocab_size, embedding_dim, hidden_dim, output_dim)`: The constructor initializes the model with specific parameters:

    - `vocab_size`: Reflects the vocabulary's size, indicating the count of unique words within the text data.

    - `embedding_dim`: Specifies the dimensionality of word embeddings, which defines how words are represented in a continuous vector space.

    - `hidden_dim`: Denotes the dimensionality of the hidden layer within the neural network.

    - `output_dim`: Represents the dimensionality of the output layer, corresponding to the number of classes or categories in the text classification task.

  - Within the constructor, the model comprises an embedding layer (`self.embedding`) used for converting numerical tokens into dense word embeddings. These embeddings are subsequently passed through a fully connected neural network (`self.fc`) equipped with ReLU activation functions to capture patterns and relationships in the data.

  - `forward(self, text)`: The `forward` method defines the data flow through the model. It accepts a batch of text data as input and returns the model's output.

    - The input text undergoes embedding using the `embedding` layer.

    - An average pooling operation is applied across the sequence dimension, resulting in a single vector (`pooled`) that encapsulates the entire input sequence.

    - This pooled vector is then fed through the fully connected neural network (`self.fc`) to yield the final output of the model.

  - The model is crafted for text classification tasks, enabling it to be trained for predicting the category or label of text data.




In [47]:
# Define a custom collate function with vocab as an argument
def custom_collate_fn(batch, vocab):
    # Unzip the batch into 'comments' and 'labels'
    comments, labels = zip(*batch)
    
    # Convert text comments to numerical tensors using the provided vocabulary
    padded_comments = [text_to_tensor(comment, vocab) for comment in comments]
    
    # Pad the numerical comment tensors to have the same length
    padded_comments = pad_sequence(padded_comments, batch_first=True)
    
    # Convert labels to PyTorch FloatTensor
    labels = torch.FloatTensor(labels)
    
    # Return the padded comments and labels
    return padded_comments, labels

# Define a simple text classification model
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(TextClassifier, self).__init__()
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # Fully connected layers for classification
        self.fc = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),  # Input to hidden layer
            nn.ReLU(),  # ReLU activation function
            nn.Linear(hidden_dim, output_dim)  # Hidden layer to output
        )

    def forward(self, text):
        # Embed the input text
        embedded = self.embedding(text)
        
        # Perform average pooling over the sequence dimension
        pooled = embedded.mean(1)  # Average pooling over the sequence dimension
        
        # Pass the pooled representation through the fully connected layers
        output = self.fc(pooled)
        
        # Return the output
        return output


 will receive processed numerical representations of the text comments (from the custom collate function) and predictions or scores generated by the text classification model after feeding data through this code. Depending on how the code is included into a broader workflow or application, the output's precise format could vary.

### Calculating Vocabulary Size and Printing ###

The provided code snippet involves calculating the size of a vocabulary and printing out the result.

#### Vocabulary Size Calculation ####

- `vocab_size = len(vocab)`: In this line of code, the variable `vocab_size` is assigned the value of the length of a previously constructed vocabulary, which is stored in the `vocab` variable. The vocabulary represents a collection of unique words from a dataset and their corresponding frequency counts.


In [29]:
vocab_size = len(vocab)
print("Vocabulary size:", vocab_size)

Vocabulary size: 532299


### Hyperparameters, Model Creation, Loss Function, and Optimization ###

The provided code snippet encompasses several significant steps, including the establishment of hyperparameters, the creation of a text classification model, and the configuration of a loss function and optimizer.

#### Definition of Hyperparameters ####

- `vocab_size = 532299`: In this line, the variable `vocab_size` is assigned the value `532299`, indicating the size of the vocabulary. This value represents the count of unique words in the language of the dataset.

- `embedding_dim = 100`: Here, `embedding_dim` is set to `100`, specifying the dimensionality of word embeddings. These embeddings serve as continuous vectors representing words, capturing their semantic relationships.

- `hidden_dim = 128`: The `hidden_dim` variable is assigned the value `128`, which represents the dimensionality of the hidden layer within the neural network. This dimension influences the complexity of the model's internal representations.

- `output_dim = 6`: `output_dim` is set to `6`, signifying the dimensionality of the model's output. In the context of text classification, this often corresponds to the number of classes or categories to predict.

#### Model Instantiation ####

- `model = TextClassifier(vocab_size, embedding_dim, hidden_dim, output_dim)`: In this line, an instance of the `TextClassifier` model is generated by providing the previously defined hyperparameters. This instantiation prepares the model for performing text classification tasks.

#### Loss Function and Optimizer Configuration ####

- `criterion = nn.BCEWithLogitsLoss()`: Here, a loss function is defined. The `nn.BCEWithLogitsLoss()` function is commonly used for tasks involving binary or multilabel classification. It calculates the loss based on the model's logits (unnormalized scores) and the target labels.

- `optimizer = optim.Adam(model.parameters(), lr=0.001)`: An optimizer is configured using the Adam optimization algorithm. Its role is to adjust the model's parameters during training to minimize the specified loss. The learning rate (`lr`) is set to `0.001`, controlling the step size for parameter updates.

#### Purpose ####

- The hyperparameters determine the model's configuration and the training process. These values are carefully selected based on the specific task and characteristics of the dataset.

- Model instantiation creates an instance of the text classification model with the defined hyperparameters, making it ready for training and inference.

- The choice of the loss function (`BCEWithLogitsLoss`) and optimizer (Adam) is pivotal for effective model training. The loss function quantifies the error between model predictions and target labels, while the optimizer fine-tunes model parameters to minimize this error during the training process.



In [48]:
# Define hyperparameters and instantiate the model
vocab_size = 532299  # The size of the vocabulary
embedding_dim = 100  # Dimension of word embeddings
hidden_dim = 128  # Dimension of the hidden layer in the classifier
output_dim = 6  # Number of output classes (assuming a multi-class classification task)

# Create an instance of the TextClassifier model
model = TextClassifier(vocab_size, embedding_dim, hidden_dim, output_dim)

# Define loss function and optimizer
criterion = nn.BCEWithLogitsLoss()  # Binary Cross-Entropy loss with logits
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer with a learning rate of 0.001


 it defines various variables and configurations that are essential for setting up a text classification model, including hyperparameters, model instantiation, loss function definition, and optimizer configuration. 

### Custom Dataset Handling Class ###

The provided code snippet introduces a specialized Dataset class called `CustomDataset`. Its primary purpose is to facilitate data management for machine learning tasks, particularly text classification.

#### Constructor Method (`__init__`) ####

- `__init__(self, data, vocab)`: This serves as the class's constructor method. It initializes the dataset and accepts two key parameters:

  - `data`: Represents the dataset containing text comments and their respective labels. Typically, this dataset is structured as a pandas DataFrame or a similar data structure.

  - `vocab`: Refers to a vocabulary used for translating words in the text comments into numerical tokens. The assumption is that this vocabulary has been pre-constructed.

#### `__len__` Method ####

- `def __len__(self)`: This method is implemented to determine the dataset's length, which corresponds to the total number of samples it contains. It accomplishes this by returning the length of the `data` attribute, which signifies the number of rows or examples in the dataset.

#### `__getitem__` Method ####

- `def __getitem__(self, idx)`: This method is responsible for retrieving a specific data sample from the dataset based on its index (`idx`).

  - Within this method, both the text comment and its associated label are extracted from the dataset using the provided index.

  - `comment` holds the text content of the comment, typically stored as a string.

  - `label` stores the labels linked to the comment. These labels are usually represented as an array of binary values, where each element corresponds to a specific category (e.g., toxic, obscene). These values are converted to floats for consistency.

  - As its output, the method returns a tuple that includes the text comment and its corresponding label.

#### Purpose ####

- The primary purpose of the `CustomDataset` class is to simplify the organization and access of text data and labels for machine learning tasks. It encapsulates the data and offers methods for determining the dataset's size and retrieving individual samples by their indices.

- By tailoring the dataset class to meet specific data formats and needs, users can ensure compatibility with various machine learning libraries and frameworks.


In [49]:
# Define a custom Dataset class
class CustomDataset(Dataset):
    def __init__(self, data, vocab):
        self.data = data  # The dataset containing comments and labels
        self.vocab = vocab  # The vocabulary used to convert text to numerical tokens

    def __len__(self):
        # Return the total number of samples in the dataset
        return len(self.data)

    def __getitem__(self, idx):
        # Get a single sample from the dataset at the specified index 'idx'

        # Extract the comment text from the dataset at the given index
        comment = self.data.iloc[idx]['comment_text']

        # Extract the corresponding labels for toxicity categories
        label = self.data.iloc[idx][['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].values.astype(float)

        # Return a tuple containing the comment text (as a string) and its labels (as a float array)
        return comment, label


 it defines a custom dataset structure and methods for data retrieval and organization.

### Creating a CustomDataset Instance for Training and Testing ###

The provided code snippet involves the creation of instances of the `CustomDataset` class, which are used to organize and manage training and testing data for a machine learning task.

#### CustomDataset Instance Creation ####

- `train_dataset = CustomDataset(data, vocab)`: In this line of code, an instance of the `CustomDataset` class is created and assigned to the variable `train_dataset`.

  - `data`: This parameter refers to the dataset that contains text comments and their corresponding labels. It is assumed to be structured as a pandas DataFrame or a similar data structure.

  - `vocab`: Represents a pre-defined vocabulary used to map words in the text comments to numerical tokens. This vocabulary is crucial for converting text data into a format that can be processed by machine learning models.

#### Purpose ####

- The primary purpose of creating instances of the `CustomDataset` class is to organize and structure the data for machine learning tasks, such as text classification.

- By instantiating the `CustomDataset` class, you prepare the data for use in machine learning models, ensuring that it can be efficiently accessed and processed during training and evaluation.

- These instances facilitate the separation of data into training and testing sets, allowing you to evaluate the model's performance on unseen data.



In [25]:
# Create a CustomDataset instance for training and testing
train_dataset = CustomDataset(data, vocab)

### Splitting the Dataset into Train and Test Sets ###

The provided code snippet focuses on dividing the dataset into two separate sets: a training set and a testing set. This is a common practice in machine learning to assess a model's performance on unseen data.

#### Splitting the Dataset ####

- `train_size = int(0.6 * len(train_dataset))`: In this line of code, the variable `train_size` is determined by calculating 60% of the total length of the `train_dataset`. This value represents the size of the training set, and 60% is a commonly used ratio for this purpose.

- `test_size = len(train_dataset) - train_size`: The variable `test_size` is computed as the remaining portion of the dataset after allocating the training set size. This corresponds to the size of the testing set.

- `train_dataset, test_dataset = torch.utils.data.random_split(train_dataset, [train_size, test_size])`: This line of code employs the `random_split` function from the PyTorch `torch.utils.data` module to perform the actual dataset split.

  - `train_dataset` is assigned the portion of the original dataset designated for training, which is of size `train_size`.

  - `test_dataset` is assigned the remaining portion of the dataset intended for testing, which has a size of `test_size`.

#### Purpose ####

- The primary purpose of splitting the dataset into training and testing sets is to evaluate a machine learning model's performance on unseen data. The training set is used to train the model, while the testing set is reserved for evaluating how well the model generalizes to new, unseen examples.

- The split ratio, in this case, is set to 60% for training and 40% for testing. However, the specific ratio can be adjusted based on the dataset's size and the requirements of the machine learning task.



In [50]:
# Split the dataset into train and test
train_size = int(0.6 * len(train_dataset))  # Define the size of the training set (60% of the dataset)
test_size = len(train_dataset) - train_size  # Calculate the size of the test set (remaining 40%)

# Use random_split to split the 'train_dataset' into 'train_dataset' and 'test_dataset'
train_dataset, test_dataset = torch.utils.data.random_split(train_dataset, [train_size, test_size])


### Setting Up DataLoader Instances with Custom Data Handling ###

The provided code snippet focuses on the creation of DataLoader instances, a crucial component in machine learning workflows for efficient data handling. These DataLoader instances are designed to streamline both the training and testing phases.

#### Creation of DataLoader Instances ####

- `train_loader = DataLoader(train_dataset, batch_size=64, collate_fn=partial(custom_collate_fn, vocab=vocab), shuffle=True)`: This line of code initializes a DataLoader instance named `train_loader` specifically for the training dataset.

  - `train_dataset`: This parameter specifies the dataset that the DataLoader will work with during training, which, in this case, is the training dataset itself.

  - `batch_size=64`: The `batch_size` parameter dictates how many data samples are processed simultaneously during each training iteration. It ensures efficient training while managing memory constraints.

  - `collate_fn=partial(custom_collate_fn, vocab=vocab)`: The `collate_fn` parameter is set to a custom data collation function. Using `partial`, additional arguments, such as the `vocab`, are supplied to the `custom_collate_fn` function. This ensures that the custom collate function has access to the vocabulary.

  - `shuffle=True`: With `shuffle` set to `True`, the training data is randomly reorganized during each epoch. This randomness prevents the model from learning any potential sequential patterns in the data order, promoting better model generalization.

- `test_loader = DataLoader(test_dataset, batch_size=64, collate_fn=partial(custom_collate_fn, vocab=vocab), shuffle=False)`: This line of code initializes a DataLoader instance named `test_loader` for the testing dataset.

  - `test_dataset`: Similar to the training DataLoader, this parameter specifies the dataset for testing purposes, which is the testing dataset.

  - `batch_size=64`: The batch size remains consistent between training and testing DataLoader instances, ensuring uniform data processing.

  - `collate_fn=partial(custom_collate_fn, vocab=vocab)`: The custom data collation function is also applied to the testing DataLoader to ensure consistent data handling.

  - `shuffle=False`: For the testing DataLoader, `shuffle` is set to `False` since data order preservation is important during the evaluation phase.

#### Purpose ####

- DataLoader instances play a pivotal role in efficiently managing data during model training and evaluation. They enable batch processing, improving both training speed and memory utilization.

- The custom collate function (`custom_collate_fn`) allows for tailored data processing, including text tokenization and padding, ensuring that the data is well-suited for machine learning models.

- Shuffling the training data (`shuffle=True`) is crucial to prevent the model from learning any potential sequence-related patterns, thereby enhancing its ability to generalize to new data.



In [51]:
# Create DataLoader instances for training and testing with the custom collate_fn

# For the training dataset:
train_loader = DataLoader(
    train_dataset,            # The training dataset to be loaded
    batch_size=64,            # Batch size (number of samples per batch)
    collate_fn=partial(custom_collate_fn, vocab=vocab),  # Custom collate function with the provided vocabulary
    shuffle=True              # Shuffle the data during each epoch (for training)
)

# For the testing dataset:
test_loader = DataLoader(
    test_dataset,             # The testing dataset to be loaded
    batch_size=64,            # Batch size (number of samples per batch)
    collate_fn=partial(custom_collate_fn, vocab=vocab),  # Custom collate function with the provided vocabulary
    shuffle=False             # Do not shuffle the data (for testing/validation)
)


For training and testing, this line of code creates DataLoader instances with unique data handling. For better model training and evaluation, it assures randomization, preserves batch consistency, and optimises data processing.

### Training Loop ###

The provided code snippet defines a training loop, which is a fundamental part of training machine learning models. This loop iterates through batches of data, computes model predictions, calculates loss, and updates model parameters through optimization.

#### `train` Function ####

- `def train(model, iterator, optimizer, criterion)`: This is the training function that takes several key parameters:

  - `model`: Represents the machine learning model that will be trained. This should be an instance of a PyTorch neural network model.

  - `iterator`: Refers to the DataLoader iterator, which provides batches of training data. The training loop iterates through these batches.

  - `optimizer`: Specifies the optimization algorithm used for updating the model's parameters during training. It should be an instance of a PyTorch optimizer.

  - `criterion`: Denotes the loss function that quantifies the difference between model predictions and actual labels. It should be a PyTorch loss function.

- `model.train()`: This line sets the model's mode to "training." It's essential for certain layers (e.g., dropout and batch normalization) that behave differently during training and evaluation.

- `epoch_loss = 0`: Initializes a variable `epoch_loss` to keep track of the total loss during the entire training epoch.

- Loop through the training data batches:

  - `for batch in iterator:`: This loop iterates over batches of training data provided by the `iterator`.

    - `comments, labels = batch`: Each batch is unpacked into `comments` (representing the text comments) and `labels` (representing the corresponding ground truth labels).

    - `optimizer.zero_grad()`: The optimizer's gradient information is reset to zero at the beginning of each batch to prevent accumulation of gradients from previous batches.

    - `predictions = model(comments).squeeze(1)`: The model is used to make predictions on the input `comments`, and `.squeeze(1)` is applied to ensure that the predictions have the expected shape.

    - `loss = criterion(predictions, labels)`: The loss between model predictions and actual labels is calculated using the specified loss function (`criterion`).

    - `loss.backward()`: Gradients are computed for all model parameters with respect to the calculated loss.

    - `optimizer.step()`: The optimizer updates the model's parameters based on the computed gradients to minimize the loss.

    - `epoch_loss += loss.item()`: The current batch's loss is added to `epoch_loss`, which accumulates the total loss for the entire epoch.

- `return epoch_loss / len(iterator)`: The training function returns the average loss over all batches in the epoch as a measure of how well the model is performing during training.

#### Purpose ####

- The training loop is a core component of training machine learning models. It iterates through training data, calculates loss, and updates model parameters to optimize the model's performance.

- Setting the model mode to "training" (`model.train()`) ensures that layers such as dropout behave correctly during training.

- Resetting the optimizer's gradients (`optimizer.zero_grad()`) before each batch prevents gradient accumulation issues.

- Calculating the loss and performing backpropagation (`loss.backward()`) are essential steps for training the model to minimize prediction errors.

- The average loss over all batches (`epoch_loss / len(iterator)`) is a crucial metric for monitoring the model's training progress and assessing its performance.



In [52]:
# Training loop
def train(model, iterator, optimizer, criterion):
    # Set the model in training mode
    model.train()
    
    # Initialize epoch loss to 0
    epoch_loss = 0

    # Iterate over the data batches provided by the 'iterator'
    for batch in iterator:
        comments, labels = batch
        
        # Zero the gradients to clear any previous gradient calculations
        optimizer.zero_grad()
        
        # Forward pass: Compute predictions by passing 'comments' through the model
        predictions = model(comments).squeeze(1)  # Squeeze to remove extra dimension
        
        # Compute the loss between 'predictions' and 'labels'
        loss = criterion(predictions, labels)
        
        # Backpropagate the gradients
        loss.backward()
        
        # Update the model's parameters using the optimizer
        optimizer.step()
        
        # Accumulate the loss for the epoch
        epoch_loss += loss.item()

    # Return the average loss for the epoch
    return epoch_loss / len(iterator)


### Training Loop and Model Saving ###

The provided code snippet showcases a training loop that iteratively trains a machine learning model over a specified number of epochs. After training, the trained model is saved to a file for later use.

#### Training Loop ####

- `N_EPOCHS = 10`: This variable defines the number of training epochs, which determines how many times the entire training dataset will be processed by the model.

- Loop through the specified number of epochs:

  - `for epoch in range(N_EPOCHS):`: This loop iterates from 0 to `N_EPOCHS - 1`, representing each training epoch.

    - `train_loss = train(model, train_loader, optimizer, criterion)`: The `train` function (explained earlier) is called to train the model using the training data provided by `train_loader`. The training loss for the current epoch is computed and stored in `train_loss`.

    - `print(f'Epoch: {epoch+1:02}')`: This line prints the current epoch number, formatted with leading zeros for readability.

    - `print(f'\tTrain Loss: {train_loss:.3f}')`: The training loss for the current epoch is printed with three decimal places to monitor the training progress.

#### Model Saving ####

- `torch.save(model.state_dict(), 'text_classifier_model.pth')`: After completing all training epochs, this line saves the trained model's state dictionary to a file named 'text_classifier_model.pth'.

#### Purpose ####

- The training loop (`for epoch in range(N_EPOCHS)`) repeats the training process for the specified number of epochs, allowing the model to learn from the data over multiple iterations.

- Printing the epoch number and training loss provides insights into the training progress and helps in identifying potential issues.

- Saving the trained model to a file ('text_classifier_model.pth') allows for easy reuse of the trained model for inference, evaluation, or further training.



In [30]:
# Training loop
N_EPOCHS = 10  # Number of training epochs

# Iterate over the specified number of epochs
for epoch in range(N_EPOCHS):
    # Call the 'train' function to train the model for one epoch and compute the training loss
    train_loss = train(model, train_loader, optimizer, criterion)
    
    # Print training progress for the current epoch
    print(f'Epoch: {epoch+1:02}')  # Display the epoch number
    print(f'\tTrain Loss: {train_loss:.3f}')  # Display the training loss for the epoch

# Save the trained model state dictionary to a file
torch.save(model.state_dict(), 'text_classifier_model.pth')


  labels = torch.FloatTensor(labels)


Epoch: 01
	Train Loss: 0.138
Epoch: 02
	Train Loss: 0.114
Epoch: 03
	Train Loss: 0.101
Epoch: 04
	Train Loss: 0.093
Epoch: 05
	Train Loss: 0.088
Epoch: 06
	Train Loss: 0.084
Epoch: 07
	Train Loss: 0.082
Epoch: 08
	Train Loss: 0.080
Epoch: 09
	Train Loss: 0.079
Epoch: 10
	Train Loss: 0.077


The training loss is a measure of how well the model is fitting the training data. A decreasing training loss generally indicates that the model is learning and improving its ability to make predictions on the training dataset.

### Setting the Model to Evaluation Mode ###

The provided code snippet involves changing the mode of a machine learning model to "evaluation" mode. This change in mode affects how certain layers, such as dropout and batch normalization, behave during model evaluation and inference.

#### Setting the Model to Evaluation Mode ####

- `model.eval()`: This line of code sets the machine learning model represented by the variable `model` to "evaluation" mode.



In [32]:
# Set the model to evaluation mode
model.eval()

TextClassifier(
  (embedding): Embedding(532299, 100)
  (fc): Sequential(
    (0): Linear(in_features=100, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=6, bias=True)
  )
)

### Initializing Lists for Storing Predictions and Actual Labels ###

The provided code snippet focuses on the creation of two lists that serve as containers for storing predictions generated by a machine learning model and the corresponding true labels.

#### List Initialization ####

- `all_predictions = []`: This line of code starts by creating an empty list named `all_predictions`. This list will be employed to hold the model's predictions.

- `all_labels = []`: Likewise, this line establishes an empty list known as `all_labels`. This list is intended to retain the genuine labels that correspond to the data.



In [33]:
# Initialize lists to store predictions and true labels
all_predictions = []
all_labels = []

### Disabling Gradient Calculation for Evaluation and Prediction Generation ###

The provided code snippet illustrates a critical step when evaluating a machine learning model. It involves temporarily deactivating the computation of gradients, which is essential during inference to conserve memory and avoid unnecessary computations.

#### Disabling Gradient Calculation ####

- `with torch.no_grad():`: This context manager is employed to temporarily turn off gradient calculations for tensor operations within the following code block. During this context, gradients are neither computed nor stored.

- Iteration through batches in the testing DataLoader:

  - `for batch in test_loader:`: This loop sequentially processes batches of data obtained from the testing DataLoader, enabling model evaluation on the testing dataset.

    - `comments, labels = batch`: Each batch is unpacked into `comments` (representing text comments) and `labels` (representing the actual labels).

    - `predictions = model(comments).squeeze(1)`: The model generates predictions for the input `comments`. The `.squeeze(1)` operation ensures that the predictions have the expected shape.

    - `sigmoid_predictions = torch.sigmoid(predictions)`: Raw model predictions undergo a transformation using the sigmoid function, resulting in probabilities. This step is typical in binary classification tasks.

    - `binary_predictions = (sigmoid_predictions > 0.5).float()`: Probabilities are converted into binary predictions (0 or 1) by applying a threshold of 0.5. Values greater than 0.5 are categorized as 1, while those less than or equal to 0.5 are classified as 0.

    - `all_predictions.extend(binary_predictions.tolist())`: Binary predictions for the current batch are appended to the `all_predictions` list, extending its contents.

    - `all_labels.extend(labels.tolist())`: Similarly, actual labels for the current batch are appended to the `all_labels` list.

#### Purpose ####

- During the evaluation or inference phase, gradient calculation becomes redundant because model parameters are not updated. By temporarily disabling gradient computation with `torch.no_grad()`, memory usage and computational overhead are reduced.

- The iteration through batches from the testing DataLoader allows the model to generate predictions for the testing dataset, a crucial step in evaluating its performance.

- The process of converting raw predictions into probabilities through sigmoid transformation and subsequently into binary predictions with a 0.5 threshold is a standard approach for binary classification problems.

- Storing both model predictions (`all_predictions`) and true labels (`all_labels`) facilitates subsequent evaluation and analysis of the model's performance on the testing dataset.



In [53]:
# Disable gradient computation for evaluation using 'torch.no_grad()'
with torch.no_grad():
    # Initialize empty lists to collect predictions and labels
    all_predictions = []
    all_labels = []

    # Iterate over batches in the 'test_loader' for evaluation
    for batch in test_loader:
        comments, labels = batch
        
        # Forward pass through the model to compute predictions
        predictions = model(comments).squeeze(1)
        
        # Apply the sigmoid function to convert logits to probabilities
        sigmoid_predictions = torch.sigmoid(predictions)
        
        # Convert probabilities to binary predictions (0 or 1) using a threshold of 0.5
        binary_predictions = (sigmoid_predictions > 0.5).float()
        
        # Extend the 'all_predictions' and 'all_labels' lists with the batch's predictions and labels
        all_predictions.extend(binary_predictions.tolist())
        all_labels.extend(labels.tolist())


### Converting Lists to Tensors for Compatibility with Scikit-Learn Metrics ###

The provided code snippet focuses on the conversion of Python lists into PyTorch tensors. This transformation is necessary to ensure compatibility with metrics and evaluation functions provided by the scikit-learn library.

#### List to Tensor Conversion ####

- `predictions_tensor = torch.FloatTensor(all_predictions)`: This line of code converts the list `all_predictions` (which contains model predictions) into a PyTorch tensor of type `FloatTensor`. This tensor will be used for further evaluation.

- `labels_tensor = torch.FloatTensor(all_labels)`: Similarly, this line converts the list `all_labels` (which contains actual labels) into a PyTorch tensor of type `FloatTensor`. This tensor is used as ground truth labels for evaluation.

#### Purpose ####

- While PyTorch tensors are essential for deep learning operations and PyTorch-based evaluations, scikit-learn provides a comprehensive set of metrics and evaluation functions that often work with NumPy arrays or Python lists.

- To utilize scikit-learn metrics for evaluating model performance, it is essential to convert the model predictions and true labels into NumPy arrays or compatible data structures. In this case, PyTorch tensors are converted to facilitate seamless integration with scikit-learn.

- Converting lists to tensors ensures that the data can be easily passed to scikit-learn functions, such as calculating accuracy, precision, recall, F1-score, and other metrics, to assess the model's effectiveness.



In [35]:
# Convert lists to tensors for compatibility with scikit-learn metrics
predictions_tensor = torch.FloatTensor(all_predictions)
labels_tensor = torch.FloatTensor(all_labels)

These newly created tensors (predictions_tensor and labels_tensor) can be used in subsequent code to calculate evaluation metrics such as accuracy, precision, recall, F1-score

### Calculation of Evaluation Metrics ###

The provided code snippet focuses on the calculation of various evaluation metrics to assess the performance of a machine learning model. These metrics provide insights into how well the model's predictions align with the actual labels.

#### Evaluation Metrics Calculation ####

- `accuracy = accuracy_score(labels_tensor, predictions_tensor)`: This line of code computes the accuracy score, which measures the proportion of correctly classified instances among all instances. It compares the true labels (`labels_tensor`) with the model's predictions (`predictions_tensor`).

- `precision = precision_score(labels_tensor, predictions_tensor, average='micro')`: Here, precision is calculated using the `precision_score` function. Precision measures the ratio of correctly predicted positive instances to all instances predicted as positive. The 'micro' averaging method is used for multi-label classification tasks.

- `recall = recall_score(labels_tensor, predictions_tensor, average='micro')`: This line calculates recall, which quantifies the ratio of correctly predicted positive instances to all actual positive instances. Similar to precision, the 'micro' averaging method is used for multi-label classification.

- `f1 = f1_score(labels_tensor, predictions_tensor, average='micro')`: F1-score is determined by the `f1_score` function. It combines precision and recall into a single metric, providing a balanced assessment of a model's performance. The 'micro' averaging method is applied.

- `roc_auc = roc_auc_score(labels_tensor, predictions_tensor, average='micro')`: The code calculates the ROC AUC (Receiver Operating Characteristic - Area Under the Curve) score, which evaluates the model's ability to distinguish between positive and negative instances. The 'micro' averaging method is employed for multi-label classification.

#### Purpose ####

- These evaluation metrics provide quantitative measures of the model's performance on the testing dataset, offering a comprehensive view of its effectiveness in tasks like multi-label classification.

- Accuracy assesses overall correctness, while precision, recall, and F1-score provide insights into the model's performance regarding positive and negative classes, considering false positives and false negatives.

- ROC AUC is especially useful for binary classification tasks but can be adapted for multi-label scenarios as well. It assesses the model's ability to rank positive instances higher than negative ones.

- These metrics collectively aid in understanding how well the model generalizes to unseen data and in comparing different models to select the most suitable one for a specific task.



In [54]:
# Calculate evaluation metrics

# Calculate accuracy
accuracy = accuracy_score(labels_tensor, predictions_tensor)

# Calculate precision (micro-averaged)
# Precision measures how many of the predicted positive instances are actually positive
precision = precision_score(labels_tensor, predictions_tensor, average='micro')

# Calculate recall (micro-averaged)
# Recall measures how many of the actual positive instances were correctly predicted
recall = recall_score(labels_tensor, predictions_tensor, average='micro')

# Calculate F1 score (micro-averaged)
# F1 score is the harmonic mean of precision and recall and provides a balanced measure
f1 = f1_score(labels_tensor, predictions_tensor, average='micro')

# Calculate ROC AUC score (micro-averaged)
# ROC AUC (Receiver Operating Characteristic Area Under the Curve) measures the model's ability to distinguish between classes
roc_auc = roc_auc_score(labels_tensor, predictions_tensor, average='micro')


In [55]:
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-score: {f1:.4f}')
print(f'ROC AUC: {roc_auc:.4f}')

Accuracy: 0.9060
Precision: 0.7961
Recall: 0.3664
F1-score: 0.5018
ROC AUC: 0.6814


### Interpretation of Evaluation Metrics ###

The provided evaluation metrics offer insights into the performance of a machine learning model on a testing dataset. Here's the meaning of each metric:

- **Accuracy: 0.9060**.
  - An accuracy of 0.9060 suggests that approximately 90.60% of the model's predictions were correct.

- **Precision: 0.7961**
  - A precision of 0.7961 indicates that about 79.61% of the instances predicted as positive were true positives, with the remainder being false positives.

- **Recall: 0.3664**
  - A recall of 0.3664 means that the model correctly identified approximately 36.64% of all positive instances in the dataset.

- **F1-score: 0.5018**
  - An F1-score of 0.5018 indicates a trade-off between precision and recall, with higher values representing a better balance.

- **ROC AUC: 0.6814**
  - An ROC AUC score of 0.6814 suggests that the model's predictions rank positive instances higher than negative instances approximately 68.14% of the time.

With insights into accuracy, precision, recall, the balance between precision and recall (F1-score), and the model's capacity to discriminate between classes (ROC AUC), these measures combined offer a thorough assessment of the model's performance. Depending on the precise criteria and objectives of the machine learning activity, interpretation may change.
