### Detailed Description of the Problem (Topic Detection in Reviews)

#### **Objective**
The goal of this task is to identify the topics of customer reviews related to products sold on the Digikala platform. These reviews are categorized based on various product features, such as price, quality, warranty, and more. The challenge is to determine whether each review is associated with specific topics or attributes of the product.

---

#### **Data Structure**
The provided dataset contains various features and annotations for each review, which can be utilized for analysis and topic extraction. Each review includes the following fields:

1. **`id`**: A unique identifier for the review.
2. **`comment`**: The text of the review submitted by the customer.
3. **`product_id`**: A unique identifier for the product related to the review.
4. **`product_title_fa`**: The product title in Persian.
5. **`category_id`**: A unique identifier for the category of the product.
6. **`category_title_fa`**: The category title in Persian.
7. **`is_buyer`**: Indicates whether the review was written by a verified buyer of the product.

---

#### **Additional Features**
The dataset also includes binary labels (0 or 1) for various product attributes. These labels indicate whether the review is associated with specific product-related topics. The additional features are as follows:

1. **`price_value`**: Whether the review discusses the value or fairness of the product's price.
2. **`fake_originality`**: Whether the review discusses concerns about the product's originality or authenticity.
3. **`warranty`**: Whether the review mentions after-sales services or warranty-related issues.
4. **`size`**: Whether the review discusses the size or dimensions of the product.
5. **`discrepancy`**: Whether the review mentions any discrepancies between the product description and the actual product received.
6. **`flavor_odor`**: Whether the review discusses the flavor or odor of the product (applicable to consumables or similar items).
7. **`expiration_date`**: Whether the review mentions the expiration date of the product (applicable to perishable goods).

---

#### **Additional Information**
- These labels are intended to facilitate the analysis and extraction of topics from customer reviews.
- Each topic (such as price or quality) is represented as a binary value (`0` or `1`):
  - **`1`**: Indicates that the review is associated with the specific topic.
  - **`0`**: Indicates that the review does not mention the specific topic.

The task is essentially a **multi-label classification problem**, where a single review can be associated with multiple topics simultaneously. For instance, a review might discuss both the **price** and **warranty** of a product, but not its **size** or **expiration date**.

---

#### **Objective Summary**
The main objective is to train a model that can:
1. Process the text of the reviews (along with other features if necessary).
2. Identify which topics (labels) are relevant for each review.

This is useful for:
- Automating the analysis of customer feedback.
- Providing insights into common customer concerns about products.
- Enhancing product recommendations and quality control processes.


Based on this description, train a good RNN model to solve this problem.

Please train SimpleRNN, GRU and LSTM models and compare their performances.

Please adjust the hyperparameters based on your experiments to reach out best performances. (Don't rely on default values)

Please fill the '...' in the following code.

Download the Dataset from this [Link](https://drive.google.com/file/d/1QOcw01rxMIkJyl2oDEL1mlJYnIEAtDTr/view?usp=sharing)

In [1]:
!pip install hazm



In [28]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from hazm import Normalizer, word_tokenize, Stemmer, Lemmatizer, stopwords_list
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
import re

# Preprocessing for Persian Text
normalizer = Normalizer()
stemmer = Stemmer()
lemmatizer = Lemmatizer()
stopwords = set(stopwords_list())

def preprocess_text(text):
    '''
    Do the pre-process steps as needed.
    For your choices, do experiments.
    '''
    text = normalizer.normalize(text)  # Normalize
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    tokens = word_tokenize(text)  # Tokenize
    clean_tokens = [] # Clean tokens
    for token in tokens:
       if token not in stopwords:
        clean_tokens.append(lemmatizer.lemmatize(stemmer.stem(token)))
    tokens = ' '.join(clean_tokens)
    return tokens

train_data = pd.read_csv('train.csv')

# Handle missing and non-string comments
train_data['comment'] = train_data['comment'].fillna('').astype(str)

# Preprocess the comments
train_data['cleaned_comment'] = train_data['comment'].apply(preprocess_text)

# Tokenization
from collections import Counter

def build_vocab(texts, max_vocab_size=20000):
    '''
    The function builds a vocabulary (a mapping of words to unique integer indices) from a list of text samples.
    It includes the most frequent words up to a specified maximum vocabulary size.
    '''
    # Count Word Frequencies
    word_counter = Counter()
    for text in texts:
      for word in text.split():
            word_counter[word] += 1
    # Create Vocabulary
    most_common_words = word_counter.most_common(max_vocab_size - 1)
    vocab = {word: idx + 1 for idx, (word, _) in enumerate(most_common_words)}
    vocab['<PAD>'] = 0
    return vocab

vocab = build_vocab(train_data['cleaned_comment'])
max_len = 100

def text_to_sequence(text, vocab, max_len):
    '''
    The function converts a single piece of text into a numerical sequence of fixed length.
    Each word in the text is mapped to its corresponding integer index from a vocabulary (vocab).
    Words not in the vocabulary are assigned a default value (0).
    '''
    tokens = text.split()
    sequence = [vocab.get(word, 0) for word in tokens] # Word-to-Index Mapping
    if len(sequence) > max_len:
        sequence = sequence[:max_len]
    else:
        sequence.extend([0] * (max_len - len(sequence)))

    return sequence # Padding or Truncating

max_len = 100
train_data['sequence'] = train_data['cleaned_comment'].apply(lambda x: text_to_sequence(x, vocab, max_len))

# Targets
X = np.array(train_data['sequence'].tolist())
y = train_data[['price_value', 'fake_originality', 'warranty', 'size', 'discrepancy', 'flavor_odor', 'expiration_date']].values  # Assuming target columns are after column 2
y = torch.tensor(y, dtype=torch.float32)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Custom Dataset
class TextDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.long)
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_dataset = TextDataset(X_train, y_train)
val_dataset = TextDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)

# Model Architecture
class RNNModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size, rnn_type='SimpleRNN'):
        '''
        if needed, change the arguments
        '''
        super(RNNModel, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embed_size) # Define embedding layer

        bidirectional = True
        if rnn_type == 'SimpleRNN':

            self.rnn = nn.RNN(embed_size, hidden_size, batch_first=True, bidirectional=bidirectional) # Define your own RNN layer

        elif rnn_type == 'GRU':

            self.rnn = nn.GRU(embed_size, hidden_size, batch_first=True, bidirectional=bidirectional) # Define your own GRU layer

        elif rnn_type == 'LSTM':

            self.rnn = nn.LSTM(embed_size, hidden_size, batch_first=True, bidirectional=bidirectional) # Define your own LSTM layer

        self.fc = nn.Linear(hidden_size * (2 if bidirectional else 1), output_size) # Define your own Linear layer
        self.dropout = nn.Dropout(0.3) # Define your own Dropout layer
    def forward(self, x):
        '''
        embed -> r-nn -> dropout -> fc
        '''
        x = self.embedding(x)
        if isinstance(self.rnn, nn.LSTM):
            _, (hidden, _) = self.rnn(x)
        else:
            _, hidden = self.rnn(x)

        hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1) if self.rnn.bidirectional else hidden[-1]
        x = self.fc(x)
        x = self.dropout(hidden)
        return x

from sklearn.metrics import f1_score
from sklearn.preprocessing import Binarizer

# Function to compute F1 score
def compute_f1_score(y_true, y_pred):
    # Binarize predictions using a threshold of 0.5
    binarizer = Binarizer(threshold=0.5)
    y_pred_binarized = binarizer.transform(y_pred)
    return f1_score(y_true, y_pred_binarized, average='macro')

# Training function with F1 score and NaN handling
def train_model_with_f1(rnn_type, vocab_size, embed_size, hidden_size, output_size, num_epochs=10):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = RNNModel(vocab_size, embed_size, hidden_size, output_size, rnn_type=rnn_type).to(device) # Define your model
    criterion = nn.BCEWithLogitsLoss() # Define your own criterion w.r.t. problem
    optimizer = optim.Adam(model.parameters(), lr=0.001) # Define your own optimizer based on your choice

    best_val_loss = float('inf')
    for epoch in range(num_epochs):
        model.train()
        train_loss = 0
        all_train_preds = []
        all_train_targets = []
        for X_batch, y_batch in tqdm(train_loader):
            X_batch, y_batch = X_batch.to(device), y_batch.to(device) # transfer to the device

            # Handle NaN in y_batch
            y_batch = torch.nan_to_num(y_batch, nan=0.0)

            # Zero the gradients
            optimizer.zero_grad()
            # Feedforward pass
            outputs = model(X_batch)
            # compute loss
            loss = criterion(outputs, y_batch)
            # Do back propagation
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            # Update parameters
            optimizer.step()
            train_loss += loss.item()

            # Collect predictions and targets for F1 score
            all_train_preds.append(outputs.detach().cpu().numpy())
            all_train_targets.append(y_batch.cpu().numpy())

        # Compute training F1 score
        train_preds = np.vstack(all_train_preds)
        train_targets = np.vstack(all_train_targets)
        train_f1 = compute_f1_score(train_targets, torch.sigmoid(torch.tensor(train_preds)).numpy())

        # Validation
        model.eval()
        val_loss = 0
        all_val_preds = []
        all_val_targets = []
        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)

                # Handle NaN in y_batch
                y_batch = torch.nan_to_num(y_batch, nan=0.0)

                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                val_loss += loss.item()

                # Collect predictions and targets for F1 score
                all_val_preds.append(outputs.cpu().numpy())
                all_val_targets.append(y_batch.cpu().numpy())

        # Compute validation F1 score
        val_preds = np.vstack(all_val_preds)
        val_targets = np.vstack(all_val_targets)
        val_f1 = compute_f1_score(val_targets, torch.sigmoid(torch.tensor(val_preds)).numpy())

        print(f"Epoch {epoch + 1}/{num_epochs}, Train Loss: {train_loss / len(train_loader)}, Train F1: {train_f1}")
        print(f"Epoch {epoch + 1}/{num_epochs}, Val Loss: {val_loss / len(val_loader)}, Val F1: {val_f1}")

        # Save the best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), f"best_{rnn_type}_model.pth")

    return model



# Compare Models
vocab_size = len(vocab)
embed_size = 128
hidden_size = 64
output_size = y.shape[1]

for rnn_type in ['SimpleRNN', 'GRU', 'LSTM']:
    print()
    print(f"Training {rnn_type}...")
    train_model_with_f1(rnn_type, vocab_size, embed_size, hidden_size, output_size)



Training SimpleRNN...


100%|██████████| 2802/2802 [00:12<00:00, 220.73it/s]


Epoch 1/10, Train Loss: 0.09661210470066421, Train F1: 0.37141615424365887
Epoch 1/10, Val Loss: 0.07439315161493723, Val F1: 0.4610352055886427


100%|██████████| 2802/2802 [00:12<00:00, 222.20it/s]


Epoch 2/10, Train Loss: 0.07305584182972082, Train F1: 0.4880067431145138
Epoch 2/10, Val Loss: 0.06736496037746462, Val F1: 0.5065407925755172


100%|██████████| 2802/2802 [00:12<00:00, 221.88it/s]


Epoch 3/10, Train Loss: 0.06682406344998104, Train F1: 0.54600091491648
Epoch 3/10, Val Loss: 0.06474280892771678, Val F1: 0.5205499527654932


100%|██████████| 2802/2802 [00:12<00:00, 220.59it/s]


Epoch 4/10, Train Loss: 0.06283673863235774, Train F1: 0.6092341146437168
Epoch 4/10, Val Loss: 0.06303244833429257, Val F1: 0.5886910348158076


100%|██████████| 2802/2802 [00:12<00:00, 221.60it/s]


Epoch 5/10, Train Loss: 0.059995291502661016, Train F1: 0.6418802931192139
Epoch 5/10, Val Loss: 0.062121736305565195, Val F1: 0.6462513593620295


100%|██████████| 2802/2802 [00:13<00:00, 212.21it/s]


Epoch 6/10, Train Loss: 0.058513152214308285, Train F1: 0.6653598754104156
Epoch 6/10, Val Loss: 0.06180254706897681, Val F1: 0.6272738866970925


100%|██████████| 2802/2802 [00:13<00:00, 214.76it/s]


Epoch 7/10, Train Loss: 0.05608997342783657, Train F1: 0.6762852794217135
Epoch 7/10, Val Loss: 0.061606788699556346, Val F1: 0.6119955603751678


100%|██████████| 2802/2802 [00:12<00:00, 219.87it/s]


Epoch 8/10, Train Loss: 0.054773784102306974, Train F1: 0.7028168750959859
Epoch 8/10, Val Loss: 0.060918421071346915, Val F1: 0.6596929681584783


100%|██████████| 2802/2802 [00:12<00:00, 219.59it/s]


Epoch 9/10, Train Loss: 0.053452545401376325, Train F1: 0.7213841483237264
Epoch 9/10, Val Loss: 0.06276448994088445, Val F1: 0.673005299366294


100%|██████████| 2802/2802 [00:12<00:00, 222.20it/s]


Epoch 10/10, Train Loss: 0.052219075438627, Train F1: 0.7355769043093471
Epoch 10/10, Val Loss: 0.06250936375896363, Val F1: 0.690785486788952

Training GRU...


100%|██████████| 2802/2802 [00:13<00:00, 206.74it/s]


Epoch 1/10, Train Loss: 0.0802555855484593, Train F1: 0.5087159277633225
Epoch 1/10, Val Loss: 0.05913869043368416, Val F1: 0.6674925360968371


100%|██████████| 2802/2802 [00:13<00:00, 202.08it/s]


Epoch 2/10, Train Loss: 0.05826961445348931, Train F1: 0.7286282546252687
Epoch 2/10, Val Loss: 0.055165813635296046, Val F1: 0.7338188228606713


100%|██████████| 2802/2802 [00:13<00:00, 205.86it/s]


Epoch 3/10, Train Loss: 0.053187685938542714, Train F1: 0.7660574392660094
Epoch 3/10, Val Loss: 0.05475560350665097, Val F1: 0.7604820623991068


100%|██████████| 2802/2802 [00:13<00:00, 207.02it/s]


Epoch 4/10, Train Loss: 0.049481756500913084, Train F1: 0.7915278655043556
Epoch 4/10, Val Loss: 0.055463121964804796, Val F1: 0.7560710725406433


100%|██████████| 2802/2802 [00:13<00:00, 203.22it/s]


Epoch 5/10, Train Loss: 0.04621113394470161, Train F1: 0.815012758250875
Epoch 5/10, Val Loss: 0.057059924565946496, Val F1: 0.7656721853406022


100%|██████████| 2802/2802 [00:13<00:00, 208.47it/s]


Epoch 6/10, Train Loss: 0.0430829816841541, Train F1: 0.8328699004253816
Epoch 6/10, Val Loss: 0.058244712953688244, Val F1: 0.749586237623993


100%|██████████| 2802/2802 [00:13<00:00, 205.41it/s]


Epoch 7/10, Train Loss: 0.04010744272052529, Train F1: 0.8495238756870073
Epoch 7/10, Val Loss: 0.06009359280718291, Val F1: 0.7533841402370911


100%|██████████| 2802/2802 [00:13<00:00, 204.94it/s]


Epoch 8/10, Train Loss: 0.037150966392751575, Train F1: 0.8689026752877889
Epoch 8/10, Val Loss: 0.06280235960526916, Val F1: 0.7600853466909664


100%|██████████| 2802/2802 [00:13<00:00, 206.56it/s]


Epoch 9/10, Train Loss: 0.03457561682651802, Train F1: 0.8822301383403511
Epoch 9/10, Val Loss: 0.06522025748524023, Val F1: 0.7586554159977084


100%|██████████| 2802/2802 [00:13<00:00, 207.98it/s]


Epoch 10/10, Train Loss: 0.03235156361800057, Train F1: 0.8917873137037917
Epoch 10/10, Val Loss: 0.06744814779046131, Val F1: 0.7500603031125992

Training LSTM...


100%|██████████| 2802/2802 [00:13<00:00, 200.19it/s]


Epoch 1/10, Train Loss: 0.08777200605329405, Train F1: 0.4120829244991367
Epoch 1/10, Val Loss: 0.06280120749414478, Val F1: 0.5041659468408394


100%|██████████| 2802/2802 [00:14<00:00, 198.70it/s]


Epoch 2/10, Train Loss: 0.06049640016094499, Train F1: 0.6357658635633039
Epoch 2/10, Val Loss: 0.05718766626889968, Val F1: 0.6847698991603394


100%|██████████| 2802/2802 [00:13<00:00, 200.60it/s]


Epoch 3/10, Train Loss: 0.05448540471837212, Train F1: 0.731822219111005
Epoch 3/10, Val Loss: 0.05654359110943261, Val F1: 0.7284021518478238


100%|██████████| 2802/2802 [00:14<00:00, 196.27it/s]


Epoch 4/10, Train Loss: 0.05072663937376907, Train F1: 0.7747754373806292
Epoch 4/10, Val Loss: 0.055801178842846744, Val F1: 0.7376260308944961


100%|██████████| 2802/2802 [00:14<00:00, 199.64it/s]


Epoch 5/10, Train Loss: 0.04740983147545053, Train F1: 0.7959453679534958
Epoch 5/10, Val Loss: 0.05595968348823073, Val F1: 0.759336111954721


100%|██████████| 2802/2802 [00:13<00:00, 201.59it/s]


Epoch 6/10, Train Loss: 0.04438560457666173, Train F1: 0.8157288512402187
Epoch 6/10, Val Loss: 0.05779246886934389, Val F1: 0.7582640071702335


100%|██████████| 2802/2802 [00:13<00:00, 200.44it/s]


Epoch 7/10, Train Loss: 0.04186112995323945, Train F1: 0.8248381695321614
Epoch 7/10, Val Loss: 0.058813617082469474, Val F1: 0.7415537601889873


100%|██████████| 2802/2802 [00:13<00:00, 201.13it/s]


Epoch 8/10, Train Loss: 0.03949050213623098, Train F1: 0.8442827325372935
Epoch 8/10, Val Loss: 0.060437245073910274, Val F1: 0.7534200790620821


100%|██████████| 2802/2802 [00:14<00:00, 197.65it/s]


Epoch 9/10, Train Loss: 0.037327788458428456, Train F1: 0.8560997782223991
Epoch 9/10, Val Loss: 0.06138204489588823, Val F1: 0.7580518799450505


100%|██████████| 2802/2802 [00:14<00:00, 196.82it/s]


Epoch 10/10, Train Loss: 0.03541036680683781, Train F1: 0.8646629758022378
Epoch 10/10, Val Loss: 0.06336758097104951, Val F1: 0.7557772547266212
