# Practical machine learning and deep learning. Lab 3

# Deep Learning in Natural Language Processing

# [Competition](https://www.kaggle.com/t/f330221b5e8044d8aedf5d209dcaeeb1)

## Goal

Your goal is to implement Neural Network to classify Amazon Products reviews. 

## Submission

Submission format is described at competition page.

## Data preprocessing

Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed.

In NLP, text preprocessing is the first step in the process of building a model.

The various text preprocessing steps are:

* Tokenization
* Lower casing
* Stop words removal
* Stemming

These various text preprocessing steps are widely used for dimensionality reduction.

First, let's read the input data and then perform preprocessing steps

In [1]:
import pandas as pd

train_dataframe = pd.read_csv('/kaggle/input/pmldl-week-4-helpfulness-of-amazon-reviews/train.csv')
test_dataframe = pd.read_csv('/kaggle/input/pmldl-week-4-helpfulness-of-amazon-reviews/test.csv')

train_dataframe.head()

Unnamed: 0,Title,Helpfulness,Score,Text,Category
0,Golden Valley Natural Buffalo Jerky,0/0,3.0,The description and photo on this product need...,grocery gourmet food
1,Westing Game,0/0,5.0,This was a great book!!!! It is well thought t...,toys games
2,Westing Game,0/0,5.0,"I am a first year teacher, teaching 5th grade....",toys games
3,Westing Game,0/0,5.0,I got the book at my bookfair at school lookin...,toys games
4,I SPY A is For Jigsaw Puzzle 63pc,2/4,5.0,Hi! I'm Martine Redman and I created this puzz...,toys games


In the training data we have `4` features (`Title`, `Helpfulness`, `Score` and `Text`) with target category (`Category`). For the test features are the same, except for target column.

First, let's write functions for preprocessing helpfulness and score feature in case we needed them.

In [2]:
def preprocess_score_inplace(df):
    """
    Normalizes score to make it from 0 to 1.
    
    For now it is from 1.0 to 5.0, so natural choice
    is to normalize by (f - 1.0)/4.0
    """
    df['Score'] = (df['Score'] - 1.0) / 4.0
    return df

def preprocess_helpfulness_inplace(df):
    """
    Splits feature by '/' and normalize helpfulness to make it from 0 to 1
    
    The total number of assessments can be 0, so let's substitute it
    with 1. The resulting helpfulness still will be zero but we
    remove the possibility of division by zero exception.

    Return value should be float
    """
    # Write your code here
    
    df[['HelpfulVotes', 'TotalVotes']] = df['Helpfulness'].str.split('/', expand=True).astype(int)
    
    # Replace total votes of 0 with 1 to avoid division by zero
    df['TotalVotes'] = df['TotalVotes'].replace(0, 1)
    
    # Calculate the helpfulness ratio and replace the original 'Helpfulness' column
    df['Helpfulness'] = df['HelpfulVotes'] / df['TotalVotes']
    
    # Drop the extra columns created
    df.drop(['HelpfulVotes', 'TotalVotes'], axis=1, inplace=True)
    
    return df  

The two other features are both text. For simplicity, let's remove concatenate them so that we will have one full text feature. The resulting code is also a function.

In [3]:
def concat_title_text_inplace(df):
    """
    Concatenates Title and Text columns together
    """
    df['Text'] = df['Title'] + " " + df['Text']
    df.drop('Title', axis=1, inplace=True)
    return df

Also, encode the target categories, so that the output is become an index

In [4]:
# define categories indices
cat2idx = {
    'toys games': 0,
    'health personal care': 1,
    'beauty': 2,
    'baby products': 3,
    'pet supplies': 4,
    'grocery gourmet food': 5,
}
# define reverse mapping
idx2cat = {
    v:k for k,v in cat2idx.items()
}

In [5]:
def encode_categories(df):
    df['Category'] = df['Category'].apply(lambda x: cat2idx[x])
    return df

Let's visualize our first stage of preprocessing.

In [6]:
train_copy = train_dataframe.head().copy()

encode_categories(preprocess_score_inplace(preprocess_helpfulness_inplace(concat_title_text_inplace(train_copy))))

Unnamed: 0,Helpfulness,Score,Text,Category
0,0.0,0.5,Golden Valley Natural Buffalo Jerky The descri...,5
1,0.0,1.0,Westing Game This was a great book!!!! It is w...,0
2,0.0,1.0,"Westing Game I am a first year teacher, teachi...",0
3,0.0,1.0,Westing Game I got the book at my bookfair at ...,0
4,0.5,1.0,I SPY A is For Jigsaw Puzzle 63pc Hi! I'm Mart...,0


### Text cleaning

For text cleaning, you can use lower casting, punctuation removal, numbers removal, tokenization, stop words removal, stemming. This will get a perfectly cleaned text without any garbage information.

In [7]:
import re

def lower_text(text: str):
    return text.lower()

def remove_numbers(text: str):
    """
    Substitute all punctuations with space in case of
    "there is5dogs".
    
    If subs with '' -> "there isdogs"
    With ' ' -> there is dogs
    """
    text_nonum = re.sub(r'\d+', ' ', text)
    return text_nonum

def remove_punctuation(text: str):
    """
    Substitute all punctuations with space in case of
    "hello!nice to meet you"
    
    If subs with '' -> "hellonice to meet you"
    With ' ' -> "hello nice to meet you"
    """
    text_nopunct = re.sub(r'[^\w\s]', ' ', text)
    return text_nopunct

def remove_multiple_spaces(text: str):
    """
    Replace multiple spaces with a single space.
    """
    text_no_doublespace = re.sub(r'\s+', ' ', text).strip()
    return text_no_doublespace

This will give us clean text.

In [8]:
sample_text = train_copy['Text'][4]

_lowered = lower_text(sample_text)
_without_numbers = remove_numbers(_lowered)
_without_punct = remove_punctuation(_without_numbers)
_single_spaced = remove_multiple_spaces(_without_punct)

print(sample_text)
print('-'*10)
print(_lowered)
print('-'*10)
print(_without_numbers)
print('-'*10)
print(_without_punct)
print('-'*10)
print(_single_spaced)

I SPY A is For Jigsaw Puzzle 63pc Hi! I'm Martine Redman and I created this puzzle for Briarpatch using a great photo from Jean Marzollo and Walter Wick's terrific book, I Spy School Days. Kids need lots of practice to master the ABC's, and this puzzle provides an enjoyable reinforcing tool. Its visual richness helps non-readers and readers alike to remember word associations, and the wealth of cleverly chosen objects surrounding each letter promote language development. The riddle included multiplies the fun of assembling this colorful puzzle. For another great Briarpatch puzzle, check out I Spy Blocks. END
----------
i spy a is for jigsaw puzzle 63pc hi! i'm martine redman and i created this puzzle for briarpatch using a great photo from jean marzollo and walter wick's terrific book, i spy school days. kids need lots of practice to master the abc's, and this puzzle provides an enjoyable reinforcing tool. its visual richness helps non-readers and readers alike to remember word associa

Now, harder preprocessing: tokenization, stop words removal and stemming.
For that you can use several packages, but we encourage you to use `nltk` - Natural Language ToolKit as well as `torchtext`.


Take a look at:
* `nltk.tokenize.word_tokenize` or `torchtext.data.utils.get_tokenizer` for tokenization
* `nltk.corpus.stopwords` for stop words removal and `nltk.corpus.punkt` for punctuation
* `nltk.stem.PorterStemmer` for stemming

In [9]:
# Necessary imports
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# from torchtext.data.utils import get_tokenizer

# Download the necessary nltk resources
nltk.download('punkt')
nltk.download('stopwords')

# Tokenization using NLTK or torchtext
def tokenize_text(text: str) -> list[str]:
    """
    Tokenize the text into individual words using nltk or torchtext.
    """
    # Option 1: Using nltk
    tokens = word_tokenize(text)
    
    # Option 2 (alternative): Using torchtext (uncomment if you prefer torchtext)
    # tokenizer = get_tokenizer("basic_english")
    # tokens = tokenizer(text)
    
    return tokens

# Remove stop words
def remove_stop_words(tokenized_text: list[str]) -> list[str]:
    """
    Remove common stop words from tokenized text using nltk.
    """
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokenized_text if word.lower() not in stop_words]
    return filtered_tokens

# Stemming using NLTK's PorterStemmer
def stem_words(tokenized_text: list[str]) -> list[str]:
    """
    Apply stemming to the tokenized text using nltk's PorterStemmer.
    """
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in tokenized_text]
    return stemmed_words


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
_tokenized = tokenize_text(_single_spaced)
_without_sw = remove_stop_words(_tokenized)
_stemmed = stem_words(_without_sw)

print(_single_spaced)
print('-'*10)
print(_tokenized)
print('-'*10)
print(_without_sw)
print('-'*10)
print(_stemmed)

i spy a is for jigsaw puzzle pc hi i m martine redman and i created this puzzle for briarpatch using a great photo from jean marzollo and walter wick s terrific book i spy school days kids need lots of practice to master the abc s and this puzzle provides an enjoyable reinforcing tool its visual richness helps non readers and readers alike to remember word associations and the wealth of cleverly chosen objects surrounding each letter promote language development the riddle included multiplies the fun of assembling this colorful puzzle for another great briarpatch puzzle check out i spy blocks end
----------
['i', 'spy', 'a', 'is', 'for', 'jigsaw', 'puzzle', 'pc', 'hi', 'i', 'm', 'martine', 'redman', 'and', 'i', 'created', 'this', 'puzzle', 'for', 'briarpatch', 'using', 'a', 'great', 'photo', 'from', 'jean', 'marzollo', 'and', 'walter', 'wick', 's', 'terrific', 'book', 'i', 'spy', 'school', 'days', 'kids', 'need', 'lots', 'of', 'practice', 'to', 'master', 'the', 'abc', 's', 'and', 'this

As you can see, there is a lot of words removed as well as the unnecessary language rules (I mean stems, com'on). Now we are able to construct full cleaning preprocessing stage.

In [11]:
def preprocessing_stage(text):
    _lowered = lower_text(text)
    _without_numbers = remove_numbers(_lowered)
    _without_punct = remove_punctuation(_without_numbers)
    _single_spaced = remove_multiple_spaces(_without_punct)
    _tokenized = tokenize_text(_single_spaced)
    _without_sw = remove_stop_words(_tokenized)
    _stemmed = stem_words(_without_sw)
    
    return _stemmed

def clean_text_inplace(df):
    df['Text'] = df['Text'].apply(preprocessing_stage)
    return df

def preprocess(df):
    df.fillna(" ", inplace=True)
    _preprocess_score = preprocess_score_inplace(df)
    _preprocess_helpfulness = preprocess_helpfulness_inplace(_preprocess_score)
    _concatted = concat_title_text_inplace(_preprocess_helpfulness)

    if 'Category' in df.columns:
        _encoded = encode_categories(_concatted)
        _cleaned = clean_text_inplace(_encoded)
    else:
        _cleaned = clean_text_inplace(_concatted)
    return _cleaned
    

And now let's apply it on our train and test dataframes.

In [12]:
train_preprocessed = preprocess(train_dataframe)
test_preprocessed = preprocess(test_dataframe)

train_preprocessed.head()

Unnamed: 0,Helpfulness,Score,Text,Category
0,0.0,0.5,"[golden, valley, natur, buffalo, jerki, descri...",5
1,0.0,1.0,"[west, game, great, book, well, thought, easil...",0
2,0.0,1.0,"[west, game, first, year, teacher, teach, th, ...",0
3,0.0,1.0,"[west, game, got, book, bookfair, school, look...",0
4,0.5,1.0,"[spi, jigsaw, puzzl, pc, hi, martin, redman, c...",0


Now, let's split our original train dataset into train and val sets.

In [13]:
from sklearn.model_selection import train_test_split

ratio = 0.2
train, val = train_test_split(
    train_preprocessed, stratify=train_preprocessed['Category'], test_size=0.2, random_state=420
)

And now, for the best result, lets get rid of pandas so that nothing is stopping us from working with torchtext. For that let's create an iterator that is going to yield samples for us.

# Creating dataloaders

First, you should generate our vocab from the train set.

For that, use `torchtext.vocab.build_vocab_from_iterator`.

In [14]:
!pip install torchtext==0.13.0

Collecting torchtext==0.13.0
  Downloading torchtext-0.13.0-cp310-cp310-manylinux1_x86_64.whl.metadata (6.9 kB)
Collecting torch==1.12.0 (from torchtext==0.13.0)
  Downloading torch-1.12.0-cp310-cp310-manylinux1_x86_64.whl.metadata (22 kB)
Downloading torchtext-0.13.0-cp310-cp310-manylinux1_x86_64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading torch-1.12.0-cp310-cp310-manylinux1_x86_64.whl (776.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 2.4.0+cpu
    Uninstalling torch-2.4.0+cpu:
      Successfully uninstalled torch-2.4.0+cpu
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the 

In [15]:
from torchtext.vocab import build_vocab_from_iterator

def yield_tokens(df):
    for _, sample in train.iterrows():
        yield sample.to_list()[2]


# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']
# Make sure the tokens are in order of their indices to properly insert them in vocab
vocab = build_vocab_from_iterator(yield_tokens(train), 
                                      specials=special_symbols, 
                                      special_first=True)
vocab.set_default_index(UNK_IDX)  # Set <unk> to be the default for unknown tokens

And then use our vocab to encode the tokenized sequence

In [16]:
sample = train['Text'][2]
print(sample)
encoded = vocab(sample)
print(encoded)

['west', 'game', 'first', 'year', 'teacher', 'teach', 'th', 'grade', 'special', 'read', 'class', 'high', 'comprehens', 'level', 'read', 'book', 'one', 'best', 'thing', 'taught', 'year', 'expand', 'mind', 'allow', 'put', 'charact', 'place', 'easi', 'student', 'make', 'mind', 'movi', 'even', 'use', 'whole', 'read', 'class', 'time', 'order', 'finish', 'book', 'student', 'wait', 'hear', 'end', 'excel', 'book', 'read', 'everi', 'year', 'student']
[2556, 43, 33, 14, 2751, 807, 860, 1724, 728, 131, 1895, 191, 6980, 583, 131, 515, 5, 59, 46, 3505, 14, 2954, 528, 450, 40, 1125, 165, 50, 1924, 22, 528, 945, 30, 4, 271, 131, 1895, 13, 68, 623, 515, 1924, 426, 600, 180, 311, 515, 131, 85, 14, 1924]


Now we can define our collate function and create dataloaders

In [17]:
import torch
from torch.utils.data import DataLoader

torch.manual_seed(420)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, score_list, helpfulness_list, offsets = [], [], [], [], [0]
    for _helpfulnes, _score, _text, _label in batch:
        label_list.append(_label)
        processed_text = torch.tensor(vocab(_text), dtype=torch.int64)
        text_list.append(processed_text)
        score_list.append(_score)
        helpfulness_list.append(_helpfulnes)
        offsets.append(processed_text.size(0))

    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    score_list = torch.tensor(score_list, dtype=torch.float64)
    helpfulness_list = torch.tensor(helpfulness_list, dtype=torch.float64)
    return label_list.to(device), text_list.to(device), offsets.to(device), score_list.to(device), helpfulness_list.to(device)

train_dataloader = DataLoader(
    train.to_numpy(), batch_size=128, shuffle=True, collate_fn=collate_batch
)

val_dataloader = DataLoader(
    val.to_numpy(), batch_size=128, shuffle=False, collate_fn=collate_batch
)

# Defining Network


For writing a network you can use `torch.nn.Embedding` or `torch.nn.EmbeddingBag`. This will allow your netorwk to learn embedding vector for your tokens.

As for the other modules in your network, consider these options:
* Simple Linear layers, activations, basic stuff that goes into the network
* There is a possible of not using the offsets (indices of sequences) in the formart, put use predefined sequence length (maximum length, some value, etc.). If this is an option for you, change the `collate_batch` function according to your architecture.
* You could use all this recurrent stuff (RNN, GRU, LSTM, even Transformer, all up to you), but remembder about the dimentions and hidden states

In [18]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes, use_lstm=False, hidden_dim=None):
        super(TextClassificationModel, self).__init__()

        # Embedding layer, assuming vocab_size is the size of the vocabulary
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        
        # Recurrent layers (LSTM or GRU) - optional
        self.use_lstm = use_lstm
        if self.use_lstm:
            # hidden_dim should be defined when using LSTM or GRU
            self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
            self.fc = nn.Linear(hidden_dim, num_classes)
        else:
            # A simple linear layer for classification
            self.fc = nn.Linear(embed_dim, num_classes)
        
        # You can add other activation functions or layers as needed
    
    def forward(self, text, offsets=None):
        # Get the embedded representation of the text
        embedded = self.embedding(text, offsets)  # Use nn.EmbeddingBag for efficient embeddings
        
        # If using LSTM/GRU, pass through the recurrent layer
        if self.use_lstm:
            # Assume the input is in padded sequence format
            output, (hn, cn) = self.lstm(embedded.unsqueeze(1))
            embedded = hn[-1]  # Take the hidden state of the last layer

        # Pass the output of the embedding or recurrent layer through the linear classifier
        return self.fc(embedded)

In [19]:
import torch
import torch.nn as nn

class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes, use_lstm=False, hidden_dim=None):
        super(TextClassificationModel, self).__init__()
        
        # Embedding layer (or EmbeddingBag)
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        
        self.use_lstm = use_lstm
        if self.use_lstm:
            # hidden_dim should be defined when using LSTM or GRU
            self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
            self.fc = nn.Linear(hidden_dim, num_classes)
        else:
            # A simple linear layer for classification
            self.fc = nn.Linear(embed_dim, num_classes)
        
        # Linear layer to output the 6 categories
        self.fc = nn.Linear(embed_dim, num_classes)
        
        # Optional: Activation function
        self.activation = nn.ReLU()
        
    def forward(self, text, offsets):
        # Pass text and offsets through the embedding layer
        embedded = self.embedding(text, offsets)
        
        if self.use_lstm:
            # Assume the input is in padded sequence format
            output, (hn, cn) = self.lstm(embedded.unsqueeze(1))
            embedded = hn[-1]  # Take the hidden state of the last layer
        
        # You can add an activation here if desired
        output = self.fc(self.activation(embedded))
        
        return output

In [20]:
from tqdm.autonotebook import tqdm

def train_one_epoch(
    model,
    loader,
    optimizer,
    loss_fn,
    epoch_num=-1
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Epoch {epoch_num}: train",
        leave=True,
    )
    model.train()
    train_loss = 0.0
    for i, batch in loop:
        labels, texts, offsets, scores, helpfulness = batch
        optimizer.zero_grad()  # Clear gradients
    
        # Forward pass
        outputs = model(texts, offsets)

        # Loss calculation
        loss = loss_fn(outputs, labels)

        # Backward pass
        loss.backward()

        # Optimizer step (update weights)
        optimizer.step()

        # Accumulate loss for tracking
        train_loss += loss.item() * len(labels)

        # Update the progress bar with the current loss
        loop.set_postfix({"loss": train_loss / (i * len(labels))})

def val_one_epoch(
    model,
    loader,
    loss_fn,
    epoch_num=-1,
    best_so_far=0.0,
    ckpt_path='best.pt'
):
    
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Epoch {epoch_num}: val",
        leave=True,
    )
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        model.eval()  # evaluation mode
        for i, batch in loop:
            labels, texts, offsets, scores, helpfulness = batch
            
            # Forward pass
            outputs = model(texts, offsets)

            # Loss calculation
            loss = loss_fn(outputs, labels)

            # Prediction and accuracy
            _, predicted = torch.max(outputs, 1)  # Get index of max log-probability
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

            # Accumulate validation loss
            val_loss += loss.item() * len(labels)

            # Update the progress bar
            loop.set_postfix({"loss": val_loss / total, "acc": correct / total})

        if correct / total > best:
            torch.save(model.state_dict(), ckpt_path)
            return correct / total

    return best_so_far

In [21]:
import torch
import torch.optim as optim
import torch.nn as nn

vocab_size = len(vocab)  # Size of your vocabulary
embed_dim = 128  # Embedding dimension
num_classes = 6  # Number of classes in your classification task

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

model = TextClassificationModel(vocab_size, embed_dim, num_classes, use_lstm=False, hidden_dim=128).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Using Adam optimizer
loss_fn = nn.CrossEntropyLoss()  # Appropriate loss function for classification tasks
epochs = 10



cpu


In [22]:
best = -float('inf')
for epoch in range(epochs):
    train_one_epoch(model, train_dataloader, optimizer, loss_fn, epoch_num=epoch)
    best = val_one_epoch(model, val_dataloader, loss_fn, epoch, best_so_far=best)

Epoch 0: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 0: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 1: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 1: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 2: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 2: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 3: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 3: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 4: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 4: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 5: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 5: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 6: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 6: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 7: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 7: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 8: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 8: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 9: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 9: val:   0%|          | 0/63 [00:00<?, ?it/s]

# Predictions

In [23]:
def collate_batch(batch):
    label_list, text_list, score_list, helpfulness_list, offsets = [], [], [], [], [0]
    for _helpfulnes, _score, _text, ids in batch:
        processed_text = torch.tensor(vocab(_text), dtype=torch.int64)
        text_list.append(processed_text)
        score_list.append(_score)
        helpfulness_list.append(_helpfulnes)
        offsets.append(processed_text.size(0))

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    score_list = torch.tensor(score_list, dtype=torch.float64)
    helpfulness_list = torch.tensor(helpfulness_list, dtype=torch.float64)
    return text_list.to(device), offsets.to(device), score_list.to(device), helpfulness_list.to(device)

test_dataloader = DataLoader(
    test_preprocessed.to_numpy(), batch_size=128, shuffle=False, collate_fn=collate_batch
)

In [24]:
def predict(
    model,
    loader,
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc="Predictions:",
        leave=True,
    )
    predictions = []
    with torch.no_grad():
        model.eval()  # evaluation mode
        for i, batch in loop:
            texts, offsets, scores, helpfulness = batch

            # forward pass and loss calculation
            outputs = model(texts, offsets)
            
            _, predicted = torch.max(outputs.data, 1)
            predictions += predicted.detach().cpu().tolist()

    return predictions

In [25]:
ckpt = torch.load("best.pt")
model.load_state_dict(ckpt)

predictions = predict(model, test_dataloader)
predictions[:10]

Predictions::   0%|          | 0/16 [00:00<?, ?it/s]

[1, 1, 2, 1, 3, 1, 3, 2, 2, 4]

In [26]:
submission_df = test_dataframe.copy()

submission_df.drop(['Helpfulness','Score','Text'],axis=1,inplace=True)
submission_df['Category'] = [idx2cat[x] for x in predictions]

submission_df.to_csv('submission.csv', index=False)