# Practical machine learning and deep learning. Lab 3

# Deep Learning in Natural Language Processing

# [Competition](https://www.kaggle.com/t/f330221b5e8044d8aedf5d209dcaeeb1)

## Goal

Your goal is to implement Neural Network to classify Amazon Products reviews. 

## Submission

Submission format is described at competition page.

## Data preprocessing

Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed.

In NLP, text preprocessing is the first step in the process of building a model.

The various text preprocessing steps are:

* Tokenization
* Lower casing
* Stop words removal
* Stemming

These various text preprocessing steps are widely used for dimensionality reduction.

First, let's read the input data and then perform preprocessing steps

In [1]:
import pandas as pd
import os

train_dataframe = pd.read_csv(os.pardir+'/data/train.csv')
test_dataframe = pd.read_csv(os.pardir+'/data/test.csv')

train_dataframe.head()

Unnamed: 0,Title,Helpfulness,Score,Text,Category
0,Golden Valley Natural Buffalo Jerky,0/0,3.0,The description and photo on this product need...,grocery gourmet food
1,Westing Game,0/0,5.0,This was a great book!!!! It is well thought t...,toys games
2,Westing Game,0/0,5.0,"I am a first year teacher, teaching 5th grade....",toys games
3,Westing Game,0/0,5.0,I got the book at my bookfair at school lookin...,toys games
4,I SPY A is For Jigsaw Puzzle 63pc,2/4,5.0,Hi! I'm Martine Redman and I created this puzz...,toys games


In the training data we have `4` features (`Title`, `Helpfulness`, `Score` and `Text`) with target category (`Category`). For the test features are the same, except for target column.

First, let's write functions for preprocessing helpfulness and score feature in case we needed them.

In [2]:
def preprocess_score_inplace(df):
    """
    Normalizes score to make it from 0 to 1.
    
    For now it is from 1.0 to 5.0, so natural choice
    is to normalize by (f - 1.0)/4.0
    """
    df['Score'] = (df['Score'] - 1.0) / 4.0
    return df

def preprocess_helpfulness_inplace(df):
    """
    Splits feature by '/' and normalize helpfulness to make it from 0 to 1
    
    The total number of assessments can be 0, so let's substitute it
    with 1. The resulting helpfulness still will be zero but we
    remove the possibility of division by zero exception.

    Return value should be float
    """
    helpf_df = df['Helpfulness'].str.split("/", expand=True).astype(int)
    helpf_df.columns = ["Helpful", "Total"]
    helpf_df["Total"] = helpf_df["Total"].replace(0,1)
    
    df["Helpfulness"] = helpf_df["Helpful"] / helpf_df["Total"]
    
    return df    

The two other features are both text. For simplicity, let's remove concatenate them so that we will have one full text feature. The resulting code is also a function.

In [3]:
def concat_title_text_inplace(df):
    """
    Concatenates Title and Text columns together
    """
    df['Text'] = df['Title'] + " " + df['Text']
    df.drop('Title', axis=1, inplace=True)
    return df

Also, encode the target categories, so that the output is become an index

In [4]:
# define categories indices
cat2idx = {
    'toys games': 0,
    'health personal care': 1,
    'beauty': 2,
    'baby products': 3,
    'pet supplies': 4,
    'grocery gourmet food': 5,
}
# define reverse mapping
idx2cat = {
    v:k for k,v in cat2idx.items()
}

In [5]:
def encode_categories(df):
    df['Category'] = df['Category'].apply(lambda x: cat2idx[x])
    return df

Let's visualize our first stage of preprocessing.

In [6]:
train_copy = train_dataframe.head().copy()

encode_categories(preprocess_score_inplace(preprocess_helpfulness_inplace(concat_title_text_inplace(train_copy))))

Unnamed: 0,Helpfulness,Score,Text,Category
0,0.0,0.5,Golden Valley Natural Buffalo Jerky The descri...,5
1,0.0,1.0,Westing Game This was a great book!!!! It is w...,0
2,0.0,1.0,"Westing Game I am a first year teacher, teachi...",0
3,0.0,1.0,Westing Game I got the book at my bookfair at ...,0
4,0.5,1.0,I SPY A is For Jigsaw Puzzle 63pc Hi! I'm Mart...,0


### Text cleaning

For text cleaning, you can use lower casting, punctuation removal, numbers removal, tokenization, stop words removal, stemming. This will get a perfectly cleaned text without any garbage information.

In [7]:
import re

def lower_text(text: str):
    return text.lower()

def remove_numbers(text: str):
    """
    Substitute all punctuations with space in case of
    "there is5dogs".
    
    If subs with '' -> "there isdogs"
    With ' ' -> there is dogs
    """
    text_nonum = re.sub(r'\d+', ' ', text)
    return text_nonum

def remove_punctuation(text: str):
    """
    Substitute all punctiations with space in case of
    "hello!nice to meet you"
    
    If subs with '' -> "hellonice to meet you"
    With ' ' -> "hello nice to meet you"
    """
    text_nopunct = re.sub(r'\W+', ' ', text)
    return text_nopunct

def remove_multiple_spaces(text: str):
    text_no_doublespace = re.sub(r'\s+', ' ', text)
    return text_no_doublespace

This will give us clean text.

In [8]:
sample_text = train_copy['Text'][4]

_lowered = lower_text(sample_text)
_without_numbers = remove_numbers(_lowered)
_without_punct = remove_punctuation(_without_numbers)
_single_spaced = remove_multiple_spaces(_without_punct)

print(sample_text)
print('-'*10)
print(_lowered)
print('-'*10)
print(_without_numbers)
print('-'*10)
print(_without_punct)
print('-'*10)
print(_single_spaced)

I SPY A is For Jigsaw Puzzle 63pc Hi! I'm Martine Redman and I created this puzzle for Briarpatch using a great photo from Jean Marzollo and Walter Wick's terrific book, I Spy School Days. Kids need lots of practice to master the ABC's, and this puzzle provides an enjoyable reinforcing tool. Its visual richness helps non-readers and readers alike to remember word associations, and the wealth of cleverly chosen objects surrounding each letter promote language development. The riddle included multiplies the fun of assembling this colorful puzzle. For another great Briarpatch puzzle, check out I Spy Blocks. END
----------
i spy a is for jigsaw puzzle 63pc hi! i'm martine redman and i created this puzzle for briarpatch using a great photo from jean marzollo and walter wick's terrific book, i spy school days. kids need lots of practice to master the abc's, and this puzzle provides an enjoyable reinforcing tool. its visual richness helps non-readers and readers alike to remember word associa

Now, harder preprocessing: tokenization, stop words removal and stemming.
For that you can use several packages, but we encourage you to use `nltk` - Natural Language ToolKit as well as `torchtext`.


Take a look at:
* `nltk.tokenize.word_tokenize` or `torchtext.data.utils.get_tokenizer` for tokenization
* `nltk.corpus.stopwords` for stop words removal and `nltk.corpus.punkt` for punctuation
* `nltk.stem.PorterStemmer` for stemming

In [9]:
import nltk
from nltk.tokenize import word_tokenize, punkt
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('punkt_tab')
nltk.download('stopwords')
print(stopwords.words('english'))

[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', '

In [10]:
def tokenize_text(text: str) -> list[str]:
    return word_tokenize(text)

def remove_stop_words(tokenized_text: list[str]) -> list[str]:
    stop_words = set(stopwords.words('english'))
    filtered_sentence = [w for w in tokenized_text if not w.lower() in stop_words]
    return filtered_sentence

def stem_words(tokenized_text: list[str]) -> list[str]:
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokenized_text]

In [11]:
_tokenized = tokenize_text(_single_spaced)
_without_sw = remove_stop_words(_tokenized)
_stemmed = stem_words(_without_sw)

print(_single_spaced)
print('-'*10)
print(_tokenized)
print('-'*10)
print(_without_sw)
print('-'*10)
print(_stemmed)

i spy a is for jigsaw puzzle pc hi i m martine redman and i created this puzzle for briarpatch using a great photo from jean marzollo and walter wick s terrific book i spy school days kids need lots of practice to master the abc s and this puzzle provides an enjoyable reinforcing tool its visual richness helps non readers and readers alike to remember word associations and the wealth of cleverly chosen objects surrounding each letter promote language development the riddle included multiplies the fun of assembling this colorful puzzle for another great briarpatch puzzle check out i spy blocks end
----------
['i', 'spy', 'a', 'is', 'for', 'jigsaw', 'puzzle', 'pc', 'hi', 'i', 'm', 'martine', 'redman', 'and', 'i', 'created', 'this', 'puzzle', 'for', 'briarpatch', 'using', 'a', 'great', 'photo', 'from', 'jean', 'marzollo', 'and', 'walter', 'wick', 's', 'terrific', 'book', 'i', 'spy', 'school', 'days', 'kids', 'need', 'lots', 'of', 'practice', 'to', 'master', 'the', 'abc', 's', 'and', 'this

As you can see, there is a lot of words removed as well as the unnecessary language rules (I mean stems, com'on). Now we are able to construct full cleaning preprocessing stage.

In [12]:
def preprocessing_stage(text):
    _lowered = lower_text(text)
    _without_numbers = remove_numbers(_lowered)
    _without_punct = remove_punctuation(_without_numbers)
    _single_spaced = remove_multiple_spaces(_without_punct)
    _tokenized = tokenize_text(_single_spaced)
    _without_sw = remove_stop_words(_tokenized)
    _stemmed = stem_words(_without_sw)
    
    return _stemmed

def clean_text_inplace(df):
    df['Text'] = df['Text'].apply(preprocessing_stage)
    return df

def preprocess(df):
    df.fillna(" ", inplace=True)
    _preprocess_score = preprocess_score_inplace(df)
    _preprocess_helpfulness = preprocess_helpfulness_inplace(_preprocess_score)
    _concatted = concat_title_text_inplace(_preprocess_helpfulness)

    if 'Category' in df.columns:
        _encoded = encode_categories(_concatted)
        _cleaned = clean_text_inplace(_encoded)
    else:
        _cleaned = clean_text_inplace(_concatted)
    return _cleaned
    

And now let's apply it on our train and test dataframes.

In [13]:
train_preprocessed = preprocess(train_dataframe)
test_preprocessed = preprocess(test_dataframe)

train_preprocessed.head()

Unnamed: 0,Helpfulness,Score,Text,Category
0,0.0,0.5,"[golden, valley, natur, buffalo, jerki, descri...",5
1,0.0,1.0,"[west, game, great, book, well, thought, easil...",0
2,0.0,1.0,"[west, game, first, year, teacher, teach, th, ...",0
3,0.0,1.0,"[west, game, got, book, bookfair, school, look...",0
4,0.5,1.0,"[spi, jigsaw, puzzl, pc, hi, martin, redman, c...",0


Now, let's split our original train dataset into train and val sets.

In [14]:
from sklearn.model_selection import train_test_split

ratio = 0.2
train, val = train_test_split(
    train_preprocessed, stratify=train_preprocessed['Category'], test_size=0.2, random_state=420
)

And now, for the best result, lets get rid of pandas so that nothing is stopping us from working with torchtext. For that let's create an iterator that is going to yield samples for us.

# Creating dataloaders

First, you should generate our vocab from the train set.

For that, use `torchtext.vocab.build_vocab_from_iterator`.

In [15]:
!pip install torchtext==0.13.0

Collecting torchtext==0.13.0
  Downloading torchtext-0.13.0-cp310-cp310-manylinux1_x86_64.whl.metadata (6.9 kB)
Collecting torch==1.12.0 (from torchtext==0.13.0)
  Downloading torch-1.12.0-cp310-cp310-manylinux1_x86_64.whl.metadata (22 kB)
Downloading torchtext-0.13.0-cp310-cp310-manylinux1_x86_64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading torch-1.12.0-cp310-cp310-manylinux1_x86_64.whl (776.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 2.4.0
    Uninstalling torch-2.4.0:
      Successfully uninstalled torch-2.4.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following de

In [16]:
from torchtext.vocab import build_vocab_from_iterator

def yield_tokens(df):
    for _, sample in train.iterrows():
        yield sample.to_list()[2]


# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

vocab = build_vocab_from_iterator(yield_tokens(train), specials=special_symbols)
vocab.set_default_index(UNK_IDX)

And then use our vocab to encode the tokenized sequence

In [17]:
sample = train['Text'][2]
print(sample)
encoded = vocab(sample)
print(encoded)

['west', 'game', 'first', 'year', 'teacher', 'teach', 'th', 'grade', 'special', 'read', 'class', 'high', 'comprehens', 'level', 'read', 'book', 'one', 'best', 'thing', 'taught', 'year', 'expand', 'mind', 'allow', 'put', 'charact', 'place', 'easi', 'student', 'make', 'mind', 'movi', 'even', 'use', 'whole', 'read', 'class', 'time', 'order', 'finish', 'book', 'student', 'wait', 'hear', 'end', 'excel', 'book', 'read', 'everi', 'year', 'student']
[2556, 43, 33, 14, 2751, 807, 860, 1724, 728, 131, 1895, 191, 6980, 583, 131, 515, 5, 59, 46, 3505, 14, 2954, 528, 450, 40, 1125, 165, 50, 1924, 22, 528, 945, 30, 4, 271, 131, 1895, 13, 68, 623, 515, 1924, 426, 600, 180, 311, 515, 131, 85, 14, 1924]


Now we can define our collate function and create dataloaders

In [18]:
import torch
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

torch.manual_seed(420)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def collate_batch(batch):
    label_list, text_list, score_list, helpfulness_list, lengths = [], [], [], [], []
    for _helpfulness, _score, _text, _label in batch:
        label_list.append(_label)
        processed_text = torch.tensor(vocab(_text), dtype=torch.int64)
        text_list.append(processed_text)
        score_list.append(_score)
        helpfulness_list.append(_helpfulness)
        lengths.append(processed_text.size(0))  # Store the length of each sequence

    # Pad the sequences
    padded_texts = pad_sequence(text_list, batch_first=True)

    # Get the max length after padding
    max_len = padded_texts.size(1)

    # Ensure that no sequence length exceeds the padded sequence length
    lengths = [min(length, max_len) for length in lengths]

    label_list = torch.tensor(label_list, dtype=torch.int64)
    score_list = torch.tensor(score_list, dtype=torch.float64)
    helpfulness_list = torch.tensor(helpfulness_list, dtype=torch.float64)

    return label_list.to(device), padded_texts.to(device), torch.tensor(lengths).to(device), score_list.to(device), helpfulness_list.to(device)

train_dataloader = DataLoader(
    train.to_numpy(), batch_size=128, shuffle=True, collate_fn=collate_batch
)

val_dataloader = DataLoader(
    val.to_numpy(), batch_size=128, shuffle=False, collate_fn=collate_batch
)

# Defining Network


For writing a network you can use `torch.nn.Embedding` or `torch.nn.EmbeddingBag`. This will allow your netorwk to learn embedding vector for your tokens.

As for the other modules in your network, consider these options:
* Simple Linear layers, activations, basic stuff that goes into the network
* There is a possible of not using the offsets (indices of sequences) in the formart, put use predefined sequence length (maximum length, some value, etc.). If this is an option for you, change the `collate_batch` function according to your architecture.
* You could use all this recurrent stuff (RNN, GRU, LSTM, even Transformer, all up to you), but remembder about the dimentions and hidden states

In [19]:
import torch.nn.utils.rnn as rnn_utils
import torch.nn as nn

class TextClassificationModel(nn.Module):
    def __init__(self, num_classes, input_size):
        super(TextClassificationModel, self).__init__()
        # Embedding and GRU layers
        self.embed = nn.Embedding(input_size + 1, 200)
        self.gru = nn.GRU(200, 512, batch_first=True, bidirectional=True)
        self.drop = nn.Dropout(0.2)

        # Ensure the input to the linear layer matches the GRU's output
        self.linear = nn.Sequential(
            nn.Linear(512, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, num_classes)
        )

    def forward(self, text, offsets):
        offsets = torch.tensor([i for i in offsets if i>0])
        x = self.embed(text)

        # Pack padded sequences to deal with varying lengths
        packed_input = rnn_utils.pack_padded_sequence(x, offsets.to('cpu'), batch_first=True, enforce_sorted=False)

        # Pass through the GRU
        packed_output, hidden = self.gru(packed_input)

        # Use the hidden state as the output, which has shape [batch_size, hidden_size]
        hidden = hidden[-1]  # Take the last layer's hidden state

        # Apply dropout and pass through the linear layers
        x = self.drop(hidden)
        x = self.linear(x)
        
        return x


In [20]:
from tqdm.autonotebook import tqdm

def train_one_epoch(
    model,
    loader,
    optimizer,
    loss_fn,
    epoch_num=-1
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Epoch {epoch_num}: train",
        leave=True,
    )
    model.train()
    train_loss = 0.0
    for i, batch in loop:
        labels, texts, offsets, scores, helpfulness = batch
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward pass
        outputs = model(texts, offsets)
        # loss calculation
        loss = loss_fn(outputs, labels.long())
        
        # backward pass
        loss.backward()

        # optimizer run
        optimizer.step()

        train_loss += loss.item()
        loop.set_postfix({"loss": train_loss/(i * len(labels))})

def val_one_epoch(
    model,
    loader,
    loss_fn,
    epoch_num=-1,
    best_so_far=0.0,
    ckpt_path='best.pt'
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Epoch {epoch_num}: val",
        leave=True,
    )
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        model.eval()  # evaluation mode
        for i, batch in loop:
            labels, texts, offsets, scores, helpfulness = batch

            # forward pass
            outputs = model(texts, offsets)

            # loss calculation
            loss = loss_fn(outputs, labels.long()).item() 

            # prediction and accuracy calculation
            predicted = outputs.argmax(dim=1, keepdim=True)
            c = predicted.eq(labels.view_as(predicted)).sum().item()
            correct += c

            # increment total by the actual batch size
            t = len(labels)
            total += t

            val_loss += loss
            loop.set_postfix({"loss": val_loss/total, "acc": c / t})

        if correct / total > best_so_far:
            torch.save(model.state_dict(), ckpt_path)
            return correct / total

    return best_so_far



In [21]:
from torch.optim import SGD
epochs = 10
model = TextClassificationModel(6,40000).to(device)
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
loss_fn = nn.CrossEntropyLoss()

In [22]:
best = -float('inf')
for epoch in range(epochs):
    train_one_epoch(model, train_dataloader, optimizer, loss_fn, epoch_num=epoch)
    best = val_one_epoch(model, val_dataloader, loss_fn, epoch, best_so_far=best)

Epoch 0: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 0: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 1: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 1: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 2: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 2: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 3: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 3: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 4: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 4: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 5: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 5: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 6: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 6: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 7: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 7: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 8: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 8: val:   0%|          | 0/63 [00:00<?, ?it/s]

Epoch 9: train:   0%|          | 0/250 [00:00<?, ?it/s]

Epoch 9: val:   0%|          | 0/63 [00:00<?, ?it/s]

# Predictions

In [23]:
def collate_batch(batch):
    label_list, text_list, score_list, helpfulness_list, lengths = [], [], [], [], []
    for _helpfulness, _score, _text, _label in batch:
        label_list.append(_label)
        processed_text = torch.tensor(vocab(_text), dtype=torch.int64)
        text_list.append(processed_text)
        score_list.append(_score)
        helpfulness_list.append(_helpfulness)
        lengths.append(processed_text.size(0))  # Store the length of each sequence

    # Pad the sequences
    padded_texts = pad_sequence(text_list, batch_first=True)

    # Get the max length after padding
    max_len = padded_texts.size(1)

    # Ensure that no sequence length exceeds the padded sequence length
    lengths = [min(length, max_len) for length in lengths]

    label_list = torch.tensor(label_list, dtype=torch.int64)
    score_list = torch.tensor(score_list, dtype=torch.float64)
    helpfulness_list = torch.tensor(helpfulness_list, dtype=torch.float64)

    return label_list.to(device), padded_texts.to(device), torch.tensor(lengths).to(device), score_list.to(device), helpfulness_list.to(device)

test_dataloader = DataLoader(
    test_preprocessed.to_numpy(), batch_size=128, shuffle=False, collate_fn=collate_batch
)


In [24]:
def predict(
    model,
    loader,
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc="Predictions:",
        leave=True,
    )
    predictions = []
    with torch.no_grad():
        model.eval()  # evaluation mode
        for i, batch in loop:
            labels, texts, offsets, scores, helpfulness = batch

            # Forward pass
            outputs = model(texts, offsets)
            
            # Get the predicted class (index of the max log-probability)
            _, _predicted = torch.max(outputs.data, 1)
            
            # Append predictions to list
            predictions += _predicted.detach().cpu().tolist()

    return predictions


In [25]:
ckpt = torch.load("best.pt")
model.load_state_dict(ckpt)

predictions = predict(model, test_dataloader)
predictions[:10]

Predictions::   0%|          | 0/16 [00:00<?, ?it/s]

[1, 1, 2, 1, 3, 1, 3, 2, 2, 4]

In [26]:
submission_df = test_dataframe.copy()

submission_df.drop(['Helpfulness','Score','Text'],axis=1,inplace=True)
submission_df['Category'] = [idx2cat[x] for x in predictions]

submission_df.to_csv('submission.csv', index=False)

In [27]:
submission_df

Unnamed: 0,ID,Category
0,4601,health personal care
1,2554,health personal care
2,6181,beauty
3,4937,health personal care
4,2044,baby products
...,...,...
1995,8721,toys games
1996,4786,beauty
1997,3377,baby products
1998,1474,health personal care
