# Dataset Description
### What files do I need?
You will need train.csv, test.csv, sample_submission.csv and classes.csv.

### What am I predicting?
You are predicting the class of a given product name in the context of CPI.

### Files
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - a sample submission file in the correct format

### Columns
- id - a unique identifier for each product name
- name - name of each product
- target - in train.csv only, this denotes a class of a product name. More about class details in classes.csv file

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/natural-language-processing-with-e-commerce-data/sample_submission.csv
/kaggle/input/natural-language-processing-with-e-commerce-data/classes.csv
/kaggle/input/natural-language-processing-with-e-commerce-data/train.csv
/kaggle/input/natural-language-processing-with-e-commerce-data/test.csv


In [2]:
# disable wandb to avoid errors
os.environ["WANDB_DISABLED"] = "true"

# EDA

In [3]:
classes_df = pd.read_csv('/kaggle/input/natural-language-processing-with-e-commerce-data/classes.csv')
classes_df

Unnamed: 0,id,szpt
0,0,Рожки
1,1,Хлеб пшеничный из муки первого сорта
2,2,Творог
3,3,Крупа гречневая
4,4,Рис
5,5,Масло сливочное несоленое
6,6,Мука пшеничная первого сорта
7,7,Масло подсолнечное
8,8,"Кефир 2,5%"
9,9,"Молоко пастеризованное 2,5%"


In [4]:
train_df = pd.read_csv('/kaggle/input/natural-language-processing-with-e-commerce-data/train.csv')
print(train_df.shape[0])
train_df.head()

37360


Unnamed: 0,id,name,target
0,0,"Соус для пасты Pomi базилик 400 г,",-1
1,1,"Прокладки ежедневные гигиенические KOTЕХ Lux, ...",-1
2,2,"Хлеб ""Жулдыз"", 400 гр,",1
3,3,Паста для волос Syoss Моделирующая легкий конт...,-1
4,4,"Набор для завтрака детский ""Муми-Троли"", 3 пре...",-1


In [5]:
train_df[train_df['target']==11]

Unnamed: 0,id,name,target
399,399,Bruto Крафт Картошка жареная Томат и укроп 70 гр,11
2632,2632,КАРТОФЕЛЬ Эссель,11
2880,2880,Ов Картофель пакистан,11
3489,3489,"Картофель Павлодар вес,",11
3585,3585,картофель вес,11
3869,3869,Крaхмал Приправыч картофельный сорт экстра 200...,11
6332,6332,Картофель Южный,11
7089,7089,Bruto Крафт Картошка рифленная Розовый перец 7...,11
7325,7325,Ов Картофель,11
7647,7647,Картофель,11


In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37360 entries, 0 to 37359
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      37360 non-null  int64 
 1   name    37359 non-null  object
 2   target  37360 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 875.8+ KB


In [7]:
train_df['name'] = train_df['name'].fillna('')
train_df['name'] = train_df['name'].astype(str)

In [8]:
test_df = pd.read_csv('/kaggle/input/natural-language-processing-with-e-commerce-data/test.csv')
print(test_df.shape[0])
test_df.head()

9341


Unnamed: 0,id,name
0,0,Корм Whiskas для котят рагу с ягненком 1-12 ме...
1,1,Напиток чайный TASSAY Ice tea зеленый Лимон пл...
2,2,Семечки Мартин отборные жареные с морской соль...
3,3,Дезодорант AXE Таинственное искушение Спрей ба...
4,4,Тампоны KOTEX Active Super Гигиенические карт/...


In [9]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9341 entries, 0 to 9340
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      9341 non-null   int64 
 1   name    9341 non-null   object
dtypes: int64(1), object(1)
memory usage: 146.1+ KB


In [10]:
df_submission = pd.read_csv('/kaggle/input/natural-language-processing-with-e-commerce-data/sample_submission.csv')
df_submission.head()

Unnamed: 0,id,target
0,0,9
1,1,14
2,2,2
3,3,13
4,4,5


### Conclusion
We have training dataset with 37360 values and have to make 9341 predictions. Not all the products have corresponding classes, some of them will be NOT CLASSIFIED (-1).

# Define and train the model

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch

label_encoder = LabelEncoder()
train_df['target'] = label_encoder.fit_transform(train_df['target'])

train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_encodings = tokenizer(train_df['name'].tolist(), truncation=True, padding=True)
val_encodings = tokenizer(val_df['name'].tolist(), truncation=True, padding=True)
test_encodings = tokenizer(test_df['name'].tolist(), truncation=True, padding=True)

class TextDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = TextDataset(train_encodings, train_df['target'].tolist())
val_dataset = TextDataset(val_encodings, val_df['target'].tolist())

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_encoder.classes_))

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()
results = trainer.evaluate()
print(results)

2024-07-06 12:48:46.696268: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-06 12:48:46.696373: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-06 12:48:46.839846: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss
1,0.0174,0.08559
2,0.0013,0.042518
3,0.0021,0.035846


{'eval_loss': 0.035846248269081116, 'eval_runtime': 13.8166, 'eval_samples_per_second': 270.399, 'eval_steps_per_second': 4.27, 'epoch': 3.0}


# Make predictions and create submission file

In [12]:
test_dataset = TextDataset(test_encodings, [0] * len(test_encodings['input_ids']))
predictions = trainer.predict(test_dataset)
predicted_labels = np.argmax(predictions.predictions, axis=-1)

predicted_labels = label_encoder.inverse_transform(predicted_labels)

submission_df = pd.DataFrame({'id': test_df['id'], 'target': predicted_labels})
submission_df.to_csv('submission.csv', index=False)