# Fake Review Detection with ParsBERT on fake-review-dataset by Joni Salminen


## BERT Overview

BERT stands for Bi-directional Encoder Representation from Transformers is designed to pre-train deep bidirectional representations from unlabeled texts by jointly conditioning on both left and right context in all layers. The pretrained BERT model can be fine-tuned with just one additional output layer (in many cases) to create state-of-the-art models. This model can use for a wide range of NLP tasks, such as question answering and language inference, and so on without substantial task-specific architecture modification.


Natural Language Processing (NLP) tasks include sentence-level tasks or token-level tasks:

- **Sentence-Level:** Tasks such as Natural Language Inference (NLI) aim to predict the relationships between sentences by holistically analyzing them.
- **Token-Level:** Tasks such as Named Entity Recognition (NER), Question Answering (QA), the model makes predictions on a word-by-word basis.

In the pre-trained language representation, there are two primary strategies for applying to down-stream NLP tasks:

- Feature-based: They use task-specific architectures that include pre-training representation as additional features like Word2vec, ELMo, ...
- Fine-tunning: Introduce minimal task-specific parameters, and are trained on the down-stream tasks by merely tuning the pre-training parameters like GPT.

Before going more further into code, let me introduce **ParsBERT**.

## ParsBERT

ParsBERT Is a monolingual language model based on Google's BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news, ...) with more than **3.9M** documents, **73M** sentences, and **1.3B** words. For more information about ParsBERT, please check out the article: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515).

In [None]:
!nvidia-smi

Mon Aug 29 18:10:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install required packages
!pip install -q pyyaml==5.4.1

!pip install -q transformers
!pip install -q hazm
!pip install -q clean-text[gpl]

[K     |████████████████████████████████| 636 kB 6.9 MB/s 
[K     |████████████████████████████████| 4.7 MB 8.4 MB/s 
[K     |████████████████████████████████| 120 kB 65.7 MB/s 
[K     |████████████████████████████████| 6.6 MB 53.7 MB/s 
[K     |████████████████████████████████| 316 kB 8.1 MB/s 
[K     |████████████████████████████████| 1.4 MB 46.7 MB/s 
[K     |████████████████████████████████| 233 kB 57.0 MB/s 
[?25h  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Building wheel for libwapiti (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 53 kB 1.8 MB/s 
[K     |████████████████████████████████| 175 kB 14.0 MB/s 
[K     |████████████████████████████████| 235 kB 57.3 MB/s 
[?25h  Building wheel for emoji (setup.py) ... [?25l[?25hdone


In [None]:
# Import required packages

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.utils import shuffle

import hazm
from cleantext import clean

import plotly.express as px
import plotly.graph_objects as go

from tqdm.notebook import tqdm

import os
import re
import json
import copy
import collections

## Dataset

[fake-revie-dataset](https://osf.io/3vds7) by Joni Salminen. this dataset is in English so we just translated the dataset to Persian using google translate

Let's look at the dataset and obtain some intuitions about the data, distribution, and any further operation regarding this particular case.

In [None]:
!gdown https://drive.google.com/u/0/uc?id=1ZKYXua5UALFjogm6mtR3u1dqd3JHMGHF&export=download

Downloading...
From: https://drive.google.com/u/0/uc?id=1ZKYXua5UALFjogm6mtR3u1dqd3JHMGHF
To: /content/merged_fake_dataset.csv
100% 23.8M/23.8M [00:00<00:00, 82.5MB/s]


### Load the data using Pandas

In [None]:
data = pd.read_csv('merged_fake_dataset.csv', encoding='utf-8')
data.columns = ['comment','rate']

### Fixing Conflicts

the dataset has some structural problems, as shown below.

For simplicity, We fix this problem by removing rows with the `rate` value of `None`. Furthermore, the dataset contains duplicated rows and missing values in the comment section.

In [None]:
# print data information
print('data information')
print(data.info(), '\n')

# print missing values information
print('missing values stats')
print(data.isnull().sum(), '\n')

# print some missing values
print('some missing values')
print(data[data['rate'].isnull()].iloc[:5], '\n')

data information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40432 entries, 0 to 40431
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   comment  40432 non-null  object
 1   rate     40432 non-null  object
dtypes: object(2)
memory usage: 631.9+ KB
None 

missing values stats
comment    0
rate       0
dtype: int64 

some missing values
Empty DataFrame
Columns: [comment, rate]
Index: [] 



In [None]:
# handle some conflicts with the dataset structure
# you can find a reliable solution, for the sake of the simplicity
# We just remove these bad combinations!

data = data.dropna(subset=['rate'])
data = data.dropna(subset=['comment'])
data = data.drop_duplicates(subset=['comment'], keep='first')
data = data.reset_index(drop=True)

# previous information after solving the conflicts

# print data information
print('data information')
print(data.info(), '\n')

# print missing values information
print('missing values stats')
print(data.isnull().sum(), '\n')

# print some missing values
print('some missing values')
print(data[data['rate'].isnull()].iloc[:5], '\n')

data information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40401 entries, 0 to 40400
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   comment  40401 non-null  object
 1   rate     40401 non-null  object
dtypes: object(2)
memory usage: 631.4+ KB
None 

missing values stats
comment    0
rate       0
dtype: int64 

some missing values
Empty DataFrame
Columns: [comment, rate]
Index: [] 



### Normalization / Preprocessing

The comments have different lengths based on words! Detecting the most normal range could help us find the maximum length of the sequences for the preprocessing step. On the other hand, we suppose that the minimum word combination for having a meaningful phrase for our learning process is 3.

In [None]:
# calculate the length of comments based on their words
data['comment_len_by_words'] = data['comment'].apply(lambda t: len(hazm.word_tokenize(t)))

In [None]:
min_max_len = data["comment_len_by_words"].min(), data["comment_len_by_words"].max()
print(f'Min: {min_max_len[0]} \tMax: {min_max_len[1]}')

Min: 2 	Max: 522


In [None]:
def data_gl_than(data, less_than=100.0, greater_than=0.0, col='comment_len_by_words'):
    data_length = data[col].values

    data_glt = sum([1 for length in data_length if greater_than < length <= less_than])

    data_glt_rate = (data_glt / len(data_length)) * 100

    print(f'Texts with word length of greater than {greater_than} and less than {less_than} includes {data_glt_rate:.2f}% of the whole!')

In [None]:
data_gl_than(data, 256, 3)

Texts with word length of greater than 3 and less than 256 includes 94.65% of the whole!


In [None]:
minlim, maxlim = 3, 256

In [None]:
# remove comments with the length of fewer than three words
data['comment_len_by_words'] = data['comment_len_by_words'].apply(lambda len_t: len_t if minlim < len_t <= maxlim else None)
data = data.dropna(subset=['comment_len_by_words'])
data = data.reset_index(drop=True)

In [None]:
fig = go.Figure()

fig.add_trace(go.Histogram(
    x=data['comment_len_by_words']
))

fig.update_layout(
    title_text='Distribution of word counts within comments',
    xaxis_title_text='Word Count',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

In [None]:
unique_rates = list(sorted(data['rate'].unique()))
print(f'We have #{len(unique_rates)}: {unique_rates}')

We have #2: ['CG', 'OR']


In [None]:
fig = go.Figure()

groupby_rate = data.groupby('rate')['rate'].count()

fig.add_trace(go.Bar(
    x=list(sorted(groupby_rate.index)),
    y=groupby_rate.tolist(),
    text=groupby_rate.tolist(),
    textposition='auto'
))

fig.update_layout(
    title_text='Distribution of rate within comments',
    xaxis_title_text='Rate',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

We have two labels [CG, OR].

OR is real review and CG is fake review.

In [None]:
def rate_to_label_fake_review(rate):
    if rate == 'CG':
        return 'CG'
    elif rate == 'OR':
        return 'OR'

data['label'] = data['rate'].apply(lambda t: rate_to_label_fake_review(t))
labels = list(sorted(data['label'].unique()))
data.head()

Unnamed: 0,comment,rate,comment_len_by_words,label
0,این را دوست دارم خوش ساخت، محکم و بسیار راحت. ...,CG,20.0,CG
1,آن را دوست دارم، یک ارتقای عالی نسبت به نسخه ا...,CG,22.0,CG
2,این بالش کمرم را نجات داد. من ظاهر و احساس این...,CG,17.0,CG
3,اطلاعاتی در مورد نحوه استفاده از آن وجود ندارد...,CG,19.0,CG
4,ست بسیار زیبا کیفیت خوب. دو ماه است که این مجم...,CG,16.0,CG


Cleaning is the final step in this section. Our cleaned method includes these steps:

- fixing unicodes
- removing specials like a phone number, email, url, new lines, ...
- cleaning HTMLs
- normalizing
- removing emojis

also we are using [Hazm](https://github.com/sobhe/hazm) library.

furthermore we can use the InformalNormalizer instead of Normalizer for better model but the InformalNormalizer takes too much time to process!

for using InformalNormalizer we must use the norm_words function too.

In [None]:
def norm_words(words):
    string = ''
    for i in range(len(words[0]) - 1):
        string = string + words[0][i][0] + ' '
    string = string + words[0][len(words[0]) - 1][0]

    return string

def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext


def cleaning(text):
    text = text.strip()
    
    # regular cleaning
    text = clean(text,
        fix_unicode=True,
        to_ascii=False,
        lower=True,
        no_line_breaks=True,
        no_urls=True,
        no_emails=True,
        no_phone_numbers=True,
        no_numbers=False,
        no_digits=False,
        no_currency_symbols=True,
        no_punct=False,
        replace_with_url="",
        replace_with_email="",
        replace_with_phone_number="",
        replace_with_number="",
        replace_with_digit="0",
        replace_with_currency_symbol="",
    )

    # cleaning htmls
    text = cleanhtml(text)
    
    # normalizing
    #normalizer = hazm.InformalNormalizer(seperation_flag=True)
    normalizer = hazm.Normalizer()
    text = normalizer.normalize(text)

    #text = norm_words(text)
    
    # removing wierd patterns
    wierd_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u'\U00010000-\U0010ffff'
        u"\u200d"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\u3030"
        u"\ufe0f"
        u"\u2069"
        u"\u2066"
        # u"\u200c"
        u"\u2068"
        u"\u2067"
        "]+", flags=re.UNICODE)
    
    text = wierd_pattern.sub(r'', text)
    
    # removing extra spaces, hashtags
    text = re.sub("#", "", text)
    text = re.sub("\s+", " ", text)
    
    return text

In [None]:
# cleaning comments
data['cleaned_comment'] = data['comment'].apply(cleaning)


# calculate the length of comments based on their words
data['cleaned_comment_len_by_words'] = data['cleaned_comment'].apply(lambda t: len(hazm.word_tokenize(t)))

# remove comments with the length of fewer than three words
data['cleaned_comment_len_by_words'] = data['cleaned_comment_len_by_words'].apply(lambda len_t: len_t if minlim < len_t <= maxlim else len_t)
data = data.dropna(subset=['cleaned_comment_len_by_words'])
data = data.reset_index(drop=True)

data.head()

Unnamed: 0,comment,rate,comment_len_by_words,label,cleaned_comment,cleaned_comment_len_by_words
0,این را دوست دارم خوش ساخت، محکم و بسیار راحت. ...,CG,20.0,CG,این را دوست دارم خوش ساخت، محکم و بسیار راحت. ...,20
1,آن را دوست دارم، یک ارتقای عالی نسبت به نسخه ا...,CG,22.0,CG,آن را دوست دارم، یک ارتقای عالی نسبت به نسخه ا...,22
2,این بالش کمرم را نجات داد. من ظاهر و احساس این...,CG,17.0,CG,این بالش کمرم را نجات داد. من ظاهر و احساس این...,17
3,اطلاعاتی در مورد نحوه استفاده از آن وجود ندارد...,CG,19.0,CG,اطلاعاتی در مورد نحوه استفاده از آن وجود ندارد...,19
4,ست بسیار زیبا کیفیت خوب. دو ماه است که این مجم...,CG,16.0,CG,ست بسیار زیبا کیفیت خوب. دو ماه است که این مجم...,16


In [None]:
data = data[['cleaned_comment', 'label']]
data.columns = ['comment', 'label']
data.head()

Unnamed: 0,comment,label
0,این را دوست دارم خوش ساخت، محکم و بسیار راحت. ...,CG
1,آن را دوست دارم، یک ارتقای عالی نسبت به نسخه ا...,CG
2,این بالش کمرم را نجات داد. من ظاهر و احساس این...,CG
3,اطلاعاتی در مورد نحوه استفاده از آن وجود ندارد...,CG
4,ست بسیار زیبا کیفیت خوب. دو ماه است که این مجم...,CG


In [None]:
print(f'We have #{len(labels)} labels: {labels}')

We have #2 labels: ['CG', 'OR']


### Handling Unbalanced Data

In [None]:
fig = go.Figure()

groupby_label = data.groupby('label')['label'].count()

fig.add_trace(go.Bar(
    x=list(sorted(groupby_label.index)),
    y=groupby_label.tolist(),
    text=groupby_label.tolist(),
    textposition='auto'
))

fig.update_layout(
    title_text='Distribution of label within comments [DATA]',
    xaxis_title_text='Label',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

UnderSampling

Again, for making things simple. We cut the dataset randomly based on the fewer label, the OR class.

In [None]:
negative_data = data[data['label'] == 'CG']
positive_data = data[data['label'] == 'OR']

cutting_point = min(len(negative_data), len(positive_data))

if cutting_point <= len(negative_data):
    negative_data = negative_data.sample(n=cutting_point).reset_index(drop=True)

if cutting_point <= len(positive_data):
    positive_data = positive_data.sample(n=cutting_point).reset_index(drop=True)

new_data = pd.concat([negative_data, positive_data])
new_data = new_data.sample(frac=1).reset_index(drop=True)
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37282 entries, 0 to 37281
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   comment  37282 non-null  object
 1   label    37282 non-null  object
dtypes: object(2)
memory usage: 582.7+ KB


In [None]:
fig = go.Figure()

groupby_label = new_data.groupby('label')['label'].count()

fig.add_trace(go.Bar(
    x=list(sorted(groupby_label.index)),
    y=groupby_label.tolist(),
    text=groupby_label.tolist(),
    textposition='auto'
))

fig.update_layout(
    title_text='Distribution of label within comments [NEW DATA]',
    xaxis_title_text='Label',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

## Train,Validation,Test split

To achieve a globalized model, we need to split the cleaned dataset into train, valid, test sets due to size of the data. We have considered a rate of **0.1** for both *valid*, *test* sets. For splitting, We use `train_test_split` provided by Sklearn package with stratifying on the label for preserving the distribution balance.

In [None]:
new_data['label_id'] = new_data['label'].apply(lambda t: labels.index(t))

train, valid = train_test_split(new_data, test_size=0.1, random_state=1, stratify=new_data['label'])
train, test = train_test_split(train, test_size=0.1, random_state=1, stratify=train['label'])

train = train.reset_index(drop=True)
valid = valid.reset_index(drop=True)
test = test.reset_index(drop=True)

x_train, y_train = train['comment'].values.tolist(), train['label_id'].values.tolist()
x_valid, y_valid = valid['comment'].values.tolist(), valid['label_id'].values.tolist()
x_test, y_test = test['comment'].values.tolist(), test['label_id'].values.tolist()

print(train.shape)
print(valid.shape)
print(test.shape)

(30197, 3)
(3729, 3)
(3356, 3)


### Saving train, valid and test sets to drive

In [None]:
train.to_csv('train.csv')
valid.to_csv('valid.csv')
test.to_csv('test.csv')

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
import shutil

shutil.copy("train.csv", "/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_dataset/train.csv")
shutil.copy("valid.csv", "/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_dataset/valid.csv")
shutil.copy("test.csv", "/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_dataset/test.csv")

'/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_dataset/test.csv'

## Implement model with PyTorch

We will follow the model using *PyTorch*

![BERT INPUTS](https://res.cloudinary.com/m3hrdadfi/image/upload/v1595158991/kaggle/bert_inputs_w8rith.png)

The BERT model input is a combination of 3 embeddings.
- Token embeddings: WordPiece token vocabulary (WordPiece is another word segmentation algorithm, similar to BPE)
- Segment embeddings: for pair sentences [A-B] marked as $E_A$ or $E_B$ mean that it belongs to the first sentence or the second one.
- Position embeddings: specify the position of words in a sentence

In [None]:
from transformers import BertConfig, BertTokenizer
from transformers import BertModel

from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

import torch
import torch.nn as nn
import torch.nn.functional as F

### Configuration

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'device: {device}')

train_on_gpu = torch.cuda.is_available()

if not train_on_gpu:
    print('CUDA is not available.  Training on CPU ...')
else:
    print('CUDA is available!  Training on GPU ...')

device: cuda:0
CUDA is available!  Training on GPU ...


In [None]:
# general config
MAX_LEN = 128
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 16
TEST_BATCH_SIZE = 16

EPOCHS = 10
EEVERY_EPOCH = 1000
LEARNING_RATE = 2e-5
CLIP = 0.0

MODEL_NAME_OR_PATH = 'HooshvareLab/bert-fa-base-uncased'
OUTPUT_PATH = '/content/bert-fa-base-uncased-fake-review-detection/pytorch_model.bin'

os.makedirs(os.path.dirname(OUTPUT_PATH), exist_ok=True)

In [None]:
labels = ['CG', 'OR']
print(f'We have #{len(labels)} labels: {labels}')

We have #2 labels: ['CG', 'OR']


In [None]:
# create a key finder based on label 2 id and id to label

label2id = {label: i for i, label in enumerate(labels)}
id2label = {v: k for k, v in label2id.items()}

print(f'label2id: {label2id}')
print(f'id2label: {id2label}')

label2id: {'CG': 0, 'OR': 1}
id2label: {0: 'CG', 1: 'OR'}


In [None]:
# setup the tokenizer and configuration

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME_OR_PATH)
config = BertConfig.from_pretrained(
    MODEL_NAME_OR_PATH, **{
        'label2id': label2id,
        'id2label': id2label,
    })

print(config.to_json_string())

Downloading:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440 [00:00<?, ?B/s]

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "CG",
    "1": "OR"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "CG": 0,
    "OR": 1
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 100000
}



### Input Embeddings

In [None]:
idx = np.random.randint(0, len(train))
sample_comment = train.iloc[idx]['comment']
sample_label = train.iloc[idx]['label']

print(f'Sample: \n{sample_comment}\n{sample_label}')

Sample: 
سگ‌های من (و هر سگ دیگری) آنها را دوست دارند. من طعم‌های دیگر را امتحان کردم، اما این طعم مورد علاقه است. پلاستیک جامد است و آسان است
CG


In [None]:
tokens = tokenizer.tokenize(sample_comment)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f'  Comment: {sample_comment}')
print(f'   Tokens: {tokenizer.convert_tokens_to_string(tokens)}')
print(f'Token IDs: {token_ids}')

  Comment: سگ‌های من (و هر سگ دیگری) آنها را دوست دارند. من طعم‌های دیگر را امتحان کردم، اما این طعم مورد علاقه است. پلاستیک جامد است و آسان است
   Tokens: سگهای من ( و هر سگ دیگری ) انها را دوست دارند . من طعمهای دیگر را امتحان کردم ، اما این طعم مورد علاقه است . پلاستیک جامد است و اسان است
Token IDs: [20093, 2842, 1006, 1379, 2937, 8267, 3574, 1007, 2950, 2803, 4219, 3188, 1012, 2842, 28452, 2972, 2803, 7216, 5501, 1348, 2949, 2802, 7773, 3050, 5351, 2806, 1012, 10980, 9957, 2806, 1379, 6699, 2806]


In [None]:
encoding = tokenizer.encode_plus(
    sample_comment,
    max_length=32,
    truncation=True,
    add_special_tokens=True, # Add '[CLS]' and '[SEP]'
    return_token_type_ids=True,
    return_attention_mask=True,
    padding='max_length',
    return_tensors='pt',  # Return PyTorch tensors
)

print(f'Keys: {encoding.keys()}\n')
for k in encoding.keys():
    print(f'{k}:\n{encoding[k]}')

Keys: dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

input_ids:
tensor([[    2, 20093,  2842,  1006,  1379,  2937,  8267,  3574,  1007,  2950,
          2803,  4219,  3188,  1012,  2842, 28452,  2972,  2803,  7216,  5501,
          1348,  2949,  2802,  7773,  3050,  5351,  2806,  1012, 10980,  9957,
          2806,     4]])
token_type_ids:
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]])


### Dataset

In [None]:
class Dataset(torch.utils.data.Dataset):
    """ Create a PyTorch dataset. """

    def __init__(self, tokenizer, comments, targets=None, label_list=None, max_len=128):
        self.comments = comments
        self.targets = targets
        self.has_target = isinstance(targets, list) or isinstance(targets, np.ndarray)

        self.tokenizer = tokenizer
        self.max_len = max_len

        
        self.label_map = {label: i for i, label in enumerate(label_list)} if isinstance(label_list, list) else {}
    
    def __len__(self):
        return len(self.comments)

    def __getitem__(self, item):
        comment = str(self.comments[item])

        if self.has_target:
            target = self.label_map.get(str(self.targets[item]), str(self.targets[item]))

        encoding = self.tokenizer.encode_plus(
            comment,
            add_special_tokens=True,
            truncation=True,
            max_length=self.max_len,
            return_token_type_ids=True,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt')
        
        inputs = {
            'comment': comment,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'token_type_ids': encoding['token_type_ids'].flatten(),
        }

        if self.has_target:
            inputs['targets'] = torch.tensor(target, dtype=torch.long)
        
        return inputs


def create_data_loader(x, y, tokenizer, max_len, batch_size, label_list):
    dataset = Dataset(
        comments=x,
        targets=y,
        tokenizer=tokenizer,
        max_len=max_len, 
        label_list=label_list)
    
    return torch.utils.data.DataLoader(dataset, batch_size=batch_size)

In [None]:
label_list = ['CG', 'OR']
train_data_loader = create_data_loader(train['comment'].to_numpy(), train['label'].to_numpy(), tokenizer, MAX_LEN, TRAIN_BATCH_SIZE, label_list)
valid_data_loader = create_data_loader(valid['comment'].to_numpy(), valid['label'].to_numpy(), tokenizer, MAX_LEN, VALID_BATCH_SIZE, label_list)
test_data_loader = create_data_loader(test['comment'].to_numpy(), None, tokenizer, MAX_LEN, TEST_BATCH_SIZE, label_list)

In [None]:
sample_data = next(iter(train_data_loader))

print(sample_data.keys())

print(sample_data['comment'])
print(sample_data['input_ids'].shape)
print(sample_data['input_ids'][0, :])
print(sample_data['attention_mask'].shape)
print(sample_data['attention_mask'][0, :])
print(sample_data['token_type_ids'].shape)
print(sample_data['token_type_ids'][0, :])
print(sample_data['targets'].shape)
print(sample_data['targets'][0])

dict_keys(['comment', 'input_ids', 'attention_mask', 'token_type_ids', 'targets'])
['این یک خواندن شیرین و تند بود. این یک خواندن سریع بود که ربطی به داستان نداشت. من دوست داشتم که شخصیت\u200cها هم قوی و هم مصمم بودند. این یک کتاب خوب برای خواندن در مدت زمان کوتاه بود. من این کتاب را برای خواهرزاده\u200cام و', 'این محصول بسیار خوبی است تا زمانی که سگ شما خوش اخلاق باشد. ضرب و شتم فوق العاده سختی نخواهد داشت، بنابراین سگ\u200cهای خود را آموزش دهید و چند روز آنها را تماشا کنید تا مطمئن شوید که در تلاش برای کشف آن نیستند. اوه، فکر می\u200cکنم می\u200cدانی به کجا می\u200cروم. سگ من که «می داند» چگونه از درب سگ استفاده کند می\u200cخواست در خانه باشد اما لغزنده در پشت صفحه بسته بود. او برای ورود به داخل آنقدر تلاش کرد که در سگ را از روی صفحه بیرون آورد. خوشبختانه من توانستم در را دوباره داخل کنم، اما حدود ۳۰ ثانیه دیگر و همه آن سطل زباله می\u200cشد.', 'من چیزی جز خوش شانسی با گارمین نداشتم. من گارمین خود را در این یکی نگه خواهم داشت. من این را برای یکی از دوستان خریدم و او از خرید آن بسیار ر

In [None]:
sample_test = next(iter(test_data_loader))
print(sample_test.keys())

dict_keys(['comment', 'input_ids', 'attention_mask', 'token_type_ids'])


### Model

During the implementation of the model, sometime, you may be faced with this kind of error. It said you used all the Cuda-Memory. for solving this error There are many ways but the simple one is to clear the Cuda cache memory!

![Cuda-Error](https://res.cloudinary.com/m3hrdadfi/image/upload/v1599979552/kaggle/cuda-error_iyqh4o.png)


**Simple Solution**
```python
import torch, gc

gc.collect()
torch.cuda.empty_cache()

!nvidia-smi
```

In [None]:
class FakeReviewDetectionModel(nn.Module):

    def __init__(self, config):
        super(FakeReviewDetectionModel, self).__init__()

        self.bert = BertModel.from_pretrained(MODEL_NAME_OR_PATH)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
    
    def forward(self, input_ids, attention_mask, token_type_ids):
        _, pooled_output = self.bert(
            input_ids=input_ids, 
            attention_mask=attention_mask, 
            token_type_ids=token_type_ids,
            return_dict=False)
        
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits 

In [None]:
import torch, gc

gc.collect()
torch.cuda.empty_cache()
pt_model = None

!nvidia-smi

Sun Jul 17 14:06:43 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P0    27W / 250W |      2MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
pt_model = FakeReviewDetectionModel(config=config)
pt_model = pt_model.to(device)

print('pt_model', type(pt_model))

In [None]:
# sample data output

sample_data_comment = sample_data['comment']
sample_data_input_ids = sample_data['input_ids']
sample_data_attention_mask = sample_data['attention_mask']
sample_data_token_type_ids = sample_data['token_type_ids']
sample_data_targets = sample_data['targets']

# available for using in GPU
sample_data_input_ids = sample_data_input_ids.to(device)
sample_data_attention_mask = sample_data_attention_mask.to(device)
sample_data_token_type_ids = sample_data_token_type_ids.to(device)
sample_data_targets = sample_data_targets.to(device)


# outputs = F.softmax(
#     pt_model(sample_data_input_ids, sample_data_attention_mask, sample_data_token_type_ids), 
#     dim=1)

outputs = pt_model(sample_data_input_ids, sample_data_attention_mask, sample_data_token_type_ids)
_, preds = torch.max(outputs, dim=1)

print(outputs[:5, :])
print(preds[:5])

tensor([[ 0.3520, -0.6875],
        [ 0.4874, -0.4267],
        [ 0.2629, -0.2463],
        [ 0.1488, -0.4677],
        [ 0.1473, -0.6805]], device='cuda:0', grad_fn=<SliceBackward0>)
tensor([0, 0, 0, 0, 0], device='cuda:0')


### Training

Train and optimization process

During the optimization process we will save the best model to drive

Best model: model with minimum validation loss during optimization

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
!ls /content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_dataset

finetuned_parsbert_fake_reveiws_dataset.pt


In [None]:
def simple_accuracy(y_true, y_pred):
    return (y_true == y_pred).mean()

def acc_and_f1(y_true, y_pred, average='weighted'):
    acc = simple_accuracy(y_true, y_pred)
    f1 = f1_score(y_true=y_true, y_pred=y_pred, average=average)
    return {
        "acc": acc,
        "f1": f1,
    }

def y_loss(y_true, y_pred, losses):
    y_true = torch.stack(y_true).cpu().detach().numpy()
    y_pred = torch.stack(y_pred).cpu().detach().numpy()
    y = [y_true, y_pred]
    loss = np.mean(losses)

    return y, loss


def eval_op(model, data_loader, loss_fn):
    model.eval()

    losses = []
    y_pred = []
    y_true = []

    with torch.no_grad():
        for dl in tqdm(data_loader, total=len(data_loader), desc="Evaluation... "):
            
            input_ids = dl['input_ids']
            attention_mask = dl['attention_mask']
            token_type_ids = dl['token_type_ids']
            targets = dl['targets']

            # move tensors to GPU if CUDA is available
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            token_type_ids = token_type_ids.to(device)
            targets = targets.to(device)

            # compute predicted outputs by passing inputs to the model
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids)
            
            # convert output probabilities to predicted class
            _, preds = torch.max(outputs, dim=1)

            # calculate the batch loss
            loss = loss_fn(outputs, targets)

            # accumulate all the losses
            losses.append(loss.item())

            y_pred.extend(preds)
            y_true.extend(targets)
    
    eval_y, eval_loss = y_loss(y_true, y_pred, losses)
    return eval_y, eval_loss


def train_op(model, 
             data_loader, 
             loss_fn, 
             optimizer, 
             scheduler, 
             step=0, 
             print_every_step=100, 
             eval=False,
             eval_cb=None,
             eval_loss_min=np.Inf,
             eval_data_loader=None, 
             clip=0.0):
    
    model.train()

    losses = []
    y_pred = []
    y_true = []

    for dl in tqdm(data_loader, total=len(data_loader), desc="Training... "):
        step += 1

        input_ids = dl['input_ids']
        attention_mask = dl['attention_mask']
        token_type_ids = dl['token_type_ids']
        targets = dl['targets']

        # move tensors to GPU if CUDA is available
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        token_type_ids = token_type_ids.to(device)
        targets = targets.to(device)

        # clear the gradients of all optimized variables
        optimizer.zero_grad()

        # compute predicted outputs by passing inputs to the model
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids)
        
        # convert output probabilities to predicted class
        _, preds = torch.max(outputs, dim=1)

        # calculate the batch loss
        loss = loss_fn(outputs, targets)

        # accumulate all the losses
        losses.append(loss.item())

        # compute gradient of the loss with respect to model parameters
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        if clip > 0.0:
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip)

        # perform optimization step
        optimizer.step()

        # perform scheduler step
        scheduler.step()

        y_pred.extend(preds)
        y_true.extend(targets)

        if eval:
            train_y, train_loss = y_loss(y_true, y_pred, losses)
            train_score = acc_and_f1(train_y[0], train_y[1], average='weighted')

            if step % print_every_step == 0:
                eval_y, eval_loss = eval_op(model, eval_data_loader, loss_fn)
                eval_score = acc_and_f1(eval_y[0], eval_y[1], average='weighted')

                if hasattr(eval_cb, '__call__'):
                    eval_loss_min = eval_cb(model, step, train_score, train_loss, eval_score, eval_loss, eval_loss_min)

    train_y, train_loss = y_loss(y_true, y_pred, losses)

    return train_y, train_loss, step, eval_loss_min

In [None]:
optimizer = AdamW(pt_model.parameters(), lr=LEARNING_RATE, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

loss_fn = nn.CrossEntropyLoss()

step = 0
eval_loss_min = np.Inf
history = collections.defaultdict(list)


def eval_callback(epoch, epochs, output_path):
    def eval_cb(model, step, train_score, train_loss, eval_score, eval_loss, eval_loss_min):
        statement = ''
        statement += 'Epoch: {}/{}...'.format(epoch, epochs)
        statement += 'Step: {}...'.format(step)
        
        statement += 'Train Loss: {:.6f}...'.format(train_loss)
        statement += 'Train Acc: {:.3f}...'.format(train_score['acc'])

        statement += 'Valid Loss: {:.6f}...'.format(eval_loss)
        statement += 'Valid Acc: {:.3f}...'.format(eval_score['acc'])

        print(statement)

        if eval_loss <= eval_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
                eval_loss_min,
                eval_loss))
            
            path = F"/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_dataset/finetuned_parsbert_fake_reveiws_dataset.pt" 
            torch.save(model.state_dict(), path)
            
            torch.save(model.state_dict(), output_path)
            eval_loss_min = eval_loss
        
        return eval_loss_min


    return eval_cb


for epoch in tqdm(range(1, EPOCHS + 1), desc="Epochs... "):
    train_y, train_loss, step, eval_loss_min = train_op(
        model=pt_model, 
        data_loader=train_data_loader, 
        loss_fn=loss_fn, 
        optimizer=optimizer, 
        scheduler=scheduler, 
        step=step, 
        print_every_step=EEVERY_EPOCH, 
        eval=True,
        eval_cb=eval_callback(epoch, EPOCHS, OUTPUT_PATH),
        eval_loss_min=eval_loss_min,
        eval_data_loader=valid_data_loader, 
        clip=CLIP)
    
    train_score = acc_and_f1(train_y[0], train_y[1], average='weighted')
    
    eval_y, eval_loss = eval_op(
        model=pt_model, 
        data_loader=valid_data_loader, 
        loss_fn=loss_fn)
    
    eval_score = acc_and_f1(eval_y[0], eval_y[1], average='weighted')
    
    history['train_acc'].append(train_score['acc'])
    history['train_loss'].append(train_loss)
    history['val_acc'].append(eval_score['acc'])
    history['val_loss'].append(eval_loss)





Epochs... :   0%|          | 0/10 [00:00<?, ?it/s]

Training... :   0%|          | 0/1888 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 1/10...Step: 1000...Train Loss: 0.227556...Train Acc: 0.904...Valid Loss: 0.151042...Valid Acc: 0.938...
Validation loss decreased (inf --> 0.151042).  Saving model ...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Training... :   0%|          | 0/1888 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 2/10...Step: 2000...Train Loss: 0.075328...Train Acc: 0.969...Valid Loss: 0.134903...Valid Acc: 0.949...
Validation loss decreased (0.151042 --> 0.134903).  Saving model ...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 2/10...Step: 3000...Train Loss: 0.058088...Train Acc: 0.978...Valid Loss: 0.187809...Valid Acc: 0.943...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Training... :   0%|          | 0/1888 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 3/10...Step: 4000...Train Loss: 0.027061...Train Acc: 0.991...Valid Loss: 0.237495...Valid Acc: 0.937...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 3/10...Step: 5000...Train Loss: 0.020244...Train Acc: 0.993...Valid Loss: 0.191769...Valid Acc: 0.950...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Training... :   0%|          | 0/1888 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 4/10...Step: 6000...Train Loss: 0.019685...Train Acc: 0.993...Valid Loss: 0.199786...Valid Acc: 0.950...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 4/10...Step: 7000...Train Loss: 0.013888...Train Acc: 0.995...Valid Loss: 0.212854...Valid Acc: 0.952...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Training... :   0%|          | 0/1888 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 5/10...Step: 8000...Train Loss: 0.013213...Train Acc: 0.995...Valid Loss: 0.237277...Valid Acc: 0.940...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 5/10...Step: 9000...Train Loss: 0.007358...Train Acc: 0.997...Valid Loss: 0.241362...Valid Acc: 0.949...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Training... :   0%|          | 0/1888 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 6/10...Step: 10000...Train Loss: 0.008191...Train Acc: 0.997...Valid Loss: 0.265411...Valid Acc: 0.942...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 6/10...Step: 11000...Train Loss: 0.004425...Train Acc: 0.998...Valid Loss: 0.256184...Valid Acc: 0.948...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Training... :   0%|          | 0/1888 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 7/10...Step: 12000...Train Loss: 0.004057...Train Acc: 0.999...Valid Loss: 0.271513...Valid Acc: 0.944...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 7/10...Step: 13000...Train Loss: 0.002438...Train Acc: 0.999...Valid Loss: 0.250968...Valid Acc: 0.951...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Training... :   0%|          | 0/1888 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 8/10...Step: 14000...Train Loss: 0.002175...Train Acc: 0.999...Valid Loss: 0.316917...Valid Acc: 0.944...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 8/10...Step: 15000...Train Loss: 0.001041...Train Acc: 1.000...Valid Loss: 0.289519...Valid Acc: 0.951...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Training... :   0%|          | 0/1888 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 9/10...Step: 16000...Train Loss: 0.000394...Train Acc: 1.000...Valid Loss: 0.300765...Valid Acc: 0.953...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Training... :   0%|          | 0/1888 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 10/10...Step: 17000...Train Loss: 0.000064...Train Acc: 1.000...Valid Loss: 0.305905...Valid Acc: 0.952...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

Epoch: 10/10...Step: 18000...Train Loss: 0.000011...Train Acc: 1.000...Valid Loss: 0.308920...Valid Acc: 0.952...


Evaluation... :   0%|          | 0/234 [00:00<?, ?it/s]

### Prediction

In [None]:
def predict(model, comments, tokenizer, max_len=128, batch_size=32):
    data_loader = create_data_loader(comments, None, tokenizer, max_len, batch_size, None)
    
    predictions = []
    prediction_probs = []

    
    model.eval()
    with torch.no_grad():
        for dl in tqdm(data_loader, position=0):
            input_ids = dl['input_ids']
            attention_mask = dl['attention_mask']
            token_type_ids = dl['token_type_ids']

            # move tensors to GPU if CUDA is available
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            token_type_ids = token_type_ids.to(device)
            
            # compute predicted outputs by passing inputs to the model
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids)
            
            # convert output probabilities to predicted class
            _, preds = torch.max(outputs, dim=1)

            predictions.extend(preds)
            prediction_probs.extend(F.softmax(outputs, dim=1))

    predictions = torch.stack(predictions).cpu().detach().numpy()
    prediction_probs = torch.stack(prediction_probs).cpu().detach().numpy()

    return predictions, prediction_probs

In [None]:
test_comments = test['comment'].to_numpy()
preds, probs = predict(pt_model, test_comments, tokenizer, max_len=128)

print(preds.shape, probs.shape)

  0%|          | 0/105 [00:00<?, ?it/s]

(3356,) (3356, 2)


In [None]:
y_test, y_pred = [label_list.index(label) for label in test['label'].values], preds

print(f'F1: {f1_score(y_test, y_pred, average="weighted")}')
print()
print(classification_report(y_test, y_pred, target_names=label_list))

F1: 0.95738929191844

              precision    recall  f1-score   support

          CG       0.96      0.95      0.96      1678
          OR       0.95      0.96      0.96      1678

    accuracy                           0.96      3356
   macro avg       0.96      0.96      0.96      3356
weighted avg       0.96      0.96      0.96      3356



### Load the trained model from drive and using for prediction

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!ls /content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_dataset

finetuned_parsbert_fake_reveiws_dataset.pt  test.csv  train.csv  valid.csv


In [None]:
tmp_model = FakeReviewDetectionModel(config=config)
tmp_model = tmp_model.to(device)

print('tmp_model', type(tmp_model))

In [None]:
path = F"/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_dataset/finetuned_parsbert_fake_reveiws_dataset.pt" 
tmp_model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [None]:
test_comments = test['comment'].to_numpy()
preds, probs = predict(tmp_model, test_comments, tokenizer, max_len=128)

print(preds.shape, probs.shape)

In [None]:
y_test, y_pred = [label_list.index(label) for label in test['label'].values], preds

print(f'F1: {f1_score(y_test, y_pred, average="weighted")}')
print()
print(classification_report(y_test, y_pred, target_names=label_list))

### Custom input test

In [None]:
xtmp_test = ['چرا تو ساندویچ کشک بادمجون خیارشور گذاشته بودین این دفه؟! اصن خوب نشده بود.']
test_comments = np.array(xtmp_test)
preds, probs = predict(tmp_model, test_comments, tokenizer, max_len=128)

preds

  0%|          | 0/1 [00:00<?, ?it/s]

array([1])

In [None]:
xtmp_test = ['خوب نبود']
test_comments = np.array(xtmp_test)
preds, probs = predict(tmp_model, test_comments, tokenizer, max_len=128)

preds

  0%|          | 0/1 [00:00<?, ?it/s]

array([1])