# Fake Review Detection with ParsBERT on Digikala Dataset


## BERT Overview

BERT stands for Bi-directional Encoder Representation from Transformers is designed to pre-train deep bidirectional representations from unlabeled texts by jointly conditioning on both left and right context in all layers. The pretrained BERT model can be fine-tuned with just one additional output layer (in many cases) to create state-of-the-art models. This model can use for a wide range of NLP tasks, such as question answering and language inference, and so on without substantial task-specific architecture modification.


Natural Language Processing (NLP) tasks include sentence-level tasks or token-level tasks:

- **Sentence-Level:** Tasks such as Natural Language Inference (NLI) aim to predict the relationships between sentences by holistically analyzing them.
- **Token-Level:** Tasks such as Named Entity Recognition (NER), Question Answering (QA), the model makes predictions on a word-by-word basis.

In the pre-trained language representation, there are two primary strategies for applying to down-stream NLP tasks:

- Feature-based: They use task-specific architectures that include pre-training representation as additional features like Word2vec, ELMo, ...
- Fine-tunning: Introduce minimal task-specific parameters, and are trained on the down-stream tasks by merely tuning the pre-training parameters like GPT.

Before going more further into code, let me introduce **ParsBERT**.

## ParsBERT

ParsBERT Is a monolingual language model based on Google's BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news, ...) with more than **3.9M** documents, **73M** sentences, and **1.3B** words. For more information about ParsBERT, please check out the article: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515).

In [None]:
!nvidia-smi

Fri Sep 16 02:27:58 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install required packages
!pip install -q pyyaml==5.4.1

!pip install -q transformers
!pip install -q hazm
!pip install -q clean-text[gpl]

[K     |████████████████████████████████| 636 kB 9.0 MB/s 
[K     |████████████████████████████████| 4.9 MB 7.5 MB/s 
[K     |████████████████████████████████| 6.6 MB 46.1 MB/s 
[K     |████████████████████████████████| 120 kB 56.9 MB/s 
[K     |████████████████████████████████| 316 kB 6.8 MB/s 
[K     |████████████████████████████████| 233 kB 62.3 MB/s 
[K     |████████████████████████████████| 1.4 MB 53.5 MB/s 
[?25h  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Building wheel for libwapiti (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 53 kB 1.7 MB/s 
[K     |████████████████████████████████| 175 kB 12.7 MB/s 
[K     |████████████████████████████████| 235 kB 65.1 MB/s 
[?25h  Building wheel for emoji (setup.py) ... [?25l[?25hdone


In [None]:
# Import required packages

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.utils import shuffle

import hazm
from cleantext import clean

import plotly.express as px
import plotly.graph_objects as go

from tqdm.notebook import tqdm

import os
import re
import json
import copy
import collections

## Dataset

Digikala Dataset.

Let's look at the dataset and obtain some intuitions about the data, distribution, and any further operation regarding this particular case.

In [None]:
!gdown https://drive.google.com/u/0/uc?id=1JX2cF9QcFL_X-3tqtnwJie8DM4Veggru&export=download

Downloading...
From: https://drive.google.com/u/0/uc?id=1JX2cF9QcFL_X-3tqtnwJie8DM4Veggru
To: /content/train_users.csv
100% 68.7M/68.7M [00:00<00:00, 106MB/s]


In [None]:
!mkdir Data
!mv 'train_users.csv' 'Data/digikala.csv'

### Load the data using Pandas

In [None]:
data = pd.read_csv('Data/digikala.csv')
data = data[['comment', 'verification_status', 'rate']]
data.head()

Unnamed: 0,comment,verification_status,rate
0,مثل بقیه محصولات الکل دار پوست رو خشک نمیکنه,1,100.0
1,با این مبلغ اگه امکانات و ارزش خرید واستون مهم...,1,80.0
2,خوبه فقط کج و کوله بدستم رسید ولی پسرم خیلی خو...,1,100.0
3,در کل خوب بود ولی متاسفانه درب محصول شکسته بود...,1,70.0
4,من که خیلی باهاش حال کردم فقط من که همیشه L می...,1,72.0


### Fixing Conflicts

the dataset has some structural problems, as shown below.

For simplicity, We fix this problem by removing rows with the `verification_status` value of `None`. Furthermore, the dataset contains duplicated rows and missing values in the comment section.

In [None]:
# print data information
print('data information')
print(data.info(), '\n')

# print missing values information
print('missing values stats')
print(data.isnull().sum(), '\n')

# print some missing values
print('some missing values')
print(data[data['verification_status'].isnull()].iloc[:5], '\n')

data information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264399 entries, 0 to 264398
Data columns (total 3 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   comment              262732 non-null  object 
 1   verification_status  264399 non-null  int64  
 2   rate                 264399 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.1+ MB
None 

missing values stats
comment                1667
verification_status       0
rate                      0
dtype: int64 

some missing values
Empty DataFrame
Columns: [comment, verification_status, rate]
Index: [] 



In [None]:
# handle some conflicts with the dataset structure
# you can find a reliable solution, for the sake of the simplicity
# We just remove these bad combinations!

data = data.dropna(subset=['verification_status'])
data = data.dropna(subset=['comment'])
data = data.dropna(subset=['rate'])
data = data.drop_duplicates(subset=['comment'], keep='first')
data = data.reset_index(drop=True)

# previous information after solving the conflicts

# print data information
print('data information')
print(data.info(), '\n')

# print missing values information
print('missing values stats')
print(data.isnull().sum(), '\n')

# print some missing values
print('some missing values')
print(data[data['verification_status'].isnull()].iloc[:5], '\n')

data information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218479 entries, 0 to 218478
Data columns (total 3 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   comment              218479 non-null  object 
 1   verification_status  218479 non-null  int64  
 2   rate                 218479 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 5.0+ MB
None 

missing values stats
comment                0
verification_status    0
rate                   0
dtype: int64 

some missing values
Empty DataFrame
Columns: [comment, verification_status, rate]
Index: [] 



### Normalization / Preprocessing

The comments have different lengths based on words! Detecting the most normal range could help us find the maximum length of the sequences for the preprocessing step. On the other hand, we suppose that the minimum word combination for having a meaningful phrase for our learning process is 2.

In [None]:
# calculate the length of comments based on their words
data['comment_len_by_words'] = data['comment'].apply(lambda t: len(hazm.word_tokenize(t)))

In [None]:
min_max_len = data["comment_len_by_words"].min(), data["comment_len_by_words"].max()
print(f'Min: {min_max_len[0]} \tMax: {min_max_len[1]}')

Min: 1 	Max: 2064


In [None]:
def data_gl_than(data, less_than=100.0, greater_than=0.0, col='comment_len_by_words'):
    data_length = data[col].values

    data_glt = sum([1 for length in data_length if greater_than < length <= less_than])

    data_glt_rate = (data_glt / len(data_length)) * 100

    print(f'Texts with word length of greater than {greater_than} and less than {less_than} includes {data_glt_rate:.2f}% of the whole!')

In [None]:
data_gl_than(data, 512, 2)

Texts with word length of greater than 2 and less than 512 includes 97.95% of the whole!


In [None]:
minlim, maxlim = 2, 512

In [None]:
# remove comments with the length of fewer than three words
data['comment_len_by_words'] = data['comment_len_by_words'].apply(lambda len_t: len_t if minlim < len_t <= maxlim else None)
data = data.dropna(subset=['comment_len_by_words'])
data = data.reset_index(drop=True)

In [None]:
fig = go.Figure()

fig.add_trace(go.Histogram(
    x=data['comment_len_by_words']
))

fig.update_layout(
    title_text='Distribution of word counts within comments',
    xaxis_title_text='Word Count',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

In [None]:
unique_rates = list(sorted(data['verification_status'].unique()))
print(f'We have #{len(unique_rates)}: {unique_rates}')

We have #2: [0, 1]


In [None]:
fig = go.Figure()

groupby_rate = data.groupby('verification_status')['verification_status'].count()

fig.add_trace(go.Bar(
    x=list(sorted(groupby_rate.index)),
    y=groupby_rate.tolist(),
    text=groupby_rate.tolist(),
    textposition='auto'
))

fig.update_layout(
    title_text='Distribution of rate within comments',
    xaxis_title_text='Rate',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

- `verification_status` = 1 means `verified`
- `verification_status` = 0 means `rejected`

In [None]:
def verification_to_label(verification_status):
    if verification_status == 1:
        return 'verified'
    else:
        return 'rejected'

data['label'] = data['verification_status'].apply(lambda t: verification_to_label(t))
labels = list(sorted(data['label'].unique()))
data.head()

Unnamed: 0,comment,verification_status,rate,comment_len_by_words,label
0,مثل بقیه محصولات الکل دار پوست رو خشک نمیکنه,1,100.0,9.0,verified
1,با این مبلغ اگه امکانات و ارزش خرید واستون مهم...,1,80.0,19.0,verified
2,خوبه فقط کج و کوله بدستم رسید ولی پسرم خیلی خو...,1,100.0,19.0,verified
3,در کل خوب بود ولی متاسفانه درب محصول شکسته بود...,1,70.0,17.0,verified
4,من که خیلی باهاش حال کردم فقط من که همیشه L می...,1,72.0,21.0,verified


Cleaning is the final step in this section. Our cleaned method includes these steps:

- fixing unicodes
- removing specials like a phone number, email, url, new lines, ...
- cleaning HTMLs
- normalizing
- removing emojis

also we are using [Hazm](https://github.com/sobhe/hazm) library.

furthermore we can use the InformalNormalizer instead of Normalizer for better model but the InformalNormalizer takes too much time to process!

for using InformalNormalizer we must use the norm_words function too.

In [None]:
def norm_words(words):
    string = ''
    for i in range(len(words[0]) - 1):
        string = string + words[0][i][0] + ' '
    string = string + words[0][len(words[0]) - 1][0]

    return string

def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext


def cleaning(text):
    text = text.strip()
    
    # regular cleaning
    text = clean(text,
        fix_unicode=True,
        to_ascii=False,
        lower=True,
        no_line_breaks=True,
        no_urls=True,
        no_emails=True,
        no_phone_numbers=True,
        no_numbers=False,
        no_digits=False,
        no_currency_symbols=True,
        no_punct=False,
        replace_with_url="",
        replace_with_email="",
        replace_with_phone_number="",
        replace_with_number="",
        replace_with_digit="0",
        replace_with_currency_symbol="",
    )

    # cleaning htmls
    text = cleanhtml(text)
    
    # normalizing
    #normalizer = hazm.InformalNormalizer(seperation_flag=True)
    normalizer = hazm.Normalizer()
    text = normalizer.normalize(text)

    #text = norm_words(text)
    
    # removing wierd patterns
    wierd_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u'\U00010000-\U0010ffff'
        u"\u200d"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\u3030"
        u"\ufe0f"
        u"\u2069"
        u"\u2066"
        # u"\u200c"
        u"\u2068"
        u"\u2067"
        "]+", flags=re.UNICODE)
    
    text = wierd_pattern.sub(r'', text)
    
    # removing extra spaces, hashtags
    text = re.sub("#", "", text)
    text = re.sub("\s+", " ", text)
    
    return text

In [None]:
# cleaning comments
data['cleaned_comment'] = data['comment'].apply(cleaning)


# calculate the length of comments based on their words
data['cleaned_comment_len_by_words'] = data['cleaned_comment'].apply(lambda t: len(hazm.word_tokenize(t)))

# remove comments with the length of fewer than three words
data['cleaned_comment_len_by_words'] = data['cleaned_comment_len_by_words'].apply(lambda len_t: len_t if minlim < len_t <= maxlim else len_t)
data = data.dropna(subset=['cleaned_comment_len_by_words'])
data = data.reset_index(drop=True)

data.head()

Unnamed: 0,comment,verification_status,rate,comment_len_by_words,label,cleaned_comment,cleaned_comment_len_by_words
0,مثل بقیه محصولات الکل دار پوست رو خشک نمیکنه,1,100.0,9.0,verified,مثل بقیه محصولات الکل دار پوست رو خشک نمیکنه,9
1,با این مبلغ اگه امکانات و ارزش خرید واستون مهم...,1,80.0,19.0,verified,با این مبلغ اگه امکانات و ارزش خرید واستون مهم...,19
2,خوبه فقط کج و کوله بدستم رسید ولی پسرم خیلی خو...,1,100.0,19.0,verified,خوبه فقط کج و کوله بدستم رسید ولی پسرم خیلی خو...,19
3,در کل خوب بود ولی متاسفانه درب محصول شکسته بود...,1,70.0,17.0,verified,در کل خوب بود ولی متاسفانه درب محصول شکسته بود...,17
4,من که خیلی باهاش حال کردم فقط من که همیشه L می...,1,72.0,21.0,verified,من که خیلی باهاش حال کردم فقط من که همیشه l می...,21


In [None]:
data = data[['cleaned_comment', 'label', 'rate']]
data.columns = ['comment', 'label', 'rate']
data.head()

Unnamed: 0,comment,label,rate
0,مثل بقیه محصولات الکل دار پوست رو خشک نمیکنه,verified,100.0
1,با این مبلغ اگه امکانات و ارزش خرید واستون مهم...,verified,80.0
2,خوبه فقط کج و کوله بدستم رسید ولی پسرم خیلی خو...,verified,100.0
3,در کل خوب بود ولی متاسفانه درب محصول شکسته بود...,verified,70.0
4,من که خیلی باهاش حال کردم فقط من که همیشه l می...,verified,72.0


In [None]:
print(f'We have #{len(labels)} labels: {labels}')

We have #2 labels: ['rejected', 'verified']


### Handling Unbalanced Data

In [None]:
fig = go.Figure()

groupby_label = data.groupby('label')['label'].count()

fig.add_trace(go.Bar(
    x=list(sorted(groupby_label.index)),
    y=groupby_label.tolist(),
    text=groupby_label.tolist(),
    textposition='auto'
))

fig.update_layout(
    title_text='Distribution of label within comments [DATA]',
    xaxis_title_text='Label',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

UnderSampling

Again, for making things simple. We cut the dataset randomly based on the fewer label, the rejected class.

In [None]:
rejected_data = data[data['label'] == 'rejected']
verified_data = data[data['label'] == 'verified']

cutting_point = min(len(rejected_data), len(verified_data)) * 2

if cutting_point <= len(rejected_data):
    rejected_data = rejected_data.sample(n=cutting_point).reset_index(drop=True)

if cutting_point <= len(verified_data):
    verified_data = verified_data.sample(n=cutting_point).reset_index(drop=True)

new_data = pd.concat([rejected_data, verified_data])
new_data = new_data.sample(frac=1).reset_index(drop=True)
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22173 entries, 0 to 22172
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   comment  22173 non-null  object 
 1   label    22173 non-null  object 
 2   rate     22173 non-null  float64
dtypes: float64(1), object(2)
memory usage: 519.8+ KB


In [None]:
fig = go.Figure()

groupby_label = new_data.groupby('label')['label'].count()

fig.add_trace(go.Bar(
    x=list(sorted(groupby_label.index)),
    y=groupby_label.tolist(),
    text=groupby_label.tolist(),
    textposition='auto'
))

fig.update_layout(
    title_text='Distribution of label within comments [NEW DATA]',
    xaxis_title_text='Label',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

## Train,Validation,Test split

To achieve a globalized model, we need to split the cleaned dataset into train, valid, test sets due to size of the data. We have considered a rate of **0.1** for both *valid*, *test* sets. For splitting, We use `train_test_split` provided by Sklearn package with stratifying on the label for preserving the distribution balance.

In [None]:
new_data['label_id'] = new_data['label'].apply(lambda t: labels.index(t))

train, valid = train_test_split(new_data, test_size=0.1, random_state=1, stratify=new_data['label'])
train, test = train_test_split(train, test_size=0.1, random_state=1, stratify=train['label'])

train = train.reset_index(drop=True)
valid = valid.reset_index(drop=True)
test = test.reset_index(drop=True)

x_train, y_train = train['comment'].values.tolist(), train['label_id'].values.tolist()
x_valid, y_valid = valid['comment'].values.tolist(), valid['label_id'].values.tolist()
x_test, y_test = test['comment'].values.tolist(), test['label_id'].values.tolist()

print(train.shape)
print(valid.shape)
print(test.shape)

(17959, 4)
(2218, 4)
(1996, 4)


### Saving train, valid and test sets to drive

In [None]:
train.to_csv('train.csv')
valid.to_csv('valid.csv')
test.to_csv('test.csv')

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import shutil

shutil.copy("train.csv", "/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid/train.csv")
shutil.copy("valid.csv", "/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid/valid.csv")
shutil.copy("test.csv", "/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid/test.csv")

'/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid/test.csv'

## Implement model with PyTorch

We will follow the model using *PyTorch*

![BERT INPUTS](https://res.cloudinary.com/m3hrdadfi/image/upload/v1595158991/kaggle/bert_inputs_w8rith.png)

The BERT model input is a combination of 3 embeddings.
- Token embeddings: WordPiece token vocabulary (WordPiece is another word segmentation algorithm, similar to BPE)
- Segment embeddings: for pair sentences [A-B] marked as $E_A$ or $E_B$ mean that it belongs to the first sentence or the second one.
- Position embeddings: specify the position of words in a sentence

In [None]:
from transformers import BertConfig, BertTokenizer
from transformers import BertModel

from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

import torch
import torch.nn as nn
import torch.nn.functional as F

### Configuration

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'device: {device}')

train_on_gpu = torch.cuda.is_available()

if not train_on_gpu:
    print('CUDA is not available.  Training on CPU ...')
else:
    print('CUDA is available!  Training on GPU ...')

device: cuda:0
CUDA is available!  Training on GPU ...


In [None]:
# general config
MAX_LEN = 512
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 16
TEST_BATCH_SIZE = 16

EPOCHS = 3
EEVERY_EPOCH = 500
LEARNING_RATE = 2e-5
CLIP = 0.0

MODEL_NAME_OR_PATH = 'HooshvareLab/bert-fa-zwnj-base'
OUTPUT_PATH = '/content/bert-fa-zwnj-base-fake-review-detection/pytorch_model.bin'

os.makedirs(os.path.dirname(OUTPUT_PATH), exist_ok=True)

In [None]:
labels = ['rejected', 'verified']
print(f'We have #{len(labels)} labels: {labels}')

We have #2 labels: ['rejected', 'verified']


In [None]:
# create a key finder based on label 2 id and id to label

label2id = {label: i for i, label in enumerate(labels)}
id2label = {v: k for k, v in label2id.items()}

print(f'label2id: {label2id}')
print(f'id2label: {id2label}')

label2id: {'rejected': 0, 'verified': 1}
id2label: {0: 'rejected', 1: 'verified'}


In [None]:
# setup the tokenizer and configuration

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME_OR_PATH)
config = BertConfig.from_pretrained(
    MODEL_NAME_OR_PATH, **{
        'label2id': label2id,
        'id2label': id2label,
    })

print(config.to_json_string())

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/134 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/292 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/565 [00:00<?, ?B/s]

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "rejected",
    "1": "verified"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "rejected": 0,
    "verified": 1
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 42000
}



### Input Embeddings

In [None]:
idx = np.random.randint(0, len(train))
sample = train.iloc[idx]['comment']
sample_label = train.iloc[idx]['label']

print(f'Sample: \n{sample}\n{sample_label}')

Sample: 
جنسش خوبه نمیچسبه
verified


In [None]:
tokens = tokenizer.tokenize(sample)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f'   Sample: {sample}')
print(f'   Tokens: {tokenizer.convert_tokens_to_string(tokens)}')
print(f'Token IDs: {token_ids}')

   Sample: جنسش خوبه نمیچسبه
   Tokens: جنسش خوبه نمیچسبه
Token IDs: [4965, 1121, 12228, 2204, 6402, 1123]


In [None]:
encoding = tokenizer.encode_plus(
    sample,
    max_length=32,
    truncation=True,
    add_special_tokens=True, # Add '[CLS]' and '[SEP]'
    return_token_type_ids=True,
    return_attention_mask=True,
    padding='max_length',
    return_tensors='pt',  # Return PyTorch tensors
)

print(f'Keys: {encoding.keys()}\n')
for k in encoding.keys():
    print(f'{k}:\n{encoding[k]}')

Keys: dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

input_ids:
tensor([[    2,  4965,  1121, 12228,  2204,  6402,  1123,     3,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]])
token_type_ids:
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])


### Dataset

In [None]:
class Dataset(torch.utils.data.Dataset):
    """ Create a PyTorch dataset. """

    def __init__(self, tokenizer, comments, targets=None, label_list=None, max_len=128):
        self.comments = comments
        self.targets = targets
        self.has_target = isinstance(targets, list) or isinstance(targets, np.ndarray)

        self.tokenizer = tokenizer
        self.max_len = max_len

        
        self.label_map = {label: i for i, label in enumerate(label_list)} if isinstance(label_list, list) else {}
    
    def __len__(self):
        return len(self.comments)

    def __getitem__(self, item):
        comment = str(self.comments[item])

        if self.has_target:
            target = self.label_map.get(str(self.targets[item]), str(self.targets[item]))

        encoding = self.tokenizer.encode_plus(
            comment,
            add_special_tokens=True,
            truncation=True,
            max_length=self.max_len,
            return_token_type_ids=True,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt')
        
        inputs = {
            'comment': comment,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'token_type_ids': encoding['token_type_ids'].flatten(),
        }

        if self.has_target:
            inputs['targets'] = torch.tensor(target, dtype=torch.long)
        
        return inputs


def create_data_loader(x, y, tokenizer, max_len, batch_size, label_list):
    dataset = Dataset(
        comments=x,
        targets=y,
        tokenizer=tokenizer,
        max_len=max_len, 
        label_list=label_list)
    
    return torch.utils.data.DataLoader(dataset, batch_size=batch_size)

In [None]:
label_list = ['rejected', 'verified']
train_data_loader = create_data_loader(train['comment'].to_numpy(), train['label'].to_numpy(), tokenizer, MAX_LEN, TRAIN_BATCH_SIZE, label_list)
valid_data_loader = create_data_loader(valid['comment'].to_numpy(), valid['label'].to_numpy(), tokenizer, MAX_LEN, VALID_BATCH_SIZE, label_list)
test_data_loader = create_data_loader(test['comment'].to_numpy(), None, tokenizer, MAX_LEN, TEST_BATCH_SIZE, label_list)

In [None]:
sample_data = next(iter(train_data_loader))

print(sample_data.keys())

print(sample_data['comment'])
print(sample_data['input_ids'].shape)
print(sample_data['input_ids'][0, :])
print(sample_data['attention_mask'].shape)
print(sample_data['attention_mask'][0, :])
print(sample_data['token_type_ids'].shape)
print(sample_data['token_type_ids'][0, :])
print(sample_data['targets'].shape)
print(sample_data['targets'][0])

dict_keys(['comment', 'input_ids', 'attention_mask', 'token_type_ids', 'targets'])
['آقا خواهشا یا از این طرح تخفیفا نزارین یا اگ میزارید حداقل دیگه تو این مورد پارتی بازی نکنید. بابا من پول قرض کردم ریختم تو کارت ولی هربار سرساعت ک میام ثبت نام کنم نمیشه. یا میزنه تمام شد. یا میزنه نامعتبر اس. اخه این چ کاریه خدایی؟', 'کالای مورد نظر با تخفیف سفارش دادم پولش را هم واریز کردم ولی نیامد که نیامد وتو پروفایلم پریروز زده بود انصراف مامور ارسال امروز زده تحویل به مشتری چرا', 'بنظرمن خوب نبود. هم نازکه خیلی هم دوختش بده', 'سلام کسی میدونه نهایت قدش چقدر میشه؟', 'نسبت به آنچه که انتظار داشتم بهتر بود بسیار راحت و شیک است', 'لپ تاپخوبیه میخوام بخرم اگه cpu قوی\u200cتر میخواید یه مدل از همین هست ۲۲ ملیون رایزن ۷ ولی چیز خوبیه', 'سایز مناسب برای ۲۰۶ و کارراه انداز', 'به نسبت قیمت کیفیت قابل قبولی دارد باز هم خرید می\u200cکنم', 'اولأاین محصول رازده\u200cاید آلمانی، درصورتی که چینی است اون درجه۳، دومأقیمت اصلی رازده\u200cاید۱۲۸۰۰۰۰تومان، درصورتی که قیمت واقعیش توبازار۴۸۰۰۰۰هزارتومان است. آخربه چه

In [None]:
sample_test = next(iter(test_data_loader))
print(sample_test.keys())

dict_keys(['comment', 'input_ids', 'attention_mask', 'token_type_ids'])


### Model

During the implementation of the model, sometime, you may be faced with this kind of error. It said you used all the Cuda-Memory. for solving this error There are many ways but the simple one is to clear the Cuda cache memory!

![Cuda-Error](https://res.cloudinary.com/m3hrdadfi/image/upload/v1599979552/kaggle/cuda-error_iyqh4o.png)


**Simple Solution**
```python
import torch, gc

gc.collect()
torch.cuda.empty_cache()

!nvidia-smi
```

In [None]:
class FakeReviewDetectionModel(nn.Module):

    def __init__(self, config):
        super(FakeReviewDetectionModel, self).__init__()

        self.bert = BertModel.from_pretrained(MODEL_NAME_OR_PATH)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
    
    def forward(self, input_ids, attention_mask, token_type_ids):
        _, pooled_output = self.bert(
            input_ids=input_ids, 
            attention_mask=attention_mask, 
            token_type_ids=token_type_ids,
            return_dict=False)
        
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits 

In [None]:
import torch, gc

gc.collect()
torch.cuda.empty_cache()
pt_model = None

!nvidia-smi

Fri Sep 16 02:30:19 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
pt_model = FakeReviewDetectionModel(config=config)
pt_model = pt_model.to(device)

print('pt_model', type(pt_model))

In [None]:
# sample data output

sample_data_comment = sample_data['comment']
sample_data_input_ids = sample_data['input_ids']
sample_data_attention_mask = sample_data['attention_mask']
sample_data_token_type_ids = sample_data['token_type_ids']
sample_data_targets = sample_data['targets']

# available for using in GPU
sample_data_input_ids = sample_data_input_ids.to(device)
sample_data_attention_mask = sample_data_attention_mask.to(device)
sample_data_token_type_ids = sample_data_token_type_ids.to(device)
sample_data_targets = sample_data_targets.to(device)


# outputs = F.softmax(
#     pt_model(sample_data_input_ids, sample_data_attention_mask, sample_data_token_type_ids), 
#     dim=1)

outputs = pt_model(sample_data_input_ids, sample_data_attention_mask, sample_data_token_type_ids)
_, preds = torch.max(outputs, dim=1)

print(outputs[:5, :])
print(preds[:5])

tensor([[-0.5167, -0.1706],
        [ 0.0664, -0.1128],
        [ 0.0763,  0.0484],
        [-0.1061, -0.0687],
        [-0.0326, -0.1090]], device='cuda:0', grad_fn=<SliceBackward0>)
tensor([1, 0, 0, 1, 0], device='cuda:0')


### Training

Train and optimization process

During the optimization process we will save the best model to drive

Best model: model with minimum validation loss during optimization

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
!ls /content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_digikala

test.csv  train.csv  valid.csv


In [None]:
def simple_accuracy(y_true, y_pred):
    return (y_true == y_pred).mean()

def acc_and_f1(y_true, y_pred, average='weighted'):
    acc = simple_accuracy(y_true, y_pred)
    f1 = f1_score(y_true=y_true, y_pred=y_pred, average=average)
    return {
        "acc": acc,
        "f1": f1,
    }

def y_loss(y_true, y_pred, losses):
    y_true = torch.stack(y_true).cpu().detach().numpy()
    y_pred = torch.stack(y_pred).cpu().detach().numpy()
    y = [y_true, y_pred]
    loss = np.mean(losses)

    return y, loss


def eval_op(model, data_loader, loss_fn):
    model.eval()

    losses = []
    y_pred = []
    y_true = []

    with torch.no_grad():
        for dl in tqdm(data_loader, total=len(data_loader), desc="Evaluation... "):
            
            input_ids = dl['input_ids']
            attention_mask = dl['attention_mask']
            token_type_ids = dl['token_type_ids']
            targets = dl['targets']

            # move tensors to GPU if CUDA is available
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            token_type_ids = token_type_ids.to(device)
            targets = targets.to(device)

            # compute predicted outputs by passing inputs to the model
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids)
            
            # convert output probabilities to predicted class
            _, preds = torch.max(outputs, dim=1)

            # calculate the batch loss
            loss = loss_fn(outputs, targets)

            # accumulate all the losses
            losses.append(loss.item())

            y_pred.extend(preds)
            y_true.extend(targets)
    
    eval_y, eval_loss = y_loss(y_true, y_pred, losses)
    return eval_y, eval_loss


def train_op(model, 
             data_loader, 
             loss_fn, 
             optimizer, 
             scheduler, 
             step=0, 
             print_every_step=100, 
             eval=False,
             eval_cb=None,
             eval_loss_min=np.Inf,
             eval_data_loader=None, 
             clip=0.0):
    
    model.train()

    losses = []
    y_pred = []
    y_true = []

    for dl in tqdm(data_loader, total=len(data_loader), desc="Training... "):
        step += 1

        input_ids = dl['input_ids']
        attention_mask = dl['attention_mask']
        token_type_ids = dl['token_type_ids']
        targets = dl['targets']

        # move tensors to GPU if CUDA is available
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        token_type_ids = token_type_ids.to(device)
        targets = targets.to(device)

        # clear the gradients of all optimized variables
        optimizer.zero_grad()

        # compute predicted outputs by passing inputs to the model
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids)
        
        # convert output probabilities to predicted class
        _, preds = torch.max(outputs, dim=1)

        # calculate the batch loss
        loss = loss_fn(outputs, targets)

        # accumulate all the losses
        losses.append(loss.item())

        # compute gradient of the loss with respect to model parameters
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        if clip > 0.0:
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip)

        # perform optimization step
        optimizer.step()

        # perform scheduler step
        scheduler.step()

        y_pred.extend(preds)
        y_true.extend(targets)

        if eval:
            train_y, train_loss = y_loss(y_true, y_pred, losses)
            train_score = acc_and_f1(train_y[0], train_y[1], average='weighted')

            if step % print_every_step == 0:
                eval_y, eval_loss = eval_op(model, eval_data_loader, loss_fn)
                eval_score = acc_and_f1(eval_y[0], eval_y[1], average='weighted')

                if hasattr(eval_cb, '__call__'):
                    eval_loss_min = eval_cb(model, step, train_score, train_loss, eval_score, eval_loss, eval_loss_min)

    train_y, train_loss = y_loss(y_true, y_pred, losses)

    return train_y, train_loss, step, eval_loss_min

In [None]:
optimizer = AdamW(pt_model.parameters(), lr=LEARNING_RATE, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

loss_fn = nn.CrossEntropyLoss()

step = 0
eval_loss_min = np.Inf
history = collections.defaultdict(list)


def eval_callback(epoch, epochs, output_path):
    def eval_cb(model, step, train_score, train_loss, eval_score, eval_loss, eval_loss_min):
        statement = ''
        statement += 'Epoch: {}/{}...'.format(epoch, epochs)
        statement += 'Step: {}...'.format(step)
        
        statement += 'Train Loss: {:.6f}...'.format(train_loss)
        statement += 'Train Acc: {:.3f}...'.format(train_score['acc'])

        statement += 'Valid Loss: {:.6f}...'.format(eval_loss)
        statement += 'Valid Acc: {:.3f}...'.format(eval_score['acc'])

        print(statement)

        if eval_loss <= eval_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
                eval_loss_min,
                eval_loss))
            
            path = F"/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_digikala/finetuned_parsbert_fake_reveiws_digikala.pt" 
            torch.save(model.state_dict(), path)
            
            torch.save(model.state_dict(), output_path)
            eval_loss_min = eval_loss
        
        return eval_loss_min


    return eval_cb


for epoch in tqdm(range(1, EPOCHS + 1), desc="Epochs... "):
    train_y, train_loss, step, eval_loss_min = train_op(
        model=pt_model, 
        data_loader=train_data_loader, 
        loss_fn=loss_fn, 
        optimizer=optimizer, 
        scheduler=scheduler, 
        step=step, 
        print_every_step=EEVERY_EPOCH, 
        eval=True,
        eval_cb=eval_callback(epoch, EPOCHS, OUTPUT_PATH),
        eval_loss_min=eval_loss_min,
        eval_data_loader=valid_data_loader, 
        clip=CLIP)
    
    train_score = acc_and_f1(train_y[0], train_y[1], average='weighted')
    
    eval_y, eval_loss = eval_op(
        model=pt_model, 
        data_loader=valid_data_loader, 
        loss_fn=loss_fn)
    
    eval_score = acc_and_f1(eval_y[0], eval_y[1], average='weighted')
    
    history['train_acc'].append(train_score['acc'])
    history['train_loss'].append(train_loss)
    history['val_acc'].append(eval_score['acc'])
    history['val_loss'].append(eval_loss)





Epochs... :   0%|          | 0/3 [00:00<?, ?it/s]

Training... :   0%|          | 0/1123 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/139 [00:00<?, ?it/s]

Epoch: 1/3...Step: 500...Train Loss: 0.328657...Train Acc: 0.869...Valid Loss: 0.300020...Valid Acc: 0.891...
Validation loss decreased (inf --> 0.300020).  Saving model ...


Evaluation... :   0%|          | 0/139 [00:00<?, ?it/s]

Epoch: 1/3...Step: 1000...Train Loss: 0.294539...Train Acc: 0.884...Valid Loss: 0.274278...Valid Acc: 0.892...
Validation loss decreased (0.300020 --> 0.274278).  Saving model ...


Evaluation... :   0%|          | 0/139 [00:00<?, ?it/s]

Training... :   0%|          | 0/1123 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/139 [00:00<?, ?it/s]

Epoch: 2/3...Step: 1500...Train Loss: 0.201193...Train Acc: 0.932...Valid Loss: 0.302608...Valid Acc: 0.890...


Evaluation... :   0%|          | 0/139 [00:00<?, ?it/s]

Epoch: 2/3...Step: 2000...Train Loss: 0.174089...Train Acc: 0.941...Valid Loss: 0.284939...Valid Acc: 0.895...


Evaluation... :   0%|          | 0/139 [00:00<?, ?it/s]

Training... :   0%|          | 0/1123 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/139 [00:00<?, ?it/s]

Epoch: 3/3...Step: 2500...Train Loss: 0.138232...Train Acc: 0.958...Valid Loss: 0.301594...Valid Acc: 0.895...


Evaluation... :   0%|          | 0/139 [00:00<?, ?it/s]

Epoch: 3/3...Step: 3000...Train Loss: 0.108147...Train Acc: 0.967...Valid Loss: 0.342291...Valid Acc: 0.893...


Evaluation... :   0%|          | 0/139 [00:00<?, ?it/s]

### Prediction

In [None]:
def predict(model, comments, tokenizer, max_len=128, batch_size=16):
    data_loader = create_data_loader(comments, None, tokenizer, max_len, batch_size, None)
    
    predictions = []
    prediction_probs = []

    
    model.eval()
    with torch.no_grad():
        for dl in tqdm(data_loader, position=0):
            input_ids = dl['input_ids']
            attention_mask = dl['attention_mask']
            token_type_ids = dl['token_type_ids']

            # move tensors to GPU if CUDA is available
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            token_type_ids = token_type_ids.to(device)
            
            # compute predicted outputs by passing inputs to the model
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids)
            
            # convert output probabilities to predicted class
            _, preds = torch.max(outputs, dim=1)

            predictions.extend(preds)
            prediction_probs.extend(F.softmax(outputs, dim=1))

    predictions = torch.stack(predictions).cpu().detach().numpy()
    prediction_probs = torch.stack(prediction_probs).cpu().detach().numpy()

    return predictions, prediction_probs

In [None]:
test_comments = test['comment'].to_numpy()
preds, probs = predict(pt_model, test_comments, tokenizer, max_len=512)

print(preds.shape, probs.shape)

  0%|          | 0/125 [00:00<?, ?it/s]

(1996,) (1996, 2)


In [None]:
y_test, y_pred = [label_list.index(label) for label in test['label'].values], preds

print(f'F1: {f1_score(y_test, y_pred, average="weighted")}')
print()
print(classification_report(y_test, y_pred, target_names=label_list))

F1: 0.9006883758503387

              precision    recall  f1-score   support

    rejected       0.85      0.85      0.85       665
    verified       0.92      0.93      0.93      1331

    accuracy                           0.90      1996
   macro avg       0.89      0.89      0.89      1996
weighted avg       0.90      0.90      0.90      1996



### Saving the last epoch model to drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
import shutil

shutil.copy('/content/bert-fa-zwnj-base-fake-review-detection/pytorch_model.bin', "/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid.bin")

'/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid.bin'

### Load the trained model from drive and using for prediction

In [None]:
!ls /content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid

finetuned_parsbert_fake_reveiws_digikala_2Xverifeid.bin  train.csv
test.csv						 valid.csv


In [None]:
tmp_model = FakeReviewDetectionModel(config=config)
tmp_model = tmp_model.to(device)

print('tmp_model', type(tmp_model))

In [None]:
path = F"/content/gdrive/MyDrive/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid/finetuned_parsbert_fake_reveiws_digikala_2Xverifeid.bin" 
tmp_model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [None]:
test_comments = test['comment'].to_numpy()
preds, probs = predict(tmp_model, test_comments, tokenizer, max_len=512)

print(preds.shape, probs.shape)

  0%|          | 0/125 [00:00<?, ?it/s]

(1996,) (1996, 2)


In [None]:
y_test, y_pred = [label_list.index(label) for label in test['label'].values], preds

print(f'F1: {f1_score(y_test, y_pred, average="weighted")}')
print()
print(classification_report(y_test, y_pred, target_names=label_list))

F1: 0.9059858651362502

              precision    recall  f1-score   support

    rejected       0.84      0.89      0.86       665
    verified       0.94      0.91      0.93      1331

    accuracy                           0.91      1996
   macro avg       0.89      0.90      0.90      1996
weighted avg       0.91      0.91      0.91      1996



### Custom input test

In [None]:
xtmp_test = ['خیلی عالی بود.']
test_comments = np.array(xtmp_test)
preds, probs = predict(tmp_model, test_comments, tokenizer, max_len=512)

labels[preds[0]]

  0%|          | 0/1 [00:00<?, ?it/s]

'verified'