# Sentiment Analysis with ParsBERT on SentiPers Dataset


## BERT Overview

BERT stands for Bi-directional Encoder Representation from Transformers is designed to pre-train deep bidirectional representations from unlabeled texts by jointly conditioning on both left and right context in all layers. The pretrained BERT model can be fine-tuned with just one additional output layer (in many cases) to create state-of-the-art models. This model can use for a wide range of NLP tasks, such as question answering and language inference, and so on without substantial task-specific architecture modification.


Natural Language Processing (NLP) tasks include sentence-level tasks or token-level tasks:

- **Sentence-Level:** Tasks such as Natural Language Inference (NLI) aim to predict the relationships between sentences by holistically analyzing them.
- **Token-Level:** Tasks such as Named Entity Recognition (NER), Question Answering (QA), the model makes predictions on a word-by-word basis.

In the pre-trained language representation, there are two primary strategies for applying to down-stream NLP tasks:

- Feature-based: They use task-specific architectures that include pre-training representation as additional features like Word2vec, ELMo, ...
- Fine-tunning: Introduce minimal task-specific parameters, and are trained on the down-stream tasks by merely tuning the pre-training parameters like GPT.

Before going more further into code, let me introduce **ParsBERT**.

## ParsBERT

ParsBERT Is a monolingual language model based on Google's BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news, ...) with more than **3.9M** documents, **73M** sentences, and **1.3B** words. For more information about ParsBERT, please check out the article: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515).

In [None]:
!nvidia-smi

Sun Jul 17 17:03:21 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P0    49W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install required packages
!pip install pyyaml==5.4.1

!pip install -q transformers
!pip install -q hazm
!pip install -q clean-text[gpl]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyyaml==5.4.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 5.1 MB/s 
[?25hInstalling collected packages: pyyaml
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed pyyaml-5.4.1
[K     |████████████████████████████████| 4.4 MB 5.1 MB/s 
[K     |████████████████████████████████| 101 kB 10.3 MB/s 
[K     |████████████████████████████████| 6.6 MB 49.7 MB/s 
[K     |████████████████████████████████| 316 kB 5.1 MB/s 
[K     |████████████████████████████████| 233 kB 62.6 MB/s 
[K     |████████████████████████████████| 1.4 MB 59.1 MB/s 
[?25h  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Building wheel for libwapiti (setup.py) ... [?25l[?25hdone
[K     |██████████████████

In [None]:
# Import required packages

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.utils import shuffle

import hazm
from cleantext import clean

import plotly.express as px
import plotly.graph_objects as go

from tqdm.notebook import tqdm

import os
import re
import json
import copy
import collections

## Dataset

SentiPers Dataset. we merged all the datasets except the test dataset and used the test dataset for evaluate the last results.

Let's look at the dataset and obtain some intuitions about the data, distribution, and any further operation regarding this particular case.

In [None]:
!gdown https://drive.google.com/u/0/uc?id=1IDfgl6yKwX9yo_NG2AnQCqr71i-0OzNR&export=download
!gdown https://drive.google.com/u/0/uc?id=1537HAv-BSnNvRXlnZuhQxKWziXlcKpha&export=download

Downloading...
From: https://drive.google.com/u/0/uc?id=1IDfgl6yKwX9yo_NG2AnQCqr71i-0OzNR
To: /content/merged.csv
100% 7.72M/7.72M [00:00<00:00, 59.9MB/s]
Downloading...
From: https://drive.google.com/u/0/uc?id=1537HAv-BSnNvRXlnZuhQxKWziXlcKpha
To: /content/test.csv
100% 366k/366k [00:00<00:00, 127MB/s]


### Load the data using Pandas

In [None]:
data = pd.read_csv('merged.csv', encoding='utf-8')
data_test = pd.read_csv('test.csv', encoding='utf-8')
data = data[['comment', 'rate']]
data_test.columns = ['comment','rate']
data.tail()

Unnamed: 0,comment,rate
42304,البته نمی‌توان گفت که سیستم خنک کننده کاملا بی...,1
42305,باتری با وجود تمام سخت افزارهای فوق‌العاده به ...,1
42306,نرم افزار به طور پیش فرض Retina MacBook Pro از...,1
42307,متاسفانه سایر برنامه‌ها بر روی این صفحه نمایش ...,-1
42308,جمع بندی همواره MacBook Pro جزو سری خاصی از لپ...,0


### Fixing Conflicts

the dataset has some structural problems, as shown below.

For simplicity, We fix this problem by removing rows with the `rate` value of `None`. Furthermore, the dataset contains duplicated rows and missing values in the comment section.

In [None]:
# print data information
print('data information')
print(data.info(), '\n')

# print data information
print('data information')
print(data_test.info(), '\n')

# print missing values information
print('missing values stats')
print(data.isnull().sum(), '\n')

print('missing values stats')
print(data_test.isnull().sum(), '\n')

# print some missing values
print('some missing values')
print(data[data['rate'].isnull()].iloc[:5], '\n')

print('some missing values')
print(data_test[data_test['rate'].isnull()].iloc[:5], '\n')

data information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42309 entries, 0 to 42308
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   comment  42309 non-null  object
 1   rate     42309 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 661.2+ KB
None 

data information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1853 entries, 0 to 1852
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   comment  1853 non-null   object
 1   rate     1853 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 29.1+ KB
None 

missing values stats
comment    0
rate       0
dtype: int64 

missing values stats
comment    0
rate       0
dtype: int64 

some missing values
Empty DataFrame
Columns: [comment, rate]
Index: [] 

some missing values
Empty DataFrame
Columns: [comment, rate]
Index: [] 



In [None]:
# handle some conflicts with the dataset structure
# you can find a reliable solution, for the sake of the simplicity
# We just remove these bad combinations!

data = data.dropna(subset=['rate'])
data = data.dropna(subset=['comment'])
data = data.drop_duplicates(subset=['comment'], keep='first')
data = data.reset_index(drop=True)

# previous information after solving the conflicts

# print data information
print('data information')
print(data.info(), '\n')

# print missing values information
print('missing values stats')
print(data.isnull().sum(), '\n')

# print some missing values
print('some missing values')
print(data[data['rate'].isnull()].iloc[:5], '\n')

data information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24595 entries, 0 to 24594
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   comment  24595 non-null  object
 1   rate     24595 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 384.4+ KB
None 

missing values stats
comment    0
rate       0
dtype: int64 

some missing values
Empty DataFrame
Columns: [comment, rate]
Index: [] 



In [None]:
# handle some conflicts with the dataset structure
# you can find a reliable solution, for the sake of the simplicity
# I just remove these bad combinations!

data_test = data_test.dropna(subset=['rate'])
data_test = data_test.dropna(subset=['comment'])
data_test = data_test.drop_duplicates(subset=['comment'], keep='first')
data_test = data_test.reset_index(drop=True)

# previous information after solving the conflicts

# print data information
print('data information')
print(data_test.info(), '\n')

# print missing values information
print('missing values stats')
print(data_test.isnull().sum(), '\n')

# print some missing values
print('some missing values')
print(data_test[data_test['rate'].isnull()].iloc[:5], '\n')

data information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1834 entries, 0 to 1833
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   comment  1834 non-null   object
 1   rate     1834 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 28.8+ KB
None 

missing values stats
comment    0
rate       0
dtype: int64 

some missing values
Empty DataFrame
Columns: [comment, rate]
Index: [] 



### Normalization / Preprocessing

The comments have different lengths based on words! Detecting the most normal range could help us find the maximum length of the sequences for the preprocessing step. On the other hand, we suppose that the minimum word combination for having a meaningful phrase for our learning process is 3.

In [None]:
# calculate the length of comments based on their words
data['comment_len_by_words'] = data['comment'].apply(lambda t: len(hazm.word_tokenize(t)))

In [None]:
# calculate the length of comments based on their words
data_test['comment_len_by_words'] = data_test['comment'].apply(lambda t: len(hazm.word_tokenize(t)))

In [None]:
min_max_len = data["comment_len_by_words"].min(), data["comment_len_by_words"].max()
print(f'Min: {min_max_len[0]} \tMax: {min_max_len[1]}')

Min: 1 	Max: 339


In [None]:
min_max_len = data_test["comment_len_by_words"].min(), data_test["comment_len_by_words"].max()
print(f'Min: {min_max_len[0]} \tMax: {min_max_len[1]}')

Min: 1 	Max: 338


In [None]:
def data_gl_than(data, less_than=100.0, greater_than=0.0, col='comment_len_by_words'):
    data_length = data[col].values

    data_glt = sum([1 for length in data_length if greater_than < length <= less_than])

    data_glt_rate = (data_glt / len(data_length)) * 100

    print(f'Texts with word length of greater than {greater_than} and less than {less_than} includes {data_glt_rate:.2f}% of the whole!')

In [None]:
data_gl_than(data, 256, 3)

Texts with word length of greater than 3 and less than 256 includes 97.67% of the whole!


In [None]:
data_gl_than(data_test, 256, 3)

Texts with word length of greater than 3 and less than 256 includes 97.55% of the whole!


In [None]:
minlim, maxlim = 3, 256

In [None]:
# remove comments with the length of fewer than three words
data['comment_len_by_words'] = data['comment_len_by_words'].apply(lambda len_t: len_t if minlim < len_t <= maxlim else None)
data = data.dropna(subset=['comment_len_by_words'])
data = data.reset_index(drop=True)

In [None]:
# remove comments with the length of fewer than three words
data_test['comment_len_by_words'] = data_test['comment_len_by_words'].apply(lambda len_t: len_t if minlim < len_t <= maxlim else None)
data_test = data_test.dropna(subset=['comment_len_by_words'])
data_test = data_test.reset_index(drop=True)

In [None]:
fig = go.Figure()

fig.add_trace(go.Histogram(
    x=data['comment_len_by_words']
))

fig.update_layout(
    title_text='Distribution of word counts within comments',
    xaxis_title_text='Word Count',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

In [None]:
fig = go.Figure()

fig.add_trace(go.Histogram(
    x=data_test['comment_len_by_words']
))

fig.update_layout(
    title_text='Distribution of word counts within comments',
    xaxis_title_text='Word Count',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

In [None]:
unique_rates = list(sorted(data['rate'].unique()))
print(f'We have #{len(unique_rates)}: {unique_rates}')

We have #5: [-2, -1, 0, 1, 2]


In [None]:
unique_rates = list(sorted(data_test['rate'].unique()))
print(f'We have #{len(unique_rates)}: {unique_rates}')

We have #5: [-2, -1, 0, 1, 2]


In [None]:
fig = go.Figure()

groupby_rate = data.groupby('rate')['rate'].count()

fig.add_trace(go.Bar(
    x=list(sorted(groupby_rate.index)),
    y=groupby_rate.tolist(),
    text=groupby_rate.tolist(),
    textposition='auto'
))

fig.update_layout(
    title_text='Distribution of rate within comments',
    xaxis_title_text='Rate',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

In [None]:
fig = go.Figure()

groupby_rate = data_test.groupby('rate')['rate'].count()

fig.add_trace(go.Bar(
    x=list(sorted(groupby_rate.index)),
    y=groupby_rate.tolist(),
    text=groupby_rate.tolist(),
    textposition='auto'
))

fig.update_layout(
    title_text='Distribution of rate within comments',
    xaxis_title_text='Rate',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

- `rate` = -2 means `furious`
- `rate` = -1 means `angry`
- `rate` = 0 means `neutral`
- `rate` = 1 means `happy`
- `rate` = 2 means `delighted`

In [None]:
def rate_to_label(rate):
    if rate == -2:
        return 'furious'
    elif rate == -1:
        return 'angry'
    elif rate == 0:
        return 'neutral'
    elif rate == 1:
        return 'happy'
    elif rate == 2:
        return 'delighted'

data['label'] = data['rate'].apply(lambda t: rate_to_label(t))
labels = list(sorted(data['label'].unique()))
data.head()

Unnamed: 0,comment,rate,comment_len_by_words,label
0,سلام خيلي خوبه بخرين.,2,5.0,delighted
1,از جمله قابلیت‌های ارتباطی HTC Desire SV می‌تو...,0,29.0,neutral
2,نهایتا، یک دوربین VGA نیز برای انجام مکالمات ...,0,18.0,neutral
3,من حدوداً ۱ ماهي‌ که مي‌شه اين گوشي رو دارم، ر...,1,91.0,happy
4,اندازه نسبتاً مناسب و وزن خوب 4.,1,7.0,happy


In [None]:
data_test['label'] = data_test['rate'].apply(lambda t: rate_to_label(t))
labels = list(sorted(data_test['label'].unique()))
data_test.head()

Unnamed: 0,comment,rate,comment_len_by_words,label
0,با اين چيزا نميتونه از Galaxy S III بهتر باشه,-1,10.0,angry
1,سرعت اجرا بسيار بالا است و مصرف باتري نيز مناس...,2,12.0,delighted
2,از حساسیت 400 مقداری نویز در عکس ها مشاهده می ...,1,21.0,happy
3,در کل، با اینکه عکاسی با تبلت را همواره جزو م...,1,42.0,happy
4,به هر صورت دیدن یک نمایشگری لمسی بر روی دوربین...,2,17.0,delighted


Cleaning is the final step in this section. Our cleaned method includes these steps:

- fixing unicodes
- removing specials like a phone number, email, url, new lines, ...
- cleaning HTMLs
- normalizing
- removing emojis

also we are using [Hazm](https://github.com/sobhe/hazm) library.

furthermore we can use the InformalNormalizer instead of Normalizer for better model but the InformalNormalizer takes too much time to process!

for using InformalNormalizer we must use the norm_words function too.

In [None]:
def norm_words(words):
    string = ''
    for i in range(len(words[0]) - 1):
        string = string + words[0][i][0] + ' '
    string = string + words[0][len(words[0]) - 1][0]

    return string

def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext


def cleaning(text):
    text = text.strip()
    
    # regular cleaning
    text = clean(text,
        fix_unicode=True,
        to_ascii=False,
        lower=True,
        no_line_breaks=True,
        no_urls=True,
        no_emails=True,
        no_phone_numbers=True,
        no_numbers=False,
        no_digits=False,
        no_currency_symbols=True,
        no_punct=False,
        replace_with_url="",
        replace_with_email="",
        replace_with_phone_number="",
        replace_with_number="",
        replace_with_digit="0",
        replace_with_currency_symbol="",
    )

    # cleaning htmls
    text = cleanhtml(text)
    
    # normalizing
    # normalizer = hazm.InformalNormalizer(seperation_flag=True)
    normalizer = hazm.Normalizer()
    text = normalizer.normalize(text)

    #text = norm_words(text)
    
    # removing wierd patterns
    wierd_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u'\U00010000-\U0010ffff'
        u"\u200d"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\u3030"
        u"\ufe0f"
        u"\u2069"
        u"\u2066"
        # u"\u200c"
        u"\u2068"
        u"\u2067"
        "]+", flags=re.UNICODE)
    
    text = wierd_pattern.sub(r'', text)
    
    # removing extra spaces, hashtags
    text = re.sub("#", "", text)
    text = re.sub("\s+", " ", text)
    
    return text

In [None]:
# cleaning comments
data['cleaned_comment'] = data['comment'].apply(cleaning)


# calculate the length of comments based on their words
data['cleaned_comment_len_by_words'] = data['cleaned_comment'].apply(lambda t: len(hazm.word_tokenize(t)))

# remove comments with the length of fewer than three words
data['cleaned_comment_len_by_words'] = data['cleaned_comment_len_by_words'].apply(lambda len_t: len_t if minlim < len_t <= maxlim else len_t)
data = data.dropna(subset=['cleaned_comment_len_by_words'])
data = data.reset_index(drop=True)

data.head()

Unnamed: 0,comment,rate,comment_len_by_words,label,cleaned_comment,cleaned_comment_len_by_words
0,سلام خيلي خوبه بخرين.,2,5.0,delighted,سلام خیلی خوبه بخرین.,5
1,از جمله قابلیت‌های ارتباطی HTC Desire SV می‌تو...,0,29.0,neutral,از جمله قابلیت‌های ارتباطی htc desire sv می‌تو...,29
2,نهایتا، یک دوربین VGA نیز برای انجام مکالمات ...,0,18.0,neutral,نهایتا، یک دوربین vga نیز برای انجام مکالمات ت...,18
3,من حدوداً ۱ ماهي‌ که مي‌شه اين گوشي رو دارم، ر...,1,91.0,happy,من حدودا ۱ ماهی‌ که می‌شه این گوشی رو دارم، را...,89
4,اندازه نسبتاً مناسب و وزن خوب 4.,1,7.0,happy,اندازه نسبتا مناسب و وزن خوب ۴.,7


In [None]:
# cleaning comments
data_test['cleaned_comment'] = data_test['comment'].apply(cleaning)


# calculate the length of comments based on their words
data_test['cleaned_comment_len_by_words'] = data_test['cleaned_comment'].apply(lambda t: len(hazm.word_tokenize(t)))

# remove comments with the length of fewer than three words
data_test['cleaned_comment_len_by_words'] = data_test['cleaned_comment_len_by_words'].apply(lambda len_t: len_t if minlim < len_t <= maxlim else len_t)
data_test = data_test.dropna(subset=['cleaned_comment_len_by_words'])
data_test = data_test.reset_index(drop=True)

data_test.head()

Unnamed: 0,comment,rate,comment_len_by_words,label,cleaned_comment,cleaned_comment_len_by_words
0,با اين چيزا نميتونه از Galaxy S III بهتر باشه,-1,10.0,angry,با این چیزا نمیتونه از galaxy s iii بهتر باشه,10
1,سرعت اجرا بسيار بالا است و مصرف باتري نيز مناس...,2,12.0,delighted,سرعت اجرا بسیار بالا است و مصرف باتری نیز مناس...,12
2,از حساسیت 400 مقداری نویز در عکس ها مشاهده می ...,1,21.0,happy,از حساسیت ۴۰۰ مقداری نویز در عکس‌ها مشاهده می‌...,18
3,در کل، با اینکه عکاسی با تبلت را همواره جزو م...,1,42.0,happy,در کل، با اینکه عکاسی با تبلت را همواره جزو مو...,42
4,به هر صورت دیدن یک نمایشگری لمسی بر روی دوربین...,2,17.0,delighted,به هر صورت دیدن یک نمایشگری لمسی بر روی دوربین...,17


In [None]:
data = data[['cleaned_comment', 'label']]
data.columns = ['comment', 'label']
data.head()

Unnamed: 0,comment,label
0,سلام خیلی خوبه بخرین.,delighted
1,از جمله قابلیت‌های ارتباطی htc desire sv می‌تو...,neutral
2,نهایتا، یک دوربین vga نیز برای انجام مکالمات ت...,neutral
3,من حدودا ۱ ماهی‌ که می‌شه این گوشی رو دارم، را...,happy
4,اندازه نسبتا مناسب و وزن خوب ۴.,happy


In [None]:
data_test = data_test[['cleaned_comment', 'label']]
data_test.columns = ['comment', 'label']
data_test.head()

Unnamed: 0,comment,label
0,با این چیزا نمیتونه از galaxy s iii بهتر باشه,angry
1,سرعت اجرا بسیار بالا است و مصرف باتری نیز مناس...,delighted
2,از حساسیت ۴۰۰ مقداری نویز در عکس‌ها مشاهده می‌...,happy
3,در کل، با اینکه عکاسی با تبلت را همواره جزو مو...,happy
4,به هر صورت دیدن یک نمایشگری لمسی بر روی دوربین...,delighted


In [None]:
print(f'We have #{len(labels)} labels: {labels}')

We have #5 labels: ['angry', 'delighted', 'furious', 'happy', 'neutral']


### Review Data

In [None]:
fig = go.Figure()

groupby_label = data.groupby('label')['label'].count()

fig.add_trace(go.Bar(
    x=list(sorted(groupby_label.index)),
    y=groupby_label.tolist(),
    text=groupby_label.tolist(),
    textposition='auto'
))

fig.update_layout(
    title_text='Distribution of label within comments [DATA]',
    xaxis_title_text='Label',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

In [None]:
fig = go.Figure()

groupby_label = data_test.groupby('label')['label'].count()

fig.add_trace(go.Bar(
    x=list(sorted(groupby_label.index)),
    y=groupby_label.tolist(),
    text=groupby_label.tolist(),
    textposition='auto'
))

fig.update_layout(
    title_text='Distribution of label within comments [DATA]',
    xaxis_title_text='Label',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

## Train,Validation,Test split

To achieve a globalized model, we need to split the cleaned dataset into train and valid sets due to size of the data. We have considered a rate of **0.1** for *valid* set. For splitting, We use `train_test_split` provided by Sklearn package with stratifying on the label for preserving the distribution balance.

test set is seperated from train and valid sets and has it's own dataset

In [None]:
data['label_id'] = data['label'].apply(lambda t: labels.index(t))
data_test['label_id'] = data_test['label'].apply(lambda t: labels.index(t))

train, valid = train_test_split(data, test_size=0.1, random_state=1, stratify=data['label'])
test = data_test

train = train.reset_index(drop=True)
valid = valid.reset_index(drop=True)
test = test.reset_index(drop=True)

x_train, y_train = train['comment'].values.tolist(), train['label_id'].values.tolist()
x_valid, y_valid = valid['comment'].values.tolist(), valid['label_id'].values.tolist()
x_test, y_test = test['comment'].values.tolist(), test['label_id'].values.tolist()

print(train.shape)
print(valid.shape)
print(test.shape)

(21619, 3)
(2403, 3)
(1789, 3)


### Saving train, valid and test sets to drive

In [None]:
train.to_csv('train.csv')
valid.to_csv('valid.csv')
test.to_csv('test.csv')

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
import shutil

shutil.copy("train.csv", "/content/gdrive/MyDrive/finetuned_parsbert_sentipers_10/train.csv")
shutil.copy("valid.csv", "/content/gdrive/MyDrive/finetuned_parsbert_sentipers_10/valid.csv")
shutil.copy("test.csv", "/content/gdrive/MyDrive/finetuned_parsbert_sentipers_10/test.csv")

'/content/gdrive/MyDrive/finetuned_parsbert_sentipers_10/test.csv'

## Implement model with PyTorch

We will follow the model using *PyTorch*

![BERT INPUTS](https://res.cloudinary.com/m3hrdadfi/image/upload/v1595158991/kaggle/bert_inputs_w8rith.png)

The BERT model input is a combination of 3 embeddings.
- Token embeddings: WordPiece token vocabulary (WordPiece is another word segmentation algorithm, similar to BPE)
- Segment embeddings: for pair sentences [A-B] marked as $E_A$ or $E_B$ mean that it belongs to the first sentence or the second one.
- Position embeddings: specify the position of words in a sentence

In [None]:
from transformers import BertConfig, BertTokenizer
from transformers import BertModel

from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

import torch
import torch.nn as nn
import torch.nn.functional as F

### Configuration

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'device: {device}')

train_on_gpu = torch.cuda.is_available()

if not train_on_gpu:
    print('CUDA is not available.  Training on CPU ...')
else:
    print('CUDA is available!  Training on GPU ...')

device: cuda:0
CUDA is available!  Training on GPU ...


In [None]:
# general config
MAX_LEN = 128
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 16
TEST_BATCH_SIZE = 16

EPOCHS = 5
EEVERY_EPOCH = 1000
LEARNING_RATE = 2e-5
CLIP = 0.0

MODEL_NAME_OR_PATH = 'HooshvareLab/bert-fa-base-uncased'
OUTPUT_PATH = '/content/bert-fa-base-uncased-sentiment/pytorch_model.bin'

os.makedirs(os.path.dirname(OUTPUT_PATH), exist_ok=True)

In [None]:
# create a key finder based on label 2 id and id to label

label2id = {label: i for i, label in enumerate(labels)}
id2label = {v: k for k, v in label2id.items()}

print(f'label2id: {label2id}')
print(f'id2label: {id2label}')

label2id: {'angry': 0, 'delighted': 1, 'furious': 2, 'happy': 3, 'neutral': 4}
id2label: {0: 'angry', 1: 'delighted', 2: 'furious', 3: 'happy', 4: 'neutral'}


In [None]:
# setup the tokenizer and configuration

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME_OR_PATH)
config = BertConfig.from_pretrained(
    MODEL_NAME_OR_PATH, **{
        'label2id': label2id,
        'id2label': id2label,
    })

print(config.to_json_string())

Downloading:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440 [00:00<?, ?B/s]

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "angry",
    "1": "delighted",
    "2": "furious",
    "3": "happy",
    "4": "neutral"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "angry": 0,
    "delighted": 1,
    "furious": 2,
    "happy": 3,
    "neutral": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 100000
}



### Input Embeddings

In [None]:
idx = np.random.randint(0, len(train))
sample_comment = train.iloc[idx]['comment']
sample_label = train.iloc[idx]['label']

print(f'Sample: \n{sample_comment}\n{sample_label}')

Sample: 
گوشی بسیار زیبا کارائی است من خریدم والان این نظرو با اون نوشتم وسند کردم.
happy


In [None]:
tokens = tokenizer.tokenize(sample_comment)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f'  Comment: {sample_comment}')
print(f'   Tokens: {tokenizer.convert_tokens_to_string(tokens)}')
print(f'Token IDs: {token_ids}')

  Comment: گوشی بسیار زیبا کارائی است من خریدم والان این نظرو با اون نوشتم وسند کردم.
   Tokens: گوشی بسیار زیبا کارايی است من خریدم والان این نظرو با اون نوشتم وسند کردم .
Token IDs: [4013, 3177, 5170, 45729, 2806, 2842, 38993, 60358, 2802, 3138, 2005, 2799, 5536, 21825, 4236, 2790, 5501, 1012]


In [None]:
encoding = tokenizer.encode_plus(
    sample_comment,
    max_length=32,
    truncation=True,
    add_special_tokens=True, # Add '[CLS]' and '[SEP]'
    return_token_type_ids=True,
    return_attention_mask=True,
    padding='max_length',
    return_tensors='pt',  # Return PyTorch tensors
)

print(f'Keys: {encoding.keys()}\n')
for k in encoding.keys():
    print(f'{k}:\n{encoding[k]}')

Keys: dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

input_ids:
tensor([[    2,  4013,  3177,  5170, 45729,  2806,  2842, 38993, 60358,  2802,
          3138,  2005,  2799,  5536, 21825,  4236,  2790,  5501,  1012,     4,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]])
token_type_ids:
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])


### Dataset

In [None]:
class Dataset(torch.utils.data.Dataset):
    """ Create a PyTorch dataset. """

    def __init__(self, tokenizer, comments, targets=None, label_list=None, max_len=128):
        self.comments = comments
        self.targets = targets
        self.has_target = isinstance(targets, list) or isinstance(targets, np.ndarray)

        self.tokenizer = tokenizer
        self.max_len = max_len

        
        self.label_map = {label: i for i, label in enumerate(label_list)} if isinstance(label_list, list) else {}
    
    def __len__(self):
        return len(self.comments)

    def __getitem__(self, item):
        comment = str(self.comments[item])

        if self.has_target:
            target = self.label_map.get(str(self.targets[item]), str(self.targets[item]))

        encoding = self.tokenizer.encode_plus(
            comment,
            add_special_tokens=True,
            truncation=True,
            max_length=self.max_len,
            return_token_type_ids=True,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt')
        
        inputs = {
            'comment': comment,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'token_type_ids': encoding['token_type_ids'].flatten(),
        }

        if self.has_target:
            inputs['targets'] = torch.tensor(target, dtype=torch.long)
        
        return inputs


def create_data_loader(x, y, tokenizer, max_len, batch_size, label_list):
    dataset = Dataset(
        comments=x,
        targets=y,
        tokenizer=tokenizer,
        max_len=max_len, 
        label_list=label_list)
    
    return torch.utils.data.DataLoader(dataset, batch_size=batch_size)

In [None]:
label_list = ['furious', 'angry', 'neutral', 'happy', 'delighted']
train_data_loader = create_data_loader(train['comment'].to_numpy(), train['label'].to_numpy(), tokenizer, MAX_LEN, TRAIN_BATCH_SIZE, label_list)
valid_data_loader = create_data_loader(valid['comment'].to_numpy(), valid['label'].to_numpy(), tokenizer, MAX_LEN, VALID_BATCH_SIZE, label_list)
test_data_loader = create_data_loader(test['comment'].to_numpy(), None, tokenizer, MAX_LEN, TEST_BATCH_SIZE, label_list)

In [None]:
sample_data = next(iter(train_data_loader))

print(sample_data.keys())

print(sample_data['comment'])
print(sample_data['input_ids'].shape)
print(sample_data['input_ids'][0, :])
print(sample_data['attention_mask'].shape)
print(sample_data['attention_mask'][0, :])
print(sample_data['token_type_ids'].shape)
print(sample_data['token_type_ids'][0, :])
print(sample_data['targets'].shape)
print(sample_data['targets'][0])

dict_keys(['comment', 'input_ids', 'attention_mask', 'token_type_ids', 'targets'])
['کارایی سایر سخت افزارها: در حد بسیار خوب', 'برجسته\u200cترین ویژگی eprnt است.', 'دوستی که میگی دوربینش بعد از آپدیت ۱۲ مگا شده ۱۲ مگا رو حالت hdr هست، بذار رو حالت نرمال میشه ۱۳ لطفا اول مطمئن شو بعد به دیگران توصیه کن که آپدیت نکنن', 'با فشار دادن دکمه direct backup، دوربین فیلم\u200cها را از طریق کابل usb به کامپیوتر منتقل می\u200cکند تا حافظه دوربین پاک شود.', 'دوما: نورگیری ان بسیار کم است ودر شرایط نور کم تصویر بسیار بی کیفیت میشود.', 'پس مطمئننا می\u200cشه ۳g رو هم به همین روش و به راحتی روش داشت.', 'تو این چند مدتی که دارمش فقط باتریش اذیتم کرده که شاید به خاطر عکس و فیلم\u200cهای عالیه که باهاش میگیرم و واقعا میشه گفت که با امکاناتی که اپراتورها تو ایران ارائه میدن چیزی کم نداره.', 'مود av درست بر عکس tv است و اختیار تنظیم سرعت شاتر را از شما می\u200cگیرد تا دیافراگم اختیار اصلی شما باشد.', 'اما در کل، کار با این تبلت که سرعت کار بسیار خوبی را در کنار زمان تاخیر ایجاد می\u200cکند، بسیار لذت بخش

In [None]:
sample_test = next(iter(test_data_loader))
print(sample_test.keys())

dict_keys(['comment', 'input_ids', 'attention_mask', 'token_type_ids'])


### Model

During the implementation of the model, sometime, you may be faced with this kind of error. It said you used all the Cuda-Memory. for solving this error There are many ways but the simple one is to clear the Cuda cache memory!

![Cuda-Error](https://res.cloudinary.com/m3hrdadfi/image/upload/v1599979552/kaggle/cuda-error_iyqh4o.png)


**Simple Solution**
```python
import torch, gc

gc.collect()
torch.cuda.empty_cache()

!nvidia-smi
```

In [None]:
class SentimentModel(nn.Module):

    def __init__(self, config):
        super(SentimentModel, self).__init__()

        self.bert = BertModel.from_pretrained(MODEL_NAME_OR_PATH)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
    
    def forward(self, input_ids, attention_mask, token_type_ids):
        _, pooled_output = self.bert(
            input_ids=input_ids, 
            attention_mask=attention_mask, 
            token_type_ids=token_type_ids,
            return_dict=False)
        
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits 

In [None]:
import torch, gc

gc.collect()
torch.cuda.empty_cache()
pt_model = None

!nvidia-smi

Mon Jul 18 10:00:26 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      2MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
pt_model = SentimentModel(config=config)
pt_model = pt_model.to(device)

print('pt_model', type(pt_model))

Some weights of the model checkpoint at HooshvareLab/bert-fa-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


pt_model <class '__main__.SentimentModel'>


In [None]:
# sample data output

sample_data_comment = sample_data['comment']
sample_data_input_ids = sample_data['input_ids']
sample_data_attention_mask = sample_data['attention_mask']
sample_data_token_type_ids = sample_data['token_type_ids']
sample_data_targets = sample_data['targets']

# available for using in GPU
sample_data_input_ids = sample_data_input_ids.to(device)
sample_data_attention_mask = sample_data_attention_mask.to(device)
sample_data_token_type_ids = sample_data_token_type_ids.to(device)
sample_data_targets = sample_data_targets.to(device)


# outputs = F.softmax(
#     pt_model(sample_data_input_ids, sample_data_attention_mask, sample_data_token_type_ids), 
#     dim=1)

outputs = pt_model(sample_data_input_ids, sample_data_attention_mask, sample_data_token_type_ids)
_, preds = torch.max(outputs, dim=1)

print(outputs[:5, :])
print(preds[:5])

tensor([[ 0.1237,  0.0751,  0.1817,  0.2848, -0.3968],
        [-0.2387,  0.1895, -0.0182,  0.5969, -0.1014],
        [ 0.0417, -0.1325,  0.2056,  0.2076, -0.1469],
        [ 0.0884,  0.0290, -0.0884,  0.3500, -0.0504],
        [-0.1372,  0.1787, -0.1543,  0.6033, -0.2221]], device='cuda:0',
       grad_fn=<SliceBackward0>)
tensor([3, 3, 3, 3, 3], device='cuda:0')


### Training

Train and optimization process

During the optimization process we will save the best model to drive

Best model: model with minimum validation loss during optimization

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
!ls /content/gdrive/MyDrive/finetuned_parsbert_sentipers

finetuned_parsbert_sentipers.pt  test.csv  train.csv  valid.csv


In [None]:
def simple_accuracy(y_true, y_pred):
    return (y_true == y_pred).mean()

def acc_and_f1(y_true, y_pred, average='weighted'):
    acc = simple_accuracy(y_true, y_pred)
    f1 = f1_score(y_true=y_true, y_pred=y_pred, average=average)
    return {
        "acc": acc,
        "f1": f1,
    }

def y_loss(y_true, y_pred, losses):
    y_true = torch.stack(y_true).cpu().detach().numpy()
    y_pred = torch.stack(y_pred).cpu().detach().numpy()
    y = [y_true, y_pred]
    loss = np.mean(losses)

    return y, loss


def eval_op(model, data_loader, loss_fn):
    model.eval()

    losses = []
    y_pred = []
    y_true = []

    with torch.no_grad():
        for dl in tqdm(data_loader, total=len(data_loader), desc="Evaluation... "):
            
            input_ids = dl['input_ids']
            attention_mask = dl['attention_mask']
            token_type_ids = dl['token_type_ids']
            targets = dl['targets']

            # move tensors to GPU if CUDA is available
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            token_type_ids = token_type_ids.to(device)
            targets = targets.to(device)

            # compute predicted outputs by passing inputs to the model
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids)
            
            # convert output probabilities to predicted class
            _, preds = torch.max(outputs, dim=1)

            # calculate the batch loss
            loss = loss_fn(outputs, targets)

            # accumulate all the losses
            losses.append(loss.item())

            y_pred.extend(preds)
            y_true.extend(targets)
    
    eval_y, eval_loss = y_loss(y_true, y_pred, losses)
    return eval_y, eval_loss


def train_op(model, 
             data_loader, 
             loss_fn, 
             optimizer, 
             scheduler, 
             step=0, 
             print_every_step=100, 
             eval=False,
             eval_cb=None,
             eval_loss_min=np.Inf,
             eval_data_loader=None, 
             clip=0.0):
    
    model.train()

    losses = []
    y_pred = []
    y_true = []

    for dl in tqdm(data_loader, total=len(data_loader), desc="Training... "):
        step += 1

        input_ids = dl['input_ids']
        attention_mask = dl['attention_mask']
        token_type_ids = dl['token_type_ids']
        targets = dl['targets']

        # move tensors to GPU if CUDA is available
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        token_type_ids = token_type_ids.to(device)
        targets = targets.to(device)

        # clear the gradients of all optimized variables
        optimizer.zero_grad()

        # compute predicted outputs by passing inputs to the model
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids)
        
        # convert output probabilities to predicted class
        _, preds = torch.max(outputs, dim=1)

        # calculate the batch loss
        loss = loss_fn(outputs, targets)

        # accumulate all the losses
        losses.append(loss.item())

        # compute gradient of the loss with respect to model parameters
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        if clip > 0.0:
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip)

        # perform optimization step
        optimizer.step()

        # perform scheduler step
        scheduler.step()

        y_pred.extend(preds)
        y_true.extend(targets)

        if eval:
            train_y, train_loss = y_loss(y_true, y_pred, losses)
            train_score = acc_and_f1(train_y[0], train_y[1], average='weighted')

            if step % print_every_step == 0:
                eval_y, eval_loss = eval_op(model, eval_data_loader, loss_fn)
                eval_score = acc_and_f1(eval_y[0], eval_y[1], average='weighted')

                if hasattr(eval_cb, '__call__'):
                    eval_loss_min = eval_cb(model, step, train_score, train_loss, eval_score, eval_loss, eval_loss_min)

    train_y, train_loss = y_loss(y_true, y_pred, losses)

    return train_y, train_loss, step, eval_loss_min

In [None]:
optimizer = AdamW(pt_model.parameters(), lr=LEARNING_RATE, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

loss_fn = nn.CrossEntropyLoss()

step = 0
eval_loss_min = np.Inf
history = collections.defaultdict(list)


def eval_callback(epoch, epochs, output_path):
    def eval_cb(model, step, train_score, train_loss, eval_score, eval_loss, eval_loss_min):
        statement = ''
        statement += 'Epoch: {}/{}...'.format(epoch, epochs)
        statement += 'Step: {}...'.format(step)
        
        statement += 'Train Loss: {:.6f}...'.format(train_loss)
        statement += 'Train Acc: {:.3f}...'.format(train_score['acc'])

        statement += 'Valid Loss: {:.6f}...'.format(eval_loss)
        statement += 'Valid Acc: {:.3f}...'.format(eval_score['acc'])

        print(statement)

        if eval_loss <= eval_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
                eval_loss_min,
                eval_loss))
            
            path = F"/content/gdrive/MyDrive/finetuned_parsbert_sentipers/finetuned_parsbert_sentipers.pt" 
            torch.save(model.state_dict(), path)

            #best_model_state = deepcopy(model.state_dict())

            torch.save(model.state_dict(), output_path)
            eval_loss_min = eval_loss
        
        return eval_loss_min


    return eval_cb


for epoch in tqdm(range(1, EPOCHS + 1), desc="Epochs... "):
    train_y, train_loss, step, eval_loss_min = train_op(
        model=pt_model, 
        data_loader=train_data_loader, 
        loss_fn=loss_fn, 
        optimizer=optimizer, 
        scheduler=scheduler, 
        step=step, 
        print_every_step=EEVERY_EPOCH, 
        eval=True,
        eval_cb=eval_callback(epoch, EPOCHS, OUTPUT_PATH),
        eval_loss_min=eval_loss_min,
        eval_data_loader=valid_data_loader, 
        clip=CLIP)
    
    train_score = acc_and_f1(train_y[0], train_y[1], average='weighted')
    
    eval_y, eval_loss = eval_op(
        model=pt_model, 
        data_loader=valid_data_loader, 
        loss_fn=loss_fn)
    
    eval_score = acc_and_f1(eval_y[0], eval_y[1], average='weighted')
    
    history['train_acc'].append(train_score['acc'])
    history['train_loss'].append(train_loss)
    history['val_acc'].append(eval_score['acc'])
    history['val_loss'].append(eval_loss)





Epochs... :   0%|          | 0/10 [00:00<?, ?it/s]

Training... :   0%|          | 0/1352 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Training... :   0%|          | 0/1352 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Epoch: 2/10...Step: 1500...Train Loss: 0.551802...Train Acc: 0.785...Valid Loss: 0.709230...Valid Acc: 0.722...
Validation loss decreased (inf --> 0.709230).  Saving model ...


Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Training... :   0%|          | 0/1352 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Epoch: 3/10...Step: 3000...Train Loss: 0.319323...Train Acc: 0.888...Valid Loss: 0.760522...Valid Acc: 0.759...


Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Training... :   0%|          | 0/1352 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Epoch: 4/10...Step: 4500...Train Loss: 0.190965...Train Acc: 0.936...Valid Loss: 0.856389...Valid Acc: 0.762...


Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Training... :   0%|          | 0/1352 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Epoch: 5/10...Step: 6000...Train Loss: 0.119861...Train Acc: 0.959...Valid Loss: 0.925032...Valid Acc: 0.782...


Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Training... :   0%|          | 0/1352 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Epoch: 6/10...Step: 7500...Train Loss: 0.092378...Train Acc: 0.970...Valid Loss: 0.963946...Valid Acc: 0.783...


Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Training... :   0%|          | 0/1352 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Epoch: 7/10...Step: 9000...Train Loss: 0.067033...Train Acc: 0.977...Valid Loss: 0.976236...Valid Acc: 0.780...


Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Training... :   0%|          | 0/1352 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Epoch: 8/10...Step: 10500...Train Loss: 0.050957...Train Acc: 0.981...Valid Loss: 1.047276...Valid Acc: 0.787...


Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Training... :   0%|          | 0/1352 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Epoch: 9/10...Step: 12000...Train Loss: 0.034808...Train Acc: 0.987...Valid Loss: 1.086653...Valid Acc: 0.788...


Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Training... :   0%|          | 0/1352 [00:00<?, ?it/s]

Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

Epoch: 10/10...Step: 13500...Train Loss: 0.029874...Train Acc: 0.988...Valid Loss: 1.116085...Valid Acc: 0.789...


Evaluation... :   0%|          | 0/151 [00:00<?, ?it/s]

### Prediction

In [None]:
def predict(model, comments, tokenizer, max_len=128, batch_size=32):
    data_loader = create_data_loader(comments, None, tokenizer, max_len, batch_size, None)
    
    predictions = []
    prediction_probs = []

    
    model.eval()
    with torch.no_grad():
        for dl in tqdm(data_loader, position=0):
            input_ids = dl['input_ids']
            attention_mask = dl['attention_mask']
            token_type_ids = dl['token_type_ids']

            # move tensors to GPU if CUDA is available
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            token_type_ids = token_type_ids.to(device)
            
            # compute predicted outputs by passing inputs to the model
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids)
            
            # convert output probabilities to predicted class
            _, preds = torch.max(outputs, dim=1)

            predictions.extend(preds)
            prediction_probs.extend(F.softmax(outputs, dim=1))

    predictions = torch.stack(predictions).cpu().detach().numpy()
    prediction_probs = torch.stack(prediction_probs).cpu().detach().numpy()

    return predictions, prediction_probs

In [None]:
test_comments = test['comment'].to_numpy()
preds, probs = predict(pt_model, test_comments, tokenizer, max_len=128)

print(preds.shape, probs.shape)

  0%|          | 0/56 [00:00<?, ?it/s]

(1789,) (1789, 5)


In [None]:
y_test, y_pred = [label_list.index(label) for label in test['label'].values], preds

print(f'F1: {f1_score(y_test, y_pred, average="weighted")}')
print()
print(classification_report(y_test, y_pred, target_names=label_list))

F1: 0.9491487299204161

              precision    recall  f1-score   support

     furious       0.77      0.83      0.80        12
       angry       0.94      0.96      0.95       183
     neutral       0.97      0.97      0.97       712
       happy       0.94      0.92      0.93       545
   delighted       0.93      0.95      0.94       337

    accuracy                           0.95      1789
   macro avg       0.91      0.93      0.92      1789
weighted avg       0.95      0.95      0.95      1789



### Saving the last epoch model to drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
path = F"/content/gdrive/MyDrive/finetuned_parsbert_sentipers_10/finetuned_parsbert_sentipers.pt" 
torch.save(pt_model.state_dict(), path)

### Load the trained model from drive and using for prediction

In [None]:
!ls /content/gdrive/MyDrive/finetuned_parsbert_sentipers

finetuned_parsbert_sentipers.pt  test.csv  train.csv  valid.csv


In [None]:
tmp_model = SentimentModel(config=config)
tmp_model = tmp_model.to(device)

print('tmp_model', type(tmp_model))

Downloading:   0%|          | 0.00/624M [00:00<?, ?B/s]

Some weights of the model checkpoint at HooshvareLab/bert-fa-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tmp_model <class '__main__.SentimentModel'>


In [None]:
path = F"/content/gdrive/MyDrive/finetuned_parsbert_sentipers_5/finetuned_parsbert_sentipers.pt" 
tmp_model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [None]:
test_comments = test['comment'].to_numpy()
preds, probs = predict(tmp_model, test_comments, tokenizer, max_len=128)

print(preds.shape, probs.shape)

  0%|          | 0/56 [00:00<?, ?it/s]

(1789,) (1789, 5)


In [None]:
y_test, y_pred = [label_list.index(label) for label in test['label'].values], preds

print(f'F1: {f1_score(y_test, y_pred, average="weighted")}')
print()
print(classification_report(y_test, y_pred, target_names=label_list))

F1: 0.9450947250691565

              precision    recall  f1-score   support

     furious       0.85      0.92      0.88        12
       angry       0.96      0.96      0.96       183
     neutral       0.96      0.97      0.96       712
       happy       0.94      0.91      0.92       545
   delighted       0.93      0.94      0.94       337

    accuracy                           0.95      1789
   macro avg       0.93      0.94      0.93      1789
weighted avg       0.95      0.95      0.95      1789



### Custom input test

In [None]:
xtmp_test = ['خوب بود ولی می‌تونست بهتر باشه .']
test_comments = np.array(xtmp_test)
preds, probs = predict(pt_model, test_comments, tokenizer, max_len=128)

preds

  0%|          | 0/1 [00:00<?, ?it/s]

array([2])

In [None]:
xtmp_test = ['خوب نبود']
test_comments = np.array(xtmp_test)
preds, probs = predict(tmp_model, test_comments, tokenizer, max_len=128)

preds

  0%|          | 0/1 [00:00<?, ?it/s]

array([1])