# Project for "WikiShop" (using BERT)

The online store "WikiShop" is launching a new service. Now users can edit and enhance product descriptions, similar to wiki communities. In other words, customers can suggest their edits and comment on the changes made by others. The store requires a tool that can identify toxic comments and send them for moderation.

The objective is to train a model using the BERT (Bidirectional Encoder Representations from Transformers) to classify comments as positive or negative. We have a labeled dataset available that indicates the toxicity of edits.

We will build a model with an F1 quality metric value of at least 0.75.

## Data preprocessing

Libraries import

In [1]:
import pandas as pd
import numpy as np

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import torch
import transformers

import re
from tqdm import notebook

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

import lightgbm as lgb

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ivanshurgalin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ivanshurgalin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/ivanshurgalin/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Let's take a look at the dataset and save it in a dataframe 

In [2]:
df = pd.read_csv('toxic_comments (1).csv')

In [3]:
df.head

<bound method NDFrame.head of         Unnamed: 0                                               text  toxic
0                0  Explanation\nWhy the edits made under my usern...      0
1                1  D'aww! He matches this background colour I'm s...      0
2                2  Hey man, I'm really not trying to edit war. It...      0
3                3  "\nMore\nI can't make any real suggestions on ...      0
4                4  You, sir, are my hero. Any chance you remember...      0
...            ...                                                ...    ...
159287      159446  ":::::And for the second time of asking, when ...      0
159288      159447  You should be ashamed of yourself \n\nThat is ...      0
159289      159448  Spitzer \n\nUmm, theres no actual article for ...      0
159290      159449  And it looks like it was actually you who put ...      0
159291      159450  "\nAnd ... I really don't think you understand...      0

[159292 rows x 3 columns]>

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [5]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

First, we will convert the texts to lowercase and remove unnecessary symbols, what would help to normalize the text and remove any inconsistencies or noise that might affect the model's performance

In [6]:
def pre_edit(text):
    text = re.sub(r"(?:\n|\r)", " ", text)
    text = re.sub(r"[^a-zA-Z ]+", "", text).strip()
    text = text.lower()
    return text

from tqdm import tqdm
tqdm.pandas()
df['text'] = df['text'].progress_apply(pre_edit)
df['text'][0]

100%|████████████████████████████████| 159292/159292 [00:01<00:00, 80100.85it/s]


'explanation why the edits made under my username hardcore metallica fan were reverted they werent vandalisms just closure on some gas after i voted at new york dolls fac and please dont remove the template from the talk page since im retired now'

Tokenization and lemmatization:

In [7]:
df['text'][0]

'explanation why the edits made under my username hardcore metallica fan were reverted they werent vandalisms just closure on some gas after i voted at new york dolls fac and please dont remove the template from the talk page since im retired now'

In [8]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(w,'v') for w in w_tokenizer.tokenize(text)])



In [9]:
df['text_final'] = df.text.apply(lemmatize_text)
df['text_final'][0]

'explanation why the edit make under my username hardcore metallica fan be revert they werent vandalisms just closure on some gas after i vote at new york dolls fac and please dont remove the template from the talk page since im retire now'

In [10]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic,text_final
0,0,explanation why the edits made under my userna...,0,explanation why the edit make under my usernam...
1,1,daww he matches this background colour im seem...,0,daww he match this background colour im seemin...
2,2,hey man im really not trying to edit war its j...,0,hey man im really not try to edit war its just...
3,3,more i cant make any real suggestions on impro...,0,more i cant make any real suggestions on impro...
4,4,you sir are my hero any chance you remember wh...,0,you sir be my hero any chance you remember wha...


In [11]:
#text after
df['text_final'][0]

'explanation why the edit make under my username hardcore metallica fan be revert they werent vandalisms just closure on some gas after i vote at new york dolls fac and please dont remove the template from the talk page since im retire now'

Now let's separate the samples

In [12]:
x_comms = df['text_final']
y_comms = df['toxic']

x_train_comms, x_test_comms, y_train_comms, y_test_comms = train_test_split(x_comms, y_comms, random_state=0, 
                                                                            stratify=y_comms)
x_train_comms.shape, x_test_comms.shape, y_train_comms.shape, y_test_comms.shape

((119469,), (39823,), (119469,), (39823,))

In [13]:
print('Class Distribution in the Training Dataset')
print(y_train_comms.value_counts()[0] / y_train_comms.value_counts().sum())
print(y_train_comms.value_counts()[1] / y_train_comms.value_counts().sum())
print()
print('Class Distribution in the Test Dataset:')
print(y_test_comms.value_counts()[0] / y_test_comms.value_counts().sum())
print(y_test_comms.value_counts()[1] / y_test_comms.value_counts().sum())

Class Distribution in the Training Dataset
0.8983836811222995
0.10161631887770049

Class Distribution in the Test Dataset:
0.8984004218667604
0.10159957813323958


Now, let's create a corpus and add stop words:

In [14]:
x_train_comms_corpus = x_train_comms.values
x_test_comms_corpus = x_test_comms.values
 
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
len(stopwords)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ivanshurgalin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


179

To store the results, let's create a separate dataframe:

In [15]:
results = pd.DataFrame({
    'Preprocessing model' : [], 'Learning model' : [], 'Train f1 score' : [], 'Test f1 score' : []
}) 

In [16]:
models = [
    LogisticRegression(max_iter=1000),
    lgb.LGBMClassifier(n_estimators = 1000, learning_rate = 0.1)
]

We will train the models using a separate function `learn models`

In [17]:
def learn_models(models_list, x_train, y_train, x_test, y_test, prepr_model : str):
    for i in models_list:
        clf_gs = GridSearchCV(i, {}, cv=5, scoring='f1')
        clf_gs.fit(x_train,y_train)
        
        train_f1_score = f1_score(y_train, clf_gs.predict(x_train))
        test_f1_score = f1_score(y_test, clf_gs.predict(x_test))
        
        name = str(i).split(sep='(')[0]
        
        globals()['results'] = globals()['results'].append({
            'Preprocessing model' : prepr_model, 'Learning model' : name, 
            'Train f1 score' : round(train_f1_score, 2), 'Test f1 score' : round(test_f1_score, 2)}, 
            ignore_index=True)
    

## Model training

Vectorization:

In [18]:
vectorizer = CountVectorizer(stop_words=stopwords, dtype=np.float32) 

x_train_comms_vectorized = vectorizer.fit_transform(x_train_comms_corpus)
x_test_comms_vectorized = vectorizer.transform(x_test_comms_corpus)

print(x_train_comms_vectorized.shape)
print(x_train_comms_vectorized[:5].toarray())

(119469, 173887)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [19]:
%%time
learn_models(models, x_train_comms_vectorized, y_train_comms, x_test_comms_vectorized, y_test_comms,
             prepr_model = 'CountVectorizer')

  globals()['results'] = globals()['results'].append({


CPU times: user 17min 4s, sys: 2min 21s, total: 19min 26s
Wall time: 3min 30s


  globals()['results'] = globals()['results'].append({


In [20]:
results

Unnamed: 0,Preprocessing model,Learning model,Train f1 score,Test f1 score
0,CountVectorizer,LogisticRegression,0.89,0.76
1,CountVectorizer,LGBMClassifier,0.92,0.78


TF-IDF training:

In [21]:
tf_idf = TfidfVectorizer(stop_words=stopwords)

x_train_comms_tf_idf = tf_idf.fit_transform(x_train_comms)
x_test_comms_tf_idf = tf_idf.transform(x_test_comms)

In [22]:
print(x_train_comms_tf_idf.shape)
print(x_train_comms_tf_idf[:5].toarray())

(119469, 173887)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Let's train the model and evaluate the time it takes.

In [23]:
%%time
learn_models(models, x_train_comms_tf_idf, y_train_comms, x_test_comms_tf_idf, y_test_comms,
             prepr_model = 'TfidfVectorizer')

  globals()['results'] = globals()['results'].append({


CPU times: user 42min 56s, sys: 5min 13s, total: 48min 9s
Wall time: 6min 47s


  globals()['results'] = globals()['results'].append({


Check the results

In [24]:
results

Unnamed: 0,Preprocessing model,Learning model,Train f1 score,Test f1 score
0,CountVectorizer,LogisticRegression,0.89,0.76
1,CountVectorizer,LGBMClassifier,0.92,0.78
2,TfidfVectorizer,LogisticRegression,0.76,0.74
3,TfidfVectorizer,LGBMClassifier,0.95,0.78


DistilBert training

In [25]:
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertModel, transformers.DistilBertTokenizer, 
                                                    'distilbert-base-uncased')

In [26]:
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Tokenization

In [27]:
from tqdm.notebook import tqdm
tqdm.pandas()
tokenized = df['text_final'][:5000].progress_apply(lambda x: tokenizer.encode(x[:512], add_special_tokens=True))

  0%|          | 0/5000 [00:00<?, ?it/s]

Padding

In [28]:
len(tokenized[0])

53

In [29]:
padded = np.array([i + [0]*(512-len(i)) for i in tokenized.values])
len(padded[0])

512

Masking

In [30]:
attention_mask = np.where(padded != 0, 1, 0)
padded.shape, attention_mask.shape

((5000, 512), (5000, 512))

In [31]:
# batch sizw
batch_size = 10
# embeding
embeddings = []

In [32]:
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)])
    attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
    
    with torch.no_grad():
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)
    embeddings.append(batch_embeddings[0][:,0,:].numpy())

  0%|          | 0/500 [00:00<?, ?it/s]

To speed up the training of the BERT model, we will take a subset of only 5000 rows from the main dataframe.

In [33]:
x_bert = np.concatenate(embeddings)
y_bert = df['toxic'][:5000]
x_bert.shape, y_bert.shape

((5000, 768), (5000,))

In [34]:
x_train_bert, x_test_bert, y_train_bert, y_test_bert = train_test_split(x_bert, y_bert, random_state=0, stratify=y_bert)


Model training

In [35]:
%%time
learn_models(models, x_train_bert, y_train_bert, x_test_bert, y_test_bert,
             prepr_model = 'DistilBert')

  globals()['results'] = globals()['results'].append({


CPU times: user 4min 21s, sys: 33.5 s, total: 4min 54s
Wall time: 43.5 s


  globals()['results'] = globals()['results'].append({


And renew the results

In [36]:
results

Unnamed: 0,Preprocessing model,Learning model,Train f1 score,Test f1 score
0,CountVectorizer,LogisticRegression,0.89,0.76
1,CountVectorizer,LGBMClassifier,0.92,0.78
2,TfidfVectorizer,LogisticRegression,0.76,0.74
3,TfidfVectorizer,LGBMClassifier,0.95,0.78
4,DistilBert,LogisticRegression,0.81,0.66
5,DistilBert,LGBMClassifier,1.0,0.65


## Conclusion

Sorted table:

In [37]:
results.sort_values('Test f1 score', ascending=False)

Unnamed: 0,Preprocessing model,Learning model,Train f1 score,Test f1 score
1,CountVectorizer,LGBMClassifier,0.92,0.78
3,TfidfVectorizer,LGBMClassifier,0.95,0.78
0,CountVectorizer,LogisticRegression,0.89,0.76
2,TfidfVectorizer,LogisticRegression,0.76,0.74
4,DistilBert,LogisticRegression,0.81,0.66
5,DistilBert,LGBMClassifier,1.0,0.65


The best-performing model was the LGBMClassifier on preprocessed text using the CountVectorizer preprocessing technique. Due to the use of a smaller dataset (5000 rows), the DistilBERT model might not have been able to learn complex patterns effectively, leading to the unsatisfactory results.e.