# Text Classification with BERT

The online store is launching a new service. Now users can edit and supplement product descriptions, as in wiki communities. That is, clients offer their edits and comment on the changes of others. The store needs a tool that will search for toxic comments and send them for moderation. 

It is required to train the model to classify comments into positive and negative. At your disposal is a data set with markup on the toxicity of edits.

It is necessary to build a model with the value of the quality metric *F1* at least 0.75. 


**Data description**

The data is in the file `toxic_comments.csv'. The *text* column in it contains the comment text, and *toxic* is the target attribute.

## Preparing

In [1]:
# pip install transformers
# pip install torch

In [2]:
import numpy as np
import pandas as pd
import torch
import transformers
from tqdm import notebook
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

RAND = 2007

In [3]:
df_tweets = pd.read_csv('datasets\yandex_13_toxic_comments.csv')

In [4]:
df_tweets.head(10)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


In [5]:
# check NaN
df_tweets.isna().sum()

text     0
toxic    0
dtype: int64

In [6]:
# check target column
df_tweets['toxic'].describe()

count    159571.000000
mean          0.101679
std           0.302226
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: toxic, dtype: float64

The sample is not balanced. It is necessary to make a balanced sample for training and testing, especially since we will not be able to process all 159 thousand texts anyway (it will be on the CPU for a long time)

In [7]:
# Let's check the length of the texts (number of characters)
ln = np.array([len(x) for x in df_tweets['text']])
print(f'min lenght: {ln.min()}, max lenght: {ln.max()}')

min lenght: 6, max lenght: 5000


In [8]:
# Forming a balanced sample
df_0= df_tweets.query("toxic == 0").sample(n=5000, random_state=RAND)
df_1= df_tweets.query("toxic == 1").sample(n=5000, random_state=RAND)
df_sample = pd.concat([df_0,df_1]).reset_index(drop=True)

## Learning

In [9]:
# Initializing the tokenizer as a class object BertTokenizer()
tokenizer = transformers.BertTokenizer(vocab_file='bert\\vocab.txt')

In [10]:
vectors = []
max_n =0
y=[]
for i in notebook.tqdm(range(df_sample['text'].shape[0])):    
    txt = df_sample['text'][i]
    # Convert the text to token numbers from the dictionary
    # the add_special_tokens argument equal to True means that to any converted text 
    # the start token (101) and the end token of the text (102) are added
    v = tokenizer.encode(txt, add_special_tokens=True)
    # If the text contains a lot of words, then we discard such text 
    # (otherwise we will need to complicate the model by splitting the text into pieces and back-gluing)if len(v) < 512:
        vectors.append(v)
        y.append(df_sample['toxic'][i])
        if len(v) > max_n:
            max_n=len(v)

  0%|          | 0/10000 [00:00<?, ?it/s]

In [11]:
# after tokenization, the lengths of the source texts in the corpus should be equal
padded = np.array([i + [0]*(max_n - len(i)) for i in vectors])

In [12]:
# Explain to the model that zeros do not carry meaningful information.
attention_mask = np.where(padded != 0, 1, 0)

In [13]:
# Initializing the configuration​ BertConfig
config = transformers.BertConfig.from_json_file('bert\\config.json')
# initialize the class model itselfBertModel
model = transformers.BertModel.from_pretrained('bert\\pytorch_model.bin', config=config)

Some weights of the model checkpoint at bert\pytorch_model.bin were not used when initializing BertModel: ['fit_denses.3.bias', 'fit_denses.0.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'fit_denses.2.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'fit_denses.0.weight', 'fit_denses.1.weight', 'fit_denses.4.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'fit_denses.3.weight', 'cls.seq_relationship.bias', 'fit_denses.2.bias', 'fit_denses.4.weight', 'fit_denses.1.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequ

In [14]:
# let's check the size of the corpus and mask
attention_mask.shape

(9796, 509)

In [15]:
# Check the size of the target
len(y)

9796

The sizes match, we begin the calculation of embeddings

In [16]:
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        # convert data
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        # convert mask
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            # create a batch of embeddings
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        # adding the calculated embeddings to the list
        embeddings.append(batch_embeddings[0][:,0,:].numpy())
        
        # --- ADDED to save memory
        del batch
        del attention_mask_batch
        del batch_embeddings

  0%|          | 0/97 [00:00<?, ?it/s]

In [17]:
# generate features
features = np.concatenate(embeddings)

In [18]:
features.shape

(9700, 312)

The number of feature observations corresponds to the number of observations of the target feature

In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    features, 
    y[:len(features)],
    test_size = 0.1,
    random_state=2007
)

pd.DataFrame(
    data={
        'train': [X_train.shape, len(y_train)], 
        'test': [X_test.shape, len(y_test)] },
    index=['X', 'y']
)

![image.png](attachment:image.png)

In [20]:
# Launching a logistic regression

linreg = LogisticRegression(
    penalty='l2',
    dual=False, 
    tol=0.0001, 
    C=1.0, 
    fit_intercept=True, 
    intercept_scaling=1, 
    class_weight=None, 
    random_state=2007, 
    solver='lbfgs', 
    max_iter=100, 
    multi_class='auto', 
    verbose=0, 
    warm_start=False, 
    n_jobs=-1, 
    l1_ratio=None)



lr_fitted = linreg.fit(X_train, y_train)
predictions = lr_fitted.predict(X_test)


# calc score
f1_score(y_test, predictions)

0.8599137931034484

### LightGBM

Preform the regression with LGBMClassifier

In [22]:
from lightgbm import LGBMClassifier

params = {
   
    'max_depth': 5,
    'n_estimators': 60,
    'colsample_bytree': 0.85,
    'reg_alpha': 5.5,
    'random_state': RAND
}

lgb_booster = LGBMClassifier(**params)


model_fit = lgb_booster.fit(X_train, y_train)
# make prediction
prediction = model_fit.predict(X_test)

# calc score
f1_score(y_test, prediction)

0.8451612903225806

## Conclusions
During the project, a chunk of 10,000 records (out of 159571) was processed, embeddings were calculated using the BERT neural network. Using logistic regression and LightGBM Classifier, models were calculated to predict the tone of the comment. The accuracy of F1 turned out to be 0.86, which is higher than baseline 0.75. It is surprising that LightGBM showed itself worse than LogReg.

## Recommendations
The notebook is calculated for about 2-3 hours. If you allocate more resources for the task of calculating embeds and use the GPU, then it will probably be possible to create a model with a large volume on a larger train sample.

You can also select different classification models by doing cross-validation, as well as selecting hyperparameters.

You can also choose a full BERT model with a large number of layers, it will probably give another 5 percent to the speed.

Other approaches are also used to work with texts. For example, transformers are now actively used (BERT and others from Sesame Street, for example, ELMO). BUT! They are not a panacea, they are not always needed, since TF-IDF or Word2Vec + models from classic ML can also cope.   
BERT is heavy, there are many variations of it for different tasks, there are ready-made models, there are add-ons over the transformers library. If you train BERT on the GPU (you can use Google Colab or Kaggle), then it should be faster.   
https://huggingface.co/transformers/model_doc/bert.html   
https://t.me/renat_alimbekov    
https://web.stanford.edu /~jurafsky/slp3/10.pdf - about encoder-decoder models, etensheny     
https://pytorch.org/tutorials/beginner/transformer_tutorial.html - the official guide to the transformer from the creators of pytorch    
https://transformer.huggingface.co / - chat with a transformer    
Libraries: allennlp, fairseq, transformers, tensorflow-text — many implemented methods for NLP method transformers
Word2Vec https://radimrehurek.com/gensim/models/word2vec.html
