<a href="https://colab.research.google.com/github/Seiilaa/Yandex-Praktikum-Data-Science-program/blob/main/bert_tweet_sentiment_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Tweet sentiment recognition with BERT

Build an ML model that will determine whether the tweet is toxic or not. 

## Data description

All data is stored in `toxic_comments.csv`. 
- "text" contais text of a tweet
- "toxic" target variable

#Preparation

## Get libraries, enable GPU

In [10]:
random_state = 321

In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 12.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.4 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 40.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 43.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 45.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    F

In [45]:
import pandas as pd
import numpy as np

import torch
import tensorflow as tf

import transformers as ppb
from tqdm import notebook 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

from xgboost import XGBClassifier
from xgboost import plot_importance


In [4]:
# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [5]:
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


In [6]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')

In [11]:
print('Dataset size:', df.shape)

Dataset size: (159571, 2)


In [8]:
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [13]:
df['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

0 in this dataset stands for non toxic tweet and 1 for toxic. 

We can see class disbalance in the data. Thus I shouldn't use the Accuracy metric and maybe try F1 score to measure models performances 

# Data preprocessing

## Tokenizing the data

In [14]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

In [15]:
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
#make sure that the model uses gpu
model.cuda()

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

In [18]:
tokenized = df['text'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

From the message above, it becomes clear that there are tweets in the dataset that after tokenization have more than 512 tokens. The prebuilt BERT model that I use can't process more tokens as input. 
I will study in more detail what these big tweets are and what should I do about them.

In [19]:
max_len = 0
big_tweets_count = 0
for i in range(tokenized.shape[0]):
  if len(tokenized.values[i]) > 512:
    big_tweets_count += 1
    if len(tokenized.values[i]) > max_len:
        max_len = len(tokenized.values[i])

print('The maximum length of a tokenized tweet:', max_len)
print('The number of tweets with 512+ tokens:', big_tweets_count)
print("It's {} from total number of tweets".format(round(big_tweets_count/tokenized.shape[0], ndigits=2)))

The maximum length of a tokenized tweet: 4950
The number of tweets with 512+ tokens: 3523
It's 0.02 from total number of tweets


Considering that there are not so many 512+ token tweets in the dataset (about 2%), I will try to tokenize them using truncation, that is, by reducing their length.
I assume that the toxicity of these tweets isn't concentrated only in the last few words. 

In [20]:
tokenized = df['text'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True))

In [21]:
tokenized.shape

(159571,)

In [22]:
max_len = 512
#padding the tweets to 512 length
padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

In [23]:
padded.shape

(159571, 512)

In [24]:
attention_mask = np.where(padded != 0, 1, 0)

In [None]:
#input_ids = torch.tensor(padded)  
#attention_mask = torch.tensor(attention_mask)

In [25]:
model = model_class.from_pretrained(pretrained_weights)
model.cuda()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

In [26]:
batch_size = 100
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.tensor(padded[batch_size*i:batch_size*(i+1)]).to(device) 
        attention_mask_batch = torch.tensor(attention_mask[batch_size*i:batch_size*(i+1)]).to(device)
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())

  0%|          | 0/1595 [00:00<?, ?it/s]

In [27]:
#Seems like the last batch wasn't added. Let's add it manually
last_batch = torch.tensor(padded[batch_size*1595:]).to(device)
attention_mask_last_batch = torch.tensor(attention_mask[batch_size*1595:]).to(device)

with torch.no_grad():
  batch_embeddings = model(last_batch, attention_mask=attention_mask_last_batch)

embeddings.append(batch_embeddings[0][:,0,:].cpu().numpy())

In [28]:
features = np.concatenate(embeddings)

Embeddings are ready, let's train the models now

# Model building

## Train and Evaluate 70-30 split

In [29]:
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    df['toxic'], 
                                                    test_size=0.3, 
                                                    stratify = df['toxic'], 
                                                    random_state=random_state)

print('Train set size: ', X_train.shape)
print('Test set size: ', X_test.shape)

Train set size:  (111699, 768)
Test set size:  (47872, 768)


In [47]:
%%time
f1_results = {}

#Logistic regression
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

#Evaluating predictions
f1_result = f1_score(y_pred, y_test)
f1_results['Logistic Regression'] = round(f1_result, 3)
print('F1 score:', round(f1_result, 3))

F1 score: 0.75
CPU times: user 54.8 s, sys: 230 ms, total: 55.1 s
Wall time: 55.4 s


In [48]:
%%time

#Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state = random_state)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

#Evaluating predictions
f1_result = f1_score(y_pred, y_test)
f1_results['Random Forest'] = round(f1_result, 3)
print('F1 score:', round(f1_result, 3))

F1 score: 0.61%
CPU times: user 8min 50s, sys: 857 ms, total: 8min 51s
Wall time: 8min 48s


In [49]:
%%time

#XGboost Classifier
xgb = XGBClassifier()
xgb.fit(X_train,y_train)

y_pred = xgb.predict(X_test)

#Evaluating predictions
f1_result = f1_score(y_pred, y_test)
f1_results['XG boost'] = round(f1_result, 3)
print('F1 score:', round(f1_result, 3))

F1 score: 0.673


## Compare the algorithms

In [52]:
pd.DataFrame.from_dict(f1_results, orient='index', columns=['F1 score'])

Unnamed: 0,F1 score
Logistic Regression,0.75
Random Forest,0.61
XG boost,0.67


Logistic regression shows the best result. We can try to experiment with hypterparameters for Random Forest and XGBoost. But first I want to try to train the algorithms on a bigger training data - perhaps the algos didn't have enough data to train considering class disbalance.

## Train and Evaluate 80-20 split

In [53]:
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    df['toxic'], 
                                                    test_size=0.2, 
                                                    stratify = df['toxic'], 
                                                    random_state=random_state)

print('Train set size: ', X_train.shape)
print('Test set size: ', X_test.shape)

Train set size:  (127656, 768)
Test set size:  (31915, 768)


In [54]:
%%time
f1_results = {}

#Logistic regression
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

#Evaluating predictions
f1_result = f1_score(y_pred, y_test)
f1_results['Logistic Regression'] = round(f1_result, 3)
print('F1 score:', round(f1_result, 3))

F1 score: 0.759
CPU times: user 1min 6s, sys: 1.18 s, total: 1min 8s
Wall time: 1min 9s


In [55]:
%%time

#Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state = random_state)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

#Evaluating predictions
f1_result = f1_score(y_pred, y_test)
f1_results['Random Forest'] = round(f1_result, 3)
print('F1 score:', round(f1_result, 3))

F1 score: 0.605
CPU times: user 11min 13s, sys: 1.08 s, total: 11min 14s
Wall time: 11min 16s


In [56]:
%%time

#XGboost Classifier
xgb = XGBClassifier()
xgb.fit(X_train,y_train)

y_pred = xgb.predict(X_test)

#Evaluating predictions
f1_result = f1_score(y_pred, y_test)
f1_results['XG boost'] = round(f1_result, 3)
print('F1 score:', round(f1_result, 3))

F1 score: 0.674
CPU times: user 12min 57s, sys: 1.54 s, total: 12min 59s
Wall time: 12min 55s


In [58]:
pd.DataFrame.from_dict(f1_results, orient='index', columns=['F1 score'])

Unnamed: 0,F1 score
Logistic Regression,0.759
Random Forest,0.605
XG boost,0.674


The F1 score slightly improved for Logistic Regression and XG Boost but Logistic Regression model still leads showing the best results.

I will try to parameter tune the XB boost model to see whether it might be possible to beat the Logistic regressor  

In [60]:
params = {}
params['n_estimators'] = 200
params['max_depth'] = 3
params['learning_rate'] = 0.08

#XGboost Classifier
xgb = XGBClassifier(**params)
xgb.fit(X_train,y_train)

y_pred = xgb.predict(X_test)

#Evaluating predictions
f1_result = f1_score(y_pred, y_test)
print('F1 score:', round(f1_result, 3))

F1 score: 0.693


The F1 score improved by 0.02 which is quite a good improvement. This makes me believe that with some time and computational power it might be possible to get the results equal to the Logistic Regression. But it will probably take ages to do as one training session takes about 10 minutes

# Results

In this project I used pretratrained distilbert to get embeddings. This allowed me to turn thousands of tweets of various length into vectors that we could use to teach the models.

The distilbert model is relatively light (it still took quite a while to create embeddings) but has a limit of 512 tokens in one vector. For his reason I decided to truncate ~2% of tweets that had more then 512 tokens. If there were less tokens, I padded them.

Next I experimented with train/test sets ratio. More training data (80%) led to higher F1 score.

I trained and compared three models: Logistic regression, Random Forest and XGBoost. 

Logistic regression easily beat other models showing 0.75-0.76 F1 score. 
Better results were achieved with XGBoost — I believe that by tuning the parameters the results could be improved and it could compare to the Logistic Regression.

But even if the results are equal, Logistic regression will show faster performance by taking only about a minute to train which is 10 times less compared to XGBoost and Random Forest.