<div class="alert" style="background-color:#fff; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:15px 15px; color:#5d3a8e; font-size:40px'> 3.3 Feature Extraction (PyTorch HuggingFace Transformer)</h1>
</div>

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> Table of Content</h2>
</div>

* [Required Libraries and Modules](#Required-Libraries-and-Modules)
* [Import Clean Text Data](#Import-Clean-Text-Data)
* [BERT and its variants Word Embeddings PyTorch HuggingFace Transformers](#BERT-and-its-variants-Word-Embeddings-PyTorch-HuggingFace-Transformers)

**Notes:**

**How can I combine different features?**

Usually, if possible, you'd want to keep your matrice sparse as long as possible as it saves a lot of memory. That's why there are sparse matrices after all, otherwise, why bother? So, even if your classifier requires you to use dense input, you might want to keep the TFIDF features as sparse, and add the other features to them in a sparse format. And then only, make the matrix dense.

To do that, you could use scipy.sparse.hstack. It combines two sparse matrices together by column. scipy.sparse.vstack also exists. And of course, scipy also has the non-sparse version scipy.hstack and scipy.vstack

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Required Libraries and Modules</h2>
</div>

In [1]:
# Import Dependencies
%matplotlib inline

# Begin Python Imports
import datetime, warnings, scipy
warnings.filterwarnings("ignore")
import pickle
import gc

# Data Manipulation
import numpy as np
import pandas as pd
from scipy import sparse
from scipy.sparse import hstack
pd.set_option('display.max_columns', None)

# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')

# Progress bar
from tqdm._tqdm_notebook import tqdm_notebook
from tqdm import tqdm
tqdm_notebook.pandas()

# Feature Extraction -  Textual Features
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Modelling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score, 
    accuracy_score, 
    confusion_matrix, 
    classification_report, 
    plot_confusion_matrix,
    plot_precision_recall_curve
)

import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Import Clean Text Data</h2>
</div>

In [2]:
###############################################################
# Note: Change the name of data set used for feature creation
###############################################################
task = 'bully_binary_classification'
data_set='bully_data_clean_no_stopword_all'
    
    
###################
# Import Data Set #
###################
bully_data_cleaned = pd.read_csv(data_set+'.csv', encoding='utf8')
bully_data_cleaned = bully_data_cleaned.drop(['ner','pos','Unnamed: 0'],axis=1)
bully_data_cleaned = bully_data_cleaned[~bully_data_cleaned['text_check'].isna()]
bully_data_cleaned = bully_data_cleaned[bully_data_cleaned['text_check'] != ""]
# bully_data_cleaned = bully_data_cleaned[bully_data_cleaned['role']!='None']
bully_data_cleaned = bully_data_cleaned.reset_index(drop=True)

In [3]:
bully_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107101 entries, 0 to 107100
Data columns (total 64 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tag                      107101 non-null  object 
 1   text                     107101 non-null  object 
 2   label                    107101 non-null  object 
 3   role                     107101 non-null  object 
 4   harmfulness_score        107101 non-null  int64  
 5   oth_language             107101 non-null  int64  
 6   file_index               107101 non-null  object 
 7   word_count               107101 non-null  int64  
 8   char_count               107101 non-null  int64  
 9   avg_word_len             107101 non-null  float64
 10  stopword_count           107101 non-null  int64  
 11  hashtag_count            107101 non-null  int64  
 12  mention_count            107101 non-null  int64  
 13  digit_counts             107101 non-null  int64  
 14  uppe

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>BERT and its variants Word Embeddings PyTorch HuggingFace Transformers</h2>
</div>

In [5]:
##########################################################
# Word Embedding (BERT and its variants) HuggingFace Hub #
##########################################################

def get_bert_features_huggingface(df=bully_data_cleaned,
                                  type= "distilbert",
                                  chunk_size=50):
    
    '''
    -------------
     Description 
    -------------
    Umbrella Function to extract bert and its variants word embedding features
    from PyTorch Huggingface Hub
    
    ------------
     Parameters
    ------------

    df: data frame name
    type: specify the type of bert or its variants
    - mobilebert
    - roberta
    - distillbert
    
    chunk_size: size of chunk sets
    
    '''
    
    # Selection of model, tokenizer, and pretrained model
    if type == "distilbert":
        dimension=768
        model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

    elif type == "roberta":
        dimension=768
        model_class, tokenizer_class, pretrained_weights = (ppb.RobertaModel, ppb.RobertaTokenizer, 'roberta-base')

    elif type == "mobilebert":
        dimension=512
        model_class, tokenizer_class, pretrained_weights = (ppb.MobileBertModel, ppb.MobileBertTokenizer, 'google/mobilebert-uncased')

             
    # Load pretrained model/tokenizer
    tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
    model = model_class.from_pretrained(pretrained_weights).to('cuda:0')


    # Preprocess
    tokenized = df['text_check'].progress_apply(lambda x: tokenizer.encode(x, truncation=True,max_length=512,add_special_tokens=True))


    # Check maximum length of all inputs
    max_len = 0
    for i in tokenized.values:
        if len(i) > max_len:
            max_len = len(i)
    print(max_len)
    
    
    # Padding - Insert padded [0]
    padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])


    # Attention mask
    attention_mask = np.where(padded != 0, 1, 0)

    # Use cuda to run with GPU
    input_ids = torch.tensor(padded).to('cuda:0')  
    attention_mask = torch.tensor(attention_mask).to('cuda:0')

    
    
    ##################
    # Loop by chunks #
    ##################
    
    bert_embedding=np.empty((0, dimension), float)
    for i in tqdm(np.arange(0,df.shape[0],chunk_size)):
        with torch.no_grad():
            last_hidden_states = model(input_ids[i:i+chunk_size], attention_mask=attention_mask[i:i+chunk_size])

        features = last_hidden_states[0][:,0,:].cpu().data.numpy()
        bert_embedding = np.vstack((bert_embedding,features))
        
    bert_embedding_sparse = sparse.csr_matrix(bert_embedding)
    return bert_embedding_sparse

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Combination of Features</h2>
</div>

In [None]:
##############
# DistilBert #
##############
print("Generating DistilBertEmbedding features")
# About 1.5 hour
X_DistilBertEmbedding=get_bert_features_huggingface(df=bully_data_cleaned,
                                  type="distilbert",
                                  chunk_size=50)

with open(task+"\\"+data_set+"\\features\\X_DistilBertEmbedding.pkl",'wb') as f:
    pickle.dump(X_DistilBertEmbedding, f)
    
print("Shape: "+str(X_DistilBertEmbedding.shape)) # check shape   

del X_DistilBertEmbedding # to free up memory
gc.collect()   

In [None]:
###########
# RoBerta #
###########
print()
print("Generating RoBertaEmbedding features")
# About 2 hour
X_RoBertaEmbedding=get_bert_features_huggingface(df=bully_data_cleaned,
                                  type="roberta",
                                  chunk_size=20)

with open(task+"\\"+data_set+"\\features\\X_RoBertaEmbedding.pkl",'wb') as f:
    pickle.dump(X_RoBertaEmbedding, f)
    
print("Shape: "+str(X_RoBertaEmbedding.shape)) # check shape   
    
del X_RoBertaEmbedding # to free up memory
gc.collect()   

In [None]:
##############
# mobileBert #
##############
print()
print("Generating mobileBertEmbedding features")
# About 35 minutes
X_mobileBertEmbedding=get_bert_features_huggingface(df=bully_data_cleaned,
                                  type="mobilebert",
                                  chunk_size=40)

with open(task+"\\"+data_set+"\\features\\X_mobileBertEmbedding.pkl",'wb') as f:
    pickle.dump(X_mobileBertEmbedding, f)
    
print("Shape: "+str(X_mobileBertEmbedding.shape)) # check shape   
    
del X_mobileBertEmbedding # to free up memory
gc.collect()   