<div class="alert" style="background-color:#fff; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:15px 15px; color:#5d3a8e; font-size:40px'> 3.4 Feature Extraction (Tensorflow)</h1>
</div>

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> Table of Content</h2>
</div>

* [Required Libraries and Modules](#Required-Libraries-and-Modules)
* [Import Clean Text Data](#Import-Clean-Text-Data)
* [BERT and its variants Word Embeddings TensorFlow](#BERT-and-its-variants-Word-Embeddings-TensorFlow)

**Notes:**

**How can I combine different features?**

Usually, if possible, you'd want to keep your matrice sparse as long as possible as it saves a lot of memory. That's why there are sparse matrices after all, otherwise, why bother? So, even if your classifier requires you to use dense input, you might want to keep the TFIDF features as sparse, and add the other features to them in a sparse format. And then only, make the matrix dense.

To do that, you could use scipy.sparse.hstack. It combines two sparse matrices together by column. scipy.sparse.vstack also exists. And of course, scipy also has the non-sparse version scipy.hstack and scipy.vstack

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Required Libraries and Modules</h2>
</div>

In [None]:
# Import Dependencies
%matplotlib inline

# Begin Python Imports
import datetime, warnings, scipy
warnings.filterwarnings("ignore")
import pickle
import gc

# Data Manipulation
import numpy as np
import pandas as pd
from scipy import sparse
from scipy.sparse import hstack
pd.set_option('display.max_columns', None)

# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')

# Progress bar
from tqdm._tqdm_notebook import tqdm_notebook
from tqdm import tqdm
tqdm_notebook.pandas()

# Feature Extraction -  Textual Features
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Modelling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score, 
    accuracy_score, 
    confusion_matrix, 
    classification_report, 
    plot_confusion_matrix,
    plot_precision_recall_curve
)

import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Tensorflow
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Import Clean Text Data</h2>
</div>

In [None]:
###############################################################
# Note: Change the name of data set used for feature creation
###############################################################
task = 'bully_binary_classification'
data_set='bully_data_clean_no_stopword_pronoun'
    
    
###################
# Import Data Set #
###################
bully_data_cleaned = pd.read_csv(data_set+'.csv', encoding='utf8')
bully_data_cleaned = bully_data_cleaned.drop(['ner','pos','Unnamed: 0'],axis=1)
bully_data_cleaned = bully_data_cleaned[~bully_data_cleaned['text_check'].isna()]
bully_data_cleaned = bully_data_cleaned[bully_data_cleaned['text_check'] != ""]
bully_data_cleaned = bully_data_cleaned.reset_index(drop=True)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>BERT and its variants Word Embeddings TensorFlow</h2>
</div>

- bert
- electra
- albert
- tnbert
- ggelubert


In [None]:
#########################################################
# Word Embedding (BERT and its variants) Tensorflow Hub #
#########################################################

def get_bert_features(df=bully_data_cleaned,
                  type="bert",
                  chunk_size=100,
                  preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
                  encoder_url="https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"):
    
    '''
    -------------
     Description 
    -------------
    Umbrella Function to extract bert and its variants word embedding features
    from TensorFlowHub
    
    ------------
     Parameters
    ------------

    df: data frame name
    type: specify the type of bert or its variants
    - bert
    - electra
    - albert
    - tnbert
    - ggelubert
    
    chunk_size: size of chunk sets
    preprocess_url: Refer TensorFlowHub
    encoder_url: Refer TensorFlowHub
    
    '''

    def get_sentence_embeding(sentences):
        preprocessed_text = bert_preprocess(sentences)
        return bert_encoder(preprocessed_text)['pooled_output']

    
    # Dimension for each bert variants
    if type == "mobilebert":
        dim=512
    elif type == "lambert" or type =='roberta':
        dim=1024
    else:
        dim=768

        
    bert_preprocess = hub.KerasLayer(preprocess_url)
    bert_encoder = hub.KerasLayer(encoder_url)
    bert_embedding=np.empty((0, dim), float)

    
    for i in tqdm(np.arange(0,df.shape[0],chunk_size)):
        temp_list = df['text_check'][i:i+chunk_size].to_list()
        bert_vec_tf = get_sentence_embeding(temp_list)
        bert_vec = bert_vec_tf.numpy()
        bert_embedding = np.vstack((bert_embedding,bert_vec)) 
        
    bert_embedding_sparse = sparse.csr_matrix(bert_embedding)
    return bert_embedding_sparse

In [None]:
# About 55 minutes
# X_bert_features=get_bert_features(df=bully_data_cleaned,
#                                   type="bert",
#                                   chunk_size=100,
#                                      preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
#                                      encoder_url="https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [None]:
# About 1 hour
# X_electra_features=get_bert_features(df=bully_data_cleaned,
#                                      type="electra",
#                                      chunk_size=100,
#                                          preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
#                                          encoder_url="https://tfhub.dev/google/electra_base/2")

In [None]:
# About 50 minutes
# X_albert_features=get_bert_features(df=bully_data_cleaned,
#                                     type="albert",
#                                     chunk_size=100,
#                                         preprocess_url="http://tfhub.dev/tensorflow/albert_en_preprocess/3",
#                                          encoder_url="https://tfhub.dev/tensorflow/albert_en_base/3")


In [None]:
# About 1 hour
# X_tnbert_features=get_bert_features(df=bully_data_cleaned,
#                                     type="tnbert",
#                                     chunk_size=100,
#                                         preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
#                                         encoder_url="https://tfhub.dev/google/tn_bert/1")

In [None]:
# About 1 hour
# X_ggelubert_features=get_bert_features(df=bully_data_cleaned,
#                                        type="ggelubert",
#                                        chunk_size=100,
#                                         preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
#                                         encoder_url="https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/2")

In [None]:
# X_lambert_features=get_bert_features(df=bully_data_cleaned,
#                                      type="lambert",
#                                      chunk_size=10,
#                                         preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
#                                         encoder_url="https://tfhub.dev/tensorflow/lambert_en_uncased_L-24_H-1024_A-16/2")

In [None]:
# Chnage to huggingface
# X_distilbert_features=get_bert_features(df=bully_data_cleaned,
#                                         type="distilbert",
#                                         chunk_size=10,
#                                         preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
#                                         encoder_url="https://tfhub.dev/jeongukjae/distilbert_en_uncased_L-6_H-768_A-12/1")

In [None]:
# Chnage hugging face
# X_mobilebert_features=get_bert_features(df=bully_data_cleaned,
#                                         type="mobilebert",
#                                         chunk_size=10,
#                                             preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
#                                             encoder_url="https://tfhub.dev/tensorflow/mobilebert_en_uncased_L-24_H-128_B-512_A-4_F-4_OPT/1")

In [None]:
# X_roberta_features=get_bert_features(df=bully_data_cleaned,
#                                         type="roberta",
#                                         chunk_size=1,
#                                         preprocess_url="https://tfhub.dev/jeongukjae/roberta_en_cased_preprocess/1",
#                                         encoder_url="https://tfhub.dev/jeongukjae/roberta_en_cased_L-24_H-1024_A-16/1")

### ELMO Word Embeddings

In [None]:
def get_elmo_features(df=bully_data_cleaned):
    
    '''
    -------------
    Description 
    -------------
    Umbrella Function to extract ELMO word embedding features

    ------------
    Parameters
    ------------

    df: specify data frame which will be used as follow with text column and convert to list
    Eg. df['Message'].to_list()

    '''
    elmo_model = hub.load("https://tfhub.dev/google/elmo/3")

    chunk_size=5
    dim = 1024 #elmo vector dimension

    elmo_embedding=np.empty((0, dim), float)
   
    for i in tqdm(np.arange(0,len(df['text_check'].to_list()),chunk_size)):
        temp_list = tf.cast(df['text_check'].to_list()[i:i+chunk_size],tf.string)
        embed = elmo_model.signatures['default'](temp_list)['elmo']
        
        # return average of ELMo features
        embed_mean = tf.reduce_mean(embed,1).numpy()
        elmo_embedding = np.vstack((elmo_embedding,embed_mean))
        
    elmo_embedding_sparse = sparse.csr_matrix(elmo_embedding)
    return elmo_embedding_sparse

In [None]:
# About 45 minutes
# X_elmo_features = get_elmo_features(df=bully_data_cleaned)

### nnlm Word Embeddings
```python
model = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(model, input_shape=[], dtype=tf.string, trainable=True)
hub_layer(train_examples[:3])
```

In [None]:
def get_nnlm_features(df=bully_data_cleaned):

    nnlm_model = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim128/2")
    dimension = 128

    def transform_features(X,dimension):
        features = np.empty((0, dimension), float)

        for x in tqdm(X):
            features = np.append(features, x.numpy(), 0)
        return features

    X = df['text_check'].progress_apply(lambda x: nnlm_model([x]))
    X_nnlm_features=transform_features(X=X,dimension=dimension)
    X_nnlm_features_sparse = sparse.csr_matrix(X_nnlm_features)
    
    return X_nnlm_features_sparse

In [None]:
# Within 30 minutes
# X_nnlm_features = get_nnlm_features(df=bully_data_cleaned)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Combination of Features</h2>
</div>

In [None]:
########
# Bert #
########
print("Generating bert features")
# About 1 hour
X_BertEmbedding=get_bert_features(df=bully_data_cleaned,
                                  type="bert",
                                  chunk_size=100,
                                     preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
                                     encoder_url="https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

print("Shape: "+str(X_BertEmbedding.shape)) # check shape

with open(task+"\\"+data_set+"\\features\\X_BertEmbedding.pkl",'wb') as f:
    pickle.dump(X_BertEmbedding, f)
    
del X_BertEmbedding # to free up memory
gc.collect()    
    

In [None]:

##########
# albert #
##########
print()
print("Generating albert features")
# About 1 hour
X_AlbertEmbedding=get_bert_features(df=bully_data_cleaned,
                                    type="albert",
                                    chunk_size=100,
                                        preprocess_url="http://tfhub.dev/tensorflow/albert_en_preprocess/3",
                                         encoder_url="https://tfhub.dev/tensorflow/albert_en_base/3")

print("Shape: "+str(X_AlbertEmbedding.shape)) # check shape

with open(task+"\\"+data_set+"\\features\\X_AlbertEmbedding.pkl",'wb') as f:
    pickle.dump(X_AlbertEmbedding, f)
    
del X_AlbertEmbedding # to free up memory
gc.collect()

In [None]:
###########
# electra #
###########
print()
print("Generating electra features")
# About 46 minutes
X_ElectraEmbedding=get_bert_features(df=bully_data_cleaned,
                                     type="electra",
                                     chunk_size=100,
                                         preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
                                         encoder_url="https://tfhub.dev/google/electra_base/2")

print("Shape: "+str(X_ElectraEmbedding.shape)) # check shape

with open(task+"\\"+data_set+"\\features\\X_ElectraEmbedding.pkl",'wb') as f:
    pickle.dump(X_ElectraEmbedding, f)
    
del X_ElectraEmbedding # to free up memory
gc.collect()

In [None]:
##########
# tnbert #
##########
print()
print("Generating tnbert features")
# About 1 hour
X_tnBertEmbedding=get_bert_features(df=bully_data_cleaned,
                                    type="tnbert",
                                    chunk_size=80,
                                        preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
                                        encoder_url="https://tfhub.dev/google/tn_bert/1")

print("Shape: "+ str(X_tnBertEmbedding.shape)) # check shape

with open(task+"\\"+data_set+"\\features\\X_tnBertEmbedding.pkl",'wb') as f:
    pickle.dump(X_tnBertEmbedding, f)
    
del X_tnBertEmbedding # to free up memory
gc.collect()    

In [None]:
##############
# ggelubert #
#############
print()
print("Generating ggelubert features")
# About 1 hour
X_ggeluBertEmbedding=get_bert_features(df=bully_data_cleaned,
                                       type="ggelubert",
                                       chunk_size=100,
                                        preprocess_url="https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3",
                                        encoder_url="https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/2")

print("Shape: "+str(X_ggeluBertEmbedding.shape)) # check shape

with open(task+"\\"+data_set+"\\features\\X_ggeluBertEmbedding.pkl",'wb') as f:
    pickle.dump(X_ggeluBertEmbedding, f)
    
del X_ggeluBertEmbedding # to free up memory
gc.collect()    
    

In [None]:
   
########
# elmo #
########
print()
print("Generating elmo features")
# About 45 minutes
X_elmoEmbedding = get_elmo_features(df=bully_data_cleaned)

print("Shape: "+str(X_elmoEmbedding.shape)) # check shape

with open(task+"\\"+data_set+"\\features\\X_elmoEmbedding.pkl",'wb') as f:
    pickle.dump(X_elmoEmbedding, f)
    
del X_elmoEmbedding # to free up memory
gc.collect()    


In [None]:
########
# nnlm #
########
print()
print("Generating nnlm features")
# Within 30 minutes
X_nnlmEmbedding = get_nnlm_features(df=bully_data_cleaned)

print("Shape: "+str(X_nnlmEmbedding.shape)) # check shape

with open(task+"\\"+data_set+"\\features\\X_nnlmEmbedding.pkl",'wb') as f:
    pickle.dump(X_nnlmEmbedding, f)
    
del X_nnlmEmbedding # to free up memory
gc.collect()

In [None]:
##########################
# Output as pickle files #
##########################

# Feature sets #
# feature_set = { 
#                 'X_BertEmbedding': X_BertEmbedding,
#                 'X_ElectraEmbedding': X_ElectraEmbedding,
#                 'X_AlbertEmbedding': X_AlbertEmbedding
#                 'X_tnBertEmbedding': X_tnBertEmbedding,
#                 'X_ggeluBertEmbedding': X_ggeluBertEmbedding,
#                 'X_elmo_features': X_elmo_features,
#                 'X_nnlm_features': X_nnlm_features
#              }

# for fname, fset in feature_set.items():
#     with open(data_set+"\\features\\"+ fname + ".pkl",'wb') as f:
#         pickle.dump(fset, f)
