<div class="alert" style="background-color:#fff; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:15px 15px; color:#5d3a8e; font-size:40px'> 3.2 Feature Extraction (Toxicity)</h1>
</div>

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> Table of Content</h2>
</div>

* [Required Libraries and Modules](#Required-Libraries-and-Modules)
* [Import Clean Text Data](#Import-Clean-Text-Data)
* [Download Transformers Model - Toxic BERT](#Download-Transformers-Model---Toxic-BERT)
* [Toxicity Features](#Toxicity-Features)
* [Output Pickle Files](#Output-Pickle-Files)

**Notes:**

**How can I combine different features?**

Usually, if possible, you'd want to keep your matrice sparse as long as possible as it saves a lot of memory. That's why there are sparse matrices after all, otherwise, why bother? So, even if your classifier requires you to use dense input, you might want to keep the TFIDF features as sparse, and add the other features to them in a sparse format. And then only, make the matrix dense.

To do that, you could use scipy.sparse.hstack. It combines two sparse matrices together by column. scipy.sparse.vstack also exists. And of course, scipy also has the non-sparse version scipy.hstack and scipy.vstack

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Required Libraries and Modules</h2>
</div>

In [11]:
# Import Dependencies
%matplotlib inline

# Begin Python Imports
import datetime, warnings, scipy
warnings.filterwarnings("ignore")
import pickle

# Data Manipulation
import numpy as np
import pandas as pd
from scipy import sparse
from scipy.sparse import hstack
pd.set_option('display.max_columns', None)

# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')

# Progress bar
from tqdm._tqdm_notebook import tqdm_notebook
from tqdm import tqdm
tqdm_notebook.pandas()

# Modelling
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, AutoModelForSequenceClassification
from transformers import pipeline

import transformers
import warnings
warnings.filterwarnings('ignore')

In [12]:
# Tensorflow
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Import Clean Text Data</h2>
</div>

In [15]:
###############################################################
# Note: Change the name of data set used for feature creation
###############################################################
data_set='bully_data_clean_with_stopword'
    
    
###################
# Import Data Set #
###################
bully_data_cleaned = pd.read_csv(data_set+'.csv', encoding='utf8')
bully_data_cleaned = bully_data_cleaned.drop(['ner','pos','Unnamed: 0'],axis=1)
# Drop uninformative columns
bully_data_cleaned = bully_data_cleaned.drop(['emails_count',
                                              'emoji_counts',
                                              'hashtag_count',
                                              'mention_count',
                                              'urls_count',
                                              'ner_EVENT_counts',
                                              'ner_FAC_counts', 
                                              'ner_LANGUAGE_counts',
                                              'ner_LAW_counts', 
                                              'ner_LOC_counts', 
                                              'ner_MONEY_counts',
                                              'ner_NORP_counts',
                                              'ner_ORDINAL_counts', 
                                              'ner_PERCENT_counts', 
                                              'ner_PRODUCT_counts',
                                              'ner_QUANTITY_counts', 
                                              'ner_TIME_counts', 
                                              'ner_WORK_OF_ART_counts'],axis=1)
                                              
bully_data_cleaned = bully_data_cleaned[~bully_data_cleaned['text_check'].isna()]
bully_data_cleaned = bully_data_cleaned[bully_data_cleaned['text_check'] != ""]
#bully_data_cleaned = bully_data_cleaned[bully_data_cleaned['role']!='None']
bully_data_cleaned = bully_data_cleaned.reset_index(drop=True)



In [17]:
bully_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112247 entries, 0 to 112246
Data columns (total 46 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tag                      112247 non-null  object 
 1   text                     112247 non-null  object 
 2   label                    112247 non-null  object 
 3   role                     112247 non-null  object 
 4   harmfulness_score        112247 non-null  int64  
 5   oth_language             112247 non-null  int64  
 6   file_index               112247 non-null  object 
 7   word_count               112247 non-null  int64  
 8   char_count               112247 non-null  int64  
 9   avg_word_len             112247 non-null  float64
 10  stopword_count           112247 non-null  int64  
 11  digit_counts             112247 non-null  int64  
 12  uppercase_count          112247 non-null  int64  
 13  punc_count               112247 non-null  int64  
 14  excl

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Download Transformers Model - Toxic BERT</h2>
</div>

Now, to download the models and tokenizer we found previously, we just have to use the
`AutoModelForSequenceClassification.from_pretrained` method (feel free to replace `model_name` by
any other model from the model hub):

In [20]:
model_name = "unitary/toxic-bert"

# This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow.
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer, return_all_scores=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForSequenceClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Toxicity Features</h2>
</div>

In [21]:
bully_data_cleaned['toxic_dict'] = bully_data_cleaned['text_check'].progress_apply(lambda x: classifier(x)[0])

  0%|          | 0/112247 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (565 > 512). Running this sequence through the model will result in indexing errors


In [23]:
bully_data_cleaned.to_csv(data_set+'_toxic.csv')

In [27]:
######################
# Toxicity Features #
######################

def combine_toxicity_feature(df=bully_data_cleaned, type=None):
    
    '''
    -------------
     Description 
    -------------
    Umbrella Function to combine all toxic list feature by category
    
    ------------
     Parameters
    ------------

    df: data frame name
    type: role or binary

    '''
    
    # Initialize empty data frame
    X_toxic_comb_df = pd.DataFrame()
    df_filter = df.copy()
    
    if type == 'role':
        df_filter = df_filter[df_filter['role']!='None']
    else:
        df_filter = df_filter
        
    X_toxic_comb_df['toxic']=df_filter['toxic_dict'].progress_apply(lambda x: x[0]['score'])
    X_toxic_comb_df['severe_toxic']=df_filter['toxic_dict'].progress_apply(lambda x: x[1]['score'])
    X_toxic_comb_df['obscene']=df_filter['toxic_dict'].progress_apply(lambda x: x[2]['score'])
    X_toxic_comb_df['threat']=df_filter['toxic_dict'].progress_apply(lambda x: x[3]['score'])
    X_toxic_comb_df['insult']=df_filter['toxic_dict'].progress_apply(lambda x: x[4]['score'])
    X_toxic_comb_df['identity_hate']=df_filter['toxic_dict'].progress_apply(lambda x: x[5]['score'])

            
    # Convert to matrix form that can be feed into sklean model
    print("Consolidating all toxic List Feature. Done")
    X_toxic_comb = sparse.csr_matrix(X_toxic_comb_df.values) 
    
    return X_toxic_comb

In [28]:
#####################
# Toxicity Features #
#####################
print()
print("Generating Toxicity features for Role Classification Model")
X_ToxicityRole = combine_toxicity_feature(df=bully_data_cleaned,type='role')

print()
print("Generating Toxicity features for Binary Classification Model")
X_Toxicity = combine_toxicity_feature(df=bully_data_cleaned,type=None)


Generating Toxicity features for Role Classification Model


  0%|          | 0/5375 [00:00<?, ?it/s]

  0%|          | 0/5375 [00:00<?, ?it/s]

  0%|          | 0/5375 [00:00<?, ?it/s]

  0%|          | 0/5375 [00:00<?, ?it/s]

  0%|          | 0/5375 [00:00<?, ?it/s]

  0%|          | 0/5375 [00:00<?, ?it/s]

Consolidating all toxic List Feature. Done

Generating Toxicity features for Binary Classification Model


  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

Consolidating all toxic List Feature. Done


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Output Pickle Files</h2>
</div>

In [30]:
##########################
# Output as pickle files #
##########################
with open("bully_binary_classification\\"+data_set+"\\features\\selected\\X_Toxicity.pkl",'wb') as f:
    pickle.dump(X_Toxicity, f)
    
with open("bully_role_classification\\"+data_set+"\\features\\selected\\X_ToxicityRole.pkl",'wb') as f:
    pickle.dump(X_ToxicityRole, f)