<div class="alert" style="background-color:#fff; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:15px 15px; color:#5d3a8e; font-size:40px'> 3.1 Feature Extraction (Base)</h1>
</div>

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> Table of Content</h2>
</div>

* [Required Libraries and Modules](#Required-Libraries-and-Modules)
* [Textual Features](#Textual-Features)
* [Sentiment Features](#Sentiment-Features)
* [Word Embeddings](#Word-Embeddings)
* [Personality Traits](#Personality-Traits)
* [Psycholinguistics](#Psycholinguistics)
* [Term Lists](#Term-Lists)
* [Combination of Features](#Combination-of-Features)
* [Target Classes](#Target-Classes)

**Notes:**

**How can I combine different features?**

Usually, if possible, you'd want to keep your matrice sparse as long as possible as it saves a lot of memory. That's why there are sparse matrices after all, otherwise, why bother? So, even if your classifier requires you to use dense input, you might want to keep the TFIDF features as sparse, and add the other features to them in a sparse format. And then only, make the matrix dense.

To do that, you could use scipy.sparse.hstack. It combines two sparse matrices together by column. scipy.sparse.vstack also exists. And of course, scipy also has the non-sparse version scipy.hstack and scipy.vstack

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Required Libraries and Modules</h2>
</div>

In [1]:
# Import Dependencies
%matplotlib inline

# Begin Python Imports
import datetime, warnings, scipy
warnings.filterwarnings("ignore")
import pickle

# Data Manipulation
import numpy as np
import pandas as pd
from scipy import sparse
from scipy.sparse import hstack
pd.set_option('display.max_columns', None)

# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')

# Progress bar
from tqdm._tqdm_notebook import tqdm_notebook
from tqdm import tqdm
tqdm_notebook.pandas()

# Feature Extraction -  Textual Features
import textstat
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Modelling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score, 
    accuracy_score, 
    confusion_matrix, 
    classification_report, 
    plot_confusion_matrix,
    plot_precision_recall_curve
)

from sklearn.preprocessing import MaxAbsScaler

import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

In [2]:
scaler = MaxAbsScaler()

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Import Clean Text Data</h2>
</div>

In [3]:
###############################################################
# Note: Change the name of data set used for feature creation
###############################################################
task = 'bully_binary_classification'
data_set='bully_data_clean_with_stopword'
    
    
###################
# Import Data Set #
###################
bully_data_cleaned = pd.read_csv(data_set+'.csv', encoding='utf8')
bully_data_cleaned = bully_data_cleaned.drop(['ner','pos','Unnamed: 0'],axis=1)
# Drop uninformative columns
bully_data_cleaned = bully_data_cleaned.drop(['emails_count',
                                              'emoji_counts',
                                              'hashtag_count',
                                              'mention_count',
                                              'urls_count',
                                              'ner_EVENT_counts',
                                              'ner_FAC_counts', 
                                              'ner_LANGUAGE_counts',
                                              'ner_LAW_counts', 
                                              'ner_LOC_counts', 
                                              'ner_MONEY_counts',
                                              'ner_NORP_counts',
                                              'ner_ORDINAL_counts', 
                                              'ner_PERCENT_counts', 
                                              'ner_PRODUCT_counts',
                                              'ner_QUANTITY_counts', 
                                              'ner_TIME_counts', 
                                              'ner_WORK_OF_ART_counts'],axis=1)
                                              
bully_data_cleaned = bully_data_cleaned[~bully_data_cleaned['text_check'].isna()]
bully_data_cleaned = bully_data_cleaned[bully_data_cleaned['text_check'] != ""]
bully_data_cleaned['role'] = bully_data_cleaned['role'].progress_apply(lambda x: 'Harasser' if x == "Bystander_assistant" else x)
bully_data_cleaned = bully_data_cleaned.reset_index(drop=True)


  0%|          | 0/112247 [00:00<?, ?it/s]

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Textual Features</h2>
</div>

- Textual statistics
- TFIDF
- N-grams (unigram, bigrams, trigrams, quadgrams)

In [5]:
####################
# Textual Features #
####################

def combine_feature_textual(df=bully_data_cleaned,
                            textual_stats=True,
                            pos_stats=True,
                            ner_stats=True,
                            tfidf=True,
                            tfidf_ngram=(1,1),
                            count_vectorizer_word=True,
                            word_ngram=(1,1),
                            count_vectorizer_char=True,
                            char_ngram=(1,1)
                           ):
    '''
    -------------
     Description 
    -------------
    Umbrella Function to combine all Textual related features
    
    ------------
     Parameters
    ------------

    df: data frame name
    textual_stats: boolean
    pos_stats: boolean
    ner_stats: boolean
    tfidf: boolean
    tfidf_ngram: tuple (1,1)
    count_vectorizer_word: boolean
    count_vectorizer_char: boolean
    word_ngram: tuple (1,1)
    char_ngram: tuple (1,1)
    '''
    
    # Initialize empty data frame
    X_textual_other_df = pd.DataFrame()
    dummy = list(range(len(df)))
    X_textual_other_df['del'] = dummy
    
    
    # TF-IDF
    def feature_text_tfidf(df,n=(1,1)):
        tfidf = TfidfVectorizer(ngram_range=n)
        X_tfidf = tfidf.fit_transform(df['text_check'])
        return X_tfidf
    
    
    # CountVectorizer - Word
    def feature_text_ngram_word(df,n=(1,1)):
        countvec_word = CountVectorizer(ngram_range=n,analyzer="word")
        X_countvec_word = countvec_word.fit_transform(df['text_check'])
        return X_countvec_word


    # CountVectorizer - Char
    def feature_text_ngram_char(df,n=(1,1)):
        countvec_char = CountVectorizer(ngram_range=n,analyzer="char")
        X_countvec_char = countvec_char.fit_transform(df['text_check'])
        return X_countvec_char
    
    
    # Text statistics
    def feature_textual_stats(df):
        for x in df.columns[7:16]:
            X_textual_other_df[x] = df[x]

#         for x in df.columns[21:23]:
#             X_textual_other_df[x] = df[x]
            
        for x in df.columns[df.columns.isin(['emoticon_counts'])]:
            X_textual_other_df[x] = df[x]

            
    # POS and NER count by type
    def feature_pos_ner(df,type='ner'):
        if type == 'ner':
            ner_col = df.columns[df.columns.str.contains('ner')]
            for x in ner_col:
                X_textual_other_df[x] = df[x]

        if type == 'pos':
            pos_col = df.columns[df.columns.str.contains('pos')]
            for x in pos_col:
                X_textual_other_df[x] = df[x]    
    
    
    # Compile textual features
    if textual_stats:
        print("Developing Textual Feature: Text Statistics")
        feature_textual_stats(df=df)
        
    if pos_stats:
        print("Developing Textual Feature: POS")
        feature_pos_ner(df=df,type='pos')
        
    if ner_stats:
        print("Developing Textual Feature: NER")
        feature_pos_ner(df=df,type='ner')
        
    if tfidf:
        print("Developing Textual Feature: TFIDF")
        X_tfidf = feature_text_tfidf(df=df,n=tfidf_ngram)

        
    if count_vectorizer_word:
        print("Developing Textual Feature: NGram Word")
        X_countvec_word = feature_text_ngram_word(df=df,n=word_ngram) 
        
    if count_vectorizer_char:
        print("Developing Textual Feature: NGram Char")
        X_countvec_char = feature_text_ngram_char(df=df,n=char_ngram) 
       
    
    # Convert to matrix form that can be feed into sklean model
    print("Consolidating all Textual Feature. Done")
    
    # Only Textual statistics feature
    if (textual_stats or pos_stats or ner_stats):
        X_textual_other_df.drop(['del'],axis=1,inplace=True)
        X_textual_other = sparse.csr_matrix(X_textual_other_df.values)
        
    # Combine all selected textual features
    if (textual_stats or pos_stats or ner_stats) & tfidf & count_vectorizer_word & count_vectorizer_char:
        X_textual_comb = scipy.sparse.hstack((X_textual_other,X_tfidf,X_countvec_word,X_countvec_char),format='csr')
        
    elif (textual_stats or pos_stats or ner_stats) & tfidf & count_vectorizer_word:
        X_textual_comb = scipy.sparse.hstack((X_textual_other,X_tfidf,X_countvec_word),format='csr')
    
    elif (textual_stats or pos_stats or ner_stats) & tfidf & count_vectorizer_char:
        X_textual_comb = scipy.sparse.hstack((X_textual_other,X_tfidf,X_countvec_char),format='csr')
    
    elif (textual_stats or pos_stats or ner_stats) & count_vectorizer_word & count_vectorizer_char:
        X_textual_comb = scipy.sparse.hstack((X_textual_other,X_countvec_word,X_countvec_char),format='csr')
        
    elif count_vectorizer_word & count_vectorizer_char:
        X_textual_comb = scipy.sparse.hstack((X_countvec_word,X_countvec_char),format='csr')
        
    elif (textual_stats or pos_stats or ner_stats):
        X_textual_comb = X_textual_other
        
    elif tfidf:
        X_textual_comb = X_tfidf
        
    elif count_vectorizer_word:
        X_textual_comb = X_countvec_word
        
    elif count_vectorizer_char:
        X_textual_comb = X_countvec_char
    
    X_textual_comb=scaler.fit_transform(X_textual_comb)
    return X_textual_comb

In [6]:
#####################################
# Textual Features (Textstatistics) #
#####################################

def combine_feature_textstatistics(df=bully_data_cleaned):
    
    '''
    -------------
     Description 
    -------------
    Umbrella Function to combine all Textual related features
    from textstatistics package
    
    ------------
     Parameters
    ------------

    df: data frame name
   
    '''
    
    # Initialize empty data frame
    X_textstatistics_df = pd.DataFrame()
    dummy = list(range(len(df)))
    X_textstatistics_df['del'] = dummy
    
    # TextStatistics attributes
    X_textstatistics_df['ts_automated_readability_index'] = df['text_check'].progress_apply(lambda x: textstat.automated_readability_index(x))
    X_textstatistics_df['ts_avg_character_per_word'] = df['text_check'].progress_apply(lambda x: textstat.avg_character_per_word(x))
    X_textstatistics_df['ts_avg_letter_per_word'] = df['text_check'].progress_apply(lambda x: textstat.avg_letter_per_word(x))
    X_textstatistics_df['ts_avg_syllables_per_word'] = df['text_check'].progress_apply(lambda x: textstat.avg_syllables_per_word(x))
    X_textstatistics_df['ts_coleman_liau_index'] = df['text_check'].progress_apply(lambda x: textstat.coleman_liau_index(x))
    X_textstatistics_df['ts_dale_chall'] = df['text_check'].progress_apply(lambda x: textstat.dale_chall_readability_score(x))
    X_textstatistics_df['ts_dale_chall2'] = df['text_check'].progress_apply(lambda x: textstat.dale_chall_readability_score_v2(x))
    X_textstatistics_df['ts_difficult_words'] = df['text_check'].progress_apply(lambda x: textstat.difficult_words(x))
    X_textstatistics_df['ts_flesch_kincaid_grade'] = df['text_check'].progress_apply(lambda x: textstat.flesch_kincaid_grade(x))
    X_textstatistics_df['ts_flesch_reading_ease'] = df['text_check'].progress_apply(lambda x: textstat.flesch_reading_ease(x))
    X_textstatistics_df['ts_gunning_fog'] = df['text_check'].progress_apply(lambda x: textstat.gunning_fog(x))
    X_textstatistics_df['ts_letter_count'] = df['text_check'].progress_apply(lambda x: textstat.letter_count(x))
    X_textstatistics_df['ts_lexicon_count'] = df['text_check'].progress_apply(lambda x: textstat.lexicon_count(x))
    X_textstatistics_df['ts_linsear_write_formula'] = df['text_check'].progress_apply(lambda x: textstat.linsear_write_formula(x))
    X_textstatistics_df['ts_lix'] = df['text_check'].progress_apply(lambda x: textstat.lix(x))
    X_textstatistics_df['ts_mcalpine_eflaw'] = df['text_check'].progress_apply(lambda x: textstat.mcalpine_eflaw(x))
    X_textstatistics_df['ts_miniword_count'] = df['text_check'].progress_apply(lambda x: textstat.miniword_count(x))
    X_textstatistics_df['ts_monosyllabcount'] = df['text_check'].progress_apply(lambda x: textstat.monosyllabcount(x))
    X_textstatistics_df['ts_polysyllabcount'] = df['text_check'].progress_apply(lambda x: textstat.polysyllabcount(x))
    X_textstatistics_df['ts_srix'] = df['text_check'].progress_apply(lambda x: textstat.rix(x))
    X_textstatistics_df['ts_long_word_count'] = df['text_check'].progress_apply(lambda x: textstat.long_word_count(x))
    X_textstatistics_df['ts_spache_readability'] = df['text_check'].progress_apply(lambda x: textstat.spache_readability(x))
    X_textstatistics_df['ts_text_standard'] = df['text_check'].progress_apply(lambda x: textstat.text_standard(x,float_output=True))
   
    
    # Convert to matrix form that can be feed into sklean model
    print("Consolidating all Textual Feature. Done")
    
    X_textstatistics_df.drop(['del'],axis=1,inplace=True)
    X_textstatistics = sparse.csr_matrix(X_textstatistics_df.values)
    #X_textstatistics=scaler.fit_transform(X_textstatistics)
    
    return X_textstatistics

In [14]:
X_textstatistics = combine_feature_textstatistics(df=bully_data_cleaned)

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

Consolidating all Textual Feature. Done


In [12]:
# X_textual_comb = combine_feature_textual(df=bully_data_cleaned,
#                                          textual_stats=True,
#                                          pos_stats=True,
#                                          ner_stats=True,
#                                          tfidf=False,
#                                          tfidf_max_feature=100000,
#                                          tfidf_ngram=3,
#                                          count_vectorizer=True,
#                                          ngram_max_feature=100000,
#                                          ngram=3)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Sentiment Features</h2>
</div>

In [13]:
######################
# Sentiment Features #
######################

from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pysentiment2 as ps
from afinn import Afinn
from nrclex import NRCLex

def combine_feature_sentiment(df=bully_data_cleaned,
                              textblob=True,
                              vadersenti=True,
                              pysenti=True,
                              afinn_senti=True,
                              nrclex=True
                           ):
    '''
    -------------
     Description 
    -------------
    Umbrella Function to combine all Sentiment related features
    
    ------------
     Parameters
    ------------

    df: data frame name
    textblob: Boolean
    vadersenti: Boolean
    pysenti: Boolean
    afinn: Boolean
    nrclex: Boolean
    
    '''
    
    # Initialize empty data frame
    X_senti_comb_df = pd.DataFrame()
    dummy = list(range(len(df)))
    X_senti_comb_df['del'] = dummy
    
    
    # TextBlob: Sentiment Polarity
    def feature_senti_polarity(x):
        return TextBlob(x).polarity

    # TextBlob: Subjectivity
    def feature_senti_subjectivity(x):
        return TextBlob(x).subjectivity
    
    # Vader Sentiment
    # - negative score
    # - neutral score
    # - positive score
    # - compound score
    analyzer = SentimentIntensityAnalyzer()
    def feature_vader_senti(x):
        return list(analyzer.polarity_scores(x).values())
        
    # pysentiment2
    # Using dictionary from General Inquirer
    # - positive score
    # - negative score
    # - polarity score
    # - subjectivity score
    hiv4 = ps.HIV4()
    def feature_inquirer_senti(x):
        tokens = hiv4.tokenize(x)  # text can be tokenized by other ways
                                   # however, dict in HIV4 is preprocessed
                                   # by the default tokenizer in the library
        score = hiv4.get_score(tokens)
        return list(score.values())
    
    # Afinn
    # sentiment score
    afinn = Afinn()
    def feature_afinn_senti(x):
        return afinn.score(x)
    
    # NRCLEX
    # - fear score
    # - anger score
    # - anticipation score
    # - trust score
    # - surprise score
    # - positive score
    # - negative score
    # - sadness score
    # - disgust score
    # - joy score
    def feature_nrclex_senti(x,label):
        emotion = NRCLex(x)
        return emotion.affect_frequencies[label]

    
    # Compile sentiment features
    if textblob:
        print("Developing Sentiment Feature: TextBlob")
        X_senti_comb_df['tb_senti_pol'] = df['text_check'].progress_apply(lambda x: feature_senti_polarity(x))
        X_senti_comb_df['tb_senti_sub'] = df['text_check'].progress_apply(lambda x: feature_senti_subjectivity(x))
        
    if vadersenti:
        print("Developing Sentiment Feature: Vader")
        X_senti_comb_df['vader_senti_all'] = df['text_check'].progress_apply(lambda x: feature_vader_senti(x))
        X_senti_comb_df['vader_senti_neg'] = X_senti_comb_df['vader_senti_all'].progress_apply(lambda x: x[0])
        X_senti_comb_df['vader_senti_neu'] = X_senti_comb_df['vader_senti_all'].progress_apply(lambda x: x[1])
        X_senti_comb_df['vader_senti_pos'] = X_senti_comb_df['vader_senti_all'].progress_apply(lambda x: x[2])
        X_senti_comb_df['vader_senti_comp'] = X_senti_comb_df['vader_senti_all'].progress_apply(lambda x: x[3])
        X_senti_comb_df.drop(['vader_senti_all'],axis=1,inplace=True)
        
    if pysenti:
        print("Developing Sentiment Feature: General Inquirer")
        X_senti_comb_df['inquirer_senti_all'] = df['text_check'].progress_apply(lambda x: feature_inquirer_senti(x))
        X_senti_comb_df['inquirer_senti_pos'] = X_senti_comb_df['inquirer_senti_all'].progress_apply(lambda x: x[0])
        X_senti_comb_df['inquirer_senti_neg'] = X_senti_comb_df['inquirer_senti_all'].progress_apply(lambda x: x[1])
        X_senti_comb_df['inquirer_senti_pol'] = X_senti_comb_df['inquirer_senti_all'].progress_apply(lambda x: x[2])
        X_senti_comb_df['inquirer_senti_sub'] = X_senti_comb_df['inquirer_senti_all'].progress_apply(lambda x: x[3])
        X_senti_comb_df.drop(['inquirer_senti_all'],axis=1,inplace=True)
        
    if afinn_senti:
        print("Developing Sentiment Feature: AFINN")
        X_senti_comb_df['afinn_senti'] = df['text_check'].progress_apply(lambda x: feature_afinn_senti(x))
        
    if nrclex:
        print("Developing Sentiment Feature: NRCLEX")
        X_senti_comb_df['nrclex_senti_fear'] = df['text_check'].progress_apply(lambda x: feature_nrclex_senti(x,label='fear')) 
        X_senti_comb_df['nrclex_senti_anger'] = df['text_check'].progress_apply(lambda x: feature_nrclex_senti(x,label='anger'))
        X_senti_comb_df['nrclex_senti_anticip'] = df['text_check'].progress_apply(lambda x: feature_nrclex_senti(x,label='anticip'))
        X_senti_comb_df['nrclex_senti_trust'] = df['text_check'].progress_apply(lambda x: feature_nrclex_senti(x,label='trust'))
        X_senti_comb_df['nrclex_senti_surprise'] = df['text_check'].progress_apply(lambda x: feature_nrclex_senti(x,label='surprise'))
        X_senti_comb_df['nrclex_senti_positive'] = df['text_check'].progress_apply(lambda x: feature_nrclex_senti(x,label='positive'))
        X_senti_comb_df['nrclex_senti_negative'] = df['text_check'].progress_apply(lambda x: feature_nrclex_senti(x,label='negative'))
        X_senti_comb_df['nrclex_senti_sadness'] = df['text_check'].progress_apply(lambda x: feature_nrclex_senti(x,label='sadness'))
        X_senti_comb_df['nrclex_senti_disgust'] = df['text_check'].progress_apply(lambda x: feature_nrclex_senti(x,label='disgust'))
        X_senti_comb_df['nrclex_senti_joy'] = df['text_check'].progress_apply(lambda x: feature_nrclex_senti(x,label='joy'))

    
    # Convert to matrix form that can be feed into sklean model
    print("Consolidating all Sentiment Feature. Done")
    X_senti_comb_df.drop(['del'],axis=1,inplace=True)
    X_senti_comb = sparse.csr_matrix(X_senti_comb_df.values)
    X_senti_comb=scaler.fit_transform(X_senti_comb)
    
    return X_senti_comb

In [11]:
# X_senti_comb = combine_feature_sentiment(df=bully_data_cleaned,
#                                           textblob=True,
#                                           vadersenti=False,
#                                           pysenti=False,
#                                           afinn_senti=False,
#                                           nrclex=False
#                                        )

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Word Embeddings</h2>
</div>

- Glove
- Word2Vec
- fastText
- Contextual Word Embeddings  (BERT and its variants, ELMO, nnlm)

**Note:** 
1. Reshape your data using array.reshape(1, -1) if it contains a single sample
2. Reshape your data using array.reshape(-1, 1) if your data has a single feature

In [12]:
##############################################
# Word Embedding (Glove, Word2Vec, fastText) #
##############################################

import spacy
nlp = spacy.load('en_core_web_lg')
from gensim.models import KeyedVectors

def combine_feature_embedding(df=bully_data_cleaned,
                              glove=False,
                              glove_corpus='glove_wikipedia',
                              glove_word='6B',
                              glove_dimension=100,
                              word2vec=False,
                              word2vec_dimension=300,
                              fasttext=False,
                              fasttext_dimension=300):
    '''
    -------------
     Description 
    -------------
    Umbrella Function to combine all word embedding features
    
    ------------
     Parameters
    ------------

    df: data frame name
    glove: Boolean
    glove_corpus: Specify 'glove_wikipedia','glove_twitter', 'glove_common'
    glove_word: Specify respective number of words in dictionary
    glove_dimension: Specify respective dimension
    word2vec: Boolean
    word2vec_dimension: Specify respective dimension, 300 by default
    fasttext: Boolean
    fasttext_dimension: Specify respective dimension, 300 by default
    
    '''
    
    ######################
    # 1. Glove Embedding #
    ######################
    # Source: https://github.com/stanfordnlp/GloVe
    def get_glove_features(df=df,
                           corpus=glove_corpus,
                           word=glove_word,
                           dim=glove_dimension):

        # Selection of Glove Corpus
        path ='glove/'+ corpus + '/glove.' + word + '.' + str(dim) + 'd.txt'
        embedding_col= corpus +'_'+ str(dim) +'_'+ 'vectors'

        # Load glove vector file
        def load_glove_vector(path):
            glove_vectors = dict()
            file = open(path, encoding='utf-8')

            for line in file:
                values = line.split()

                word  = values[0]
                vectors = np.asarray(values[1:])
                glove_vectors[word] = vectors

            file.close()
            return glove_vectors

        # Get Glove Vector
        def get_glove_vec(x, glove_vectors, dim):
            arr = np.zeros(dim)
            text = str(x).split()

            for t in text:
                try:
                    vec = glove_vectors.get(t).astype(float)
                    arr = arr + vec
                except:
                    pass

            arr = arr.reshape(1, -1)[0]
            return arr/len(text)

        # Form Glove embeddings
        def feature_glove_embedding(embedding_col,dim):
            X = df[embedding_col].to_numpy()
            X = X.reshape(-1, 1)
            X = np.concatenate(np.concatenate(X, axis = 0), axis = 0).reshape(-1, dim)
            X = sparse.csr_matrix(X)
            return X

        glove_vectors = load_glove_vector(path)
        df[embedding_col] = df['text_check'].progress_apply(lambda x: get_glove_vec(x, glove_vectors, dim))
        return feature_glove_embedding(embedding_col,dim)
    
    
    #########################
    # 2. Word2Vec Embedding #
    #########################
    # `spacy` Package
    def get_word2vec_features(df=df,
                              dim=word2vec_dimension,
                              embedding_col='word2vec_vectors'):

        def get_word2vec_vectors(x):
            doc = nlp(x)
            word2vec_vectors = doc.vector
            return word2vec_vectors

        def feature_word2vec_embedding(embedding_col):
            X = df[embedding_col].to_numpy()
            X = X.reshape(-1, 1)
            X = np.concatenate(np.concatenate(X, axis = 0), axis = 0).reshape(-1, dim) #300 by default
            X = sparse.csr_matrix(X)
            return X

        df[embedding_col] = df['text_check'].progress_apply(lambda x: get_word2vec_vectors(x))
        return feature_word2vec_embedding(embedding_col)

    
    #########################
    # 3. FastText Embedding #
    #########################
    # Source: `gensim` Package
    # https://fasttext.cc/docs/en/english-vectors.html
    def get_fasttext_features(df=df,
                              dim=fasttext_dimension,
                              embedding_col='fasttext_vectors'):

        model = KeyedVectors.load_word2vec_format('fasttext/wiki-news-300d-1M.vec')

       # For each input of text
        def get_fasttext_vectors(sent, model):
            sent_vec =[]
            numw = 0

            # store the embeddings for each word in the sentence
            for w in sent:
                try:
                    if numw == 0:
                        sent_vec = model[w]
                    else:
                        sent_vec = np.add(sent_vec, model[w])
                    numw+=1
                except:
                    pass

            # return mean of embeddings for the sentence
            return np.asarray(sent_vec)/numw

        def feature_fasttext_embedding(embedding_col):
            X = df[embedding_col].to_numpy()
            X = X.reshape(-1, 1)
            X = np.concatenate(np.concatenate(X, axis = 0), axis = 0).reshape(-1, dim)
            X = sparse.csr_matrix(X)
            return X

        df[embedding_col] = df['text_check'].progress_apply(lambda x: get_fasttext_vectors(x, model))
        return feature_fasttext_embedding(embedding_col)

    
    # Developing Embedding Vector
    if glove:
        print('Developing Embedding Vectors: Glove')
        X_glove = get_glove_features(corpus=glove_corpus,word=glove_word,dim=glove_dimension)
    
    if word2vec:
        print('Developing Embedding Vectors: Word2Vec')
        X_word2vec = get_word2vec_features(embedding_col='word2vec_vectors')
        
    if fasttext:
        print('Developing Embedding Vectors: FastText')
        X_fasttext = get_fasttext_features(embedding_col='fasttext_vectors')

        
    # Combine word embeddings
    if glove & word2vec & fasttext:
        print('Combining all three Embedding Vectors')
        X_embedding_comb = scipy.sparse.hstack((X_glove, X_word2vec, X_fasttext),format='csr')
        
    elif glove & word2vec:
        print('Combining Glove and Word2vec Embedding Vectors')
        X_embedding_comb = scipy.sparse.hstack((X_glove, X_word2vec),format='csr')

    elif glove & fasttext:
        print('Combining Glove and FastText Embedding Vectors')
        X_embedding_comb = scipy.sparse.hstack((X_glove, X_fasttext),format='csr')
        
    elif word2vec & fasttext:
        print('Combining Word2Vec and FastText Embedding Vectors')
        X_embedding_comb = scipy.sparse.hstack((X_word2vec, X_fasttext),format='csr')
    
    elif glove:
        print('Just Glove Embedding Vectors')
        X_embedding_comb = X_glove
    
    elif word2vec:
        print('Just Word2Vec Embedding Vectors')
        X_embedding_comb = X_word2vec
        
    elif fasttext:
        print('Just FastText Embedding Vectors')
        X_embedding_comb = X_fasttext
        
    return X_embedding_comb

In [13]:
# X_embedding_comb=combine_feature_embedding(df=bully_data_cleaned,
#                               glove=True,
#                               glove_corpus='glove_wikipedia',
#                               glove_word='6B',
#                               glove_dimension=100,
#                               word2vec=True,
#                               word2vec_dimension=300,
#                               fasttext=True,
#                               fasttext_dimension=300)

In [14]:
# X_embedding_glove = combine_feature_embedding(df=bully_data_cleaned,
#                               glove=True,
#                               glove_corpus='glove_wikipedia',
#                               glove_word='6B',
#                               glove_dimension=100
#                               )

In [15]:
# X_embedding_word2vec = combine_feature_embedding(df=bully_data_cleaned,
#                                                   word2vec=True,
#                                                   word2vec_dimension=300
#                                                   )

In [16]:
# X_embedding_fasttext = combine_feature_embedding(df=bully_data_cleaned,
#                               fasttext=True,
#                               fasttext_dimension=300)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Psycholinguistics</h2>
</div>

- LIWC
- Empath

In [29]:
##################################
# Import External LIWC 22 Result #
##################################
bully_data_liwc22 = pd.read_csv("LIWC22_"+data_set+'.csv')
bully_data_liwc22 = bully_data_liwc22[~bully_data_liwc22['text_check'].isna()]
bully_data_liwc22 = bully_data_liwc22[bully_data_liwc22['text_check'] != ""]
#bully_data_liwc = bully_data_liwc[bully_data_liwc['role']!='None']
bully_data_liwc22 = bully_data_liwc22.drop(['Unnamed: 0','tag', 'text', 'label', 'role', 'harmfulness_score', 'oth_language',
                                        'file_index', 'text_check', 'Segment'],axis=1)
bully_data_liwc22 = bully_data_liwc22.reset_index(drop=True)

In [30]:
bully_data_liwc22.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112247 entries, 0 to 112246
Columns: 117 entries, WC to OtherP
dtypes: float64(110), int64(7)
memory usage: 100.2 MB


In [31]:
bully_data_liwc22.columns

Index(['WC', 'Analytic', 'Clout', 'Authentic', 'Tone', 'WPS', 'BigWords',
       'Dic', 'Linguistic', 'function',
       ...
       'assent', 'nonflu', 'filler', 'AllPunc', 'Period', 'Comma', 'QMark',
       'Exclam', 'Apostro', 'OtherP'],
      dtype='object', length=117)

In [32]:
##################################
# Import External LIWC 15 Result #
##################################
bully_data_liwc15 = pd.read_csv("LIWC15_"+data_set+'.csv')
bully_data_liwc15 = bully_data_liwc15[~bully_data_liwc15['text_check'].isna()]
bully_data_liwc15 = bully_data_liwc15[bully_data_liwc15['text_check'] != ""]
#bully_data_liwc15 = bully_data_liwc15[bully_data_liwc15['role']!='None']
bully_data_liwc15 = bully_data_liwc15.drop(['Unnamed: 0','tag', 'text', 'label', 'role', 'harmfulness_score', 'oth_language',
                                        'file_index', 'text_check', 'Segment'],axis=1)
bully_data_liwc15 = bully_data_liwc15.reset_index(drop=True)

In [33]:
bully_data_liwc15.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112247 entries, 0 to 112246
Data columns (total 93 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   WC            112247 non-null  int64  
 1   Analytic      112247 non-null  float64
 2   Clout         112247 non-null  float64
 3   Authentic     112247 non-null  float64
 4   Tone          112247 non-null  float64
 5   WPS           112247 non-null  int64  
 6   Sixltr        112247 non-null  float64
 7   Dic           112247 non-null  float64
 8   function      112247 non-null  float64
 9   pronoun       112247 non-null  float64
 10  ppron         112247 non-null  float64
 11  i             112247 non-null  float64
 12  we            112247 non-null  float64
 13  you           112247 non-null  float64
 14  shehe         112247 non-null  float64
 15  they          112247 non-null  float64
 16  ipron         112247 non-null  float64
 17  article       112247 non-null  float64
 18  prep

In [34]:
bully_data_liwc15.columns

Index(['WC', 'Analytic', 'Clout', 'Authentic', 'Tone', 'WPS', 'Sixltr', 'Dic',
       'function', 'pronoun', 'ppron', 'i', 'we', 'you', 'shehe', 'they',
       'ipron', 'article', 'prep', 'auxverb', 'adverb', 'conj', 'negate',
       'verb', 'adj', 'compare', 'interrog', 'number', 'quant', 'affect',
       'posemo', 'negemo', 'anx', 'anger', 'sad', 'social', 'family', 'friend',
       'female', 'male', 'cogproc', 'insight', 'cause', 'discrep', 'tentat',
       'certain', 'differ', 'percept', 'see', 'hear', 'feel', 'bio', 'body',
       'health', 'sexual', 'ingest', 'drives', 'affiliation', 'achieve',
       'power', 'reward', 'risk', 'focuspast', 'focuspresent', 'focusfuture',
       'relativ', 'motion', 'space', 'time', 'work', 'leisure', 'home',
       'money', 'relig', 'death', 'informal', 'swear', 'netspeak', 'assent',
       'nonflu', 'filler', 'AllPunc', 'Period', 'Comma', 'Colon', 'SemiC',
       'QMark', 'Exclam', 'Dash', 'Quote', 'Apostro', 'Parenth', 'OtherP'],
      dtype='o

In [35]:
##############################
# Psycholinguistics Features #
##############################
from empath import Empath

def combine_feature_psycholinguistics(df=bully_data_cleaned,
                                      df_liwc=bully_data_liwc22,
                                      liwc=True,
                                      empath=True
                           ):
    '''
    -------------
     Description 
    -------------
    Umbrella Function to combine all Psycholinguistics related features
    
    ------------
     Parameters
    ------------

    df: data frame name
    df_liwc: data frame with liwc feature name
    liwc: Boolean
    empath: Boolean
    
    '''
    
    # Initialize empty data frame
    X_psycho_comb_df = pd.DataFrame()
    dummy = list(range(len(df)))
    X_psycho_comb_df['del'] = dummy
    
    
    # LIWC 2022 Edition
    def feature_liwc(df_liwc):
        for x in df_liwc.columns[0:]:
            X_psycho_comb_df['liwc_'+x] = df_liwc[x]
            

    # Empath
    lexicon = Empath()    
    def feature_empath(x):
        EMPATH_CAT = lexicon.analyze(x, normalize=True)
        return EMPATH_CAT

    
    # Compile all psycholinguistics features
    if liwc:
        print("Developing pyscholinguistics feature from liwc 2022 tools")
        feature_liwc(df_liwc)
    
    
    if empath:
        print("Developing pyscholinguistics feature from empath package")
        empath_analysis=df['text_check'].progress_apply(lambda x: feature_empath(x)) #saved as dictionary
        X_empath_all = pd.DataFrame(empath_analysis.tolist())
        X_psycho_comb_df = X_psycho_comb_df.join(X_empath_all)

   
    
    # Convert to matrix form that can be feed into sklean model
    print("Consolidating all Psycholinguistics Feature. Done")
    X_psycho_comb_df.drop(['del'],axis=1,inplace=True)
    X_psycho_comb = sparse.csr_matrix(X_psycho_comb_df.values)
    X_psycho_comb = scaler.fit_transform(X_psycho_comb)
    
    return X_psycho_comb

In [17]:
# X_psycho_comb = combine_feature_psycholinguistics(df=bully_data_cleaned,
#                                                   liwc=True,
#                                                   empath=True)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Term Lists</h2>
</div>

Note: Binary representation feature (<numbers>: Exist, 0: Not Exist)
- Proper names *(Already Done in NER part)*
- 'allness' 
- absolute
- diminishers
- intensifiers
- negation words
- profane term

In [7]:
######################
# Term list Features #
######################

def combine_feature_termlist(df=bully_data_cleaned,
                             absolute=False,
                             allness=False,
                             badword=False,
                             negation=False,
                             diminisher=False,
                             intensifier=False,
                             convert_form="binary"
                           ):
    '''
    -------------
     Description 
    -------------
    Umbrella Function to combine all Term list feature by category
    
    ------------
     Parameters
    ------------

    df: data frame name
    absolute: boolean
    allness: boolean
    badword: boolean,
    negation: boolean
    diminisher: boolean
    intensifier: boolean
    convert_form: either "binary" to convert the feature to binary feature or 
                         "ratio" to convert the feature to term ratio 
    '''
    
    # Initialize empty data frame
    X_term_comb_df = pd.DataFrame()

    # Develop Term List Features
    if convert_form == "binary":
        if absolute:
            print('Developing Term List Feature: Absolute term')
            X_term_comb_df['term_absolute_counts']=df['term_absolute_counts'].progress_apply(lambda x: 1 if x>0 else 0)

        if allness:
            print('Developing Term List Feature: Allness term')
            X_term_comb_df['term_allness_counts']=df['term_allness_counts'].progress_apply(lambda x: 1 if x>0 else 0)

        if badword:
            print('Developing Term List Feature: Badword term')
            X_term_comb_df['term_badword_counts']=df['term_badword_counts'].progress_apply(lambda x: 1 if x>0 else 0)

        if negation:
            print('Developing Term List Feature: Negation term')
            X_term_comb_df['term_negation_counts']=df['term_negation_counts'].progress_apply(lambda x: 1 if x>0 else 0)

        if diminisher:
            print('Developing Term List Feature: Diminisher term')
            X_term_comb_df['term_diminisher_counts']=df['term_diminisher_counts'].progress_apply(lambda x: 1 if x>0 else 0)

        if intensifier:
            print('Developing Term List Feature: Intensifier term')
            X_term_comb_df['term_intensifier_counts']=df['term_intensifier_counts'].progress_apply(lambda x: 1 if x>0 else 0)

            
    elif convert_form == "ratio":
        if absolute:
            print('Developing Term List Feature: Absolute term')
            X_term_comb_df['term_absolute_ratio']=(df['term_absolute_counts']/df['word_count']*100).round(2)

        if allness:
            print('Developing Term List Feature: Allness term')
            X_term_comb_df['term_allness_ratio']=(df['term_allness_counts']/df['word_count']*100).round(2)

        if badword:
            print('Developing Term List Feature: Badword term')
            X_term_comb_df['term_badword_ratio']=(df['term_badword_counts']/df['word_count']*100).round(2)

        if negation:
            print('Developing Term List Feature: Negation term')
            X_term_comb_df['term_negation_ratio']=(df['term_negation_counts']/df['word_count']*100).round(2)

        if diminisher:
            print('Developing Term List Feature: Diminisher term')
            X_term_comb_df['term_diminisher_ratio']=(df['term_diminisher_counts']/df['word_count']*100).round(2)

        if intensifier:
            print('Developing Term List Feature: Intensifier term')
            X_term_comb_df['term_intensifier_ratio']=(df['term_intensifier_counts']/df['word_count']*100).round(2)

    elif convert_form == "count":
        if absolute:
            print('Developing Term List Feature: Absolute term')
            X_term_comb_df['term_absolute_ratio']=df['term_absolute_counts']

        if allness:
            print('Developing Term List Feature: Allness term')
            X_term_comb_df['term_allness_ratio']=df['term_allness_counts']

        if badword:
            print('Developing Term List Feature: Badword term')
            X_term_comb_df['term_badword_ratio']=df['term_badword_counts']

        if negation:
            print('Developing Term List Feature: Negation term')
            X_term_comb_df['term_negation_ratio']=df['term_negation_counts']

        if diminisher:
            print('Developing Term List Feature: Diminisher term')
            X_term_comb_df['term_diminisher_ratio']=df['term_diminisher_counts']

        if intensifier:
            print('Developing Term List Feature: Intensifier term')
            X_term_comb_df['term_intensifier_ratio']=df['term_intensifier_counts']

            
    # Convert to matrix form that can be feed into sklean model
    print("Consolidating all Term List Feature. Done")
    X_term_comb = sparse.csr_matrix(X_term_comb_df.values) 
    X_term_comb = scaler.fit_transform(X_term_comb)
    
    return X_term_comb
   

In [26]:
# X_term_comb = combine_feature_termlist(df=bully_data_cleaned,
#                              absolute=True,
#                              allness=True,
#                              badword=True,
#                              negation=True,
#                              diminisher=True,
#                              intensifier=True
#                            )

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Combination of Features</h2>
</div>

In [19]:
################
# Textual Only #
################
# print("Generating CountVecWord with unigram+bigram+trigram features")
# X_CountVecWordAll = combine_feature_textual(df=bully_data_cleaned,
#                                                     textual_stats=False,
#                                                     pos_stats=False,
#                                                     ner_stats=False,
#                                                     tfidf=False,
#                                                     tfidf_ngram=(1,1),
#                                                     count_vectorizer_word=True,
#                                                     word_ngram=(1,3),
#                                                     count_vectorizer_char=False,
#                                                     char_ngram=(1,3)
#                                                    )


# print("Generating CountVecChar with unigram+bigram+trigram features")
# X_CountVecCharAll = combine_feature_textual(df=bully_data_cleaned,
#                                                     textual_stats=False,
#                                                     pos_stats=False,
#                                                     ner_stats=False,
#                                                     tfidf=False,
#                                                     tfidf_ngram=(1,1),
#                                                     count_vectorizer_word=False,
#                                                     word_ngram=(1,3),
#                                                     count_vectorizer_char=True,
#                                                     char_ngram=(1,3)
#                                                    )


# print("Generating CountVecWordChar with unigram+bigram+trigram features")
# X_CountVecWordCharAll = combine_feature_textual(df=bully_data_cleaned,
#                                                     textual_stats=False,
#                                                     pos_stats=False,
#                                                     ner_stats=False,
#                                                     tfidf=False,
#                                                     tfidf_ngram=(1,1),
#                                                     count_vectorizer_word=True,
#                                                     word_ngram=(1,3),
#                                                     count_vectorizer_char=True,
#                                                     char_ngram=(1,3)
#                                                    )


print("Generating TextStat+CountVecWordChar with unigram+bigram+trigram features")
X_CountVecWordCharAllTextStat = combine_feature_textual(df=bully_data_cleaned,
                                                            textual_stats=True,
                                                            pos_stats=True,
                                                            ner_stats=True,
                                                            tfidf=False,
                                                            tfidf_ngram=(1,1),
                                                            count_vectorizer_word=True,
                                                            word_ngram=(1,3),
                                                            count_vectorizer_char=True,
                                                            char_ngram=(1,3)
                                                           )


##################
# Sentiment Only #
##################
print()
print("Generating SentimentAll features")
X_SentimentAll = combine_feature_sentiment(df=bully_data_cleaned,
                                              textblob=True,
                                              vadersenti=True,
                                              pysenti=True,
                                              afinn_senti=True,
                                              nrclex=True)


#######################
# Word Embedding Only #
#######################
print()
print("Generating GloveEmbedding features")
X_GloveEmbedding = combine_feature_embedding(df=bully_data_cleaned,
                                                glove=True,
                                                glove_corpus='glove_wikipedia',
                                                glove_word='6B',
                                                glove_dimension=100
                                                )


print()
print("Generating Word2VecEmbedding features")
X_Word2VecEmbedding = combine_feature_embedding(df=bully_data_cleaned,
                                                  word2vec=True,
                                                  word2vec_dimension=300
                                                  )


print()
print("Generating FastTextEmbedding features")
X_FastTextEmbedding = combine_feature_embedding(df=bully_data_cleaned,
                                                  fasttext=True,
                                                  fasttext_dimension=300)



##########################
# Psycholinguistics Only #
##########################
# print()
# print("Generating PycholinguisticEmpath features")
# X_PycholinguisticEmpath = combine_feature_psycholinguistics(df=bully_data_cleaned,
#                                                             df_liwc=bully_data_liwc22,
#                                                                 liwc=False,
#                                                                 empath=True)


# print()
# print("Generating PycholinguisticLIWC22 features")
# X_PycholinguisticLIWC22 = combine_feature_psycholinguistics(df=bully_data_cleaned,
#                                                           df_liwc=bully_data_liwc22,
#                                                               liwc=True,
#                                                               empath=False)

# print()
# print("Generating PycholinguisticLIWC15 features")
# X_PycholinguisticLIWC15 = combine_feature_psycholinguistics(df=bully_data_cleaned,
#                                                           df_liwc=bully_data_liwc15,
#                                                               liwc=True,
#                                                               empath=False)


print()
print("Generating PycholinguisticLIWC22Empath features")
X_PycholinguisticLIWC22Empath = combine_feature_psycholinguistics(df=bully_data_cleaned,
                                                      df_liwc=bully_data_liwc22,
                                                          liwc=True,
                                                          empath=True)


##################
# Term List only #
##################
print()
print("Generating TermListsRatio features")
X_TermListsRatio = combine_feature_termlist(df=bully_data_cleaned,
                                         absolute=True,
                                         allness=True,
                                         badword=True,
                                         negation=True,
                                         diminisher=True,
                                         intensifier=True,
                                         convert_form = "ratio"
                                       )


Generating TextStat+CountVecWordChar with unigram+bigram+trigram features
Developing Textual Feature: Text Statistics
Developing Textual Feature: POS
Developing Textual Feature: NER
Developing Textual Feature: NGram Word
Developing Textual Feature: NGram Char
Consolidating all Textual Feature. Done

Generating SentimentAll features
Developing Sentiment Feature: TextBlob


  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

Developing Sentiment Feature: Vader


  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

Developing Sentiment Feature: General Inquirer


  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

Developing Sentiment Feature: AFINN


  0%|          | 0/112247 [00:00<?, ?it/s]

Developing Sentiment Feature: NRCLEX


  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

  0%|          | 0/112247 [00:00<?, ?it/s]

Consolidating all Sentiment Feature. Done

Generating PycholinguisticLIWC22Empath features
Developing pyscholinguistics feature from liwc 2022 tools
Developing pyscholinguistics feature from empath package


  0%|          | 0/112247 [00:00<?, ?it/s]

Consolidating all Psycholinguistics Feature. Done

Generating TermListsRatio features
Developing Term List Feature: Absolute term
Developing Term List Feature: Allness term
Developing Term List Feature: Badword term
Developing Term List Feature: Negation term
Developing Term List Feature: Diminisher term
Developing Term List Feature: Intensifier term
Consolidating all Term List Feature. Done


In [11]:
##########################
# Output as pickle files #
##########################

# Feature sets #
feature_set = { 
               'X_CountVecWordCharAllTextStat': X_CountVecWordCharAllTextStat,
               'X_SentimentAll': X_SentimentAll,
                'X_GloveEmbedding': X_GloveEmbedding,
                'X_Word2VecEmbedding': X_Word2VecEmbedding,
                'X_FastTextEmbedding': X_FastTextEmbedding,
                'X_PycholinguisticLIWC22Empath': X_PycholinguisticLIWC22Empath,
                'X_TermListsRatio': X_TermListsRatio
    
              }

for fname, fset in feature_set.items():
    with open(task+"\\"+data_set+"\\features\\selected_scale\\"+ fname + ".pkl",'wb') as f:
        pickle.dump(fset, f)


In [None]:
print(X_CountVecWordCharAllTextStat.shape)
print(X_SentimentAll.shape)
print(X_GloveEmbedding.shape)
print(X_Word2VecEmbedding.shape)
print(X_FastTextEmbedding.shape)
print(X_PycholinguisticLIWC22Empath.shape)
print(X_TermListsRatio.shape)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Target Classes</h2>
</div>

In [29]:
# Target Variable - Cyberbullying vs Non-Cyberbullying #
with open(task+"\\"+data_set+"\\target_class\\Y_cyberbullying.pkl",'wb') as f:
    pickle.dump(bully_data_cleaned['label'].values, f)

In [19]:
def convert_role_label(x):
    if x == "None":
        return 0
    elif x == "Harasser":
        return 1
    elif x == "Victim":
        return 2
    elif x == "Bystander_defender":
        return 3

In [20]:
bully_data_cleaned['role_id'] = bully_data_cleaned['role'].progress_apply(lambda x: convert_role_label(x))

  0%|          | 0/112247 [00:00<?, ?it/s]

In [21]:
bully_data_cleaned['role_id'].value_counts()

0    106872
1      3596
2      1354
3       425
Name: role_id, dtype: int64

In [22]:
# Target Variable - Cyberbullying vs Non-Cyberbullying #
with open(task+"\\"+data_set+"\\target_class\\Y_role.pkl",'wb') as f:
    pickle.dump(bully_data_cleaned['role_id'].values, f)

In [29]:
# Target Variable - Cyberbullying vs Non-Cyberbullying #
with open(task+"\\"+data_set+"\\target_class\\Y_cyberbullying.pkl",'wb') as f:
    pickle.dump(bully_data_cleaned['label'].values, f)

In [19]:
def convert_role_label(x):
    if x == "None":
        return 0
    elif x == "Harasser":
        return 1
    elif x == "Victim":
        return 2
    elif x == "Bystander_defender":
        return 3

In [20]:
bully_data_cleaned['role_id'] = bully_data_cleaned['role'].progress_apply(lambda x: convert_role_label(x))

  0%|          | 0/112247 [00:00<?, ?it/s]

In [21]:
bully_data_cleaned['role_id'].value_counts()

0    106872
1      3596
2      1354
3       425
Name: role_id, dtype: int64

In [22]:
# Target Variable - Cyberbullying vs Non-Cyberbullying #
with open(task+"\\"+data_set+"\\target_class\\Y_role.pkl",'wb') as f:
    pickle.dump(bully_data_cleaned['role_id'].values, f)