# Fake user classfication using BERT

In this notebook, we test our hypothesis that the use of semantic information of tweets (using BERT sentence encoding) will improve the fake user classification accuracy.

We have created two deep neural networks as follows:

1.   Where the input is just profile features (like geolocation, profile picture, basic information, etc.)
2.   Input is a combination of profile features and BERT sentence encoding of user tweets


In [0]:
import numpy as np
import pandas as pd

from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

from keras.layers import Dense, Input, concatenate
from keras.models import Model

We used this repository: https://github.com/hanxiao/bert-as-service to get BERT encodings for user tweets. It has been saved in bert.csv file.

In the below section, the data file is being read. The BERT features are normalized (min-max) and training/validation sets are created. The resulting dataframe contains 800 features (32 profile features + 768 columns for BERT encodings).

In [0]:
bert_embedding_size = 768

def remap_fields(df):
    for name, dtype in zip(list(df), df.dtypes):
        
        if dtype == 'object':
            df[name] = df[name].map( lambda x: 1 if  x else 0)
    df.fillna(0, inplace = True)
    return df

'''
  load bert data with bert embeddings
'''
data = pd.read_csv('../../data/bert.csv')
data = remap_fields(data)
print(data.sample(1))

def get_dataset():
    '''
        normalize data
    '''
    normalized_data = data.copy()
    for i in range(bert_embedding_size):
        col_name = 'bert_' + str(i)
        max_col_val = data[col_name].max()
        min_col_val = data[col_name].min()

        normalized_data[col_name] = (data[col_name] - min_col_val) / (max_col_val - min_col_val)
      
    '''
        create test and train split
    '''
    train_x, test_x, _, _ = train_test_split(normalized_data, normalized_data.label,  stratify =normalized_data.label)
    train_y = train_x.label
    test_y = test_x.label

    train_x.drop(['Unnamed: 0', 'label', 'id', 'tweet', 'verified'], axis = 1, inplace = True)
    test_x.drop(['Unnamed: 0', 'label', 'id', 'tweet', 'verified'], axis = 1, inplace = True)

    return train_x, train_y, test_x, test_y

  interactivity=interactivity, compiler=compiler, result=result)


     Unnamed: 0         id  name  ...  bert_765  bert_766  bert_767
353         353  525965404     1  ... -0.421459  0.065046 -0.149392

[1 rows x 805 columns]


In [0]:
'''
    define recall, precision, and f1 score
'''
from keras import backend as K

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

# Deep neural network - Profile features

Here, we create a neural network model with one input layer, 2 hidden layers, and one output layer. The input are just profile features (dimension=32).

We have used the Adam optimizer and Binary Cross-Entropy loss function. We performed a five-fold cross validation and following are the results (averaged over the folds):

**Accuracy: 0.90343137, F1-score: 0.91886296**

In [0]:
def build_base_model(profile_dim=32):
    '''
        input for profile network
    '''
    profile_input = Input(shape=(profile_dim,))

    output = Dense(32, activation='relu')(profile_input)
    output = Dense(16, activation='relu')(output)
    output = Dense(1, activation='sigmoid')(output)

    model = Model(inputs=[profile_input], outputs=[output])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc', f1_m])
    return model

In [0]:
cross_val_split = 5
cross_val_metrics = []

for i in range(cross_val_split):
    '''
        get data
    '''
    train_x, train_y, test_x, test_y = get_dataset()

    '''
        build neural network model
    '''
    model = build_base_model()
    model.summary()

    '''
        get only profile features from train/validation data
    '''
    train_split = np.hsplit(train_x, np.array([32, 800]))[0]
    test_split = np.hsplit(test_x, np.array([32, 800]))[0]
    model.fit(x=train_split, y=train_y, batch_size=32, shuffle=True, epochs=100)

    val_res = model.evaluate(test_split, test_y)
    cross_val_metrics.append(np.array(val_res))

print("(Loss, Accuracy, F1-score) after 5-fold cross validation", np.sum(cross_val_metrics, axis=0)/cross_val_split)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Model: "model_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_15 (InputLayer)        (None, 32)                0         
_________________________________________________________________
dense_50 (Dense)             (None, 32)                1056      
_________________________________________________________________
dense_51 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_52 (Dense)             (None, 1)                 17        
Total params: 1,601
Trainable params: 1,601
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoc

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Model: "model_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_16 (InputLayer)        (None, 32)                0         
_________________________________________________________________
dense_53 (Dense)             (None, 32)                1056      
_________________________________________________________________
dense_54 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_55 (Dense)             (None, 1)                 17        
Total params: 1,601
Trainable params: 1,601
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoc

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Model: "model_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_17 (InputLayer)        (None, 32)                0         
_________________________________________________________________
dense_56 (Dense)             (None, 32)                1056      
_________________________________________________________________
dense_57 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_58 (Dense)             (None, 1)                 17        
Total params: 1,601
Trainable params: 1,601
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epo

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Model: "model_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_18 (InputLayer)        (None, 32)                0         
_________________________________________________________________
dense_59 (Dense)             (None, 32)                1056      
_________________________________________________________________
dense_60 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_61 (Dense)             (None, 1)                 17        
Total params: 1,601
Trainable params: 1,601
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epo

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Model: "model_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_19 (InputLayer)        (None, 32)                0         
_________________________________________________________________
dense_62 (Dense)             (None, 32)                1056      
_________________________________________________________________
dense_63 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_64 (Dense)             (None, 1)                 17        
Total params: 1,601
Trainable params: 1,601
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epo

# Deep neural network - Profile features + BERT sentence encoding

In addition to the profile features, we also add the semantic information contained in user tweets. We use two parallel networks:

1.   Network which takes BERT sentence encoding as its input
2.   Network which takes profile features as its input

The 1st network has a series of dense layers and an output of 32 dimensions (which captures the semantic information). The 2nd network takes the profile features. These outputs are concatenated and passed through a feed-forward network for classifying fake users.

**Accuracy: 0.93801743, F1-score: 0.9483967**

In [0]:
def build_bert_model(bert_dim=768, profile_dim=32):
    '''
        bert network
    '''
    bert_input = Input(shape=(bert_dim,))
    bert_output = Dense(256, activation='relu')(bert_input)
    bert_output = Dense(256, activation='relu')(bert_output)
    bert_output = Dense(256, activation='relu')(bert_output)
    bert_output = Dense(32, activation='relu')(bert_output)

    '''
        input for profile network
    '''
    profile_input = Input(shape=(profile_dim,))

    '''
        model for combined features
    '''
    x = concatenate([profile_input, bert_output])
    output = Dense(32, activation='relu')(x)
    output = Dense(16, activation='relu')(output)
    output = Dense(1, activation='sigmoid')(output)

    model = Model(inputs=[profile_input, bert_input], outputs=[output])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc', f1_m])
    return model

In [0]:
cross_val_split = 5
cross_val_metrics = []

for i in range(cross_val_split):
    '''
        get data with bert embeddings
    '''
    train_x, train_y, test_x, test_y = get_dataset()

    '''
        build neural network model
    '''
    model = build_bert_model()
    model.summary()

    train_split = np.hsplit(train_x, np.array([32, 800]))[:2]
    test_split = np.hsplit(test_x, np.array([32, 800]))[:2]
    model.fit(x=train_split, y=train_y, batch_size=32, shuffle=True, epochs=100)

    val_res = model.evaluate(test_split, test_y)
    cross_val_metrics.append(np.array(val_res))

print("(Loss, Accuracy, F1-score) after 5-fold cross validation", np.sum(cross_val_metrics, axis=0)/cross_val_split)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            (None, 768)          0                                            
__________________________________________________________________________________________________
dense_15 (Dense)                (None, 256)          196864      input_5[0][0]                    
__________________________________________________________________________________________________
dense_16 (Dense)                (None, 256)          65792       dense_15[0][0]                   
__________________________________________________________________________________________________
dense_17 (Dense)                (None, 256)          65792       dense_16[0][0]                   
____________________________________________________________________________________________

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_7 (InputLayer)            (None, 768)          0                                            
__________________________________________________________________________________________________
dense_22 (Dense)                (None, 256)          196864      input_7[0][0]                    
__________________________________________________________________________________________________
dense_23 (Dense)                (None, 256)          65792       dense_22[0][0]                   
__________________________________________________________________________________________________
dense_24 (Dense)                (None, 256)          65792       dense_23[0][0]                   
____________________________________________________________________________________________

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Model: "model_5"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_9 (InputLayer)            (None, 768)          0                                            
__________________________________________________________________________________________________
dense_29 (Dense)                (None, 256)          196864      input_9[0][0]                    
__________________________________________________________________________________________________
dense_30 (Dense)                (None, 256)          65792       dense_29[0][0]                   
__________________________________________________________________________________________________
dense_31 (Dense)                (None, 256)          65792       dense_30[0][0]                   
____________________________________________________________________________________________

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Model: "model_6"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_11 (InputLayer)           (None, 768)          0                                            
__________________________________________________________________________________________________
dense_36 (Dense)                (None, 256)          196864      input_11[0][0]                   
__________________________________________________________________________________________________
dense_37 (Dense)                (None, 256)          65792       dense_36[0][0]                   
__________________________________________________________________________________________________
dense_38 (Dense)                (None, 256)          65792       dense_37[0][0]                   
____________________________________________________________________________________________

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_13 (InputLayer)           (None, 768)          0                                            
__________________________________________________________________________________________________
dense_43 (Dense)                (None, 256)          196864      input_13[0][0]                   
__________________________________________________________________________________________________
dense_44 (Dense)                (None, 256)          65792       dense_43[0][0]                   
__________________________________________________________________________________________________
dense_45 (Dense)                (None, 256)          65792       dense_44[0][0]                   
____________________________________________________________________________________________

As we can see, there is an significant increase in the results in the network using the semantic information (about 3% in both accuracy and f1-score).