1. Data Preprocessing:
Clean and preprocess your text data as needed.
Tokenize the text into words or subwords.

2. BERT Embeddings:
Use Hugging Face's Transformers library to load a pre-trained BERT model.
Obtain contextualized word embeddings for each word in your text data.

3.Topic Modeling:
Apply a topic modeling technique (e.g., LDA or NMF) to extract latent topics from the preprocessed text data.

4. Feature Extraction:
Combine the BERT embeddings and topic distributions for each document. You can concatenate or merge these features.

5. Model Training:
Choose a classification model (e.g., logistic regression, random forest, or a neural network) for each dichotomy.
Train each model on the combined feature matrix, using the corresponding labels.

6. Evaluation:
Evaluate the performance of each model on a testing set using appropriate metrics.

7. Hyperparameter Tuning:
Fine-tune the hyperparameters of each model to optimize performance.

8. Interpretation:
Analyze the importance of different features, including BERT embeddings and topic distributions, for each dichotomy to gain insights into how they contribute to personality type prediction.
By building 

In [3]:
import pandas as pd
import string
import re
import nltk
import torch
from transformers import BertModel, BertTokenizer
import sklearn
import numpy as np
import tensorflow
import scikeras
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.constraints import MaxNorm
from scikeras.wrappers import KerasClassifier

  from .autonotebook import tqdm as notebook_tqdm





In [4]:
# load dataset

df_all_cleaned = pd.read_csv(r'C:\Users\bella\Downloads\Y3Q2 langai\df_all_cleaned.csv')

In [5]:
# drop columns 'post_feeling', 'post_judging', 'post_sensing' and rename 'post_extrovert' to 'post'

df_all_cleaned = df_all_cleaned.drop(['post_feeling', 'post_judging', 'post_sensing'], axis=1)
df_all_cleaned = df_all_cleaned.rename(columns={'post_extrovert': 'post'})
df_all_cleaned

Unnamed: 0,auhtor_ID,post,extrovert,feeling,judging,sensing
0,t2_12bhu7,I wear a Lorna shore shirt out alot in public ...,1.0,1.0,0.0,0.0
1,t2_12jbpd,I'd say this is a very accurate characterizati...,1.0,0.0,0.0,0.0
2,t2_12uwr5,Ya know like most people with home decorations...,0.0,0.0,1.0,0.0
3,t2_12zm15,It's true tho. They're kinda more interesting ...,0.0,1.0,0.0,0.0
4,t2_13cjjl,"Yeah, but that's one of the things that make m...",0.0,0.0,0.0,1.0
...,...,...,...,...,...,...
150,t2_vfp8y,so change profession then. this would be inadm...,0.0,0.0,1.0,0.0
151,t2_w0842,The technological singularity. And the possibi...,0.0,0.0,1.0,0.0
152,t2_w6rgl,Dear God man. Chill. I'm not Einstein or Hawki...,0.0,0.0,1.0,0.0
153,t2_wilcwvo,That's what a fake lib would say [Human blood ...,1.0,0.0,0.0,0.0


In [6]:
# preprocess sentences

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Join the tokens back into a single string
    processed_text = ' '.join(tokens)

    return processed_text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bella\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bella\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
# apply preprocessing 

df_all_cleaned['processed_post'] = df_all_cleaned['post'].apply(preprocess_text)
df_all_cleaned

Unnamed: 0,auhtor_ID,post,extrovert,feeling,judging,sensing,processed_post
0,t2_12bhu7,I wear a Lorna shore shirt out alot in public ...,1.0,1.0,0.0,0.0,wear lorna shore shirt alot public lewd long s...
1,t2_12jbpd,I'd say this is a very accurate characterizati...,1.0,0.0,0.0,0.0,id say accurate characterization ni users read...
2,t2_12uwr5,Ya know like most people with home decorations...,0.0,0.0,1.0,0.0,ya know like people home decorations could sav...
3,t2_12zm15,It's true tho. They're kinda more interesting ...,0.0,1.0,0.0,0.0,true tho theyre kinda interesting buuuut issue...
4,t2_13cjjl,"Yeah, but that's one of the things that make m...",0.0,0.0,0.0,1.0,yeah thats one things make better objectively ...
...,...,...,...,...,...,...,...
150,t2_vfp8y,so change profession then. this would be inadm...,0.0,0.0,1.0,0.0,change profession would inadmissible country p...
151,t2_w0842,The technological singularity. And the possibi...,0.0,0.0,1.0,0.0,technological singularity possibility contribu...
152,t2_w6rgl,Dear God man. Chill. I'm not Einstein or Hawki...,0.0,0.0,1.0,0.0,dear god man chill im einstein hawking serious...
153,t2_wilcwvo,That's what a fake lib would say [Human blood ...,1.0,0.0,0.0,0.0,thats fake lib would say human blood water url...


In [8]:
# export this dataset, it will be used later.

# df_all_cleaned.to_csv('df_all_cleaned_preprocessed.csv', sep=',', index=False, encoding='utf-8')

In [9]:
# getting BERT embeddings on preprocessed post.

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Function to obtain BERT embeddings for a text
def get_bert_embeddings(text):
    # Tokenize input text and convert to tensor
    tokens = tokenizer.encode(text, add_special_tokens=True, return_tensors='pt', max_length=512, truncation=True)

    # Get BERT embeddings
    with torch.no_grad():
        outputs = bert_model(tokens)
        embeddings = outputs.last_hidden_state

    # Average the embeddings across tokens (you can modify this based on your needs)
    avg_embedding = torch.mean(embeddings, dim=1).squeeze().numpy()

    return avg_embedding

df_all_cleaned['bert_embeddings'] = df_all_cleaned['post'].apply(get_bert_embeddings)
print(df_all_cleaned)

      auhtor_ID                                               post  extrovert  \
0     t2_12bhu7  I wear a Lorna shore shirt out alot in public ...        1.0   
1     t2_12jbpd  I'd say this is a very accurate characterizati...        1.0   
2     t2_12uwr5  Ya know like most people with home decorations...        0.0   
3     t2_12zm15  It's true tho. They're kinda more interesting ...        0.0   
4     t2_13cjjl  Yeah, but that's one of the things that make m...        0.0   
..          ...                                                ...        ...   
150    t2_vfp8y  so change profession then. this would be inadm...        0.0   
151    t2_w0842  The technological singularity. And the possibi...        0.0   
152    t2_w6rgl  Dear God man. Chill. I'm not Einstein or Hawki...        0.0   
153  t2_wilcwvo  That's what a fake lib would say [Human blood ...        1.0   
154   t2_zq7gkv  My biggest problem is asking for it. I don’t n...        1.0   

     feeling  judging  sens

In [10]:
# make a separate dataframe for y labels.

df_labels =  pd.DataFrame()
df_labels['extrovert']= df_all_cleaned['extrovert']
df_labels['feeling']= df_all_cleaned['feeling']
df_labels['judging']= df_all_cleaned['judging']
df_labels['sensing']= df_all_cleaned['sensing']

df_labels

Unnamed: 0,extrovert,feeling,judging,sensing
0,1.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
150,0.0,0.0,1.0,0.0
151,0.0,0.0,1.0,0.0
152,0.0,0.0,1.0,0.0
153,1.0,0.0,0.0,0.0


In [11]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# features (X) and labels (y)
X = np.vstack(df_all_cleaned['bert_embeddings'])
y = df_labels

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, df_labels, test_size=0.2, random_state=42)

# Build a simple neural network model
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(64, activation='relu'),
    Dense(df_labels.shape[1], activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

# Predictions
nn_predictions = model.predict(X_test)



Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [12]:
# Evaluate for each df_lables
for column, true_labels in df_labels.iteritems():
    i = df_labels.columns.get_loc(column)  # Get the index of the current column
    threshold = 0.5  # Adjust the threshold based on your task
    binary_predictions = (nn_predictions[:, i] > threshold).astype(int)
    accuracy = accuracy_score(y_test[column], binary_predictions)
    print(f"Accuracy for {column} (Neural Network): {accuracy}")

    # Classification report for each dichotomy
    print(f"Classification Report for {column} (Neural Network):")
    print(classification_report(y_test[column], binary_predictions))

Accuracy for extrovert (Neural Network): 0.7741935483870968
Classification Report for extrovert (Neural Network):
              precision    recall  f1-score   support

         0.0       0.82      0.92      0.87        25
         1.0       0.33      0.17      0.22         6

    accuracy                           0.77        31
   macro avg       0.58      0.54      0.55        31
weighted avg       0.73      0.77      0.74        31

Accuracy for feeling (Neural Network): 0.7419354838709677
Classification Report for feeling (Neural Network):
              precision    recall  f1-score   support

         0.0       0.74      1.00      0.85        23
         1.0       0.00      0.00      0.00         8

    accuracy                           0.74        31
   macro avg       0.37      0.50      0.43        31
weighted avg       0.55      0.74      0.63        31

Accuracy for judging (Neural Network): 0.41935483870967744
Classification Report for judging (Neural Network):
           

  for column, true_labels in df_labels.iteritems():
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [14]:
# batch size and epochs grid search

# Assuming X_train, X_test, y_train, y_test are already defined

def create_model():
    model = Sequential()
    model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
    model.add(Dense(64, activation='sigmoid'))
    model.add(Dense(df_labels.shape[1], activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)

# create model
model = KerasClassifier(model=create_model, verbose=0)

# define the grid search parameters
batch_size = [10, 20, 40, 60, 80, 100]
epochs = [10, 50, 100]
param_grid = dict(batch_size=batch_size, epochs=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train, y_train, validation_data=(X_test, y_test))

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.412118 using {'batch_size': 100, 'epochs': 100}
0.355014 (0.025132) with: {'batch_size': 10, 'epochs': 10}
0.387727 (0.056188) with: {'batch_size': 10, 'epochs': 50}
0.371467 (0.044918) with: {'batch_size': 10, 'epochs': 100}
0.371274 (0.048022) with: {'batch_size': 20, 'epochs': 10}
0.396051 (0.078653) with: {'batch_size': 20, 'epochs': 50}
0.355788 (0.083221) with: {'batch_size': 20, 'epochs': 100}
0.363144 (0.036560) with: {'batch_size': 40, 'epochs': 10}
0.363144 (0.023313) with: {'batch_size': 40, 'epochs': 50}
0.387340 (0.023560) with: {'batch_size': 40, 'epochs': 100}
0.371274 (0.048022) with: {'batch_size': 60, 'epochs': 10}
0.331010 (0.043797) with: {'batch_size': 60, 'epochs': 50}
0.379791 (0.067263) with: {'batch_size': 60, 'epochs': 100}
0.371274 (0.048022) with: {'batch_size': 80, 'epochs': 10}
0.322687 (0.031288) with: {'batch_size': 80, 'epochs': 50}
0.363144 (0.036560) with: {'batch_size': 80, 'epochs': 100}
0.371274 (0.048022) with: {'batch_size': 100, 'epochs'

In [35]:
# optimizer grid search

# Assuming X_train, X_test, y_train, y_test are already defined

def create_model():
    model = Sequential()
    model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
    model.add(Dense(64, activation='sigmoid'))
    model.add(Dense(df_labels.shape[1], activation='sigmoid'))
    # Compile model
    return model

# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)

# create model
model = KerasClassifier(model=create_model, loss="binary_crossentropy", epochs= 100, batch_size =100, verbose=0)

# define the grid search parameters
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(optimizer=optimizer)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train, y_train, validation_data=(X_test, y_test))

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.388115 using {'optimizer': 'Nadam'}
0.371274 (0.048022) with: {'optimizer': 'SGD'}
0.371274 (0.065491) with: {'optimizer': 'RMSprop'}
0.154278 (0.201581) with: {'optimizer': 'Adagrad'}
0.079365 (0.112239) with: {'optimizer': 'Adadelta'}
0.364111 (0.107798) with: {'optimizer': 'Adam'}
0.355207 (0.033803) with: {'optimizer': 'Adamax'}
0.388115 (0.089798) with: {'optimizer': 'Nadam'}


In [36]:
# grid search on learning rate and momentum

# Assuming X_train, X_test, y_train, y_test are already defined

def create_model():
    model = Sequential()
    model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
    model.add(Dense(64, activation='sigmoid'))
    model.add(Dense(df_labels.shape[1], activation='sigmoid'))
    # Compile model
    return model

# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)

# create model
model = KerasClassifier(model=create_model, loss="binary_crossentropy", optimizer="SGD", epochs=100, batch_size=10, verbose=0)

# define the grid search parameters
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]
param_grid = dict(optimizer__learning_rate=learn_rate, optimizer__momentum=momentum)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train, y_train, validation_data=(X_test, y_test))

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.388115 using {'optimizer__learning_rate': 0.1, 'optimizer__momentum': 0.6}
0.371274 (0.048022) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.0}
0.371274 (0.048022) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.2}
0.371274 (0.048022) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.4}
0.371274 (0.048022) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.6}
0.371274 (0.048022) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.8}
0.371274 (0.048022) with: {'optimizer__learning_rate': 0.001, 'optimizer__momentum': 0.9}
0.371274 (0.048022) with: {'optimizer__learning_rate': 0.01, 'optimizer__momentum': 0.0}
0.371274 (0.048022) with: {'optimizer__learning_rate': 0.01, 'optimizer__momentum': 0.2}
0.371274 (0.048022) with: {'optimizer__learning_rate': 0.01, 'optimizer__momentum': 0.4}
0.371274 (0.048022) with: {'optimizer__learning_rate': 0.01, 'optimizer__momentum': 0.6}
0.355014 (0.025132) w

In [37]:
# grid search on initial mode

# Assuming X_train, X_test, y_train, y_test are already defined

def create_model(init_mode='uniform'):
    model = Sequential()
    model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
    model.add(Dense(64, activation='sigmoid'))
    model.add(Dense(df_labels.shape[1], activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
    return model

# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)

# create model
model = KerasClassifier(model=create_model, epochs=100, batch_size=60, verbose=0)

# define the grid search parameters
init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
param_grid = dict(model__init_mode=init_mode)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train, y_train, validation_data=(X_test, y_test))

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.403600 using {'model__init_mode': 'glorot_normal'}
0.371854 (0.078381) with: {'model__init_mode': 'uniform'}
0.395858 (0.061047) with: {'model__init_mode': 'lecun_uniform'}
0.387727 (0.056188) with: {'model__init_mode': 'normal'}
0.379597 (0.049549) with: {'model__init_mode': 'zero'}
0.403600 (0.044401) with: {'model__init_mode': 'glorot_normal'}
0.396051 (0.078653) with: {'model__init_mode': 'glorot_uniform'}
0.387727 (0.056188) with: {'model__init_mode': 'he_normal'}
0.363724 (0.074701) with: {'model__init_mode': 'he_uniform'}


In [38]:
# grid search activation

# Assuming X_train, X_test, y_train, y_test are already defined

def create_model(activation='relu'):
    model = Sequential()
    model.add(Dense(128, kernel_initializer='he_uniform', activation=activation, input_shape=(X_train.shape[1],)))
    model.add(Dense(64, kernel_initializer='he_uniform', activation='sigmoid'))
    model.add(Dense(df_labels.shape[1], activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
    return model

# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)

# create model
model = KerasClassifier(model=create_model, epochs=100, batch_size=60, verbose=0)

# define the grid search parameters
activation = ['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear']
param_grid = dict(model__activation=activation)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train, y_train, validation_data=(X_test, y_test))

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.403794 using {'model__activation': 'softplus'}
0.363144 (0.036560) with: {'model__activation': 'softmax'}
0.403794 (0.049823) with: {'model__activation': 'softplus'}
0.387534 (0.038326) with: {'model__activation': 'softsign'}
0.388115 (0.089798) with: {'model__activation': 'relu'}
0.355594 (0.066993) with: {'model__activation': 'tanh'}
0.371661 (0.063953) with: {'model__activation': 'sigmoid'}
0.371661 (0.063953) with: {'model__activation': 'hard_sigmoid'}
0.387727 (0.056188) with: {'model__activation': 'linear'}


In [39]:
# grid search weight constraint, drop rate

# Assuming X_train, X_test, y_train, y_test are already defined

def create_model(dropout_rate, weight_constraint):
    model = Sequential()
    model.add(Dense(128, kernel_initializer='he_uniform', activation='relu', input_shape=(X_train.shape[1],)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(64, kernel_initializer='he_uniform', activation='sigmoid'))
    model.add(Dense(df_labels.shape[1], activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
    return model

# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)

# create model
model = KerasClassifier(model=create_model, epochs=100, batch_size=60, verbose=0)

# define the grid search parameters
weight_constraint = [1.0, 2.0, 3.0, 4.0, 5.0]
dropout_rate = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
param_grid = dict(model__dropout_rate=dropout_rate, model__weight_constraint=weight_constraint)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train, y_train, validation_data=(X_test, y_test))

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))



Best: 0.420054 using {'model__dropout_rate': 0.3, 'model__weight_constraint': 1.0}
0.395858 (0.064213) with: {'model__dropout_rate': 0.0, 'model__weight_constraint': 1.0}
0.395858 (0.064213) with: {'model__dropout_rate': 0.0, 'model__weight_constraint': 2.0}
0.387727 (0.056188) with: {'model__dropout_rate': 0.0, 'model__weight_constraint': 3.0}
0.403794 (0.053656) with: {'model__dropout_rate': 0.0, 'model__weight_constraint': 4.0}
0.395664 (0.045185) with: {'model__dropout_rate': 0.0, 'model__weight_constraint': 5.0}
0.379985 (0.085837) with: {'model__dropout_rate': 0.1, 'model__weight_constraint': 1.0}
0.396051 (0.078653) with: {'model__dropout_rate': 0.1, 'model__weight_constraint': 2.0}
0.347658 (0.083032) with: {'model__dropout_rate': 0.1, 'model__weight_constraint': 3.0}
0.355788 (0.083221) with: {'model__dropout_rate': 0.1, 'model__weight_constraint': 4.0}
0.396051 (0.078653) with: {'model__dropout_rate': 0.1, 'model__weight_constraint': 5.0}
0.379597 (0.049549) with: {'model__dr

In [40]:
# grid search neurons

# Assuming X_train, X_test, y_train, y_test are already defined

def create_model(neurons):
    model = Sequential()
    model.add(Dense(neurons, input_shape=(X_train.shape[1],), kernel_initializer='he_uniform', activation='relu', kernel_constraint=MaxNorm(1, axis=0)))
    model.add(Dropout(0.0))
    model.add(Dense(64, kernel_initializer='he_uniform', activation='sigmoid'))
    model.add(Dense(df_labels.shape[1], activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
    return model

# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)

# create model
model = KerasClassifier(model=create_model, epochs=100, batch_size=60, verbose=0)

# define the grid search parameters
neurons = [1, 5, 10, 15, 20, 25, 30]
param_grid = dict(model__neurons=neurons)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train, y_train, validation_data=(X_test, y_test))

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.387534 using {'model__neurons': 25}
0.363144 (0.054065) with: {'model__neurons': 1}
0.371274 (0.048022) with: {'model__neurons': 5}
0.371274 (0.048022) with: {'model__neurons': 10}
0.355014 (0.025132) with: {'model__neurons': 15}
0.330623 (0.049823) with: {'model__neurons': 20}
0.387534 (0.043191) with: {'model__neurons': 25}
0.371467 (0.044918) with: {'model__neurons': 30}


In [41]:
# final model

# X = np.vstack(df_all_cleaned['bert_embeddings'])
# y = df_labels

# # Train-test split
# X_train, X_test, y_train, y_test = train_test_split(X, df_labels, test_size=0.2, random_state=42)

# Build a simple neural network model
def create_model():
    model = Sequential()
    model.add(Dense(units=1, input_shape=(X_train.shape[1],), kernel_initializer='he_uniform', activation='relu', kernel_constraint=MaxNorm(1, axis=0)))
    model.add(Dropout(0.0))
    model.add(Dense(64, kernel_initializer='he_uniform', activation='sigmoid'))
    model.add(Dense(df_labels.shape[1], activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# fix random seed for reproducibility
seed = 7
tf.random.set_seed(seed)

# create model
model1 = KerasClassifier(model=create_model, epochs=100, batch_size=100, verbose=0)

# Train the model
model1.fit(X_train, y_train, epochs=100, batch_size=100, validation_data=(X_test, y_test))
# 
# Predictions
predictions = model1.predict(X_test)

In [42]:
predictions

array([[0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0]])

fine tuning parameters do not change the result of the prediction. 

In [43]:
# create columns that concatanates the columns extrovert, feeling, judging, sensing (does not add them but concatanates them), columns are floats
df_all_cleaned['extrovert'] = df_all_cleaned['extrovert'].astype(int)
df_all_cleaned['feeling'] = df_all_cleaned['feeling'].astype(int)
df_all_cleaned['judging'] = df_all_cleaned['judging'].astype(int)
df_all_cleaned['sensing'] = df_all_cleaned['sensing'].astype(int)
df_all_cleaned['personality'] = df_all_cleaned['extrovert'].astype(str) + df_all_cleaned['feeling'].astype(str) + df_all_cleaned['judging'].astype(str) + df_all_cleaned['sensing'].astype(str)
df_all_cleaned

Unnamed: 0,auhtor_ID,post,extrovert,feeling,judging,sensing,processed_post,bert_embeddings,personality
0,t2_12bhu7,I wear a Lorna shore shirt out alot in public ...,1,1,0,0,wear lorna shore shirt alot public lewd long s...,"[0.03740044, 0.03744348, 0.40402788, -0.154586...",1100
1,t2_12jbpd,I'd say this is a very accurate characterizati...,1,0,0,0,id say accurate characterization ni users read...,"[-0.122634806, 0.06978733, 0.23516777, -0.1674...",1000
2,t2_12uwr5,Ya know like most people with home decorations...,0,0,1,0,ya know like people home decorations could sav...,"[0.10358047, -0.079817355, 0.4862674, 0.006832...",0010
3,t2_12zm15,It's true tho. They're kinda more interesting ...,0,1,0,0,true tho theyre kinda interesting buuuut issue...,"[-0.11131706, 0.070213296, 0.5168624, 0.017204...",0100
4,t2_13cjjl,"Yeah, but that's one of the things that make m...",0,0,0,1,yeah thats one things make better objectively ...,"[0.21926472, 0.11031033, 0.28619415, 0.1073027...",0001
...,...,...,...,...,...,...,...,...,...
150,t2_vfp8y,so change profession then. this would be inadm...,0,0,1,0,change profession would inadmissible country p...,"[-0.17524596, 0.1841626, 0.44777465, -0.118973...",0010
151,t2_w0842,The technological singularity. And the possibi...,0,0,1,0,technological singularity possibility contribu...,"[-0.025105778, -0.08084041, 0.3463775, -0.0189...",0010
152,t2_w6rgl,Dear God man. Chill. I'm not Einstein or Hawki...,0,0,1,0,dear god man chill im einstein hawking serious...,"[0.088415004, 0.22571912, 0.38455746, -0.03194...",0010
153,t2_wilcwvo,That's what a fake lib would say [Human blood ...,1,0,0,0,thats fake lib would say human blood water url...,"[0.05957357, 0.10176626, 0.3839616, -0.0742901...",1000


In [44]:
# Combine predictions for each dichotomy
combined_predictions = np.hstack([predictions[:, i].reshape(-1, 1) for i in range(predictions.shape[1])])

# Binarize the combined predictions
binary_combined_predictions = (combined_predictions > 0.5).astype(int)

# Calculate accuracy for the combined predictions
combined_accuracy = accuracy_score(y_test, binary_combined_predictions)
print(f"Accuracy for Combined MBTI Prediction: {combined_accuracy}")

Accuracy for Combined MBTI Prediction: 0.3225806451612903


In [45]:
# Evaluate for each df_lables
for column, true_labels in df_labels.iteritems():
    i = df_labels.columns.get_loc(column)  # Get the index of the current column
    threshold = 0.5  # Adjust the threshold based on your task
    binary_predictions = (predictions[:, i] > threshold).astype(int)
    accuracy = accuracy_score(y_test[column], binary_predictions)
    print(f"Accuracy for {column} (Neural Network): {accuracy}")

    # Classification report for each dichotomy
    print(f"Classification Report for {column} (Neural Network):")
    print(classification_report(y_test[column], binary_predictions))

Accuracy for extrovert (Neural Network): 0.8064516129032258
Classification Report for extrovert (Neural Network):
              precision    recall  f1-score   support

         0.0       0.81      1.00      0.89        25
         1.0       0.00      0.00      0.00         6

    accuracy                           0.81        31
   macro avg       0.40      0.50      0.45        31
weighted avg       0.65      0.81      0.72        31

Accuracy for feeling (Neural Network): 0.7419354838709677
Classification Report for feeling (Neural Network):
              precision    recall  f1-score   support

         0.0       0.74      1.00      0.85        23
         1.0       0.00      0.00      0.00         8

    accuracy                           0.74        31
   macro avg       0.37      0.50      0.43        31
weighted avg       0.55      0.74      0.63        31

Accuracy for judging (Neural Network): 0.4838709677419355
Classification Report for judging (Neural Network):
            

  for column, true_labels in df_labels.iteritems():
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
