# Myers-Briggs Personality Type Classifier
### The Project:
Create a model that predicts a someone's MBTI personality type using their online posts as input.

### The Data:
This dataset is from Kaggle.com and contains text data in the form of the 50 most recent online posts to PersonalityCafe.com by over 8,000 users. Since Personality Cafe is a website that focuses on personality type models (especially MBTI), there are multiple instances in posts where the user's personality type is mentioned. In my previous version of this notebook I left all mentions of personality in the text. However, in this version posts such as "That's so INTJ, bro!" have been filtered to "That's so bro!" (Since we are using a bag-of-words model here it won't matter that the revised sentence doesn't make much sense).
Here is the full dataset url: https://www.kaggle.com/datasnaek/mbti-type/data

### The Model
Earlier, I trained a classifier to predict which type out of the 16 possible MBTI personality types belongs to a person based on their last 50 online posts. The accuracy is about 27% for this holistic model with all 16 labels. Although this sounds like pretty bad accuracy when you consider that if we were to pick one personality type at random as a guess that would give us 6.25% accuracy. So we are doing quite a bit better than that.
However, what I want to do here is create four separate classifiers and output a prediction for each one. Breaking this up should provide some interesting insights.

In this notebook I still include the 16-way classifier as I had in the previous notebook. However, here I also break the single 16-way classifier into 4 separate two-way classifiers. So we are training five classifiers in all:
* The all-in classifier that predicts which of the 16 personality types the text belongs to
* Dichotomy 1, E or I. Favorite world dichotomy, is this person extroverted or introverted?
* Dichotomy 2, N or S. Information dichotomy, does this person attach meaning to the information they consume (iNtuitive) or do they take information as is (sensing)?
* Dichotomy 3, F or T. Descisions dichotomy, does this person favor feelings of others and themselves when making decisions or do they operate more logically and consistently?
* Dichotomy 4, J or P. Structure dichotomy, does this person have things figured out already or are they open to new information?

Breaking down into dichotomies significantly improves the accuracy of the classifiers. See the results at the Vectorize Train and Evaluate section for more.

In [1]:
import pandas as pd
import numpy as np
import sklearn
from nltk import word_tokenize
import re

# Classifiers to use in this notebook
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.linear_model import Perceptron
from sklearn import tree
from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction import DictVectorizer
vectorizer = DictVectorizer(sparse=True)

## Set Parameters for this notebook

In [2]:
# Classifiers to be fitted and evaluated
classifiers = [MultinomialNB(), 
               LogisticRegression(), 
               # svm.SVC(kernel='rbf'), 
               Perceptron(), 
               tree.DecisionTreeClassifier()]



In [3]:
# ratio of train/test data
split_ratio = 0.85

# seed for random split of train/test data
seed = 1

In [4]:
# Location of data
file = './data/mbti_1.csv'

# Our data is the MBTI-Type dataset from Kaggle.com
# https://www.kaggle.com/datasnaek/mbti-type/data

In [5]:
# Load set of words to filter from the text
with open('./data/personality_words.txt', 'r') as reader:
    personality_words = reader.read().split('\n')

## Define Our Functions

Preprocessing

In [6]:
def load_and_extract(filepath, split_ratio=0.85, seed=1):
    loaded = pd.read_csv(filepath)
    # print('Dataset size', loaded.shape, loaded.info)
    # split into training, testing
    train=loaded.sample(frac=split_ratio, random_state=seed)
    test=loaded.drop(train.index)
    return train, test

In [7]:
def split_into_dichotomies(df):
    # split ENTP or other personality type into four columns
    df['dicho1'] = df.type.str[0]
    df['dicho2'] = df.type.str[1]
    df['dicho3'] = df.type.str[2]
    df['dicho4'] = df.type.str[3]
    return df

In [8]:
def remove_double_period(clean_str):
    return clean_str.replace('..', '. ')

In [9]:
def remove_delim(clean_str):
    return clean_str.replace('|||', '. ')

In [10]:
def replace_links(clean_str):
    urlpattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(urlpattern, '*LINK*', clean_str)

In [11]:
def clean_posts(posts_raw):
    bow = []
    if posts_raw.startswith('"') or posts_raw.startswith("'"):
                # remove extra string quotes if exist
        clean = posts_raw[1:-1]
    else:
        clean = posts_raw
    clean = remove_double_period(clean)
    clean = remove_delim(clean)
    clean = replace_links(clean)
    clean = word_tokenize(clean)
    return [c.lower() for c in clean if c.lower() not in personality_words]  # filter personality words
    # return [c.lower() for c in clean]  # not filtering personality words

Vectorization

In [12]:
def get_vocab(df):
    ## get vocabulary for a dataset, i.e. all possbile features
    vocab = []
    for row in df.itertuples():
        vocab.extend(row.post_list)
    return set(vocab)

In [13]:
def vectorize_one(x, vocab, dichotomy=False):
    # take one training example (dataframe row) and return a sparse feature vector
    # init feature vec with zeros
    feature_dict = {}
    count_keys = len(feature_dict.keys())
    
    # I know this part is really not elegant code. Rewriting these functions from
    # scratch would be the next step I take.
    if dichotomy == 1:  # if personality type is split into dichotomies
        label = x.dicho1
    elif dichotomy == 2:  # if personality type is split into dichotomies
        label = x.dicho2
    elif dichotomy == 3:  # if personality type is split into dichotomies
        label = x.dicho3
    elif dichotomy == 4:  # if personality type is split into dichotomies
        label = x.dicho4
    else:          # if personality type is not split e.g. INTJ
        label = x.type
    
    raw_feat = x.post_list
    for fx in x.post_list:
        if fx in vocab:
            feature_dict[fx] = 1  # One Hot Encoding
    return label, feature_dict  # returns label and feature dict for a single training sample 

In [14]:
def get_feature_dict(df, vocab, dichotomy=False):
    # Create a feature dictionary for a dataset
    X = []
    y = []
    for row in df.itertuples():
        if dichotomy:
            result = vectorize_one(row, vocab, dichotomy)
        else:
            result = vectorize_one(row, vocab)
        y.append(result[0])
        X.append(result[1])
    return y, X

In [15]:
def vectorize_all(train, test, dichotomy=False):
    vocab = get_vocab(train)
    
    if dichotomy:
        # training data
        y_train, X_train_dict = get_feature_dict(train, vocab, dichotomy)
        X_train = vectorizer.fit_transform(X_train_dict)
    
        # testing data
        y_test, X_test_dict = get_feature_dict(test, vocab, dichotomy)
        X_test = vectorizer.transform(X_test_dict)
    
    else:
        # training data
        y_train, X_train_dict = get_feature_dict(train, vocab)
        X_train = vectorizer.fit_transform(X_train_dict)
    
        # testing data
        y_test, X_test_dict = get_feature_dict(test, vocab)
        X_test = vectorizer.transform(X_test_dict)
    
    return X_train, y_train, X_test, y_test

Train and Evaluate

In [16]:
def fit_and_evaluate_one(classifier, X_train, y_train, X_test, y_test):
    # fitting
    classifier.fit(X_train, y_train)
    
    # accuracy
    results = {}
    y_hat_train = classifier.predict(X_train)
    accur_train = sum(y_hat_train == y_train) / len(y_train)  # train accuracy
    y_hat_test = classifier.predict(X_test)
    accur_test = sum(y_hat_test == y_test) / len(y_test)  # test accuracy
    results['accur_train'] = accur_train
    results['accur_test'] = accur_test

    return results

In [17]:
def fit_and_evaluate_all(classifiers, X_train, y_train, X_test, y_test):
    results = {}
    for c in classifiers:
        cname = str(c).split('(')[0]
        print('Training '+cname+ '...')
        results[cname] = fit_and_evaluate_one(c, X_train, y_train, X_test, y_test)
    return results

## Load and Preprocess the Data

In [18]:
train, test = load_and_extract(file, split_ratio, seed=1)

All Personality Types, i.e. all possible labels in our dataset

In [19]:
# Personality_dichotomies = {1:{'E':0, 'I':0},
#                            2:{'N':0, 'S':0},
#                            3:{'F':0, 'T':0},
#                            4:{'J':0, 'P':0}}

Split type into four dichotomies

In [20]:
train = split_into_dichotomies(train)
test = split_into_dichotomies(test)

Get rid of extra string quotes and tokenize into posts for each row in posts

In [21]:
train['post_list'] = train['posts'].apply(clean_posts)
test['post_list'] = test['posts'].apply(clean_posts)

## Vectorize Train and Evaluate
I put all of these steps into one so that each model would run one at a time and overwrite the previous feature vectors. This should prevent using too much memory.

### Holistic Model<br>
16 types

In [22]:
X_train, y_train, X_test, y_test = vectorize_all(train, test)

In [23]:
holistic_result_dict = fit_and_evaluate_all(classifiers, X_train, y_train, X_test, y_test)
holistic_result = pd.DataFrame.from_dict(holistic_result_dict).transpose().sort_values('accur_test', ascending=False)
holistic_result

Training MultinomialNB...
Training LogisticRegression...
Training Perceptron...
Training DecisionTreeClassifier...


Unnamed: 0,accur_test,accur_train
LogisticRegression,0.382014,1.0
Perceptron,0.363566,0.985625
MultinomialNB,0.264412,0.614999
DecisionTreeClassifier,0.167563,1.0


The best accuracy is about 38% for the holistic model with all 16 labels (using Logistic Regression and filtering out personality type terms). Although this sounds like pretty bad accuracy when you consider that if we were to pick one personality type at random as a guess that would give us 6.25% accuracy. So we are doing quite a bit better than that.

### Dichotomized Model<br>
First Dichotomy: E or I

In [24]:
X_train, y_train, X_test, y_test = vectorize_all(train, test, dichotomy=1)
dicho1_result_dict = fit_and_evaluate_all(classifiers, X_train, y_train, X_test, y_test)
dicho1_result = pd.DataFrame.from_dict(dicho1_result_dict).transpose().sort_values('accur_test', ascending=False)
dicho1_result

Training MultinomialNB...
Training LogisticRegression...
Training Perceptron...
Training DecisionTreeClassifier...


Unnamed: 0,accur_test,accur_train
LogisticRegression,0.790161,1.0
MultinomialNB,0.790161,0.895172
Perceptron,0.770177,0.981557
DecisionTreeClassifier,0.68947,1.0


Second Dichotomy: N or S

In [25]:
X_train, y_train, X_test, y_test = vectorize_all(train, test, dichotomy=2)
dicho2_result_dict = fit_and_evaluate_all(classifiers, X_train, y_train, X_test, y_test)
dicho2_result = pd.DataFrame.from_dict(dicho2_result_dict).transpose().sort_values('accur_test', ascending=False)
dicho2_result

Training MultinomialNB...
Training LogisticRegression...
Training Perceptron...
Training DecisionTreeClassifier...


Unnamed: 0,accur_test,accur_train
MultinomialNB,0.840123,0.88907
LogisticRegression,0.833974,1.0
Perceptron,0.787855,0.968131
DecisionTreeClassifier,0.786318,1.0


Third Dichotomy: F or T

In [26]:
X_train, y_train, X_test, y_test = vectorize_all(train, test, dichotomy=3)
dicho3_result_dict = fit_and_evaluate_all(classifiers, X_train, y_train, X_test, y_test)
dicho3_result = pd.DataFrame.from_dict(dicho3_result_dict).transpose().sort_values('accur_test', ascending=False)
dicho3_result

Training MultinomialNB...
Training LogisticRegression...
Training Perceptron...
Training DecisionTreeClassifier...


Unnamed: 0,accur_test,accur_train
MultinomialNB,0.772483,0.970979
LogisticRegression,0.770177,1.0
Perceptron,0.721752,0.90914
DecisionTreeClassifier,0.592621,1.0


Fourth Dichotomy: J or P

In [27]:
X_train, y_train, X_test, y_test = vectorize_all(train, test, dichotomy=4)
dicho4_result_dict = fit_and_evaluate_all(classifiers, X_train, y_train, X_test, y_test)
dicho4_result = pd.DataFrame.from_dict(dicho4_result_dict).transpose().sort_values('accur_test', ascending=False)
dicho4_result

Training MultinomialNB...
Training LogisticRegression...
Training Perceptron...
Training DecisionTreeClassifier...


Unnamed: 0,accur_test,accur_train
LogisticRegression,0.654112,1.0
MultinomialNB,0.652575,0.979251
Perceptron,0.650269,0.941009
DecisionTreeClassifier,0.556495,1.0


## Performance Assesment
The Best model so far is the Logistic Regression. It outperforms the three other algorithms in each dichotomy. I would highly recommend against using the Kernelized SVM for this project since it was the worst performing algorithm for 3 out of 5 models and makes the process of running this notebook take about 25 minutes where it is usually about 2-3 minutes. For this reason, I omit the SVM from our results above. However, it will run if you uncomment the SVM from the 
"Set Parameters for this notebook" section near the beginning.

### Confusion Matrix for Logistic Regression Model
Here is the confusion matrix for the log regression model (personality words filtered out):

In [28]:
# confusion matrix published publicly
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTUfeQEuInf0tpeOPeYa3onyEzou_tqtCWu7UpKOaSHPyNqcjPBgVKPe8OO'+ \
      'PaqzdFy7CuHxzm_0c8YN/pub?gid=0&single=true&output=csv'
pd.read_csv(url, index_col='Unnamed: 0')

Unnamed: 0,E,I,N,S,F,T,J,P
E,79,21,0,0,0,0,0,0
I,21,79,0,0,0,0,0,0
N,0,0,83,17,0,0,0,0
S,0,0,17,83,0,0,0,0
F,0,0,0,0,77,23,0,0
T,0,0,0,0,23,77,0,0
J,0,0,0,0,0,0,65,35
P,0,0,0,0,0,0,35,65


### Confusion Matrix for Logistic Regression Model
Here is the confusion matrix for the log regression model (personality words not filtered out):

In [29]:
# confusion matrix published publicly
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRP1tGYOS4z5U6KX1szj3HY_Q2y-u2iX-4ljhq9sFSvTmSMjp0LatjDp9E0'+ \
      'bBcw_ptpYXST2YYuAcOu/pub?gid=0&single=true&output=csv'
pd.read_csv(url, index_col='Unnamed: 0')

Unnamed: 0,E,I,N,S,F,T,J,P
E,83,17,0,0,0,0,0,0
I,17,83,0,0,0,0,0,0
N,0,0,86,14,0,0,0,0
S,0,0,14,86,0,0,0,0
F,0,0,0,0,81,19,0,0
T,0,0,0,0,19,81,0,0
J,0,0,0,0,0,0,71,29
P,0,0,0,0,0,0,29,71


We do best on predicting the second dichotomy, S or N for a given user. We do worst on predicting the final dichotomy J or P.<br>
You may also notice that filtering out personality words and accronyms has a negative effect on accuracy. However, I hypothesize that it would be best to keep filtering these personality words out because it will likely yield a model that is better suited to being used on more diverse discourse than just that on Personality Cafe.

## Discussion and Next Steps

### Takeaways
I was actually surprised to see how well these models do, especially that the 16-way classifier acheives 38% accuracy even after filtering out personality type related words and accronymns I was not surprised, however, to see that there is a huge lift by using the 4 separate 2-way classifiers instead of the single one. It was interesting to note that leaving the personality type terms in the feature vector yielded better accuracy than leaving them out. Though I do believe that, unless we were building this model to predict personality type based on posts from somewhere like Personality Cafe alone, it would be best to leave them out since this should hopefully prevent the model from overfitting to this specific type of discourse alone.

### Next Steps
I have to thank Leo for suggesting a great way to mine for new data outside of Personality Cafe. Leo suggested that since there are numerous personality quizes on social media, it would be possible to find users who have taken an MBTI quiz through social media and to then extract their previous n posts for more data. This would be my next step since it could provide data from a much more well-rounded field of discourse. This would hopefully lead to a model that could better predict personality type based on other types of discourse.<br>
Additionally, another next step would be to pass the result through a softmax layer to produce a probability. I think this would be a great way to make the model more interesting and useful to users. I imagine that you could have a "Let ML predict your personality type, did it get it right?" kind of application where users would submit their previous 50 posts or some other text and the model would return a prediction for each dichotomy. You could then ask the user to grade the model's performance giving us new data in the process!

In [30]:
def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def save_y_array(filename, array):
    array = [str(a) for a in array]
    with open(filename, 'w') as writer:
        writer.write('\n'.join(array))

save_sparse_csr('./data/X_train', X_train)
save_sparse_csr('./data/X_test', X_test)
# save_sparse_csr('X_dev', X_dev)

save_y_array('./data/y_train.txt', y_train)
save_y_array('./data/y_test.txt', y_test)
# save_y_array('y_dev.txt', y_dev)