## Myers–Briggs personality type predictor

Using the dataset from: [https://www.kaggle.com/datasnaek/mbti-type](https://www.kaggle.com/datasnaek/mbti-type)

In [1]:
import gc

import numpy as np
import pandas as pd

Loading the dataset

In [2]:
df = pd.read_csv('../data/mbti_1.csv')

Inspecting

In [3]:
df.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [4]:
df.isnull().any()

type     False
posts    False
dtype: bool

Each datapoint is an unprocessed text where different social media posts are separated by the `|||` character.

Now, we must distribute the class `yi` of `i` datapoint on the `xi` values which is splited by the `|||` character. 

In [5]:
X, y = [], []

for xi, yi in zip(df['posts'].values, df['type'].values):
    # split at the separator
    posts = xi.split('|||')
    X.extend(posts)
    
    # add the class for each posts
    for _ in range(len(posts)):
        y.append(yi)

del df
gc.collect()

0

In [6]:
def print_class_dist(y):
    classes = {}

    for c in y:
        if c in classes.keys():
            classes[c] += 1
        else:
            classes[c] = 0

    print('The dataset has')

    for c in classes:
        print(classes[c], '\t',
              c + 's', '\t',
              round(100 * classes[c] / len(y), 2), '%')
    return classes

classes = print_class_dist(y)

The dataset has
72104 	 INFJs 	 17.05 %
33760 	 ENTPs 	 7.98 %
63358 	 INTPs 	 14.98 %
52470 	 INTJs 	 12.41 %
11272 	 ENTJs 	 2.67 %
9287 	 ENFJs 	 2.2 %
89795 	 INFPs 	 21.24 %
32768 	 ENFPs 	 7.75 %
12999 	 ISFPs 	 3.07 %
16497 	 ISTPs 	 3.9 %
8120 	 ISFJs 	 1.92 %
9912 	 ISTJs 	 2.34 %
4336 	 ESTPs 	 1.03 %
2214 	 ESFPs 	 0.52 %
1920 	 ESTJs 	 0.45 %
2017 	 ESFJs 	 0.48 %


We should regulate classes

In [7]:
# choose max "class_limit" rows from each class

X_new = []
y_new = []
class_limit = 1920
class_counter = {k: 0 for k in classes.keys()}

for xi, yi in zip(X, y):
    if class_counter[yi] > class_limit:
        continue
        
    class_counter[yi] += 1
    
    X_new.append(xi)
    y_new.append(yi)
    
    
del X, y
X, y = X_new, y_new
del X_new, y_new
gc.collect()

0

In [8]:
print_class_dist(y);

The dataset has
1920 	 INFJs 	 6.25 %
1920 	 ENTPs 	 6.25 %
1920 	 INTPs 	 6.25 %
1920 	 INTJs 	 6.25 %
1920 	 ENTJs 	 6.25 %
1920 	 ENFJs 	 6.25 %
1920 	 INFPs 	 6.25 %
1920 	 ENFPs 	 6.25 %
1920 	 ISFPs 	 6.25 %
1920 	 ISTPs 	 6.25 %
1920 	 ISFJs 	 6.25 %
1920 	 ISTJs 	 6.25 %
1920 	 ESTPs 	 6.25 %
1920 	 ESFPs 	 6.25 %
1920 	 ESTJs 	 6.25 %
1920 	 ESFJs 	 6.25 %


Encode target (y) values

In [9]:
from sklearn.preprocessing import OneHotEncoder
class_encoder = OneHotEncoder()

y_encoded = class_encoder.fit_transform(np.array(y).reshape(-1, 1)).toarray()

del y
gc.collect()

0

Select train and test sets

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(np.array(X), np.array(y_encoded), test_size=.1, random_state=0)

print('Size of train set:', X_train.shape)
print('Size of test set:', X_test.shape)

del X, y_encoded
gc.collect()

Size of train set: (27662,)
Size of test set: (3074,)


10

In [11]:
print(X_train[-1])
print(20 * '-')
print(class_encoder.inverse_transform([y_train[-1]])[0][0])

You're a perfectionist yet a procrastinator :tongue:
--------------------
INFP


Preprocessing of the text input