## Myers–Briggs personality type predictor

Using the dataset from: [https://www.kaggle.com/datasnaek/mbti-type](https://www.kaggle.com/datasnaek/mbti-type)

### Importing the necessary modules

In [1]:
import sys
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.utils import shuffle
from sklearn.preprocessing import OneHotEncoder
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.externals import joblib

### Creating utilities

In [2]:
def calc_distance(sentence1, sentence2):
    """
    Calculating the euclidean distance between
    two preprocessed sentences.
    """
    s1_normalized = sentence1 / np.linalg.norm(sentence1)
    s2_normalized = sentence2 / np.linalg.norm(sentence2)
    return np.linalg.norm(s1_normalized - s2_normalized)


class StemmedTfidfVectorizer(TfidfVectorizer):
    
    def build_analyzer(self, stemmer=None):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        
        if stemmer is None:
            stemmer = SnowballStemmer('english')
        
        return lambda text: (stemmer.stem(w) for w in analyzer(text))

### Loading the dataset

In [3]:
N = 1500  # number of elements to use from the dataset, because of high ram usage  
df = shuffle( pd.read_csv('../data/mbti-myers-briggs-personality-types.csv') )[:N]

### Preprocessing

In [4]:
df.head()

Unnamed: 0,type,posts
7057,ISTJ,"'At my work, passive-aggressive behavior is wh..."
3261,INFJ,'@Macrosapien I agree with the victim part. W...
760,ENTJ,'6w7 http://youtu.be/jSWIUEV5sPQ|||sx/sp 5w4 ...
7346,ISFP,'I'm still here when you mention me by name! L...
8522,ISFP,I've spent years trying to learn how to be pro...


In [5]:
type_encoder = OneHotEncoder()
y = type_encoder.fit_transform( np.array([df['type'].values]).T ).toarray()

In [6]:
vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df['posts'].values).toarray()
X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

### Building the neural network

In [8]:
from keras.models import Sequential
from keras.layers import Dense, Flatten


model = Sequential()
model.add(Dense(16, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(16, activation='softmax'))
model.compile(loss='mean_squared_error', optimizer='adagrad')

history = model.fit(x=X_train, y=y_train, verbose=1, epochs=22, shuffle=True)

train_score = model.evaluate(X_train, y_train, verbose=0)
print('Train score', train_score)
test_score = model.evaluate(X_test, y_test, verbose=0)
print('Test score', test_score)

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Epoch 1/22
Epoch 2/22
Epoch 3/22
Epoch 4/22
Epoch 5/22
Epoch 6/22
Epoch 7/22
Epoch 8/22
Epoch 9/22
Epoch 10/22
Epoch 11/22
Epoch 12/22
Epoch 13/22
Epoch 14/22
Epoch 15/22
Epoch 16/22
Epoch 17/22
Epoch 18/22
Epoch 19/22
Epoch 20/22
Epoch 21/22
Epoch 22/22
Train score 0.010545929707586766
Test score 0.04558929497996966


In [41]:
def predict(X):
    global model, vectorizer, type_encoder
    """
    From an unprocessed string predict the class.
    """
    # preprocess input
    transformed_input = vectorizer.transform([X]).toarray()
    
    prediction = model.predict(transformed_input)
    
    return type_encoder.inverse_transform(prediction)[0][0]


# print(' '.join(list(vectorizer.inverse_transform(X[:1])[0])))
# print(type_encoder.inverse_transform(y[:1])[0])

sample = df['posts'].iloc[-1].split('|||')[-1]
print('Target is', df['type'].iloc[-1])
print('Predicted is', predict(sample))
print('Sample:\n', sample)

Target is ESFP
Predicted is ENFP
Sample:
 i can totally relate to this post. my personality, i've been told, is sweet and vivacious but...i'm secretly shy as hell...particularly when it comes to dating. in any case, it depends on who i'm...'


In [45]:
# saving models
# joblib.dump(vectorizer, '../trained/vectorizer.pkl')
# joblib.dump(type_encoder, '../trained/type_encoder.pkl')
# model.save('../trained/model.pkl')