# MBTI Prediction using text

This MBTI personality prediction model is built using text data from a lot of social media platforms, which has been preprocessed and cleaned to ensure high-quality input.

The text is further cleaned by removing stopwords, punctuation, and unnecessary tokens, followed by tokenization.

The cleaned text data is then transformed into numerical features using the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer.

A logistic regression classifier is used to train the model on the cleaned dataset, which is labeled with the 16 personality types defined by the Myers-Briggs Type Indicator (MBTI).

The model analyzes input text and predicts the corresponding personality type by recognizing patterns in language that correlate with MBTI traits.

This approach provides a robust and automated method for classifying personality types from written content.

#Import the necessary libraries and Load the Dataset

In [1]:
import pandas as pd
import numpy as np
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
data = pd.read_csv('/content/drive/MyDrive/Data Science/CSV files/MBTI 500.csv')

# Explore and Preprocess

In [3]:
data.head()

Unnamed: 0,posts,type
0,know intj tool use interaction people excuse a...,INTJ
1,rap music ehh opp yeah know valid well know fa...,INTJ
2,preferably p hd low except wew lad video p min...,INTJ
3,drink like wish could drink red wine give head...,INTJ
4,space program ah bad deal meing freelance max ...,INTJ


In [4]:
data.tail()

Unnamed: 0,posts,type
106062,stay frustrate world life want take long nap w...,INFP
106063,fizzle around time mention sure mistake thing ...,INFP
106064,schedule modify hey w intp strong wing underst...,INFP
106065,enfj since january busy schedule able spend li...,INFP
106066,feel like men good problem tell parent want te...,INFP


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106067 entries, 0 to 106066
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   posts   106067 non-null  object
 1   type    106067 non-null  object
dtypes: object(2)
memory usage: 1.6+ MB


In [6]:
data.describe()

Unnamed: 0,posts,type
count,106067,106067
unique,106067,16
top,know intj tool use interaction people excuse a...,INTP
freq,1,24961


In [7]:
data = data.dropna()

## Text Cleaning

Use nltk to remove stop words, punctuation, and lowercase the text.
As our dataset is already cleaned it won't really show any difference but i still did the cleaning just in case.

In [8]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [9]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

stop_words = set(stopwords.words('english'))

def clean_text(text):
    # Tokenize
    tokens = word_tokenize(text.lower())
    # Remove stopwords and punctuation
    cleaned = [word for word in tokens if word not in stop_words and word not in string.punctuation]
    return " ".join(cleaned)

data['cleaned_text'] = data['posts'].apply(clean_text)

##  Split the Data for Training and Testing

In [10]:
x = data['posts']
y = data['type']

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Feature Extraction

In [12]:
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(x_train)
X_test_tfidf = tfidf.transform(x_test)

# Training a Model

In [13]:
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Evaluate the model

In [14]:
y_pred = model.predict(X_test_tfidf)

In [15]:
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

        ENFJ       0.81      0.50      0.62       319
        ENFP       0.85      0.74      0.79      1249
        ENTJ       0.88      0.73      0.80       577
        ENTP       0.82      0.80      0.81      2324
        ESFJ       0.78      0.21      0.33        33
        ESFP       0.87      0.35      0.50        75
        ESTJ       0.97      0.68      0.80       105
        ESTP       0.96      0.82      0.89       398
        INFJ       0.80      0.83      0.82      2954
        INFP       0.78      0.82      0.80      2391
        INTJ       0.81      0.87      0.84      4531
        INTP       0.82      0.88      0.85      5033
        ISFJ       0.86      0.48      0.62       132
        ISFP       0.77      0.50      0.61       161
        ISTJ       0.85      0.55      0.67       253
        ISTP       0.89      0.73      0.80       679

    accuracy                           0.82     21214
   macro avg       0.84   

# Testing our model with some sample data

Lets take a simple sample data from the dataset itself to see how our model works.

In [16]:
text='stay frustrate world life want take long nap' #It is an INFP

In [17]:
sample = tfidf.transform([text])

In [18]:
model.predict(tfidf.transform(np.array([text])))

array(['INFP'], dtype=object)

Our model gave a accurate prediction. So now lets give it a unseen sample data and see how it handles it.

In [19]:
sample_text = "I enjoy exploring new ideas and discussing abstract concepts."  #It is an ENTP

# Use the same cleaning function used earlier
sample_text_cleaned = clean_text(sample_text)

In [20]:
sample_text_tfidf = tfidf.transform([sample_text_cleaned])  # Transform as a list

In [21]:
predicted_personality = model.predict(sample_text_tfidf)

print(f"Predicted Personality Type: {predicted_personality[0]}")

Predicted Personality Type: ENTP


In [22]:
prediction_probs = model.predict_proba(sample_text_tfidf)

# Display the probabilities for each class
for label, prob in zip(model.classes_, prediction_probs[0]):
    print(f"{label}: {prob:.4f}")

ENFJ: 0.0155
ENFP: 0.0222
ENTJ: 0.0283
ENTP: 0.3771
ESFJ: 0.0009
ESFP: 0.0019
ESTJ: 0.0075
ESTP: 0.0516
INFJ: 0.0200
INFP: 0.0130
INTJ: 0.1175
INTP: 0.2737
ISFJ: 0.0041
ISFP: 0.0112
ISTJ: 0.0190
ISTP: 0.0365


This model's prediction is quite satisfactory.

# Conclusion

The MBTI personality prediction model achieved an accuracy of approximately 81.8%, demonstrating its ability to reliably predict an individual's personality type based on text data. The model processes cleaned social media posts, applies natural language processing techniques such as tokenization and stopword removal, and utilizes a TF-IDF vectorizer to convert text into meaningful features. A logistic regression classifier was trained on this data, resulting in a high level of predictive performance. While the model shows strong accuracy, there is still room for further optimization, such as fine-tuning the model or exploring alternative machine learning algorithms, to enhance its performance even further. Overall, this model provides a promising tool for classifying personality types from textual content with a notable degree of reliability.

