We use the [MBTI Kaggle Dataset](https://www.kaggle.com/datasnaek/mbti-type)

## Importing Required Libraries
In this section, we import the key Python libraries used for data loading, preprocessing, modeling, and visualization.


In [19]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns


## Dataset Loading and Exploration

We begin by loading the MBTI dataset and exploring its structure — including sample entries, class distribution, and basic statistics.

In [20]:
data_set = pd.read_csv('../mbti_1.csv') 
data_set.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


## Data Preprocessing

Here, we perform minimal text preprocessing to clean and prepare the posts for modeling.  
The focus is not on heavy preprocessing but ensuring the text is in a usable form (e.g., lowercasing, removing URLs, punctuation, etc.).

In [21]:
def preprocess_text(df, remove_special=True):
    texts = df['posts'].copy()
    labels = df['type'].copy()

    #Remove links 
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'https?:\/\/.*?[\s+]', '', x.replace("|"," ") + " "))
    
    #Keep the End Of Sentence characters
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'\.', ' EOSTokenDot ', x + " "))
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'\?', ' EOSTokenQuest ', x + " "))
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'!', ' EOSTokenExs ', x + " "))
    
    #Strip Punctation
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'[\.+]', ".",x))

    #Remove multiple fullstops
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'[^\w\s]','',x))

    #Remove Non-words
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'[^a-zA-Z\s]','',x))

    #Convert posts to lowercase
    df["posts"] = df["posts"].apply(lambda x: x.lower())

    #Remove multiple letter repeating words
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'([a-z])\1{2,}[\s|\w]*','',x)) 

    #Remove very short or long words
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'(\b\w{0,3})?\b','',x)) 
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'(\b\w{30,1000})?\b','',x))

    #Remove MBTI Personality Words - crutial in order to get valid model accuracy estimation for unseen data. 
    if remove_special:
        pers_types = ['INFP','INFJ','INTP','INTJ','ENTP','ENFP','ISTP','ISFP',
                  'ENTJ','ISTJ','ENFJ','ISFJ','ESTP','ESFP','ESFJ','ESTJ']
        # build case-insensitive pattern that matches whole words
        p = re.compile(r'\b(' + "|".join([t.lower() for t in pers_types]) + r')\b')
        # actually remove them from posts
        df["posts"] = df["posts"].apply(lambda x: p.sub("", x))
    return df

#Preprocessing of entered Text
new_df = preprocess_text(data_set, remove_special=True)


In [22]:
new_df.head()

Unnamed: 0,type,posts
0,INFJ,moments sportscenter plays prank...
1,ENTP,finding lack these posts very alarming eo...
2,INTP,good course which know thats bles...
3,INTJ,dear enjoyed conversation other eostoke...
4,ENTJ,youre fired eostokendot thats another silly...


### Filtering Short Messages

To improve the quality of the text data, we focus only on posts that contain enough words to be informative for the model. 

Very short messages often lack context and provide little meaningful information about the user's personality. 

By setting a minimum word threshold, we remove posts that are too brief, ensuring that the dataset contains samples rich enough in content to help the model learn useful patterns. This step helps reduce noise and improves the overall reliability of the model's predictions.


In [23]:
#Remove posts with less than X words
min_words = 15
print("Before : Number of posts", len(new_df)) 
new_df["no. of. words"] = new_df["posts"].apply(lambda x: len(re.findall(r'\w+', x)))
new_df = new_df[new_df["no. of. words"] >= min_words]

print("After : Number of posts", len(new_df))

Before : Number of posts 8675
After : Number of posts 8462
