We use the [MBTI Kaggle Dataset](https://www.kaggle.com/datasnaek/mbti-type)

## Importing Required Libraries
In this section, we import the key Python libraries used for data loading, preprocessing, modeling, and visualization.


In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns


## Dataset Loading and Exploration

We begin by loading the MBTI dataset and exploring its structure — including sample entries, class distribution, and basic statistics.

In [2]:
data_set = pd.read_csv('../mbti_1.csv') 
data_set.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


## Data Preprocessing

Here, we perform minimal text preprocessing to clean and prepare the posts for modeling.  
The focus is not on heavy preprocessing but ensuring the text is in a usable form (e.g., lowercasing, removing URLs, punctuation, etc.).

In [3]:
def preprocess_text(df, remove_special=True):
    texts = df['posts'].copy()
    labels = df['type'].copy()

    #Remove links 
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'https?:\/\/.*?[\s+]', '', x.replace("|"," ") + " "))
    
    #Keep the End Of Sentence characters
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'\.', ' EOSTokenDot ', x + " "))
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'\?', ' EOSTokenQuest ', x + " "))
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'!', ' EOSTokenExs ', x + " "))
    
    #Strip Punctation
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'[\.+]', ".",x))

    #Remove multiple fullstops
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'[^\w\s]','',x))

    #Remove Non-words
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'[^a-zA-Z\s]','',x))

    #Convert posts to lowercase
    df["posts"] = df["posts"].apply(lambda x: x.lower())

    #Remove multiple letter repeating words
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'([a-z])\1{2,}[\s|\w]*','',x)) 

    #Remove very short or long words
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'(\b\w{0,3})?\b','',x)) 
    df["posts"] = df["posts"].apply(lambda x: re.sub(r'(\b\w{30,1000})?\b','',x))

    #Remove MBTI Personality Words - crutial in order to get valid model accuracy estimation for unseen data. 
    if remove_special:
        pers_types = ['INFP','INFJ','INTP','INTJ','ENTP','ENFP','ISTP','ISFP',
                  'ENTJ','ISTJ','ENFJ','ISFJ','ESTP','ESFP','ESFJ','ESTJ']
        # build case-insensitive pattern that matches whole words
        p = re.compile(r'\b(' + "|".join([t.lower() for t in pers_types]) + r')\b')
        # actually remove them from posts
        df["posts"] = df["posts"].apply(lambda x: p.sub("", x))
    return df

#Preprocessing of entered Text
new_df = preprocess_text(data_set, remove_special=True)


In [4]:
new_df.head()

Unnamed: 0,type,posts
0,INFJ,moments sportscenter plays prank...
1,ENTP,finding lack these posts very alarming eo...
2,INTP,good course which know thats bles...
3,INTJ,dear enjoyed conversation other eostoke...
4,ENTJ,youre fired eostokendot thats another silly...


### Filtering Short Messages

To improve the quality of the text data, we focus only on posts that contain enough words to be informative for the model. 

Very short messages often lack context and provide little meaningful information about the user's personality. 

By setting a minimum word threshold, we remove posts that are too brief, ensuring that the dataset contains samples rich enough in content to help the model learn useful patterns. This step helps reduce noise and improves the overall reliability of the model's predictions.


In [5]:
# Remove posts with less than min_words
import re

min_words = 15
print("Before : Number of posts", len(new_df)) 

# Count words
new_df["no. of. words"] = new_df["posts"].apply(lambda x: len(re.findall(r'\w+', x)))

# Filter posts and make a copy
new_df = new_df[new_df["no. of. words"] >= min_words].copy()

print("After : Number of posts", len(new_df))


Before : Number of posts 8675
After : Number of posts 8462


In [6]:
new_df.head()

Unnamed: 0,type,posts,no. of. words
0,INFJ,moments sportscenter plays prank...,422
1,ENTP,finding lack these posts very alarming eo...,793
2,INTP,good course which know thats bles...,252
3,INTJ,dear enjoyed conversation other eostoke...,766
4,ENTJ,youre fired eostokendot thats another silly...,399


## Feature Engineering

### What is it?

Feature engineering refers to the process of transforming raw data into meaningful numerical representations that can be used by machine learning models.
Most algorithms cannot directly interpret raw text, images, or categorical labels, so this step is essential.

In text-based machine learning tasks, feature engineering typically includes:
* Cleaning and preprocessing the text (lowercasing, removing stopwords, etc.)
* Converting text into numerical vectors (for example using TF-IDF, Bag-of-Words, or embeddings)
* Extracting additional features such as sentiment scores or text length
* Selecting or reducing the number of features to improve model performance

The preprocessing has already been done in the previous section, now comes the part where we make the input features (X) and the target variables (Y) readable for the computer, i.e. 
* label encoding for the target classes 
* tokenization for the input features 


### Target Variable (Y)

The target variable in this project is the MBTI personality type, which can take one of sixteen categorical values. Because this is a multiclass classification task, the target is represented using __*label encoding*__, where each MBTI type is mapped to an integer from 0 to 15. This keeps Y as a single, compact column and aligns with how common machine learning classifiers interpret labels internally—as class indices rather than as numerical values.

__*One-hot encoding*__, in contrast, expands a categorical variable into one binary column per category. This is valuable for input features (X), where preventing the model from assuming an ordinal relationship between categories is important. However, using one-hot encoding for the target variable would unnecessarily expand Y into sixteen separate columns. 

This expansion contributes nothing to learning and __only increases dimensionality__, which is undesirable. Importantly, __*the curse of dimensionality applies to the feature space (X), not the target labels.*__ The model learns from X, so excessively high-dimensional input features can harm performance. Y does not participate in the feature space, so how Y is encoded has no impact on this issue. For these reasons, label encoding is the correct and efficient choice for representing the MBTI target.

#### TLDR

The curse of dimensionality affects X, not Y. Use label encoding for the MBTI target to keep Y simple as class indices; use one-hot encoding only for categorical input features where dimensionality is justified.

This was incorrectly stated in the kaggle code source...


In [7]:
from sklearn.preprocessing import LabelEncoder

# Initialize the encoder
enc = LabelEncoder()

# Fit the encoder on the MBTI types and transform them into integer labels
new_df['type_encoded'] = enc.fit_transform(new_df['type'])

# Define the target variable
target = new_df['type_encoded']


In [8]:
new_df.head(15)

Unnamed: 0,type,posts,no. of. words,type_encoded
0,INFJ,moments sportscenter plays prank...,422,8
1,ENTP,finding lack these posts very alarming eo...,793,3
2,INTP,good course which know thats bles...,252,11
3,INTJ,dear enjoyed conversation other eostoke...,766,10
4,ENTJ,youre fired eostokendot thats another silly...,399,2
5,INTJ,eostokendot science perfect eostokendo...,239,10
6,INFJ,cant draw nails haha eostokendot those w...,964,8
7,INTJ,tend build collection things desktop th...,139,10
8,INFJ,sure thats good question eostokendot dist...,522,8
9,INTP,this position where have actually pe...,130,11


### Input Features (X)

#### Text preprocessing and Vectorization

Before we can feed text data into a machine learning model, we need to convert it into a numerical form. A few steps are important here:

1. Removing Stop Words: \
Stop words are very common words in English like “the”, “and”, or “is” that appear frequently but carry little meaning. Removing them helps the model focus on words that are more informative. Libraries like NLTK provide a standard list of English stop words.

2. Converting Text to Numbers with CountVectorizer: \
CountVectorizer transforms a collection of text documents into a numerical representation. It counts how often each word appears in each document and builds a vocabulary of all words in the dataset. Using stop_words='english' ensures that common words are ignored. This step is crucial because machine learning models cannot process raw text—they need numeric input.

3. Resulting Features: \
After vectorization, each post is represented as a high-dimensional vector where each dimension corresponds to a word in the vocabulary. For example, our dataset has 8,466 posts and 98,555 unique words (features). Each row represents a post, and each column indicates how many times a particular word appears in that post.

In [9]:
# Vectorizing the posts for the model and filtering Stop-words
vect = CountVectorizer(stop_words='english') 

# Converting posts (or training or X feature) into numerical form by count vectorization
train =  vect.fit_transform(new_df["posts"])

In [10]:
# Get feature names
feature_names = vect.get_feature_names_out()

# Convert sparse matrix to dense DataFrame
X_df = pd.DataFrame(train.toarray(), columns=feature_names)

# Show first 5 rows
print(X_df.head())


   aabbeeoorryy  aabye  aafak  aages  aain  aaliyah  aalto  aalways  aamiin  \
0             0      0      0      0     0        0      0        0       0   
1             0      0      0      0     0        0      0        0       0   
2             0      0      0      0     0        0      0        0       0   
3             0      0      0      0     0        0      0        0       0   
4             0      0      0      0     0        0      0        0       0   

   aand  ...  zygomaticus  zygomorphic  zygon  zygote  zygotes  zylinder  \
0     0  ...            0            0      0       0        0         0   
1     0  ...            0            0      0       0        0         0   
2     0  ...            0            0      0       0        0         0   
3     0  ...            0            0      0       0        0         0   
4     0  ...            0            0      0       0        0         0   

   zylinders  zynga  zynthax  zynthaxx  
0          0      0        

In [11]:
# Show the tokens with non-zero counts for the first post
first_post_vector = X_df.iloc[0]
print(first_post_vector[first_post_vector > 0])


ages           1
alternative    1
appears        1
area           1
artist         1
              ..
worry          1
xbox           1
years          1
youre          1
youve          1
Name: 0, Length: 209, dtype: int64


In [13]:
import joblib
import os

# Create processed data directory if it doesn't exist
os.makedirs('../data/processed', exist_ok=True)

# Define our features and labels with proper variable names
X = train  # The vectorized features (sparse matrix)
Y = target  # The encoded labels

# Save the vectorized features (sparse matrix)
joblib.dump(X, '../data/processed/X_vectorized.pkl')

# Save the labels
joblib.dump(Y, '../data/processed/Y_labels.pkl')

# Save the fitted vectorizer and encoder (needed for transforming new data)
joblib.dump(vect, '../data/processed/vectorizer.pkl')
joblib.dump(enc, '../data/processed/label_encoder.pkl')

# Save the preprocessed dataframe as well for reference
new_df.to_csv('../data/processed/preprocessed_data.csv', index=False)

print("Data saved successfully!")
print(f"Features shape: {X.shape}")
print(f"Labels shape: {Y.shape}")
print(f"Number of unique labels: {len(enc.classes_)}")
print(f"Label classes: {enc.classes_}")

Data saved successfully!
Features shape: (8462, 98532)
Labels shape: (8462,)
Number of unique labels: 16
Label classes: ['ENFJ' 'ENFP' 'ENTJ' 'ENTP' 'ESFJ' 'ESFP' 'ESTJ' 'ESTP' 'INFJ' 'INFP'
 'INTJ' 'INTP' 'ISFJ' 'ISFP' 'ISTJ' 'ISTP']
