# MBTI classifier and Sentimental Analysis 

This jupyter notebook file is part of my personal data science and machine learning engineering project based on my interest. This project is about creating Myers–Briggs Type Indicator(MBTI) classifier and sentimental analysis based on the individuals post on the web blogs. To create my own classifier, some of well-known machine learning such as Randomforest, Xgboost, and Support Vector Machine(SVM) Classifier are selected. 

The purposes of this project include the following:
1. To analyse the sentiment of the posts makes by each individuals with different personality type.
2. To create my own personal personality classifier based on the MBTI personality system.
3. Comparing the performance between standard tree-based, gradient-boosting tree, and support vector machine on making a classification model.
4. To understand what kind of posts that each individuals with different personality tends to post online

The project process will be done in the following in each section below:

## Personality Classifier

### 1. Import all the necessary library 

In [1]:
import nltk #text-processing library
nltk.download(['stopwords','vader_lexicon','punkt','wordnet'])
import pandas as pd #data manipulation 
import matplotlib.pyplot as plt #data visualisation
import re #regular expression for text identification and extraction
from xgboost import XGBClassifier #XGboost model
from sklearn.ensemble import RandomForestClassifier #Randomforest model
from sklearn.model_selection import train_test_split #For split the dataset to train/test dset for model training
from imblearn.over_sampling import SMOTE #For making oversampling for the dataset for counter imbalance problem
from sklearn.svm import SVC #support vector machine classifier model
import seaborn as sns #data visualisation
from nltk.corpus import stopwords #Create a stop words list and remove them from the posts data
from nltk.tokenize import word_tokenize #make a tokenise of each word on the post
from nltk.stem import WordNetLemmatizer #Change the word to it original form from adjective,verb and many more.
from sklearn.preprocessing import LabelEncoder #Make an encode for the label of the dataset, in this case it will be turn the 
#personality type to numeric value
from sklearn.feature_extraction.text import TfidfVectorizer #Make a post into a vector so it can be fit into the model training
from sklearn.metrics import accuracy_score, classification_report #Report of the classification model after training
from nltk.sentiment.vader import SentimentIntensityAnalyzer #Make a sentimental analysis on the text
import plotly.express as px #Making interactive visualisation

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
import plotly.offline as py
py.init_notebook_mode(connected=True)

### 2. Explore the file

After import all the necessary library for this project, the file `mbti_1.csv`that provide the mbti types and individuals online posts on blog will be explore by using pandas library to read the `.csv` file as the following:

In [3]:
#Import the file and explore the file first 5 rows
mbti_data = pd.read_csv('mbti_1.csv')
mbti_data.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [4]:
#Using describe include object to count all the number of post and unique type of personality
mbti_data.describe(include='O')

Unnamed: 0,type,posts
count,8675,8675
unique,16,8675
top,INFP,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
freq,1832,1


In [5]:
#Count the types in the dataset
mbti_data['type'].value_counts()

type
INFP    1832
INFJ    1470
INTP    1304
INTJ    1091
ENTP     685
ENFP     675
ISTP     337
ISFP     271
ENTJ     231
ISTJ     205
ENFJ     190
ISFJ     166
ESTP      89
ESFP      48
ESFJ      42
ESTJ      39
Name: count, dtype: int64

In [6]:
#Create an interactive pie chart to identify the number of each personality type in a proportion
px.pie(mbti_data,names='type',title='Personality type',hole=0.3)

In [7]:
#Identify whether there are nulls in the dataset
mbti_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8675 entries, 0 to 8674
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    8675 non-null   object
 1   posts   8675 non-null   object
dtypes: object(2)
memory usage: 135.7+ KB


In [8]:
#looking at the sample of the posts on this dataset
mbti_data.posts[0]

"'http://www.youtube.com/watch?v=qsXHcwe3krw|||http://41.media.tumblr.com/tumblr_lfouy03PMA1qa1rooo1_500.jpg|||enfp and intj moments  https://www.youtube.com/watch?v=iz7lE1g4XM4  sportscenter not top ten plays  https://www.youtube.com/watch?v=uCdfze1etec  pranks|||What has been the most life-changing experience in your life?|||http://www.youtube.com/watch?v=vXZeYwwRDw8   http://www.youtube.com/watch?v=u8ejam5DP3E  On repeat for most of today.|||May the PerC Experience immerse you.|||The last thing my INFJ friend posted on his facebook before committing suicide the next day. Rest in peace~   http://vimeo.com/22842206|||Hello ENFJ7. Sorry to hear of your distress. It's only natural for a relationship to not be perfection all the time in every moment of existence. Try to figure the hard times as times of growth, as...|||84389  84390  http://wallpaperpassion.com/upload/23700/friendship-boy-and-girl-wallpaper.jpg  http://assets.dornob.com/wp-content/uploads/2010/04/round-home-design.jpg ...

With the execution of the code in this section, it is found that this dataset does not have nulls, it is imbalanced, and all the posts by each individuals contains links, some no meaning words, misplace punctuations and all of the posts separated by `|||`. Regarding the problem mentioned above, it is required to do a text-pre-processing to make a suitable feature for machine learning model training.

### 3. Text pre-processing & Feature Generation

Once the file exploration is done, it is now time to do data pre-processing and feature generation. Since all the data in this dataset is a text on the internet blog post, all of them need to be vectorised in order to make the classification model for the prediction. 

Firstly,all the unnecessary text such as links and punctuations will be removed due to being invaluable for making model.

In [9]:
def remove_links_symbol(text):
    """
    The function will searching for a links and punctuations then remove them from the posts
    """
    # Define a regular expression pattern for any links
    sentence = re.sub('https?://[^\s<>"]+|www\.[^\s<>"]+',' ',text)
    
    #Remove irrelevant symbol
    sentence=re.sub('[^0-9a-zA-Z\',.]',' ',sentence)
    return sentence

mbti_data['cleaned_posts'] = mbti_data['posts'].apply(remove_links_symbol)
mbti_data

Unnamed: 0,type,posts,cleaned_posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,' and intj moments sportscenter not top t...
1,ENTP,'I'm finding the lack of me in these posts ver...,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...,"'Good one course, to which I say I ..."
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...","'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...,'You're fired. That's another silly misconce...
...,...,...,...
8670,ISFP,'https://www.youtube.com/watch?v=t8edHB_h908||...,' just because I always think of cats as Fi d...
8671,ENFP,'So...if this thread already exists someplace ...,'So...if this thread already exists someplace ...
8672,INTP,'So many questions when i do these things. I ...,'So many questions when i do these things. I ...
8673,INFP,'I am very conflicted right now when it comes ...,'I am very conflicted right now when it comes ...


After links and punctuations have been removed, the post will be vectorised by the following procedures:
1. Making all the text in a lowercase format
2. Removing stop words from all of the text
3. Change the words to it original noun form using lemmatizer
4. Join all the word together


In [10]:
def text_processing(data):
    """
    The function will transform the text to their original noun format 
    starting from replace all the unnecessary letters or number that
    make unmeaningful sentences. Then, the all the sentences will be made into 
    lowercase format. After that the stop words will be removed and they will be lemmatized in order to be join
    together at the end.
    """
    sentence=re.sub('[^0-9a-zA-Z]',' ',data)
    word_token = word_tokenize(sentence.lower())
    stopwords_list = stopwords.words('english')
    stopwords_set = set(stopwords_list)
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in word_token]
    words = [word for word in  lemmatized_words if word not in stopwords_set]
    return ' '.join(words)

mbti_data['process_post'] = mbti_data['cleaned_posts'].apply(text_processing)

mbti_data

Unnamed: 0,type,posts,cleaned_posts,process_post
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,' and intj moments sportscenter not top t...,intj moment sportscenter top ten play prank ha...
1,ENTP,'I'm finding the lack of me in these posts ver...,'I'm finding the lack of me in these posts ver...,finding lack post alarming sex boring position...
2,INTP,'Good one _____ https://www.youtube.com/wat...,"'Good one course, to which I say I ...",good one course say know blessing curse doe ab...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...","'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...
4,ENTJ,'You're fired.|||That's another silly misconce...,'You're fired. That's another silly misconce...,fired another silly misconception approaching ...
...,...,...,...,...
8670,ISFP,'https://www.youtube.com/watch?v=t8edHB_h908||...,' just because I always think of cats as Fi d...,always think cat fi doms reason website become...
8671,ENFP,'So...if this thread already exists someplace ...,'So...if this thread already exists someplace ...,thread already exists someplace else doe heck ...
8672,INTP,'So many questions when i do these things. I ...,'So many questions when i do these things. I ...,many question thing would take purple pill pic...
8673,INFP,'I am very conflicted right now when it comes ...,'I am very conflicted right now when it comes ...,conflicted right come wanting child honestly m...


### 4. Sentimental Analysis

After the text pre-processing is finished, the posts will be used for sentimental analysis to identify which personality tend to posts in a negative, neutral or positive online as the following:

In [15]:
analyzer = SentimentIntensityAnalyzer()

# Define a function to get sentiment scores
def get_sentiment_scores(text):
    sentiment = analyzer.polarity_scores(text)
    return sentiment

# Apply the sentiment analyzer to each post
mbti_data['sentiment_scores'] = mbti_data['cleaned_posts'].apply(get_sentiment_scores)

# Extract compound scores as a measure of overall sentiment
mbti_data['compound_sentiment'] = mbti_data['sentiment_scores'].apply(lambda x: x['compound'])

In [16]:
def classify_sentiment(compound_score):
    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Classify sentiment based on compound scores
mbti_data['sentiment'] = mbti_data['compound_sentiment'].apply(classify_sentiment)

In [18]:
sentiment_mbti = mbti_data[['type','sentiment']].value_counts()
sentiment_mbti

type  sentiment
INFP  positive     1754
INFJ  positive     1432
INTP  positive     1194
INTJ  positive     1011
ENFP  positive      664
ENTP  positive      641
ISTP  positive      306
ISFP  positive      262
ENTJ  positive      218
ISTJ  positive      196
ENFJ  positive      185
ISFJ  positive      159
INTP  negative      108
ESTP  positive       86
INTJ  negative       80
INFP  negative       77
ESFP  positive       45
ENTP  negative       44
ESFJ  positive       42
INFJ  negative       36
ESTJ  positive       35
ISTP  negative       31
ENFP  negative       11
ENTJ  negative       11
ISTJ  negative        8
ISFP  negative        8
ISFJ  negative        6
ENFJ  negative        5
ESTJ  negative        3
ESFP  negative        3
ESTP  negative        3
INTP  neutral         2
ENTJ  neutral         2
INFJ  neutral         2
ISFJ  neutral         1
ESTJ  neutral         1
ISFP  neutral         1
ISTJ  neutral         1
INFP  neutral         1
Name: count, dtype: int64

The result shows that INFP post more positive contents, while the INTP posts the most negative contents.

### 5. Model training and prediction

In [11]:
#Label Encoder for data type
label_encoder = LabelEncoder()
mbti_data['label'] = label_encoder.fit_transform(mbti_data['type'])

In [12]:
#Train/test split data to 80-20 proportion
X_train, X_test, y_train, y_test=train_test_split(mbti_data['process_post'],
                                                  mbti_data['label'],test_size=0.2,random_state=100)

In [13]:
# Vectorize posts
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = tfidf_vectorizer.fit_transform(X_train)
X_test_vec = tfidf_vectorizer.transform(X_test)

In [14]:
#Due to imbalance dataset now I'm using oversampling technique to improve
smote = SMOTE(random_state=100)
X_train_smote, y_train_smote = smote.fit_resample(X_train_vec, y_train)

#### Random Forest 

In [15]:
clf = RandomForestClassifier()
clf.fit(X_train_smote, y_train_smote)

In [16]:
y_pred = clf.predict(X_test_vec)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)

Accuracy: 0.5855907780979827
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.22      0.34        51
           1       0.66      0.50      0.57       148
           2       0.65      0.38      0.48        34
           3       0.57      0.53      0.55       117
           4       0.00      0.00      0.00         8
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00         6
           7       1.00      0.20      0.33        20
           8       0.64      0.65      0.65       304
           9       0.50      0.80      0.61       344
          10       0.60      0.59      0.59       217
          11       0.61      0.66      0.63       277
          12       0.79      0.38      0.52        39
          13       0.62      0.30      0.40        44
          14       0.73      0.20      0.31        55
          15       0.60      0.49      0.54        61

    accuracy               


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



#### XGBoost Classifier

In [17]:
xgb_classifier = XGBClassifier(learning_rate = 0.25,objective='multi:softmax',num_class=16)
xgb_classifier.fit(X_train_smote, y_train_smote)


is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.



In [18]:
y_pred = xgb_classifier.predict(X_test_vec)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)

Accuracy: 0.6536023054755044
Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.33      0.45        51
           1       0.69      0.60      0.64       148
           2       0.58      0.44      0.50        34
           3       0.66      0.59      0.62       117
           4       0.00      0.00      0.00         8
           5       1.00      0.10      0.18        10
           6       0.50      0.17      0.25         6
           7       0.50      0.50      0.50        20
           8       0.68      0.73      0.70       304
           9       0.63      0.76      0.69       344
          10       0.67      0.65      0.66       217
          11       0.66      0.74      0.69       277
          12       0.71      0.51      0.60        39
          13       0.58      0.43      0.49        44
          14       0.74      0.53      0.62        55
          15       0.58      0.61      0.59        61

    accuracy               

#### Support Vector Machine

In [27]:
model_svc=SVC(kernel='linear', C=1.0, class_weight='balanced', random_state=100)
model_svc.fit(X_train_smote, y_train_smote)

In [28]:
y_pred = model_svc.predict(X_test_vec)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)

Accuracy: 0.6570605187319885
Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.43      0.53        51
           1       0.67      0.66      0.67       148
           2       0.43      0.47      0.45        34
           3       0.61      0.61      0.61       117
           4       0.00      0.00      0.00         8
           5       0.33      0.10      0.15        10
           6       0.50      0.33      0.40         6
           7       0.53      0.45      0.49        20
           8       0.67      0.68      0.68       304
           9       0.67      0.74      0.70       344
          10       0.66      0.71      0.69       217
          11       0.70      0.74      0.72       277
          12       0.61      0.49      0.54        39
          13       0.59      0.43      0.50        44
          14       0.74      0.42      0.53        55
          15       0.62      0.64      0.63        61

    accuracy               

Based on the models prediction result, SVM is the most optimal machine learning model to predict the personality based on individuals posts online.