# PERSONALITY CLASSIFICATION USING NLP
In this project we use the (MBTI) Myers-Briggs Personality Type Dataset from kaggle (https://www.kaggle.com/datasnaek/mbti-type) to predict and classify the personality type of a new user based on a collection of the user's prevous text messages. We will be using Linear Support Vector Classification method from sklearn.svm for this model. This method is ideal because the large data set(we have 8674 samples) and the need for a small runtime.

The Myers Briggs Type Indicator (or MBTI for short) is a personality type system that divides everyone into 16 distinct personality types across 4 axis:

1. Introversion (I) – Extroversion (E)
2. Intuition (N) – Sensing (S)
3. Thinking (T) – Feeling (F)
4. Judging (J) – Perceiving (P)

So for example, someone who prefers introversion, intuition, thinking and perceiving would be labelled an INTP in the MBTI system, and there are lots of personality based components that would model or describe this person’s preferences or behaviour based on the label.

For more information about data set please click the link above.

The project will be split into two parts:
* Part 1: Using nlp techniques to clean and organzie the data using nltk library.
* Part 2: Using TF-IDF and LinearSVC to perform ML and obtain predictions.

### PART 1

In [7]:
# import required modules
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer


# import data and explore
data = pd.read_csv("/Users/krishnan/desktop/mbti_1.csv")
# explore data
print(data.info())
print(data.isnull().sum())
print(data.head(2))

# we dont have any null values but we have links which we will remove 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8675 entries, 0 to 8674
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    8675 non-null   object
 1   posts   8675 non-null   object
dtypes: object(2)
memory usage: 135.7+ KB
None
type     0
posts    0
dtype: int64
   type                                              posts
0  INFJ  'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1  ENTP  'I'm finding the lack of me in these posts ver...


In [8]:
# we will now create some functions to perform nlp techniques and cleaning on the 'posts' column

# this function replaces words joined by '|||' , ',' , '.' , '?' , '!' and replaces any 
# symbols and characters mid word with a blank space.
def cleaning(sentence):
    new = sentence.replace('|||', ' ')
    new = new.replace('.', ' ')
    new = new.replace(',', ' ')
    new = new.replace('?',' ')
    new = new.replace('!', ' ')
    new = re.sub('[^a-zA-Z]', ' ', new)
    new = new.lower()
    return new

# this function tokenizes the sentences in the 'posts' column 
def tokenize(text):
    return nltk.word_tokenize(text,'english')

# for this project we will use lemmatization instead of stemming because of some inaccurate stemming results
lemma_inst = WordNetLemmatizer()

# lemmatize all the tokens in a list of tokens and return a new list of tokens
def lemma(list):
    new_list=[]
    for i in list:
        new_list.append(lemma_inst.lemmatize(i))
    return new_list

# this function will remove all the stop words that are defined by nltk and returns a new list
def stopwordremoval(list):
    new_list=[]
    stop_words=set(stopwords.words("english"))
    for i in list:
        if i not in stop_words:
            new_list.append(i)
    return new_list


In [9]:
# we will now perform these functions to the data set and save it to a new .csv file named "main2"

# remove urls
data['posts'] = data['posts'].str.replace('http\S+|www.\S+', ' ', case=False)

# applying the functions
data['posts'] = data['posts'].apply(cleaning)
data['posts'] = data['posts'].apply(tokenize)
data['posts'] = data['posts'].apply(lemma)
data['posts'] = data['posts'].apply(stopwordremoval)

# convert the list representation into a string
data['posts']=data['posts'].apply(lambda x: " ".join(x))

# save to a csv
#data.to_csv(r'/Users/krishnan/desktop/main2.csv')

In [12]:
print(data.sample(2))

      type                                              posts
6678  INTJ  ey nice proportion sketch ey nice draw chitin ...
1036  INTJ  great day ruined someone unexpectedly come doo...


### PART 2

In [13]:
# import the preprocessed data
data_prep=pd.read_csv("/Users/krishnan/desktop/main2.csv")

# explore data for any issues
print(data_prep.info())
print(data_prep.isnull().sum())
print(data_prep.sample(2))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8675 entries, 0 to 8674
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  8675 non-null   int64 
 1   type        8675 non-null   object
 2   posts       8674 non-null   object
dtypes: int64(1), object(2)
memory usage: 203.4+ KB
None
Unnamed: 0    0
type          0
posts         1
dtype: int64
      Unnamed: 0  type                                              posts
5278        5278  ENFP  funny dj hmmm sound right wa wondering thought...
2352        2352  INFP  great see dreamed dog loved happy one day play...


In [14]:
# clearly there is a null value in one of the entries in 'posts'
data_prep[data_prep['posts'].isnull()].index.tolist()

[3559]

In [15]:
# inspect
print(data_prep.iloc[3559])

Unnamed: 0    3559
type          INFP
posts          NaN
Name: 3559, dtype: object


In [16]:
# we will drop it and continue with training and testing
data_prep=data_prep.drop(3559, axis=0)
data_prep[data_prep['posts'].isnull()].index.tolist()

[]

In [18]:
# for this project we will be converting each entry of 'posts' into tf-idf vectors. This ensures a fair 
# representation of all the tokens and doesnt lead to overrepresentation of a particular token.
tfidf = TfidfVectorizer()

# randomize data
randomized_data = data_prep.sample(frac = 1, random_state = 10)

# train/test split (75/25)
train = randomized_data.sample(6506, random_state = 10)
test = randomized_data.sample(2168, random_state = 10)

# using train to generate tf-idf vectors
train_tfidf = tfidf.fit_transform(train['posts'])

# converting test into tf-idf vectors generated by train
test_tfidf = tfidf.transform(test['posts'])

# train classification lables
trainlabels = train['type']

# instantiate LinearSVC model and use the train_tfidf to fit the model with trainlabels as the target column
lsvc = LinearSVC()
lsvc.fit(train_tfidf,trainlabels)

# testing
test['predictions'] = lsvc.predict(test_tfidf)

#check the accuracy of our model with the test data
correct = 0
total = test.shape[0]

for row in test.iterrows():
    row = row[1]
    if row['type'] == row['predictions']:
        correct += 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct / total)

Correct: 2156
Incorrect: 12
Accuracy: 0.9944649446494465


We obtained an accuracy of 99.4% for the test data which is very good.

In [25]:
# this is a function that'll take in a message and will classify the user into the predifined categories
def conversion(sentence):
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    sentence = sentence.lower()
    sentence = tokenize(sentence)
    sentence = lemma(sentence)
    new_list = []
    stop_words = set(stopwords.words("english"))
    for i in sentence:
        if i not in stop_words:
            new_list.append(i)
        else:
            pass

    sentence_refined = " ".join(new_list)
    sentence_refined = [sentence_refined]
    tfidfed = tfidf.transform(sentence_refined)
    category = lsvc.predict(tfidfed)
    return("You are classed as {}".format(category[0]))

In [36]:
print(conversion(""" yo bro are you coming to my party tonight. it will be super funs. loads of 
                    people are coming through. let me know if you want in"""))

You are classed as ENFP


In [30]:
print(conversion("""Sorry man , i can't make it tonight. i need to concentrate on my exam tommorrow, need to
                    pass my exam otherwise my teacher will report me to the dean"""))

You are classed as INTJ
