# MBTI Project - Data Wrangling

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file

## Data Collection

Unfortunately the dataset I am using is very large for GitHub's size limit. They actually recommend not uploading things of +50MB and will block anything over +100MB [link](https://help.github.com/en/github/managing-large-files/conditions-for-large-files). For this reason the csv file is not shared in the repository but can be accessed directly from [Kaggle](https://www.kaggle.com/datasnaek/mbti-type)

In [2]:
df = pd.read_csv(r'/Users/diego/Google Drive/2. Business Intelligence/Data/MBTI/mbti_1.csv')

## Data Definition

In [3]:
df.shape

(8675, 2)

In [4]:
df.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8675 entries, 0 to 8674
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    8675 non-null   object
 1   posts   8675 non-null   object
dtypes: object(2)
memory usage: 135.7+ KB


<br>
As we can see, the dataset is complete since it does not have missing values but it only has 2 columns.
<br>
<br>

In [6]:
# We generate a report through profile_report for a more elegant presentation of this information

report = df.profile_report(sort='None', html={'style':{'full_width':True}}, progress_bar=False)
#report # uncomment this line to see it here 

In [7]:
# This saves our report as a html file in our directory

report.to_file('mbti_report_raw.html')

## Data Cleaning

In [8]:
# Let's check if there are any duplicates

duplicate_rows = df.duplicated()
duplicate_rows.value_counts()

False    8675
dtype: int64

<br>
Since there are not missing values and no duplicate values we cannot do much in this section of data cleaning but we can do a lot to create new variables

## Identifying and Creating Variables

This is actually part of the EDA section but I proceed to do it here

### Individual Traits

We can extract the individual traits and later evaluate them individually generating thus categories of 2 concepts:
<ul><li>Introversion (I) – Extroversion (E)</li>
<li>Intuition (N) – Sensing (S)</li>
<li>Thinking (T) – Feeling (F)</li>
<li>Judging (J) – Perceiving (P)</li></ul>

In [9]:
df_dummies = df['type'].str.get_dummies('')

Since every category is part of a spectrum (I - E) (N - S) (T - F) (J - P) we can save one of each instead of keeping both of them.

In [10]:
df_dummies = df_dummies.drop(['E','S','F','P'], axis=1)

In [11]:
df = df.join(df_dummies)

### Splitting and Counting

In [12]:
# First I check the length of all the posts. This will probably provide little insights since we do not know how the data was gathered

df['posts_len'] = df['posts'].str.len()

In [13]:
# Let's split one observation to see what is contains, when we checked the header we saw that each post is separated with ""|||"

df.iloc[0,1].split('|||')

["'http://www.youtube.com/watch?v=qsXHcwe3krw",
 'http://41.media.tumblr.com/tumblr_lfouy03PMA1qa1rooo1_500.jpg',
 'enfp and intj moments  https://www.youtube.com/watch?v=iz7lE1g4XM4  sportscenter not top ten plays  https://www.youtube.com/watch?v=uCdfze1etec  pranks',
 'What has been the most life-changing experience in your life?',
 'http://www.youtube.com/watch?v=vXZeYwwRDw8   http://www.youtube.com/watch?v=u8ejam5DP3E  On repeat for most of today.',
 'May the PerC Experience immerse you.',
 'The last thing my INFJ friend posted on his facebook before committing suicide the next day. Rest in peace~   http://vimeo.com/22842206',
 "Hello ENFJ7. Sorry to hear of your distress. It's only natural for a relationship to not be perfection all the time in every moment of existence. Try to figure the hard times as times of growth, as...",
 '84389  84390  http://wallpaperpassion.com/upload/23700/friendship-boy-and-girl-wallpaper.jpg  http://assets.dornob.com/wp-content/uploads/2010/04/round-ho

In [14]:
# Now that we have seen  convert every observation into a list of each post

df['posts_separated'] = df['posts'].apply(lambda x: x.split('|||'))

In [15]:
# We count how many posts each person got recorded

df['count_posts'] = df['posts_separated'].str.len()

In [16]:
# Let's check the results from the previous line of code. It does not seem a very reliable category being so skewed towards 50

df.count_posts.value_counts()

50    7587
47      82
48      79
42      61
49      60
      ... 
5        1
77       1
14       1
78       1
75       1
Name: count_posts, Length: 77, dtype: int64

In [17]:
# Maybe the average number of characters used in each post wil provide us with more information

from statistics import mean
df['avg_num_char_x_post'] = [mean([len(i) for i in x]) for x in df.iloc[:,7]]

In [18]:
# What about the number of links a persons uses?

df['num_of_links'] = [sum([url.count('http') for url in x]) for x in df.iloc[:,7]]

### Keirsey Temperament

In [19]:
# Let's check how many unique types of MBTI profiles exist
df['type'].unique()

array(['INFJ', 'ENTP', 'INTP', 'INTJ', 'ENTJ', 'ENFJ', 'INFP', 'ENFP',
       'ISFP', 'ISTP', 'ISFJ', 'ISTJ', 'ESTP', 'ESFP', 'ESTJ', 'ESFJ'],
      dtype=object)

<br>
This is correct, there are 16 different personality types according to MBTI. In the EDA section we will check how many observations of each type we have, but for now can expect that 16 types are maybe too many to predict. Consequently, we can try to narrow it down. We have several options, to do it into each category (e.g. Introversion and Extroversion) or to do it for through Keirsey's Temperaments.

We need to be careful howeever, Keirsey's Temperaments are "closely associated with the Myers–Briggs Type Indicator (MBTI); however, there are significant practical and theoretical differences between the two personality questionnaires and their associated different descriptions." [Wikipedia](https://en.wikipedia.org/wiki/Keirsey_Temperament_Sorter). If we use them we are assuming that the content of the posts assigned to each MBTI profile can be translated into KT. We will do this, but this needs to be kept in the records.

<img src="Figures/Keirsey_Temperament_Sorter.png" alt="Drawing" style="width: 500px;"/>

In [20]:
# We extract the Keirsey Temperaments into new columns

NF = df['type'].str.contains('NF')
df.insert(6, 'NF', NF)

NT = df['type'].str.contains('NT')
df.insert(7, 'NT', NT)

SP = df['type'].str.contains('S.P', regex=True)
df.insert(8, 'SP', SP)

SJ = df['type'].str.contains('S.J', regex=True)
df.insert(9, 'SJ', SJ)

In [21]:
# This is another method to extract the pair of letters into one column

#IE = df.apply(lambda x:x[0][0], axis=1)
#df.insert(5,'IE',IE)
#NS = df.apply(lambda x:x[0][1], axis=1)
#df.insert(6,'NS',NS)
#FT = df.apply(lambda x:x[0][2], axis=1)
#df.insert(7,'FT',FT)
#JP = df.apply(lambda x:x[0][3], axis=1)
#df.insert(8,'JP',JP)

### Mentions of other groups

In [22]:
types = df['type'].unique()

In [23]:
# Number of self-referencing and mentioning other groups

for type in types:
    df[type + '_mentions'] = [sum([x.casefold().count(type.casefold()) for x in post]) for post in df['posts_separated']]

In [24]:
df['Total_Mentions'] = df.iloc[:,15:].sum(axis=1)

### Use of Emojis

In [25]:
emojis = [':D', ';D', ':)',';)',':(','xD','XD']

In [26]:
# This counts the number of emojis used by each user

for emoji in emojis:
    df[emoji + '_count'] = [sum([x.count(emoji) for x in post]) for post in df['posts_separated']]

In [27]:
df['Total_Emojis'] = df.iloc[:,32:].sum(axis=1)

## Preparing Text

### Removing punctuation & URLs from posts

[link](https://machinelearningmastery.com/clean-text-machine-learning-python/)

In [28]:
# We import the regular expressions package
import re

In [29]:
# Let's check how many posts we have

sum(df['count_posts'])

422845

In [30]:
# we review how our posts looked like

df['posts_separated'].head()

0    ['http://www.youtube.com/watch?v=qsXHcwe3krw, ...
1    ['I'm finding the lack of me in these posts ve...
2    ['Good one  _____   https://www.youtube.com/wa...
3    ['Dear INTP,   I enjoyed our conversation the ...
4    ['You're fired., That's another silly misconce...
Name: posts_separated, dtype: object

In [31]:
# This removes all URLs 

posts = [[re.sub(r'http\S+', '', word) for word in post] for post in df['posts_separated']] 

In [32]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [33]:
# Maketrans works in conjuction with translate will substitute the punctuation values above for no value

table = str.maketrans('', '', string.punctuation)

In [34]:
stripped = [[word.translate(table) for word in post] for post in posts]

In [35]:
# Here we convert all values to lowercase
words = [[word.lower() for word in alist] for alist in stripped]

In [36]:
# We remove the unnecessary spaces

posts = []
for x in words:
    while '' in x:
        x.remove('')
    posts.append(x)

In [37]:
# There were some spaces left inside each post so with this we clean it

posts = [[re.sub(' +', ' ', sentence).strip() for sentence in post] for post in posts]

In [38]:
# With the previous steps we lost around 11300 posts. This could be due to posts that were only urls

count = 0
for post in posts:
    count += len(post)

print(count)
print(len(posts))

411495
8675


### Tokenization & Removing Stopwords

We tend to remove stopwords because they have low predicting capability

In [39]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

In [40]:
# here we check what type of stopwords exist in the English language

stopwords.words('English')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [41]:
stop_words = set(stopwords.words('english'))

In [46]:
# this will divide each sentence into separate words

word_tokens = [[word_tokenize(words) for words in post] for post in posts]

In [50]:
# this will erase the stopwords from our list

words = [[[word for word in post if word not in stopwords.words('english')] for post in posts] for posts in word_tokens]

In [54]:
# Let's check a bit to see if it worked well

words[0][0:5]

[['enfp', 'intj', 'moments', 'sportscenter', 'top', 'ten', 'plays', 'pranks'],
 ['lifechanging', 'experience', 'life'],
 ['repeat', 'today'],
 ['may', 'perc', 'experience', 'immerse'],
 ['last',
  'thing',
  'infj',
  'friend',
  'posted',
  'facebook',
  'committing',
  'suicide',
  'next',
  'day',
  'rest',
  'peace']]

In [56]:
count = 0
for word in words:
    count += len(word)

print(count)
print(len(words))

411495
8675


### Lemmatizing

Shorten words back to their root form

In [57]:
from nltk.stem import WordNetLemmatizer

In [130]:
hi = ['Foot','feet','nights','waking']

In [58]:
lemmatizer = WordNetLemmatizer()

In [59]:
words_lemmatized = [[[lemmatizer.lemmatize(i) for i in elements] for elements in post] for post in words]

In [60]:
# We check if it worked well

words_lemmatized[0][0:1]

[['enfp', 'intj', 'moment', 'sportscenter', 'top', 'ten', 'play', 'prank']]

In [61]:
# We assign the new list of words to a column of our dataset

df['posts_clean'] = words_lemmatized

In [62]:
# Let's check if it has worked. We compare both columns and it seem quite faithful to what we looked for. 

df[['posts_separated', 'posts_clean']].head()

Unnamed: 0,posts_separated,posts_clean
0,"['http://www.youtube.com/watch?v=qsXHcwe3krw, ...","[[enfp, intj, moment, sportscenter, top, ten, ..."
1,['I'm finding the lack of me in these posts ve...,"[[im, finding, lack, post, alarming], [sex, bo..."
2,['Good one _____ https://www.youtube.com/wa...,"[[good, one], [course, say, know, thats, bless..."
3,"['Dear INTP, I enjoyed our conversation the ...","[[dear, intp, enjoyed, conversation, day, esot..."
4,"['You're fired., That's another silly misconce...","[[youre, fired], [thats, another, silly, misco..."


## Sentiments Analysis

"Sentiments analysis is the field of study that analyzes people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes." B. Liu. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2012. [link](https://www.researchgate.net/post/What_is_the_best_way_to_do_a_sentiment_analysis)

In [63]:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer

In [64]:
def analyze_sentiment(lists):
    
    pos_counter = 0
    neg_counter = 0
    neu_counter = 0
    
    for post in lists:
        analysis = TextBlob(post)
        
        if analysis.sentiment.polarity >= 0.6:
            pos_counter += 1
        elif analysis.sentiment.polarity <= -0.6:
            neg_counter += 1
        else:
            neu_counter += 1
   
    return [pos_counter,neg_counter,neu_counter]

In [68]:
sa = [analyze_sentiment(posts) for posts in df['posts_clean'][0]]

<br>
I take one of the results to analyze how it assigned each word

In [76]:
sa[11]

[2, 0, 16]

In [77]:
df['posts_clean'][0][11]

['thing',
 'moderation',
 'sims',
 'indeed',
 'video',
 'game',
 'good',
 'one',
 'note',
 'good',
 'one',
 'somewhat',
 'subjective',
 'completely',
 'promoting',
 'death',
 'given',
 'sim']

It detected the two "good" as positives but left "death" out. I would say it did not do bad but not very realiable. I decided to test this same sample by with the NaiveBayesAnalyzer

In [83]:
test = df['posts_clean'][0][11]

In [85]:
def analyze_sentiment_NBA(text):
    result = TextBlob(text, analyzer=NaiveBayesAnalyzer())
    classification, pos, neg = result.sentiment
    print(classification)
    print(pos)
    print(neg)
    print('-----')

In [87]:
for i in test:
    print(i)
    analyze_sentiment_NBA(i)

thing
neg
0.462648556876061
0.5373514431239389
-----
moderation
neg
0.16666666666666646
0.833333333333333
-----
sims
pos
0.6499999999999997
0.35000000000000003
-----
indeed
pos
0.5576923076923077
0.44230769230769224
-----
video
neg
0.39539748953974896
0.6046025104602512
-----
game
pos
0.5530303030303029
0.4469696969696968
-----
good
pos
0.5042265426880812
0.4957734573119189
-----
one
pos
0.5061902082160945
0.4938097917839054
-----
note
pos
0.5232558139534884
0.4767441860465117
-----
good
pos
0.5042265426880812
0.4957734573119189
-----
one
pos
0.5061902082160945
0.4938097917839054
-----
somewhat
pos
0.6272727272727273
0.3727272727272728
-----
subjective
pos
0.6499999999999997
0.35000000000000003
-----
completely
neg
0.4661971830985916
0.5338028169014085
-----
promoting
pos
0.5
0.5
-----
death
pos
0.5542763157894737
0.44572368421052644
-----
given
neg
0.4542079207920791
0.5457920792079208
-----
sim
pos
0.5
0.5
-----


As we can see here, the NaiveBayesAnalyzer did even worst

**I am not completely convinced about the results, so for now I will not add them as a column, maybe later we used other models or libraries to do it again**

## Checking Interim Dataset

In [88]:
# Let's check the current dataframe

df.head()

Unnamed: 0,type,posts,I,J,N,T,NF,NT,SP,SJ,...,Total_Mentions,:D_count,;D_count,:)_count,;)_count,:(_count,xD_count,XD_count,Total_Emojis,posts_clean
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,1,1,1,0,True,False,False,False,...,8,0,0,0,0,0,0,0,0,"[[enfp, intj, moment, sportscenter, top, ten, ..."
1,ENTP,'I'm finding the lack of me in these posts ver...,0,0,1,1,False,True,False,False,...,19,9,3,5,0,0,0,0,17,"[[im, finding, lack, post, alarming], [sex, bo..."
2,INTP,'Good one _____ https://www.youtube.com/wat...,1,0,1,1,False,True,False,False,...,4,2,0,7,0,0,0,0,9,"[[good, one], [course, say, know, thats, bless..."
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",1,1,1,1,False,True,False,False,...,12,0,0,0,0,0,0,0,0,"[[dear, intp, enjoyed, conversation, day, esot..."
4,ENTJ,'You're fired.|||That's another silly misconce...,0,1,1,1,False,True,False,False,...,5,0,0,0,1,0,0,1,2,"[[youre, fired], [thats, another, silly, misco..."


In [89]:
df.shape

(8675, 41)

In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8675 entries, 0 to 8674
Data columns (total 41 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   type                 8675 non-null   object 
 1   posts                8675 non-null   object 
 2   I                    8675 non-null   int64  
 3   J                    8675 non-null   int64  
 4   N                    8675 non-null   int64  
 5   T                    8675 non-null   int64  
 6   NF                   8675 non-null   bool   
 7   NT                   8675 non-null   bool   
 8   SP                   8675 non-null   bool   
 9   SJ                   8675 non-null   bool   
 10  posts_len            8675 non-null   int64  
 11  posts_separated      8675 non-null   object 
 12  count_posts          8675 non-null   int64  
 13  avg_num_char_x_post  8675 non-null   float64
 14  num_of_links         8675 non-null   int64  
 15  INFJ_mentions        8675 non-null   i

## Creating an interim report with Pandas Profiling

In [91]:
report = df.profile_report(sort='None', html={'style':{'full_width':True}}, progress_bar=False)
report.to_file('mbti_report_interim.html')

## Saving the dataset

Unfortunately the dataset is too large to upload to GitHub at this point... I have added **find ./* -size +100M | cat >> .gitignore"** in the gitignore file to avoid uploading it. 

In [92]:
df.to_csv('../../data/mbti_interim.csv')