# MBTI Project - Data Wrangling

***
<br>


<div class="span5 alert alert-info">
<h3>Introduction</h3>
    <p>This is the first section after the <b>problem identification</b> step. In the next few lines of code I did some datawrangling. This consists on taking raw data and preparing it for processing and analysis. In terms of cleaning the dataset there was little work to do since there were no empty or duplicate values and only 2 columns. The main actions in this notebook are related to creating new features. <b>Note</b>: to practice Object Oriented Programming (OOP), I created a .py document called "feature_extraction" to do some of the actions in this notebook. This is not optimal because I will not be creating different instances of this object but it served to practice nonetheless.<p>
</div>

<br>

### Table of Contents

- [Importing Libraries](#importing)
- [Data Collection](#collection)
- [Data Definition](#definition)
- [Data Cleaning](#cleaning)
- [Identifying and Creating Variables](#variables)
    - [Individual Traits (I/E, N/S, F/T, J/P)](#traits)
    - [Length of Gathered Posts](#len_posts)
    - [Splitting Posts into Individual Posts](#split)
    - [Average Number of Characters per Post](#avg_num_char)
    - [Number of Hyperlinks per Post](#links)
    - [Keirsey Temperaments](#keirsey)
    - [Use of Emoticons](#emojis)
    - [Mentions of other types and same type](#mentions)
    - [Word Extraction](#words)
    - [Positive and Negative word count](#pos_neg)    
- [Pandas Profile Report](#interim)
- [Saving the dataset](#saving)

<a id='importing'></a>

## Importing libraries

***

In [1]:
import pandas as pd
import numpy as np
import re
import string
import feature_extraction # this is the .py document I created for the purpose of this notebook

<a id='collection'></a>

## Data Collection

***

Unfortunately the dataset I am using is very large for GitHub's size limit [link](https://help.github.com/en/github/managing-large-files/conditions-for-large-files). For this reason the csv file is not shared in the repository but can be accessed directly from [Kaggle](https://www.kaggle.com/datasnaek/mbti-type). This dataset was put together by M. Jolly

In [2]:
# We load the dataset
df = pd.read_csv(r'/Users/diego/Google Drive/2. Business Intelligence/Data/MBTI/mbti_1.csv')

<a id='definition'></a>

## Data Definition
***

In [3]:
# Check the shape
df.shape

(8675, 2)

In [4]:
# Look at how the data is organized
df.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [5]:
# We check the data types and see if there are null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8675 entries, 0 to 8674
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    8675 non-null   object
 1   posts   8675 non-null   object
dtypes: object(2)
memory usage: 135.7+ KB


<br>
As we can see, the dataset is complete since it does not have missing values but it only has 2 columns.
<br>
<br>

<a id='cleaning'></a>

## Data Cleaning
***

In [6]:
# Let's check if there are any duplicates
duplicate_rows = df.duplicated()
duplicate_rows.value_counts()

False    8675
dtype: int64


Since there are not missing values and no duplicate values we cannot do much in this section of data cleaning but we can do a lot to create new variables
<br>

<a id='variables'></a>

## Identifying and Creating Variables
***

**Note**: From here onward I use the feature_extraction.py document mentioned before, most of the different actions here do not require it but it was a good opportunity to practice OOP.

In [7]:
# we create an instance of the feature extraction class
fe = feature_extraction.Feature_Extraction(df)
#fe = Feature_Extraction(df)

In [8]:
print(repr(fe))
print()
print(str(fe))

      type                                              posts
0     INFJ  'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1     ENTP  'I'm finding the lack of me in these posts ver...
2     INTP  'Good one  _____   https://www.youtube.com/wat...
3     INTJ  'Dear INTP,   I enjoyed our conversation the o...
4     ENTJ  'You're fired.|||That's another silly misconce...
...    ...                                                ...
8670  ISFP  'https://www.youtube.com/watch?v=t8edHB_h908||...
8671  ENFP  'So...if this thread already exists someplace ...
8672  INTP  'So many questions when i do these things.  I ...
8673  INFP  'I am very conflicted right now when it comes ...
8674  INFP  'It has been too long since I have been on per...

[8675 rows x 2 columns]

'The object created is the dataframe: 'df', the target column: 'type', and the text input column:'posts'


<br>
<br>

**Note**: the feature_extraction document has a method called `.fit()` which will do the first actions in this notebook, all at once. These are:
- dummies_types() which creates dummies for the "type" column
- posts_len() which will generate a column with the length of the "posts" column
- split_posts() which will separate the "posts" column into individual posts
- avg_num_char() which will provide the average number of characters per post
- num_links() which will return the number of hyperlinks a user made

I will not use this method, I will go one by one to show how they work individually.

<a id='traits'></a>

### Individual Traits
***

We can extract the individual traits and later evaluate them individually generating thus categories of 2 concepts:
<ul><li>Introversion (I) – Extroversion (E)</li>
<li>Intuition (N) – Sensing (S)</li>
<li>Thinking (T) – Feeling (F)</li>
<li>Judging (J) – Perceiving (P)</li></ul>

In [9]:
# This line of code creates dummy variables for the "types" column
df = fe.dummies_types(column='type', drop_first=True)

In [10]:
df

Unnamed: 0,type,posts,I,J,N,T
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,1,1,1,0
1,ENTP,'I'm finding the lack of me in these posts ver...,0,0,1,1
2,INTP,'Good one _____ https://www.youtube.com/wat...,1,0,1,1
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",1,1,1,1
4,ENTJ,'You're fired.|||That's another silly misconce...,0,1,1,1
...,...,...,...,...,...,...
8670,ISFP,'https://www.youtube.com/watch?v=t8edHB_h908||...,1,0,0,0
8671,ENFP,'So...if this thread already exists someplace ...,0,0,1,0
8672,INTP,'So many questions when i do these things. I ...,1,0,1,1
8673,INFP,'I am very conflicted right now when it comes ...,1,0,1,0


<a id='len_posts'></a>
<br>

### Counting Length of Gathered Posts
***

In [11]:
# First we create a new feature with the length of all the posts. This will probably provide little insights since we do not know how the data was gathered
fe.posts_len(df,'posts','posts_len')

<a id='split'></a>
<br>

### Splitting Posts into Individual Posts
***

In [12]:
# Let's split one observation to see what is contains, when we checked the header we saw that each post is separated with ""|||"
df.iloc[0,1].split('|||')[0:10]

["'http://www.youtube.com/watch?v=qsXHcwe3krw",
 'http://41.media.tumblr.com/tumblr_lfouy03PMA1qa1rooo1_500.jpg',
 'enfp and intj moments  https://www.youtube.com/watch?v=iz7lE1g4XM4  sportscenter not top ten plays  https://www.youtube.com/watch?v=uCdfze1etec  pranks',
 'What has been the most life-changing experience in your life?',
 'http://www.youtube.com/watch?v=vXZeYwwRDw8   http://www.youtube.com/watch?v=u8ejam5DP3E  On repeat for most of today.',
 'May the PerC Experience immerse you.',
 'The last thing my INFJ friend posted on his facebook before committing suicide the next day. Rest in peace~   http://vimeo.com/22842206',
 "Hello ENFJ7. Sorry to hear of your distress. It's only natural for a relationship to not be perfection all the time in every moment of existence. Try to figure the hard times as times of growth, as...",
 '84389  84390  http://wallpaperpassion.com/upload/23700/friendship-boy-and-girl-wallpaper.jpg  http://assets.dornob.com/wp-content/uploads/2010/04/round-ho

In [13]:
# This will generate a column with the number of posts separated and if we specify count_posts as True it will generate another column with the number of total posts gathered for that person
fe.split_posts(column='posts', separator='|||', new_col_name='posts_separated', count_posts=True)

Unnamed: 0,type,posts,I,J,N,T,posts_len,posts_separated,count_posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,1,1,1,0,4652,"['http://www.youtube.com/watch?v=qsXHcwe3krw, ...",50
1,ENTP,'I'm finding the lack of me in these posts ver...,0,0,1,1,7053,['I'm finding the lack of me in these posts ve...,50
2,INTP,'Good one _____ https://www.youtube.com/wat...,1,0,1,1,5265,['Good one _____ https://www.youtube.com/wa...,50
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",1,1,1,1,6271,"['Dear INTP, I enjoyed our conversation the ...",50
4,ENTJ,'You're fired.|||That's another silly misconce...,0,1,1,1,6111,"['You're fired., That's another silly misconce...",50


In [14]:
# Let's check the results from the previous line of code. It does not seem a very reliable category being so skewed towards 50
df.count_posts.value_counts().head(10)

50    7587
47      82
48      79
42      61
49      60
46      54
44      52
39      39
40      37
37      35
Name: count_posts, dtype: int64

<a id='avg_num_char'></a>

<a id='avg_num_char'></a>
<br>
### Average Number of Characters per Post
***

In [16]:
# Maybe the average number of characters used in each post wil provide us with more information since the number of posts was not very useful
fe.avg_num_char(column='posts_separated', new_col_name='avg_num_char_x_post')

<a id='links'></a>
<br>

### Number of Hyperlinks per Post
***

In [17]:
# What about the number of links a persons uses?
fe.num_links(column='posts_separated', new_col_name='num_of_links')

<a id='keirsey'></a>
<br>

### Keirsey Temperament
***

In [18]:
# Let's check how many unique types of MBTI profiles exist
types = df['type'].unique()
types

array(['INFJ', 'ENTP', 'INTP', 'INTJ', 'ENTJ', 'ENFJ', 'INFP', 'ENFP',
       'ISFP', 'ISTP', 'ISFJ', 'ISTJ', 'ESTP', 'ESFP', 'ESTJ', 'ESFJ'],
      dtype=object)

<br>
This is correct, there are 16 different personality types according to MBTI. In the EDA section we will check how many observations of each type we have, but for now can expect that 16 types are maybe too many to predict. Consequently, we can try to narrow it down. We have several options, to do it into each category (e.g. Introversion and Extroversion) or to do it for through Keirsey's Temperaments.

We need to be careful howeever, Keirsey's Temperaments (KT) are "closely associated with the Myers–Briggs Type Indicator (MBTI); however, there are significant practical and theoretical differences between the two personality questionnaires and their associated different descriptions." [Wikipedia](https://en.wikipedia.org/wiki/Keirsey_Temperament_Sorter). If we use them we are assuming that the content of the posts assigned to each MBTI profile can be translated into KT. We need to keep this in mind.

In [19]:
# We extract the Keirsey Temperaments into new columns
for num, kt in zip(np.arange(6,10), ['NF','NT','S.P','S.J']):
    fe.keisey_temp('type', kt, col_index=int(num))

In [20]:
df.head(3)

Unnamed: 0,type,posts,I,J,N,T,NF,NT,SP,SJ,posts_len,posts_separated,count_posts,avg_num_char_x_post,num_of_links
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,1,1,1,0,1,0,0,0,4652,"['http://www.youtube.com/watch?v=qsXHcwe3krw, ...",50,90.1,24
1,ENTP,'I'm finding the lack of me in these posts ver...,0,0,1,1,0,1,0,0,7053,['I'm finding the lack of me in these posts ve...,50,138.12,10
2,INTP,'Good one _____ https://www.youtube.com/wat...,1,0,1,1,0,1,0,0,5265,['Good one _____ https://www.youtube.com/wa...,50,102.36,5


<a id='emojis'></a>
</br>

### Use of Emoticons
***

In [21]:
emoticons = [':D', ';D', ':)',';)',':(','xD','XD']

In [22]:
fe.get_emoticons('posts_separated', emoticons, total=True, average=True)

In [23]:
df.head()

Unnamed: 0,type,posts,I,J,N,T,NF,NT,SP,SJ,...,num_of_links,:D_count,;D_count,:)_count,;)_count,:(_count,xD_count,XD_count,total_emoticons,avg_emoticons_per_post
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,1,1,1,0,1,0,0,0,...,24,0,0,0,0,0,0,0,0,0.0
1,ENTP,'I'm finding the lack of me in these posts ver...,0,0,1,1,0,1,0,0,...,10,9,3,5,0,0,0,0,17,0.34
2,INTP,'Good one _____ https://www.youtube.com/wat...,1,0,1,1,0,1,0,0,...,5,2,0,7,0,0,0,0,9,0.18
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",1,1,1,1,0,1,0,0,...,2,0,0,0,0,0,0,0,0,0.0
4,ENTJ,'You're fired.|||That's another silly misconce...,0,1,1,1,0,1,0,0,...,6,0,0,0,1,0,0,1,2,0.04


<a id='mentions'></a>
</br>

### Mentions of same MBTI types and other MBTI types
***

In [23]:
fe.get_mentions('posts_separated', types, total=True, average=True)

In [24]:
df.head()

Unnamed: 0,type,posts,I,J,N,T,NF,NT,SP,SJ,...,ISFP_mentions,ISTP_mentions,ISFJ_mentions,ISTJ_mentions,ESTP_mentions,ESFP_mentions,ESTJ_mentions,ESFJ_mentions,total_mentions,avg_mentions_per_post
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,1,1,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,8,0.16
1,ENTP,'I'm finding the lack of me in these posts ver...,0,0,1,1,0,1,0,0,...,0,0,0,0,0,0,2,0,19,0.38
2,INTP,'Good one _____ https://www.youtube.com/wat...,1,0,1,1,0,1,0,0,...,0,1,0,0,0,0,0,0,4,0.08
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",1,1,1,1,0,1,0,0,...,0,0,0,0,0,0,0,3,12,0.24
4,ENTJ,'You're fired.|||That's another silly misconce...,0,1,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,5,0.1


<a id='words'></a>
</br>

### Word Extraction
***

In [25]:
# Leave spaces before and after the word so it does not take into account words that contain these letters. Here we search for 1st person pronouns.
word_list = [' I ', ' me ', ' my ', ' mine ', ' myself ', ' We ', ' us ', ' our',' ourselves ']
fe.extract_words(word_list, total=True, total_col_name='total_first_person', average=True, avg_col_name='avg_first_person')

In [26]:
# with this list we search for 2nd person pronouns.
word_list = [' you ', ' your ', ' yours ', ' yourself ', ' yourselves ']
fe.extract_words(word_list, total=True, total_col_name='total_second_person', average=True, avg_col_name='avg_second_person')

In [27]:
# with this list we search for 3rd person pronouns.
word_list = [' he ', ' him ', ' his ', ' himself ', ' she ', ' her ', ' hers ', ' herself ', ' they ', ' them ', ' their ', ' theirs', ' themselves '] # it, its, itself, left out
fe.extract_words(word_list, total=True, total_col_name='total_third_person', average=True, avg_col_name='avg_third_person')

In [28]:
df.head()

Unnamed: 0,type,posts,I,J,N,T,NF,NT,SP,SJ,...,her_count,hers_count,herself_count,they_count,them_count,their_count,theirs_count,themselves_count,total_third_person,avg_third_person
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,1,1,1,0,1,0,0,0,...,0,0,1,1,1,2,0,0,7,0.14
1,ENTP,'I'm finding the lack of me in these posts ver...,0,0,1,1,0,1,0,0,...,0,0,0,3,5,4,0,1,20,0.4
2,INTP,'Good one _____ https://www.youtube.com/wat...,1,0,1,1,0,1,0,0,...,0,0,0,0,2,0,0,0,3,0.06
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",1,1,1,1,0,1,0,0,...,5,0,1,1,2,2,0,0,15,0.3
4,ENTJ,'You're fired.|||That's another silly misconce...,0,1,1,1,0,1,0,0,...,3,0,0,9,2,6,0,1,37,0.74


<a id=pos_neg></a>
<br>

## Positive and Negative word count
***

**Note:** the words used in the following two documents are from:<br>
- Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." <br>
- Proceedings of the ACM SIGKDD International Conference on Knowledge <br>
- Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, <br>
- Washington, USA.<br>

The text files can be found in [mkulakowski2's GitHub account](https://gist.github.com/mkulakowski2)

The idea is to use this bulk of positive and negative terms to see how many times each person uses them. This is not an exact science since it has some shortcomings. The system does not consider "negating terms". For example, the program will not recognize "not happy" as "sad", it will directly take "happy".

In [29]:
with open('words/negative-words.txt', mode='r') as file:
    negative_list = []
    for row in file:
        negative_list.append(str(' '+row+' '))

In [30]:
neg_list_clean = []
for i in negative_list:
    i = i.replace('\n','')
    neg_list_clean.append(i)

In [31]:
with open('words/positive-words.txt', mode='r') as file:
    positive_list = []
    for row in file:
        positive_list.append(str(' '+row+' '))

In [32]:
pos_list_clean = []
for i in positive_list:
    i = i.replace('\n','')
    pos_list_clean.append(i)

In [33]:
print(len(neg_list_clean))
print(len(pos_list_clean))

4783
2006


In [34]:
positive_counter = []
for post in df.posts:
    inner_count = 0
    for item in pos_list_clean:
        result = post.casefold().count(item.casefold())
        inner_count += result
    
    positive_counter.append(inner_count)

In [35]:
df['positive_words'] = positive_counter

In [36]:
df['avg_positive_words'] = df.positive_words / df.count_posts

In [37]:
negative_counter = []
for post in df.posts:
    inner_count = 0
    for item in neg_list_clean:
        result = post.casefold().count(item.casefold())
        inner_count += result
    
    negative_counter.append(inner_count)

In [38]:
df['negative_words'] = negative_counter

In [39]:
df['avg_negative_words'] = df.negative_words / df.count_posts

<a id=interim></a>
<br>

## Creating an interim report with Pandas Profiling
***

In [40]:
# Let's check the final dataframe
df.head(3)

Unnamed: 0,type,posts,I,J,N,T,NF,NT,SP,SJ,...,them_count,their_count,theirs_count,themselves_count,total_third_person,avg_third_person,positive_words,avg_positive_words,negative_words,avg_negative_words
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,1,1,1,0,1,0,0,0,...,1,2,0,0,7,0.14,19,0.38,9,0.18
1,ENTP,'I'm finding the lack of me in these posts ver...,0,0,1,1,0,1,0,0,...,5,4,0,1,20,0.4,38,0.76,24,0.48
2,INTP,'Good one _____ https://www.youtube.com/wat...,1,0,1,1,0,1,0,0,...,2,0,0,0,3,0.06,35,0.7,18,0.36


In [41]:
# As we can see, we have managed to extract 77 columns + the 2 existing ones to a total of 79
df.shape

(8675, 79)

In [42]:
# We check the types and make sure there are no null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8675 entries, 0 to 8674
Data columns (total 79 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   type                    8675 non-null   object 
 1   posts                   8675 non-null   object 
 2   I                       8675 non-null   int64  
 3   J                       8675 non-null   int64  
 4   N                       8675 non-null   int64  
 5   T                       8675 non-null   int64  
 6   NF                      8675 non-null   int64  
 7   NT                      8675 non-null   int64  
 8   SP                      8675 non-null   int64  
 9   SJ                      8675 non-null   int64  
 10  posts_len               8675 non-null   int64  
 11  posts_separated         8675 non-null   object 
 12  count_posts             8675 non-null   int64  
 13  avg_num_char_x_post     8675 non-null   float64
 14  num_of_links            8675 non-null   

In [43]:
# We generate the interim pandas report.
#from pandas_profiling import ProfileReport
#from pandas_profiling.utils.cache import cache_file
#report = df.profile_report(sort='None', html={'style':{'full_width':True}}, progress_bar=True)
#report.to_file('pandas_profiling/mbti_report_interim.html')

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=93.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…

  cmap.set_bad(cmap_bad)





HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Export report to file', max=1.0, style=ProgressStyle(desc…




<a id='saving'></a>


## Saving the dataset
***

Unfortunately the dataset is too large to upload to GitHub at this point... I have added **find ./* -size +100M | cat >> .gitignore"** in the gitignore file to avoid uploading it. 

In [44]:
# finally, we save the new dataframe as a csv file.
df.to_csv('../../data/mbti_interim.csv')