Myers Briggs Social Media Personality Classifier
----

Aspects to note:
----
- The dataset is predominately introverts.
- Feeling vs. Thinking seems to be the most evenly split feature.
- Judging vs. Perceiving is a decent 60/40 split, where it seems perception is higher.
- There are very few Sensing people; more Intuitive.

Goals:
----
- Remove links from the dataset.
- Analyze the sentiment associated with each text
- Analyze the similarity between texts of the same personality type
- Mine underlying features in the text by applying known feature extraction
- Sequence model by looking at the bi and trigram structure
- (Potential) Score the academic level of speaking of the speaker

Things to try:
----
- Train a classifier with data split up as "E" vs. "I" vs. "S" etc. in unaries.
- Train a classifier with data split up as "ES" vs.  "SF" vs. "TP" etc. in binaries
- Train a classifier with data split up as full type, e.g. "ENTJ"
- Utilize those three classifiers, trained on mined features, to get an aggregate vote of classification.

Metrics to explain:
----
- Mechanics behind feature engineering
- Data split; i.e., assumptions and bias added by example counts.
- Principle component analysis of features.
- Clustering analysis of personality types.
- Support Vector Machine model on the features
- F1 recall of our model
- Precision/Recall tradeoff
- ROC curve
- AUC metric to compare different classifiers accuracy on the data
- Confusion matrix to show where data misclassified
- Talk about future features to be mined from those insights

Future:
----
- Apply this text analysis to Russian twitter bots.
- Apply this text analysis to Political talk forums.
- Apply this text to a mutual interest group forum.
- Apply this text to a social media group setting.
- Analyze the variance and see if any insight can be gained

First, I'll import the necessary libraries.
----

In [2]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
import nltk

Next, import the dataset.
----

In [3]:
data      = pd.read_csv("datasets/mbti_1.csv")  # Reading in the data
safe_data = data                                # Extra copy incase something bad happens.

data.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


Prepare the regular expression, and split the text correctly.
----

In [34]:
data      = pd.read_csv("datasets/mbti_1.csv")  # Reading in the data
clean_tweets = []

# Split into the actual tweets
for person in data['posts']:
    examples       = person.split("|||")
    clean_tweets.append(examples)

# Put those clean tweets back into the dataset
data['posts'] = clean_tweets
data.head()

# Print some dimensions
print("There are " + str(len(clean_tweets)) + " individuals represented in the dataset.")
print("There are " + str(len(clean_tweets[0])) + " tweets from each individual represented.")

There are 8675 individuals represented in the dataset.
There are 50 tweets from each individual represented.


Data explanation at this point:
----
Now, the data has been split into 50 tweets for each of 8675 individuals.

Let's check out how the data is distributed over the types.

In [5]:
from collections import Counter

# Get a count for each of the data types.
type_count = list(Counter(data['type']).items())

# E/I count
E_raw   = [x for x in type_count if 'E' in x[0]]
E_count = sum([x[1] for x in E_raw])
print("There are " + str(E_count) + " individuals (" +
      str(float(int(E_count)/8675.)*100.) + "%) classified as Extroverted.")
print("There are " + str(int(8675 - E_count)) + " individuals (" +
      str(float(int(8675 - E_count)/8675.)*100.) + "%) classified as Introverted.")
print("")

# N/S count
N_raw   = [x for x in type_count if 'S' in x[0]]
N_count = sum([x[1] for x in N_raw])
print("There are " + str(N_count) + " individuals (" + 
      str(float(int(N_count)/8675.)*100.) + "%) classified as Sensing.")
print("There are " + str(int(8675 - N_count)) + " individuals (" +
      str(float(int(8675 - N_count)/8675.)*100.) + "%) classified as Intuitive.")
print("")

# F/T count
F_raw   = [x for x in type_count if 'F' in x[0]]
F_count = sum([x[1] for x in F_raw])
print("There are " + str(F_count) + " individuals (" +
      str(float(int(F_count)/8675.)*100.) + "%) classified as Feeling.")
print("There are " + str(int(8675 - F_count)) + " individuals (" +
      str(float(int(8675 - F_count)/8675.)*100.) + "%) classified as Thinking.")
print("")

# J/P count
J_raw   = [x for x in type_count if 'J' in x[0]]
J_count = sum([x[1] for x in J_raw])
print("There are " + str(J_count) + " individuals (" +
      str(float(int(J_count)/8675.)*100.) + "%) classified as Judging.")
print("There are " + str(int(8675 - J_count)) + " individuals (" +
      str(float(int(8675 - J_count)/8675.)*100.) + "%) classified as Perceiving.")
print("")

There are 1999 individuals (23.0432276657%) classified as Extroverted.
There are 6676 individuals (76.9567723343%) classified as Introverted.

There are 1197 individuals (13.7982708934%) classified as Sensing.
There are 7478 individuals (86.2017291066%) classified as Intuitive.

There are 4694 individuals (54.1095100865%) classified as Feeling.
There are 3981 individuals (45.8904899135%) classified as Thinking.

There are 3434 individuals (39.5850144092%) classified as Judging.
There are 5241 individuals (60.4149855908%) classified as Perceiving.



And then let's see how it's spread out by actual classification:
----

In [6]:
# Cleanly print the class, count tuples.
for a, b in type_count:
    print(str(a) + " " + str(b))

ENFJ 190
ESFP 48
INFJ 1470
ESTJ 39
ISTJ 205
ENTJ 231
ISFP 271
INTJ 1091
ISTP 337
ENTP 685
ISFJ 166
INTP 1304
ESFJ 42
ESTP 89
ENFP 675
INFP 1832


Now we'll remove the link data
----
Luckily, as you can see, every individual represented in the dataset had related media linked. This allows for much deeper future analysis on this dataset.

In [49]:
# Take the dataset and remove the links, separate into links and texts.
links = [] # Hold all link data
texts = [] # Hold all text data

# Start parsing the dataset by individual
for individual in data['posts']:
    txt = []
    img = []
    
    # For each individual, split into text vs. link
    for post in individual:
        if "http" not in post:
            img.append(post)
        else:
            txt.append(post)
    
    # Append the data to the data pool.
    links.append(txt)
    texts.append(img)

# Confirm that the data is still represented correctly.
print("Number of individuals with links in their posts: " + str(len(links)))
print("Number of individuals with texts in their posts: " + str(len(texts)))
print("")

# Find out how many posts were left over
removed = 0
for individual in links:
    removed += len(individual)
print("In total, " + str(removed) + " posts were removed as link data.")
print("There were a total of " + str(8675 * 50) + " posts in the data.")
print(str(float(removed) / float(8675 * 50)) + " percent of the data contains a link.\n")

print("Individual 0 first 10 texts:\n")
print(str(texts[0][:10]) + "\n")
print("Individual 0 first 10 links:\n")
print(str(links[0][:10]) + "\n")
print("Individual 0's personality type: " + str(data['type'][0]))


Number of individuals with links in their posts: 8675
Number of individuals with texts in their posts: 8675

In total, 25231 posts were removed as link data.
There were a total of 433750 posts in the data.
0.0581694524496 percent of the data contains a link.

Individual 0 first 10 texts:

['What has been the most life-changing experience in your life?', 'May the PerC Experience immerse you.', "Hello ENFJ7. Sorry to hear of your distress. It's only natural for a relationship to not be perfection all the time in every moment of existence. Try to figure the hard times as times of growth, as...", 'Welcome and stuff.', "Prozac, wellbrutin, at least thirty minutes of moving your legs (and I don't mean moving them while sitting in your same desk chair), weed in moderation (maybe try edibles as a healthier alternative...", "Basically come up with three items you've determined that each type (or whichever types you want to do) would more than likely use, given each types' cognitive functions an

Luckily, the link data makes up so little of the dataset that we've not lost much information
----
Also, the dimensions are the same as our original dataset, so we can place the solely text data into the dataset.

Let's return some structure.

In [45]:
# Extract the wanted variables
labels = data['type']

split_db = pd.DataFrame({'Type'      : labels,
                         'Text'      : texts,
                         'Link'      : links},
                       columns=['Type','Text','Link'])

split_db.head()

Unnamed: 0,Type,Text,Link
0,INFJ,[What has been the most life-changing experien...,"['http://www.youtube.com/watch?v=qsXHcwe3krw, ..."
1,ENTP,['I'm finding the lack of me in these posts ve...,[http://img188.imageshack.us/img188/6422/6020d...
2,INTP,"[Of course, to which I say I know; that's my b...",['Good one _____ https://www.youtube.com/wa...
3,INTJ,"['Dear INTP, I enjoyed our conversation the ...",[Sx as hell... https://www.youtube.com/watch...
4,ENTJ,"['You're fired., That's another silly misconce...",[Sometimes I just really like impoverished rap...


Let's start mining some features within the text
----

In [56]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize

for individual in texts:
    for post in individual:
        post = tokenize.sent_tokenize(post)
        
print(texts[0])

LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - 'C:\\Users\\Carson/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\Carson\\Anaconda2\\nltk_data'
    - 'C:\\Users\\Carson\\Anaconda2\\lib\\nltk_data'
    - 'C:\\Users\\Carson\\AppData\\Roaming\\nltk_data'
    - u''
**********************************************************************