# Natural Language Processing

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
import nltk 
from nltk.corpus import stopwords 

In [2]:
pip install emoji


The following command must be run outside of the IPython shell:

    $ pip install emoji

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/


We got our data from Kaggle InClass Prediction Competition, a Personality Profile Prediction.

In [3]:
# use Pandas to read in the csv files. The pd.read_csv() method creates a DataFrame from a csv file
dataset = pd.read_csv('mbti_1.csv')

In [4]:
#print(dataset.head())
dataset.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


#  Data Cleaning

# 1.1 Introduction
<br>

We'll be walking through:
- Where we got our data - in this case, we'll get our data from kaggle competition MBTI
- Cleaning the data - we will walk through popular text pre-processing techniques
- Organizing the data - we will organize the cleaned data into a way that is easy to input into other algorithms
<br>

The output of this notebook will be clean, organized data in two standard text formats:
<br>

1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format

# Problem Statement
<br>
As a reminder, our goal is to look at posts from different sites and be able to identify personality type for each post.

# Getting The Data
<br>
Luckily, there are sites like kaggle that runs exciting competitions.

# Cleaning The Data
<br>
When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there some common data cleaning techniques which are also known as text-processing techniques. 
With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimium viable product)approuch - start simple and iterate. Here are a bunchs of things you can do to clean your data. We're going to excute the common cleaning steps here and the rest can be done at a later point to improve our results. 
<br>
#### Basic Text Pre-processing of text data:
- Make text all lower case
- Remove punctuation
- Stopwords removal
- Remove numerical values
- Frequent words removal
- Remove common non-sensical text(/n)
- Rare words removal
- Tokenize text
- Stemming
- Lemmatication
- Spelling correction
<br>
#### Advance Text Processing:
- N-grams
- Term frequency
- Inverse Document Frequency
- Term Frequency-Inverse Document Frequency(TF-IDF)
- Bag of words
- Sentiment Analysis
- Word Embedding

In [5]:
# Let's take a look at our data
next(iter(dataset.keys()))

'type'

In [6]:
datasets = dataset
datasets.set_index('type',inplace=True)
datasets.head()

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
ENTP,'I'm finding the lack of me in these posts ver...
INTP,'Good one _____ https://www.youtube.com/wat...
INTJ,"'Dear INTP, I enjoyed our conversation the o..."
ENTJ,'You're fired.|||That's another silly misconce...


In [7]:
# Let's take a look at a post for INFJ
dataset.posts.loc['INFJ']

type
INFJ    'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
INFJ    'No, I can't draw on my own nails (haha). Thos...
INFJ    I'm not sure, that's a good question. The dist...
INFJ    'One time my parents were fighting over my dad...
INFJ    'Joe santagato - ENTP|||ENFJ or  ENTP?   I'm n...
INFJ    'some of these both excite and calm me:  BUTTS...
INFJ    'I fully believe in the power of being a prote...
INFJ    'It is very annoying to be misinterpreted. Esp...
INFJ    'I think that that can absolutely be true of i...
INFJ    it could be pyroluria.. you know.. it is an on...
INFJ    'Sometimes I wonder that too.. the reason bein...
INFJ    http://www.youtube.com/watch?v=ipUdoUcNmKI  ht...
INFJ    'Trying not to feel totally worthless...  Why ...
INFJ    'Me: INFJ Mom: ISTJ Dad: ENFJ Sister: ISTJ|||I...
INFJ    'I would strongly recommend not taking shortcu...
INFJ    'than you may be an ambivert, somewhere in the...
INFJ    'Yeah i'm an a-hole too depending on who you a...
INFJ    '

In [8]:
# Apply a first round of text cleaning techniques
import re
import string

def cleaning_data(text):
    '''Remove web url'''
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', str(text), flags=re.MULTILINE)
    '''Make text lowercase'''
    text = text.lower()
    '''remove text in square brackets'''
    text = re.sub('\[.*?\]', '', text)
    '''remove punctuations'''
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    '''remove digits'''
    text = re.sub('\w*\d\w*', '', text)
    '''remove stop words'''
    STOPWORDS = set(stopwords.words('english'))
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text

data_round1 = lambda x: cleaning_data(x)

In [9]:
# Lets take a look at the updated text
data_cleaning = pd.DataFrame(dataset.posts.apply(data_round1))
data_cleaning

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
INFJ,intj moments sportscenter top ten plays pranks...
ENTP,im finding lack posts alarmingsex boring posit...
INTP,good one httpswwwyoutubecomwatchvfhigbolffgwof...
INTJ,dear intp enjoyed conversation day esoteric ga...
ENTJ,youre firedthats another silly misconception a...
INTJ,science perfect scientist claims scientific in...
INFJ,cant draw nails haha done professionals nails ...
INTJ,tend build collection things desktop use frequ...
INFJ,im sure thats good question distinction two de...
INTP,position actually let go person due various re...


In [10]:
def cleaning_data2(text):
    '''Get rid of some additional punctuations '''
    text = re.sub('\[''""...]', '', text)
    '''Get rid of non-sensical'''
    text = re.sub('\n', '', text)
    '''Remove single characters from the start'''
    text = re.sub(r'\^[a-zA-Z]\s+', ' ', text)
    '''Removing prefixed 'b'''
    text = re.sub(r'^b\s+', '', text)
    '''Correcting typos'''
    text = text.correct()
    '''Remove rare words'''
    freq = pd.Series(' '.join(data_cleaning['posts']).split()).value_counts()[-500:]
    # let's remove these words as their presence will be of any use
    freq = list(freq.index)
    text = data_cleaning['posts'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
    return text

data_round2 = lambda x: cleaning_data(x)

In [11]:
# Lets take a look at the updated text
data_cleaning = pd.DataFrame(data_cleaning.posts.apply(data_round2))
data_cleaning

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
INFJ,intj moments sportscenter top ten plays pranks...
ENTP,im finding lack posts alarmingsex boring posit...
INTP,good one httpswwwyoutubecomwatchvfhigbolffgwof...
INTJ,dear intp enjoyed conversation day esoteric ga...
ENTJ,youre firedthats another silly misconception a...
INTJ,science perfect scientist claims scientific in...
INFJ,cant draw nails haha done professionals nails ...
INTJ,tend build collection things desktop use frequ...
INFJ,im sure thats good question distinction two de...
INTP,position actually let go person due various re...


In [12]:
def cleaning_data3(text):
    '''Get rid of all single characters'''
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', str(text))
    '''Substituting multiple spaces with single space'''
    text = re.sub(r'\s+', ' ', text, flags=re.I)
    '''Remove all the special characters'''
    text = re.sub(r'\W', ' ', str(text))
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', str(text))
    '''Remove Frequent words'''
    freq = pd.Series(' '.join(data_cleaning['posts']).split()).value_counts()[:500]
    # let's remove these words as their presence will be of any use
    freq = list(freq.index)
    text = data_cleaning['posts'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
    return text
                  
data_round3 = lambda x: cleaning_data(x) 

In [13]:
# Lets take a look at the updated text
data_cleaning = pd.DataFrame(data_cleaning.posts.apply(data_round3))
data_cleaning

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
INFJ,intj moments sportscenter top ten plays pranks...
ENTP,im finding lack posts alarmingsex boring posit...
INTP,good one httpswwwyoutubecomwatchvfhigbolffgwof...
INTJ,dear intp enjoyed conversation day esoteric ga...
ENTJ,youre firedthats another silly misconception a...
INTJ,science perfect scientist claims scientific in...
INFJ,cant draw nails haha done professionals nails ...
INTJ,tend build collection things desktop use frequ...
INFJ,im sure thats good question distinction two de...
INTP,position actually let go person due various re...


In [14]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    ''' Stemming - removing and replacing suffixes'''
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
        stems =ste
    return stems
data_round4 = lambda x: cleaning_data(x)

In [15]:
# Lets take a look at the updated text
data_cleaning = pd.DataFrame(data_cleaning.posts.apply(data_round4))
data_cleaning

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
INFJ,intj moments sportscenter top ten plays pranks...
ENTP,im finding lack posts alarmingsex boring posit...
INTP,good one httpswwwyoutubecomwatchvfhigbolffgwof...
INTJ,dear intp enjoyed conversation day esoteric ga...
ENTJ,youre firedthats another silly misconception a...
INTJ,science perfect scientist claims scientific in...
INFJ,cant draw nails haha done professionals nails ...
INTJ,tend build collection things desktop use frequ...
INFJ,im sure thats good question distinction two de...
INTP,position actually let go person due various re...


In [16]:
def Lemmatization(text):
    ''' Lemmatization - returns the dictionary form of a word '''
    text = text.split()

    text = [stemmer.lemmatize(word) for word in text]
    text = ' '.join(text)

    texts.append(text)
    return text

data_round5 = lambda x: cleaning_data(x) 

In [17]:
# Lets take a look at the updated text
data_cleaning = pd.DataFrame(data_cleaning.posts.apply(data_round5))
data_cleaning

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
INFJ,intj moments sportscenter top ten plays pranks...
ENTP,im finding lack posts alarmingsex boring posit...
INTP,good one httpswwwyoutubecomwatchvfhigbolffgwof...
INTJ,dear intp enjoyed conversation day esoteric ga...
ENTJ,youre firedthats another silly misconception a...
INTJ,science perfect scientist claims scientific in...
INFJ,cant draw nails haha done professionals nails ...
INTJ,tend build collection things desktop use frequ...
INFJ,im sure thats good question distinction two de...
INTP,position actually let go person due various re...


In [18]:
def remove_emoji(text):
    '''Remove all sorts of emojis'''
    emoji_pattern = emoji.get_emoji_regexp().sub(u'', text)
    return emoji_pattern
data_round6 = lambda x: cleaning_data(x) 

In [19]:
# Lets take a look at the updated text
data_cleaning = pd.DataFrame(data_cleaning.posts.apply(data_round6))
data_cleaning

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
INFJ,intj moments sportscenter top ten plays pranks...
ENTP,im finding lack posts alarmingsex boring posit...
INTP,good one httpswwwyoutubecomwatchvfhigbolffgwof...
INTJ,dear intp enjoyed conversation day esoteric ga...
ENTJ,youre firedthats another silly misconception a...
INTJ,science perfect scientist claims scientific in...
INFJ,cant draw nails haha done professionals nails ...
INTJ,tend build collection things desktop use frequ...
INFJ,im sure thats good question distinction two de...
INTP,position actually let go person due various re...


# 1.5 Organizing The Data
<br>
The output of this notebook will be clean, organized data in two standard text formats:

1. Corpus - a collection of text
<br>
2. Document-Term Matrix - words counts in matrix format

### 1.5.1 Corpus

In [20]:
dataset.head()

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
ENTP,'I'm finding the lack of me in these posts ver...
INTP,'Good one _____ https://www.youtube.com/wat...
INTJ,"'Dear INTP, I enjoyed our conversation the o..."
ENTJ,'You're fired.|||That's another silly misconce...


In [21]:
# Let's pickle it for later use
dataset.to_pickle("corpus.pkl")

### 1.5.2 Document-Term Matrix
<br>
For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.
<br>

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a','the',etc.

In [23]:
# Sentence Tokenizatin
'''Sentence tokenizer breaks text paragraph into sentences'''
from nltk import sent_tokenize
tokenized_sent=sent_tokenize(str(data_cleaning.posts))

'''Word tokenizer breaks text paragraph into words'''
from nltk import word_tokenize
tokens = word_tokenize(str(data_cleaning.posts))
tokens

['type',
 'INFJ',
 'intj',
 'moments',
 'sportscenter',
 'top',
 'ten',
 'plays',
 'pranks',
 '...',
 'ENTP',
 'im',
 'finding',
 'lack',
 'posts',
 'alarmingsex',
 'boring',
 'posit',
 '...',
 'INTP',
 'good',
 'one',
 'httpswwwyoutubecomwatchvfhigbolffgwof',
 '...',
 'INTJ',
 'dear',
 'intp',
 'enjoyed',
 'conversation',
 'day',
 'esoteric',
 'ga',
 '...',
 'ENTJ',
 'youre',
 'firedthats',
 'another',
 'silly',
 'misconception',
 'a',
 '...',
 'INTJ',
 'science',
 'perfect',
 'scientist',
 'claims',
 'scientific',
 'in',
 '...',
 'INFJ',
 'cant',
 'draw',
 'nails',
 'haha',
 'done',
 'professionals',
 'nails',
 '...',
 'INTJ',
 'tend',
 'build',
 'collection',
 'things',
 'desktop',
 'use',
 'frequ',
 '...',
 'INFJ',
 'im',
 'sure',
 'thats',
 'good',
 'question',
 'distinction',
 'two',
 'de',
 '...',
 'INTP',
 'position',
 'actually',
 'let',
 'go',
 'person',
 'due',
 'various',
 're',
 '...',
 'INFJ',
 'one',
 'time',
 'parents',
 'fighting',
 'dads',
 'affair',
 'dad',
 'push',


In [24]:
# We are going to create a document-term matrix using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer  
filtered_sent = ['being', "haven't", 'they', 'but', 'my', 'through', 'up', 'once', "wasn't", 'over', 'his', 'all', 'the', 'further', 'doing', 
 'am', 'd', 'until', 'when', 'it', 'shan', 'on', 'him', 'she', 'yourselves', 'themselves', 'theirs', 'as', 'while', 'more', 's', 
 'have', 'been', 'just', "doesn't", 'aren', "hasn't", 'will', 'were', 'your', 'ain', 'doesn', 'this', 'these', 'with', 'o', 'here', 
 're', 'same', 'isn', 'had', 'above', 'whom', 'nor', 'by', 'herself', 'such', 'ourselves', 'where', 'any', 'mightn', 'what', 'because', 
 'are', 'you', 'its', 'won', 'yourself', 'needn', 'why', "didn't", 'ma', 'no', 'against', 'don', "she's", 'has', 'be', 'ours', 'only', 
 'yours', 'm', 'hadn', 'those', 'during', 'into', 'and', "that'll", 'is', "should've", "mustn't", 'under', 'mustn', 'them', 'in', 'some', 
 'a', 'was', 'off', 'me', 'wasn', 'after', 'i', 'who', 'than', 'both', "you're", 'to', 'not', 'himself', 'he', 'again', 'now', 'how', 'so', 
 'if', 'that', "hadn't", 'which', 'too', "you'll", "aren't", "it's", 'below', 'y', 'or', 'then', 'their', 'wouldn', 'should', 've', 'can', 
 "you've", "couldn't", 'there', 'hasn', 'having', 'most', "won't", 'each', 'hers', 'did', "shouldn't", 'an', 't', 'very', "weren't", 'between', 
 'out', 'down', 'own', 'do', 'itself', 'from', "don't", 'll', 'haven', 'her', "needn't", 'couldn', "you'd", 'myself', "mightn't", 'about', 'didn', 
 'for', 'few', 'other', 'does', 'before', "wouldn't", 'we', "isn't", 'shouldn', "shan't", 'of', 'at', 'our', 'weren'
 'im','like', 'think', 'people', 'dont', 'know', 'really', 'would', 'one', 'get',
 'feel', 'love', 'time', 'ive', 'much', 'say', 'something', 'good',
 'things', 'want', 'see', 'way', 'someone', 'also', 'well', 'friends',
 'always', 'type', 'lot', 'could', 'make', 'go', 'thing', 'even', 'person', 'need',
 'find', 'right', 'never', 'youre', 'thats', 'going', 'life', 'friend',
 'pretty', 'though', 'sure', 'said', 'cant', 'first', 'actually', 'still',
 'best', 'many', 'take', 'others', 'work', 'read', 'sometimes', 'got',
 'around', 'thought', 'try', 'back', 'makes', 'better', 'trying', 'didnt',
 'agree', 'kind', 'mean', 'tell', 'post', 'two', 'probably', 'talk',
 'anything', 'since', 'maybe', 'understand', 'seems', 'ill', 'id', 'little',
 'doesnt', 'thread', 'new', 'long', 'ever', 'years', 'hard', 'might',
 'types', 'us', 'everyone','different', 'look', 'usually', 'may', 'day', 'give',
 'come', 'personality', 'guess', 'mind', 'relationship', 'bit', 'quite',
 'great', 'made', 'thinking', 'everything', 'school', 'seem', 'bad', 'every',
 'help', 'yes', 'definitely', 'believe', 'point', 'used', 'infp', 'guys', 'tend','hes', 'use', 'intj', 
 'often', 'getting', 'interesting', 'last', 'talking', 'infj', 'times',
 'another', 'mbti', 'enfp', 'world','question','part', 'theres',
 'feeling', 'fun', 'intp', 'enough', 'isnt', 'else', 'hate', 'lol', 'keep',
 'anyone', 'nice', 'idea', 'sense','least','enfj', 'entj', 'entp', 'esfj', 'esfp', 'estj', 'estp',
 'isfj', 'isfp', 'istj', 'istp','sound','thank']
vectorizer = CountVectorizer(max_features=1500, min_df=1, max_df=1.0, stop_words=filtered_sent)  
X = vectorizer.fit_transform(data_cleaning.posts)
data_x = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names()) 
data_x.index = data_cleaning.index
del data_x.index.name
data_x

Unnamed: 0,ability,able,absolute,absolutely,abstract,accept,according,account,accurate,across,...,year,yesterday,yet,youd,youi,youll,young,younger,youtube,youve
INFJ,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
ENTP,0,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
INTP,2,1,0,2,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
INTJ,0,2,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,2
ENTJ,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
INTJ,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
INFJ,0,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
INTJ,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
INFJ,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
INTP,0,0,0,2,0,0,0,0,0,0,...,4,0,1,0,0,0,0,0,0,0


In [25]:
# Let's pickle it for later use
data_x.to_pickle("xdata.pkl")

In [26]:
# Let's also pickle the cleaned data (before we put it in documnet-term matrix)
import pickle

data_cleaning.to_pickle('data_cleaning.pkl')
pickle.dump(X, open("vectorizer.pkl", "wb"))