## Introduction
Text processing has a long history and is known under a couple of different names including Computational Linguistics and Natural Language Processing.  in this notebook we will look at some text procesing options avaiable in the packages

* nltk
* gensim
* scikit-learn

### nltk - The Natural Language Toolkit
nltk is a very popular text processing tool for western languages.  It can be found at the [nltk site](http://www.nltk.org/).



In [2]:
import os
import nltk

nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords


# load training data
BASE_DIR = '../data'
TEXT_DATA_DIR = os.path.join(BASE_DIR, 'SpookyData')

import pandas as pd

# read the training data
df = pd.read_csv(os.path.join(TEXT_DATA_DIR, 'train.csv'))
print(df.shape)
df.head()


[nltk_data] Downloading package stopwords to /home/joe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/joe/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
(19579, 3)


Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [8]:
# Read the text of the training examples  
corpus = df['text'].tolist()
print(corpus[0])
unique_labels = df['author'].unique().tolist()

This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.


In [4]:
from nltk.tokenize import word_tokenize

print(word_tokenize(corpus[0]))

['This', 'process', ',', 'however', ',', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon', ';', 'as', 'I', 'might', 'make', 'its', 'circuit', ',', 'and', 'return', 'to', 'the', 'point', 'whence', 'I', 'set', 'out', ',', 'without', 'being', 'aware', 'of', 'the', 'fact', ';', 'so', 'perfectly', 'uniform', 'seemed', 'the', 'wall', '.']


In [9]:
# Remove english stopwords from the tokenized lists
stops = set(stopwords.words('english'))
modified_corpus = []
for sent in corpus:
    modified_sent = []
    for term in word_tokenize(sent):
        if term not in stops:
            modified_sent.append(term)
    modified_corpus.append(modified_sent)
print(modified_corpus[0])

labels = df['author'].tolist()
print(unique_labels)

labeled_corpus = list(zip(modified_corpus, labels))
print(labeled_corpus[0])

['This', 'process', ',', 'however', ',', 'afforded', 'means', 'ascertaining', 'dimensions', 'dungeon', ';', 'I', 'might', 'make', 'circuit', ',', 'return', 'point', 'whence', 'I', 'set', ',', 'without', 'aware', 'fact', ';', 'perfectly', 'uniform', 'seemed', 'wall', '.']
['EAP', 'HPL', 'MWS']
(['This', 'process', ',', 'however', ',', 'afforded', 'means', 'ascertaining', 'dimensions', 'dungeon', ';', 'I', 'might', 'make', 'circuit', ',', 'return', 'point', 'whence', 'I', 'set', ',', 'without', 'aware', 'fact', ';', 'perfectly', 'uniform', 'seemed', 'wall', '.'], 'EAP')


In [11]:
# create a labeled set of training features.
all_words = set()
for passage in labeled_corpus:
    for word in passage[0]:
        all_words.add(word)

len(all_words)
# another way to say the same thing...
#all_words = set(word.lower() for passage in modified_corpus for word in passage[0])
training = []
for passage in labeled_corpus:
    d = {}
    for term in passage[0]:
        d[term] = True
    training.append((d, passage[1]))
    
## TODO: add a random selection of domain terms to each sample as negative features
        
print(training[0])

({',': True, 'ascertaining': True, ';': True, 'might': True, 'whence': True, 'however': True, 'process': True, '.': True, 'seemed': True, 'perfectly': True, 'afforded': True, 'set': True, 'This': True, 'without': True, 'circuit': True, 'fact': True, 'means': True, 'uniform': True, 'return': True, 'point': True, 'wall': True, 'dimensions': True, 'dungeon': True, 'make': True, 'aware': True, 'I': True}, 'EAP')


In [13]:
# learn by NaiveBayes
classifier = nltk.NaiveBayesClassifier.train(training)

In [14]:
classifier.show_most_informative_features()

Most Informative Features
                 Raymond = True              MWS : HPL    =     99.0 : 1.0
                     Old = True              HPL : EAP    =     50.9 : 1.0
             endeavoured = True              MWS : EAP    =     44.9 : 1.0
                sinister = True              HPL : EAP    =     44.4 : 1.0
               Elizabeth = True              MWS : HPL    =     42.6 : 1.0
                    West = True              HPL : EAP    =     41.9 : 1.0
                      'd = True              HPL : MWS    =     41.8 : 1.0
                  sister = True              MWS : EAP    =     41.0 : 1.0
                  wholly = True              HPL : EAP    =     40.7 : 1.0
                 despite = True              HPL : EAP    =     39.7 : 1.0


In [15]:
classifier.labels()

['HPL', 'MWS', 'EAP']

In [16]:
# read the test data
dftest = pd.read_csv(os.path.join(TEXT_DATA_DIR, 'test.csv'))
print(dftest.shape)
dftest.head()

(8392, 2)


Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...
