## Project description:
Excerpts from books of three (horror novels) authors (Edgar Allan Poe, Mary Shelley, and HP Lovecraft) are selected, and the goal is to build a neural network model with tensorflow that is able to classify exerpts of three different authors with acceptable accuracy.

The dataset was downloaded from Kaggle.

## Work flow:
To begin with, I will audit the data, to see what it looks like and the size and dimensionality of it. Then, I will preprocess the text, eliminating parts such as punctuation and stop words. Next, I will use the count vectorizer (as the neural network can automatically conduct dimension reduction, we don't need "fancy" vectorizer here) to convert the text into features. Finally, I will tune the hyperparameters and create a neural network model to do the classification task.

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import xgboost as xgb
from tqdm import tqdm
import sklearn
from sklearn.svm import SVC
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import lightgbm as lgb 
from sklearn.feature_selection import SelectKBest, chi2, f_regression 
from sklearn.model_selection import RandomizedSearchCV, KFold
nltk.download('averaged_perceptron_tagger')
stop_words = stopwords.words('english')
import tensorflow as tf
from tensorflow import keras
from keras.wrappers.scikit_learn import KerasClassifier
import warnings
warnings.filterwarnings("ignore")
from keras.models import Sequential
from keras.layers import Dense

  import pandas.util.testing as tm
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\30523\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Read and audit the data

In [81]:
train = pd.read_csv('train.csv')

In [82]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [83]:
train.shape

(19579, 3)

### Preprocessing the text data
1. Lowercase
2. Remove Punctuation
3. Tokenize
4. Stopwords filtering
5. Lemmatization

In [84]:
def preprocess(text):
    text = text.lower()
    
    text_j = "".join([char for char in text if char not in string.punctuation])
    
    words = word_tokenize(text_j)
    
    stop_words = stopwords.words('english')
    filtered_words = [word for word in words if word not in stop_words]
    
    
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(word) for word in filtered_words]
    
    
    return lemmatized

In [85]:
def listToString(s):  
    
    # initialize an empty string 
    str1 = " " 
    
    # return string   
    return (str1.join(s)) 

In [86]:
train['text']=train['text'].apply(preprocess).apply(listToString)

In [87]:
train.head()

Unnamed: 0,id,text,author
0,id26305,process however afforded mean ascertaining dim...,EAP
1,id17569,never occurred fumbling might mere mistake,HPL
2,id11008,left hand gold snuff box capered hill cutting ...,EAP
3,id27763,lovely spring looked windsor terrace sixteen f...,MWS
4,id12958,finding nothing else even gold superintendent ...,HPL


### Split the training set into a training set and a validation set

In [88]:
X=train.text.values
y=pd.get_dummies(train[['author']])

In [89]:
pd.get_dummies(train[['author']]).shape

(19579, 3)

In [90]:
xtrain, xvalid, ytrain, yvalid = train_test_split(X, y, random_state=7, 
                                                  test_size=0.2)

### Count Vectorizer

In [91]:
ct_v = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

In [93]:
ct_v.fit(xtrain)
xtrain_ct_v =  ct_v.transform(xtrain) 
xvalid_tf_v = ct_v.transform(xvalid)

In [94]:
xtrain_ct_v.shape

(15663, 317772)

In order for the model to run, we still need to reduce the dimension for our features.

In [95]:
selector = SelectKBest(chi2, k = 10000)
xtrain_ct_v=selector.fit_transform(xtrain_ct_v, ytrain)
xvalid_tf_v=selector.transform(xvalid_tf_v)

#### Build the neural network algorithm
As the sample size is large, we can implement the mini-batch gradient descent method to optimize the parameters batch by batch.
So we have several parameters to tune:
1. learning rate, which is the most important hyperparameter
2. hidden units 
3. batch size (This can only be tuned mannually)
4. number of layers
5. regulation and decay rate

And the rest are significantly less important. 

For optimizer, I will use "Adam", as this optimizer is robust under most scenarios.

In [96]:
input_shape=xtrain_ct_v.shape[1]
output_shape = ytrain.shape[1]

In [97]:
# reference URL:https://www.kaggle.com/arrogantlymodest/randomised-cv-search-over-keras-neural-network
def create_model( nl1=1, nl2=1,  nl3=1, 
                 nn1=1000, nn2=500, nn3 = 200, lr=0.01, l2=0.01,
                act = 'relu',decay = 0.001):
    
    opt = keras.optimizers.Adam(lr=lr, beta_1=0.9, beta_2=0.999,  decay=decay)
    reg = keras.regularizers.l2( l2=l2)
                                                     
    model = Sequential()
    
    # for the firt layer we need to specify the input dimensions
    first=True
    
    for i in range(nl1):
        if first:
            model.add(Dense(nn1, input_dim=input_shape, activation=act, kernel_regularizer=reg))
            first=False
        else: 
            model.add(Dense(nn1, activation=act, kernel_regularizer=reg))
        
            
    for i in range(nl2):
        if first:
            model.add(Dense(nn2, input_dim=input_shape, activation=act, kernel_regularizer=reg))
            first=False
        else: 
            model.add(Dense(nn2, activation=act, kernel_regularizer=reg))

            
    for i in range(nl3):
        if first:
            model.add(Dense(nn3, input_dim=input_shape, activation=act, kernel_regularizer=reg))
            first=False
        else: 
            model.add(Dense(nn3, activation=act, kernel_regularizer=reg))
        
            
    model.add(Dense(output_shape, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=opt,metrics=['accuracy'])
    return model

In [98]:
# learning algorithm parameters
lr=[0.1,0.05,0.2,0.01]
decay=[0.0001,0.00005,0.001]
# activation
activation=['relu']

# numbers of layers
nl1 = [0,1]
nl2 = [0,1,2]
nl3 = [0,1,2]

# neurons in each layer
nn1=[256]
nn2=[64,128]
nn3=[16,32]

# l2 regularisation

l2 = [0.01, 0.02, 0.1]

# dictionary summary
param_grid = dict(
                    nl1=nl1, nl2=nl2, nl3=nl3, nn1=nn1, nn2=nn2, nn3=nn3,
                    act=activation, l2=l2, lr=lr, decay=decay
                 )

In [102]:
modelCV = KerasClassifier(build_fn=create_model, verbose=1,batch_size = 32,epochs=2)

In [103]:
grid = RandomizedSearchCV(estimator=modelCV, cv=KFold(4), param_distributions=param_grid, 
                          verbose=4,  n_iter=10, n_jobs=1)

In [116]:
grid_result=grid.fit(xtrain_ct_v,np.array(ytrain))

Fitting 4 folds for each of 10 candidates, totalling 40 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] nn3=16, nn2=64, nn1=256, nl3=1, nl2=0, nl1=0, lr=0.2, l2=0.1, decay=5e-05, act=relu 
Epoch 1/2
Epoch 2/2
[CV]  nn3=16, nn2=64, nn1=256, nl3=1, nl2=0, nl1=0, lr=0.2, l2=0.1, decay=5e-05, act=relu, score=0.404, total=   2.1s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.0s remaining:    0.0s


[CV] nn3=16, nn2=64, nn1=256, nl3=1, nl2=0, nl1=0, lr=0.2, l2=0.1, decay=5e-05, act=relu 
Epoch 1/2
Epoch 2/2
[CV]  nn3=16, nn2=64, nn1=256, nl3=1, nl2=0, nl1=0, lr=0.2, l2=0.1, decay=5e-05, act=relu, score=0.409, total=   2.4s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.4s remaining:    0.0s


[CV] nn3=16, nn2=64, nn1=256, nl3=1, nl2=0, nl1=0, lr=0.2, l2=0.1, decay=5e-05, act=relu 
Epoch 1/2
Epoch 2/2
[CV]  nn3=16, nn2=64, nn1=256, nl3=1, nl2=0, nl1=0, lr=0.2, l2=0.1, decay=5e-05, act=relu, score=0.399, total=   2.0s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    6.5s remaining:    0.0s


[CV] nn3=16, nn2=64, nn1=256, nl3=1, nl2=0, nl1=0, lr=0.2, l2=0.1, decay=5e-05, act=relu 
Epoch 1/2
Epoch 2/2
[CV]  nn3=16, nn2=64, nn1=256, nl3=1, nl2=0, nl1=0, lr=0.2, l2=0.1, decay=5e-05, act=relu, score=0.402, total=   2.0s
[CV] nn3=16, nn2=64, nn1=256, nl3=0, nl2=2, nl1=1, lr=0.01, l2=0.01, decay=5e-05, act=relu 
Epoch 1/2
Epoch 2/2
[CV]  nn3=16, nn2=64, nn1=256, nl3=0, nl2=2, nl1=1, lr=0.01, l2=0.01, decay=5e-05, act=relu, score=0.612, total=  24.7s
[CV] nn3=16, nn2=64, nn1=256, nl3=0, nl2=2, nl1=1, lr=0.01, l2=0.01, decay=5e-05, act=relu 
Epoch 1/2
Epoch 2/2
[CV]  nn3=16, nn2=64, nn1=256, nl3=0, nl2=2, nl1=1, lr=0.01, l2=0.01, decay=5e-05, act=relu, score=0.540, total=  24.3s
[CV] nn3=16, nn2=64, nn1=256, nl3=0, nl2=2, nl1=1, lr=0.01, l2=0.01, decay=5e-05, act=relu 
Epoch 1/2
Epoch 2/2
[CV]  nn3=16, nn2=64, nn1=256, nl3=0, nl2=2, nl1=1, lr=0.01, l2=0.01, decay=5e-05, act=relu, score=0.555, total=  23.0s
[CV] nn3=16, nn2=64, nn1=256, nl3=0, nl2=2, nl1=1, lr=0.01, l2=0.01, decay=5

[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed: 12.0min finished


Epoch 1/2
Epoch 2/2


In [119]:
grid_result.best_score_

0.702676072716713

In [120]:
grid_result.best_params_

{'nn3': 16,
 'nn2': 64,
 'nn1': 256,
 'nl3': 0,
 'nl2': 1,
 'nl1': 1,
 'lr': 0.01,
 'l2': 0.01,
 'decay': 5e-05,
 'act': 'relu'}

Given the search result, the accuracy of the best estimator is 70.3%! This is very high given that 1/4 of the sample is used to validate, and given that there are thee classes to classify.

Let's see how well it can perform against the test set.

In [122]:
# First let's fit the whole training set
al = grid_result.best_estimator_

In [123]:
al.fit(xtrain_ct_v,np.array(ytrain))

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x257eb9ebdd8>

In [126]:
# This is the predicted result, with 0 being the first author, 1 being the second, and 2 being the third
al.predict(xvalid_tf_v)



array([0, 0, 2, ..., 0, 0, 1])

In [129]:
# From the target of test set, we need to create a column that fit the format of the predicted result
yvalid['result']=yvalid['author_HPL']+2*yvalid['author_MWS']

In [146]:
from sklearn.metrics import accuracy_score

'The accuracy is '+str(round(100*accuracy_score(al.predict(xvalid_tf_v),yvalid['result']),1))+'%.'



'The accuracy is 69.4%.'

The accuracy against the test set is around 70%, which I feel is very good, for there are three classes.