
# <center> Author Classification </center>

## Introduction

Using NLP and techniques to classify author from texts from Gutenberg project.
1. Pre-process data using Spacy and other methods.
2. Perform data exploration
3. Using Bag of Word, apply supervised models such as Naive Bayes,  Decision Tree, Random Forest, and Gradient Boosting.
4. Similar to 3., but using TF-IDF.
5. Similar to 3., but using word2vec.
6. Using unsupervised technique for clustering authors. <font color='red'>**(ADDED)**</font>
7. Using LSA and LDA, print out top ten words (with their highest loading) for each topic modeling.<font color='red'>**(ADDED)**</font>

****
<font color= 'red'> **Fixed Version**

**=> Lower all words and remove author in each samples**


**=> Tuning all model when using BoW technique to imporve performance**

**=> Apply the tuned hyperparameters in BoW to TF-IDF, Word2Vec to imrpove the performance**

**=> Tuning hyperparmeters in Word2Vec techniques to gain higher accuaracy**

**=> Clustering using Word2Vec**

**=> Apply LSA and LDA for topic modeling**
    

</font>


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Explore-Data" data-toc-modified-id="Explore-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Explore Data</a></span></li><li><span><a href="#Prepare-Data" data-toc-modified-id="Prepare-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Prepare Data</a></span></li><li><span><a href="#Bag-of-words" data-toc-modified-id="Bag-of-words-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Bag of words</a></span></li><li><span><a href="#TF-IDF" data-toc-modified-id="TF-IDF-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>TF-IDF</a></span></li><li><span><a href="#Word2vec" data-toc-modified-id="Word2vec-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Word2vec</a></span></li><li><span><a href="#Clustering" data-toc-modified-id="Clustering-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Clustering</a></span></li><li><span><a href="#LSA" data-toc-modified-id="LSA-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>LSA</a></span></li><li><span><a href="#LDA" data-toc-modified-id="LDA-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>LDA</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

## Explore Data

In [1]:
import nltk
from nltk.corpus import gutenberg
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from itertools import chain
from sklearn.model_selection import GridSearchCV


nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-3-more-fun-with-word-vectors

In [2]:
Novels = gutenberg.fileids()
Novels

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

The data is name of author followed title of the book

In [0]:
numNovels = len(gutenberg.fileids())

There are 18 book in this project

In [4]:
Authors = []
for i in range(numNovels):
  author = Novels[i].split('-')[0]
  if  (author in Authors ):
    continue
  Authors.append(Novels[i].split('-')[0])
print(len(Authors))
Authors

12


['austen',
 'bible',
 'blake',
 'bryant',
 'burgess',
 'carroll',
 'chesterton',
 'edgeworth',
 'melville',
 'milton',
 'shakespeare',
 'whitman']

There are 12 authors who wrote 18 books above

In [5]:
for i in Novels:
  print(i.split('.')[0] + " has " + str(len(gutenberg.words(i))) + ' words'  )


austen-emma has 192427 words
austen-persuasion has 98171 words
austen-sense has 141576 words
bible-kjv has 1010654 words
blake-poems has 8354 words
bryant-stories has 55563 words
burgess-busterbrown has 18963 words
carroll-alice has 34110 words
chesterton-ball has 96996 words
chesterton-brown has 86063 words
chesterton-thursday has 69213 words
edgeworth-parents has 210663 words
melville-moby_dick has 260819 words
milton-paradise has 96825 words
shakespeare-caesar has 25833 words
shakespeare-hamlet has 37360 words
shakespeare-macbeth has 23140 words
whitman-leaves has 154883 words


Results above show total of words in each book.

They will be transformed to dataframe for easier to read, and this data frame sumarize all information about words, senteces and vocalbulary

In [0]:
num_word = []
num_sent = []
num_vocab = []
for fileid in gutenberg.fileids():
    num_word.append(len(gutenberg.words(fileid)) )
    num_sent.append(len(gutenberg.sents(fileid)) )
    num_vocab.append(len(set(gutenberg.words(fileid))) )



In [0]:
suma = pd.DataFrame( index= Novels, columns = ['Words','Sentences','Vocabulary'], data = np.array([num_word, num_sent,num_vocab]).T )  


In [8]:
suma


Unnamed: 0,Words,Sentences,Vocabulary
austen-emma.txt,192427,7752,7811
austen-persuasion.txt,98171,3747,6132
austen-sense.txt,141576,4999,6833
bible-kjv.txt,1010654,30103,13769
blake-poems.txt,8354,438,1820
bryant-stories.txt,55563,2863,4420
burgess-busterbrown.txt,18963,1054,1764
carroll-alice.txt,34110,1703,3016
chesterton-ball.txt,96996,4779,8947
chesterton-brown.txt,86063,3806,8299


**bible-kjv is the book which has largest amount of words than the others. while blake-poems is the least. It can understand that poems is less words than novels.**

Now extract an random book to show its content

**=> it is raw data beccasue it has a lot symbol like \n, ...**

In [9]:
gutenberg.paras('austen-emma.txt')[:2]

[[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']], [['VOLUME', 'I']]]

**=>Because the number of sentences are too large, this project focuse on "paras" which consider as set of sentences. to reduce the number of samples**

In [10]:
s = 0
for i in Novels:
  print(i.split('.')[0] + " has " +  str(len(gutenberg.paras(i)))  + " paragraphs")
  s = s + len(gutenberg.paras(i))
print(s)


austen-emma has 2371 paragraphs
austen-persuasion has 1032 paragraphs
austen-sense has 1862 paragraphs
bible-kjv has 24608 paragraphs
blake-poems has 284 paragraphs
bryant-stories has 1194 paragraphs
burgess-busterbrown has 266 paragraphs
carroll-alice has 817 paragraphs
chesterton-ball has 1606 paragraphs
chesterton-brown has 1161 paragraphs
chesterton-thursday has 1288 paragraphs
edgeworth-parents has 3726 paragraphs
melville-moby_dick has 2793 paragraphs
milton-paradise has 29 paragraphs
shakespeare-caesar has 744 paragraphs
shakespeare-hamlet has 950 paragraphs
shakespeare-macbeth has 678 paragraphs
whitman-leaves has 2478 paragraphs
47887


**Each sample will have 500 paras to reduce the number of samples and process data faster**

In [11]:
for i in Novels:
  if (len(gutenberg.paras(i)) < 500):
    print(i.split('.')[0] + " has " +  str(len(gutenberg.paras(i)))  + " paragraphs")

blake-poems has 284 paragraphs
burgess-busterbrown has 266 paragraphs
milton-paradise has 29 paragraphs


## Prepare Data

Generate data from the books which has 3 features titles, paras and authors

In [12]:
# Titles, Sentences, Authors
Titles = []
Paras = []
AuthorsLS = []
import time
tick = time.time()
# get the data
from itertools import chain

for fileid in gutenberg.fileids():
    author = fileid.split('-')[0] 
    kk = gutenberg.paras(fileid) 
    title = fileid.split('-')[1].split('.')[0] 
    for para in kk:
        AuthorsLS.append(author)
        Titles.append(title)
        para = list(chain.from_iterable(para)) 
        Paras.append(para)
    
print(time.time() - tick)
  

6.439829587936401


In [13]:
dataOrig = pd.DataFrame({ 'Titles' : Titles,
                      'Paras':    Paras,
                      'Authors': AuthorsLS})
dataOrig

Unnamed: 0,Titles,Paras,Authors
0,emma,"[[, Emma, by, Jane, Austen, 1816, ]]",austen
1,emma,"[VOLUME, I]",austen
2,emma,"[CHAPTER, I]",austen
3,emma,"[Emma, Woodhouse, ,, handsome, ,, clever, ,, a...",austen
4,emma,"[She, was, the, youngest, of, the, two, daught...",austen
...,...,...,...
47882,leaves,"[}, Good, -, Bye, My, Fancy, !]",whitman
47883,leaves,"[Good, -, bye, my, Fancy, !, Farewell, dear, m...",whitman
47884,leaves,"[Now, for, my, last, --, let, me, look, back, ...",whitman
47885,leaves,"[Long, have, we, lived, ,, joy, ', d, ,, cares...",whitman


Using stop word in english to filter data

<font color= 'red'>**=> lower all char and remove author in each samples**</cetner>

In [0]:
import string
data = dataOrig.copy()
stop_words = set(stopwords.words('english') + Authors + list(string.punctuation))
for i in range(data.shape[0]):
  words = ''
  for w in data["Paras"][i]:
    if not w.lower() in stop_words:
        words = words + " " + w.lower() 
  data["Paras"][i] = words


In [15]:
data.head()

Unnamed: 0,Titles,Paras,Authors
0,emma,emma jane 1816,austen
1,emma,volume,austen
2,emma,chapter,austen
3,emma,emma woodhouse handsome clever rich comfortab...,austen
4,emma,youngest two daughters affectionate indulgent...,austen


**=> after filtering the data is more cleaner**

In [16]:
data['Authors'].value_counts()

bible          24608
austen          5265
chesterton      4055
edgeworth       3726
melville        2793
whitman         2478
shakespeare     2372
bryant          1194
carroll          817
blake            284
burgess          266
milton            29
Name: Authors, dtype: int64

**Total number of paras for each author => the data is imbalace (milton only 29 paras)**

split data to 20% test and 80% training

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['Paras'], data['Authors'], test_size=0.2, random_state=12)

In [18]:
print("training shape: {}{}".format(X_train.shape,y_train.shape))
print("testing shape : {}{}".format(X_test.shape,y_test.shape))

training shape: (38309,)(38309,)
testing shape : (9578,)(9578,)


## Bag of words

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
count_vect = CountVectorizer(max_features = 5000)
count_vect.fit(data['Paras'])
X_train_counts = count_vect.transform(X_train)
X_test_counts = count_vect.transform(X_test)
X_train_counts.shape

(38309, 5000)

In [20]:
print("training shape: {}{}".format(X_train_counts.shape,y_train.shape))
print("testing shape : {}{}".format(X_test_counts.shape,y_test.shape))

training shape: (38309, 5000)(38309,)
testing shape : (9578, 5000)(9578,)


5000 Words in bag

In [0]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(y_train)
y_train_le = le.transform(y_train)

le1 = preprocessing.LabelEncoder()
le1.fit(y_test)
y_test_le = le1.transform(y_test)


In [22]:
model = RandomForestClassifier(n_estimators=20, random_state=1)
print(model)
model.fit(X_train_counts,y_train_le)
pr = model.predict(X_test_counts)
print(classification_report(y_test_le, pr))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)
              precision    recall  f1-score   support

           0       0.71      0.82      0.76      1003
           1       0.93      0.98      0.96      4939
           2       0.43      0.17      0.24        53
           3       0.60      0.47      0.53       246
           4       0.97      0.66      0.78        58
           5       0.95      0.69      0.80       169
           6       0.75      0.73      0.74       803
           7       0.68      0.59      0.64       758
           8       0.78   

In [23]:

param_grid = {'n_estimators': [80, 100, 120],
              'max_depth' : [200, 250, 300],
              'class_weight' : [{0:0.1,1:0.9}, {0:0.2,1:0.8},{0:0.3,1:0.7}]}
RDF_grid = GridSearchCV(estimator=RandomForestClassifier(),
                          param_grid = param_grid,
                          scoring="f1_micro",
                          cv=3,
                          n_jobs = 5)
tick = time.time()
RDF_grid.fit(X_train_counts, y_train_le)
tock = time.time()
RDF_grid_best = RDF_grid.best_estimator_ #best estimator
print(RDF_grid_best)



RandomForestClassifier(bootstrap=True, class_weight={0: 0.3, 1: 0.7},
                       criterion='gini', max_depth=300, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=120, n_jobs=None, oob_score=False,
                       random_state=None, verbose=0, warm_start=False)


In [24]:
RDF_grid_best.fit(X_train_counts,y_train_le)
pr = RDF_grid_best.predict(X_test_counts)
print(classification_report(y_test_le, pr))


              precision    recall  f1-score   support

           0       0.81      0.77      0.79      1003
           1       0.93      0.98      0.96      4939
           2       0.54      0.13      0.21        53
           3       0.68      0.47      0.55       246
           4       1.00      0.74      0.85        58
           5       0.97      0.73      0.83       169
           6       0.76      0.78      0.77       803
           7       0.74      0.68      0.71       758
           8       0.81      0.68      0.74       554
           9       0.67      0.67      0.67         3
          10       0.90      0.82      0.86       496
          11       0.47      0.61      0.53       496

    accuracy                           0.85      9578
   macro avg       0.77      0.67      0.71      9578
weighted avg       0.85      0.85      0.85      9578



<font color= 'red'>**=> The accuaracy imrpoved from 83% to 85% for previous version (max_depth and n_estimators tuned 20 to 300 and 120 respectively, and the best result is 300 and 120. It means that the model has a chance to continuosly incrase accuaracy)**</font>

In [25]:
model = DecisionTreeClassifier(random_state=2)
print(model)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
print(classification_report(y_test, pr))

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=2, splitter='best')
              precision    recall  f1-score   support

      austen       0.72      0.74      0.73      1003
       bible       0.95      0.94      0.95      4939
       blake       0.21      0.13      0.16        53
      bryant       0.44      0.51      0.47       246
     burgess       0.89      0.72      0.80        58
     carroll       0.84      0.69      0.76       169
  chesterton       0.66      0.66      0.66       803
   edgeworth       0.59      0.59      0.59       758
    melville       0.65      0.58      0.61       554
      milton       0.25      0.33      0.29         3
 sh

In [26]:

param_grid = {'criterion' :['gini', 'entropy'],
              'max_features': ['auto', 'log2'],
              'max_depth' : [1000, 1200, 1400],
              'class_weight' : [{0:0.1,1:0.9}, {0:0.2,1:0.8},{0:0.3,1:0.7}]
             }
DT_grid = GridSearchCV(estimator=DecisionTreeClassifier(),
                          param_grid = param_grid,
                          scoring="f1_micro",
                          cv=3,
                          n_jobs = 15)
tick = time.time()
DT_grid.fit(X_train_counts, y_train_le)
tock = time.time()
DT_grid_best = DT_grid.best_estimator_ #best estimator
DT_grid_best

DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='gini',
                       max_depth=1000, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [27]:
DT_grid_best.fit(X_train_counts,y_train_le)
pr = DT_grid_best.predict(X_test_counts)
print(classification_report(y_test_le, pr))


              precision    recall  f1-score   support

           0       0.61      0.59      0.60      1003
           1       0.91      0.92      0.91      4939
           2       0.25      0.19      0.22        53
           3       0.33      0.38      0.35       246
           4       0.68      0.40      0.50        58
           5       0.57      0.59      0.58       169
           6       0.56      0.55      0.56       803
           7       0.48      0.49      0.48       758
           8       0.48      0.45      0.47       554
           9       0.25      0.33      0.29         3
          10       0.58      0.77      0.66       496
          11       0.42      0.31      0.36       496

    accuracy                           0.73      9578
   macro avg       0.51      0.50      0.50      9578
weighted avg       0.72      0.73      0.72      9578



<font color= 'red'>**=> The accuaracy go down from 79% to 73%.**</font>

In [28]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
print(model)
model.fit(X_train_counts.toarray(), y_train)
pr = model.predict(X_test_counts.toarray())
print(classification_report(y_test, pr))

GaussianNB(priors=None, var_smoothing=1e-09)
              precision    recall  f1-score   support

      austen       0.90      0.78      0.84      1003
       bible       0.99      0.91      0.95      4939
       blake       0.09      0.40      0.15        53
      bryant       0.39      0.50      0.44       246
     burgess       0.29      0.71      0.41        58
     carroll       0.41      0.65      0.50       169
  chesterton       0.82      0.64      0.72       803
   edgeworth       0.67      0.79      0.73       758
    melville       0.77      0.67      0.72       554
      milton       0.06      0.67      0.12         3
 shakespeare       0.82      0.88      0.85       496
     whitman       0.44      0.59      0.50       496

    accuracy                           0.81      9578
   macro avg       0.56      0.68      0.58      9578
weighted avg       0.86      0.81      0.83      9578



In [29]:
tock - tick

24.383885622024536

In [30]:
param_grid = {'var_smoothing' :[0.00000000001,0.00000000003,0.0000000001,0.0000000003,
                                0.000000001,0.000000003,0.00000001]}
GNB_grid = GridSearchCV(estimator=GaussianNB(),
                          param_grid = param_grid,
                          scoring="f1_micro",
                          cv=3,
                          n_jobs = -1)
tick = time.time()
GNB_grid.fit(X_train_counts.toarray(), y_train_le)
tock = time.time()
GNB_grid_best = GNB_grid.best_estimator_ #best estimator
GNB_grid_best



GaussianNB(priors=None, var_smoothing=1e-08)

In [31]:
GNB_grid_best.fit(X_train_counts.toarray(),y_train_le)
pr = GNB_grid_best.predict(X_test_counts.toarray())
print(classification_report(y_test_le, pr))

              precision    recall  f1-score   support

           0       0.91      0.78      0.84      1003
           1       0.99      0.93      0.96      4939
           2       0.09      0.40      0.15        53
           3       0.39      0.50      0.44       246
           4       0.29      0.71      0.41        58
           5       0.41      0.65      0.50       169
           6       0.82      0.65      0.72       803
           7       0.69      0.79      0.73       758
           8       0.77      0.68      0.72       554
           9       0.07      0.67      0.13         3
          10       0.82      0.88      0.85       496
          11       0.49      0.59      0.53       496

    accuracy                           0.82      9578
   macro avg       0.56      0.68      0.58      9578
weighted avg       0.86      0.82      0.84      9578



<font color= 'red'>**=> The accuaracy does not change from previous version but it is better than default model up to 1%.**</font>

In [32]:
tick = time.time()
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=1)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
print(time.time() - tick)
print(classification_report(y_test, pr))

113.09358930587769
              precision    recall  f1-score   support

      austen       0.90      0.72      0.80      1003
       bible       0.77      1.00      0.87      4939
       blake       0.39      0.13      0.20        53
      bryant       0.86      0.49      0.63       246
     burgess       0.96      0.78      0.86        58
     carroll       0.94      0.74      0.83       169
  chesterton       0.93      0.61      0.74       803
   edgeworth       0.96      0.62      0.76       758
    melville       0.90      0.69      0.78       554
      milton       0.40      0.67      0.50         3
 shakespeare       0.98      0.73      0.84       496
     whitman       0.84      0.30      0.44       496

    accuracy                           0.82      9578
   macro avg       0.82      0.62      0.69      9578
weighted avg       0.84      0.82      0.80      9578



<font color= 'red'>**=> The accuaracy does not change%.**</font>

<font color= 'red'>**=> In previous version the best one is gradient bossting 82%. However, the best on in this version is decision tree which is up to 85%**</font>

some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

## TF-IDF

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF.

Term Frequency: This summarizes how often a given word appears within a document.

Inverse Document Frequency: This downscales words that appear a lot across documents.

=> TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
# create the transform
vectorizer = TfidfVectorizer(max_features= 5000)
# tokenize and build vocab
vectorizer.fit(data['Paras'])
# summarize
#print(vectorizer.vocabulary_)
#print(vectorizer.idf_)
# encode document
X_train_counts = vectorizer.transform(X_train)
X_test_counts = vectorizer.transform(X_test)
# summarize encoded vector
#print(vector.shape)
#print(vector.toarray())
X_train_counts.shape

(38309, 5000)

In [34]:
print("training shape: {}{}".format(X_train_counts.shape,y_train.shape))
print("testing shape : {}{}".format(X_test_counts.shape,y_test.shape))

training shape: (38309, 5000)(38309,)
testing shape : (9578, 5000)(9578,)


In [35]:
model = RandomForestClassifier(n_estimators=120,max_depth=300 ,random_state=1)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.73      0.83      0.78      1003
       bible       0.92      0.99      0.95      4939
       blake       1.00      0.02      0.04        53
      bryant       0.77      0.42      0.54       246
     burgess       1.00      0.74      0.85        58
     carroll       0.97      0.67      0.79       169
  chesterton       0.76      0.75      0.76       803
   edgeworth       0.78      0.63      0.70       758
    melville       0.90      0.63      0.74       554
      milton       0.67      0.67      0.67         3
 shakespeare       0.95      0.80      0.87       496
     whitman       0.49      0.61      0.54       496

    accuracy                           0.85      9578
   macro avg       0.83      0.65      0.69      9578
weighted avg       0.85      0.85      0.84      9578



<font color= 'red'>**=> The accuaracy increases from 83% to 85%.**</font>

In [36]:
model = DecisionTreeClassifier( max_depth=1200,random_state=2)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
classification_report(y_test, pr)
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.72      0.72      0.72      1003
       bible       0.94      0.95      0.95      4939
       blake       0.18      0.09      0.12        53
      bryant       0.48      0.46      0.47       246
     burgess       0.71      0.69      0.70        58
     carroll       0.86      0.70      0.77       169
  chesterton       0.67      0.68      0.67       803
   edgeworth       0.63      0.59      0.61       758
    melville       0.69      0.59      0.64       554
      milton       0.40      0.67      0.50         3
 shakespeare       0.66      0.78      0.71       496
     whitman       0.46      0.44      0.45       496

    accuracy                           0.80      9578
   macro avg       0.62      0.61      0.61      9578
weighted avg       0.80      0.80      0.80      9578



<font color= 'red'>**=> The accuaracy increase from 79% to 80%.**</font>

In [37]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB(var_smoothing=1e-08)
model.fit(X_train_counts.toarray(), y_train)
pr = model.predict(X_test_counts.toarray())
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.85      0.82      0.84      1003
       bible       0.99      0.92      0.96      4939
       blake       0.16      0.34      0.22        53
      bryant       0.37      0.49      0.42       246
     burgess       0.17      0.69      0.27        58
     carroll       0.41      0.62      0.49       169
  chesterton       0.81      0.71      0.76       803
   edgeworth       0.79      0.75      0.77       758
    melville       0.68      0.72      0.70       554
      milton       0.04      0.67      0.07         3
 shakespeare       0.87      0.87      0.87       496
     whitman       0.54      0.56      0.55       496

    accuracy                           0.83      9578
   macro avg       0.56      0.68      0.58      9578
weighted avg       0.86      0.83      0.84      9578



<font color= 'red'>**=> The accuaracy does not change.**</font>

In [38]:
tick = time.time()
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=1)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
print(time.time() - tick)
print(classification_report(y_test, pr))

272.351704120636
              precision    recall  f1-score   support

      austen       0.88      0.74      0.80      1003
       bible       0.77      1.00      0.87      4939
       blake       0.25      0.11      0.16        53
      bryant       0.80      0.56      0.66       246
     burgess       0.98      0.79      0.88        58
     carroll       0.88      0.73      0.80       169
  chesterton       0.94      0.62      0.75       803
   edgeworth       0.93      0.61      0.74       758
    melville       0.95      0.65      0.77       554
      milton       0.50      0.67      0.57         3
 shakespeare       0.99      0.73      0.84       496
     whitman       0.84      0.27      0.41       496

    accuracy                           0.81      9578
   macro avg       0.81      0.62      0.69      9578
weighted avg       0.83      0.81      0.80      9578



<font color= 'red'>**=> The accuaracy descrase from 83% to 81%.**</font>

<font color= 'red'>**=> The highest accuaracy 83% for Naive's baye instead of gradient boosting of previous version.**</font>

## Word2vec

In [0]:
from gensim.models.word2vec import Word2Vec
from string import punctuation
punc = set(punctuation)

In [40]:
dataWV = dataOrig['Paras'].copy()
dataWV

0                     [[, Emma, by, Jane, Austen, 1816, ]]
1                                              [VOLUME, I]
2                                             [CHAPTER, I]
3        [Emma, Woodhouse, ,, handsome, ,, clever, ,, a...
4        [She, was, the, youngest, of, the, two, daught...
                               ...                        
47882                      [}, Good, -, Bye, My, Fancy, !]
47883    [Good, -, bye, my, Fancy, !, Farewell, dear, m...
47884    [Now, for, my, last, --, let, me, look, back, ...
47885    [Long, have, we, lived, ,, joy, ', d, ,, cares...
47886    [Yet, let, me, not, be, too, hasty, ,, Long, i...
Name: Paras, Length: 47887, dtype: object

In [41]:
stop_words = set(stopwords.words('english') + Authors + list(string.punctuation))
for i in range(len(dataWV)):
  words = []
  for w in dataWV[i]:
    if not w.lower() in stop_words :
      words.append(w.lower()) 
  dataWV[i] = words
dataWV


0                                       [emma, jane, 1816]
1                                                 [volume]
2                                                [chapter]
3        [emma, woodhouse, handsome, clever, rich, comf...
4        [youngest, two, daughters, affectionate, indul...
                               ...                        
47882                                   [good, bye, fancy]
47883    [good, bye, fancy, farewell, dear, mate, dear,...
47884    [last, --, let, look, back, moment, slower, fa...
47885    [long, lived, joy, caress, together, delightfu...
47886    [yet, let, hasty, long, indeed, lived, slept, ...
Name: Paras, Length: 47887, dtype: object

In [0]:
sz = 300 # 500
modelW2V = Word2Vec(dataWV, size=sz, window=600, min_count=10, workers=10, iter=10)

<font color= 'red'>**=> size, window, mincount and iter are already tuning after many tries**</font>

In [0]:
#model.wv.most_similar(positive="girl", topn =3)
#len(model.wv.vocab)

In [0]:
from sklearn.model_selection import train_test_split
data_train, data_test, y_train, y_test = train_test_split(dataWV, dataOrig['Authors'], test_size=0.2, random_state=12)

In [0]:
import numpy as np  # Make sure that numpy is imported

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    # 
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.wv.index2word)
    #
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    # 
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec


def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one and return a 2D numpy array 
    # 
    # Initialize a counter
    counter = 0.
    # 
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    # 
    # Loop through the reviews
    for review in reviews:
       #
       # Print a status message every 1000th review
       if counter%10000. == 0.:
           print ("Review {} of {}" .format(counter, len(reviews)))

       # Call the function (defined above) that makes average feature vectors
       reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, num_features)
       #
       # Increment the counter
       counter = counter + 1.
    return reviewFeatureVecs

In [46]:
X_train = getAvgFeatureVecs(data_train, modelW2V, sz)
X_test = getAvgFeatureVecs(data_test, modelW2V, sz)

Review 0.0 of 38309




Review 10000.0 of 38309
Review 20000.0 of 38309
Review 30000.0 of 38309
Review 0.0 of 9578


In [0]:
X_train = np.nan_to_num(X_train) 
X_test = np.nan_to_num(X_test) 

In [48]:
tick = time.time()
#n_estimators=120,max_depth=300
model = RandomForestClassifier(n_estimators=120,max_depth=300 , random_state=1)
model.fit(X_train,y_train)
pr = model.predict(X_test)
print(classification_report(y_test, pr))
time.time() - tick

              precision    recall  f1-score   support

      austen       0.80      0.87      0.84      1003
       bible       0.96      1.00      0.98      4939
       blake       0.83      0.09      0.17        53
      bryant       0.78      0.35      0.48       246
     burgess       0.98      0.69      0.81        58
     carroll       0.94      0.79      0.86       169
  chesterton       0.77      0.86      0.81       803
   edgeworth       0.67      0.68      0.68       758
    melville       0.83      0.71      0.77       554
      milton       0.17      0.33      0.22         3
 shakespeare       0.96      0.84      0.90       496
     whitman       0.72      0.71      0.71       496

    accuracy                           0.88      9578
   macro avg       0.79      0.66      0.69      9578
weighted avg       0.88      0.88      0.88      9578



148.56853532791138

<font color= 'red'>**=> The accuaracy increase from 79% to 88% compared to previous version.**</font>

In [49]:
model = DecisionTreeClassifier(max_depth=1200,random_state=2)
model.fit(X_train,y_train)
pr = model.predict(X_test)
classification_report(y_test, pr)
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.69      0.73      0.71      1003
       bible       0.97      0.96      0.96      4939
       blake       0.15      0.17      0.16        53
      bryant       0.32      0.34      0.33       246
     burgess       0.60      0.62      0.61        58
     carroll       0.67      0.59      0.63       169
  chesterton       0.62      0.65      0.63       803
   edgeworth       0.50      0.48      0.49       758
    melville       0.58      0.60      0.59       554
      milton       0.27      1.00      0.43         3
 shakespeare       0.78      0.75      0.76       496
     whitman       0.47      0.46      0.47       496

    accuracy                           0.78      9578
   macro avg       0.55      0.61      0.56      9578
weighted avg       0.79      0.78      0.79      9578



<font color= 'red'>**=> The accuaracy increase from 72% to 78% compared to previous version.**</font>

In [50]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB(var_smoothing=1e-08)
pr = model.fit(X_train, y_train)
pr = model.predict(X_test)
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.78      0.76      0.77      1003
       bible       0.99      0.95      0.97      4939
       blake       0.09      0.47      0.15        53
      bryant       0.36      0.39      0.37       246
     burgess       0.36      0.84      0.51        58
     carroll       0.87      0.83      0.85       169
  chesterton       0.68      0.75      0.71       803
   edgeworth       0.58      0.54      0.56       758
    melville       0.73      0.65      0.69       554
      milton       0.19      1.00      0.32         3
 shakespeare       0.94      0.81      0.87       496
     whitman       0.62      0.63      0.62       496

    accuracy                           0.82      9578
   macro avg       0.60      0.72      0.62      9578
weighted avg       0.84      0.82      0.83      9578



<font color= 'red'>**=> The accuaracy increase from 65% to 82% compared to previous version.**</font>

In [51]:
tick = time.time()
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=1)
model.fit(X_train,y_train)
pr = model.predict(X_test)
print(time.time() - tick)
print(classification_report(y_test, pr))

1948.2940249443054
              precision    recall  f1-score   support

      austen       0.82      0.85      0.83      1003
       bible       0.98      0.99      0.99      4939
       blake       0.35      0.21      0.26        53
      bryant       0.67      0.50      0.57       246
     burgess       0.81      0.67      0.74        58
     carroll       0.89      0.84      0.86       169
  chesterton       0.79      0.83      0.81       803
   edgeworth       0.69      0.68      0.68       758
    melville       0.79      0.75      0.77       554
      milton       0.14      0.33      0.20         3
 shakespeare       0.94      0.85      0.89       496
     whitman       0.69      0.74      0.72       496

    accuracy                           0.88      9578
   macro avg       0.71      0.69      0.69      9578
weighted avg       0.88      0.88      0.88      9578



<font color= 'red'>**=> The accuaracy increase from 80% to 88% compared to previous version.**</font>

## Clustering

In [52]:
from gensim.models import Word2Vec
 
from nltk.cluster import KMeansClusterer
 
from sklearn import cluster
from sklearn import metrics
 

tick = time.time() 
# get vector data
X = modelW2V[modelW2V.wv.vocab]
print (X)

 
 
NUM_CLUSTERS = 12
kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans.fit(X)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
 
print ("Cluster id labels for inputted data")
print (labels)
print ("Centroids data")
print (centroids)
 
print ("Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):")
print (kmeans.score(X))
 
silhouette_score = metrics.silhouette_score(X, labels, metric='euclidean')
 
print ("Silhouette_score: ")
print (silhouette_score)
time.time() - tick

[[ 1.6149249   1.2944665   0.9105315  ...  3.7473667   1.7258674
  -0.26087084]
 [ 1.9661744  -1.0254343   3.1355164  ...  1.4374299  -1.9938434
   0.04409772]
 [-0.7695444   0.11796419 -0.59558165 ... -0.13437365 -0.32220805
  -0.0491445 ]
 ...
 [ 0.38510218 -0.00832579  0.18448576 ...  0.06716139  0.17125045
  -0.00822563]
 [-0.28247282  0.15865894 -0.83941036 ... -0.06695636  0.21947138
   0.05091806]
 [ 0.23323424  0.4981548  -0.74384993 ... -0.75298643 -0.38904727
   0.8003735 ]]


  # This is added back by InteractiveShellApp.init_path()


Cluster id labels for inputted data
[0 3 1 ... 1 1 1]
Centroids data
[[-0.6647718   0.64872146 -0.7813641  ...  0.31891087  1.1888773
  -0.28204998]
 [-0.18681288  0.06621408 -0.33785132 ... -0.05359645  0.0179157
   0.05084041]
 [-0.20481342  0.05053772 -0.20976686 ...  0.07560717  0.320293
  -0.12872763]
 ...
 [-0.72423553  0.9641554  -3.0884938  ...  0.02187857  0.4182794
   0.42999017]
 [-2.9566374   0.8212727  -3.3931105  ... -0.68058574  3.9126797
  -0.24639897]
 [-1.1135857   0.11213609 -1.7273328  ...  0.10066728  0.95346
  -0.0821972 ]]
Score (Opposite of the value of X on the K-means objective which is Sum of distances of samples to their closest cluster center):
-521983.94
Silhouette_score: 
-0.013519149


10.319035530090332

In [53]:
words = list(modelW2V.wv.vocab.keys())
#labels
print(len(words))
print(len(labels))
AuthorClass  = {"words": words,
                "labels": labels}
AuthorClass_df = pd.DataFrame(AuthorClass)    
AuthorClass_df            


10103
10103


Unnamed: 0,words,labels
0,emma,0
1,jane,3
2,volume,1
3,chapter,2
4,woodhouse,3
...,...,...
10098,brooklyn,1
10099,growths,1
10100,bugles,1
10101,myriad,1


In [54]:
AuthorClass_df[labels == 7]

Unnamed: 0,words,labels
31,father,7
188,seven,7
374,shall,7
607,year,7
610,son,7
...,...,...
6952,idols,7
7273,jerusalem,7
7385,david,7
7680,babylon,7


<font color= 'red'>**=>Clustering the most used words of 12 authors**</font>



## LSA

In [0]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['Paras'], data['Authors'], test_size=0.2, random_state=12)


The challenge is that the matrix is very sparse (or high dimension) and noisy (or include lots of low frequency word). So truncated SVD is adopted to reduce dimension.



The idea of SVD is finding the most valuable information and using lower dimension t to represent same thing. [ref](https://github.com/makcedward/nlp/blob/master/sample/nlp-lsa_lda.ipynb)

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

def build_lsa(x_train, x_test, dim=10):
    tfidf_vec = TfidfVectorizer(use_idf=True, norm='l2')
    svd = TruncatedSVD(n_components=dim)
    
    transformed_x_train = tfidf_vec.fit_transform(x_train)
    transformed_x_test = tfidf_vec.transform(x_test)
    
    print('TF-IDF output shape:', transformed_x_train.shape)
    
    x_train_svd = svd.fit_transform(transformed_x_train)
    x_test_svd = svd.transform(transformed_x_test)
    
    print('LSA output shape:', x_train_svd.shape)
    
    explained_variance = svd.explained_variance_ratio_.sum()
    print("Sum of explained variance ratio: %d%%" % (int(explained_variance * 100)))
    
    return tfidf_vec, svd, tfidf_vec.get_feature_names() , x_train_svd, x_test_svd

def display_word_distribution_lsa(model, feature_names, n_word):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        words = []
        for i in topic.argsort()[:-n_word - 1:-1]:
            words.append(feature_names[i])
        print(words)

tfidf_vec, svd, feature_names, x_train_lsa, x_test_lsa = build_lsa(X_train, X_test)

TF-IDF output shape: (38309, 38604)
LSA output shape: (38309, 10)
Sum of explained variance ratio: 3%


In [57]:
display_word_distribution_lsa(model=svd, feature_names=feature_names, n_word=10)

Topic 0:
['shall', 'unto', 'lord', 'thou', 'thy', 'said', 'thee', 'god', 'ye', 'man']
Topic 1:
['thou', 'thy', 'thee', 'shalt', 'hast', 'thine', 'unto', 'art', 'god', 'lord']
Topic 2:
['said', 'would', 'could', 'one', 'little', 'mr', 'man', 'well', 'know', 'mrs']
Topic 3:
['unto', 'lord', 'ye', 'israel', 'god', 'king', 'children', 'moses', 'came', 'hath']
Topic 4:
['said', 'ye', 'unto', 'thou', 'shall', 'go', 'thee', 'jesus', 'answered', 'say']
Topic 5:
['ye', 'god', 'thou', 'thy', 'know', 'would', 'us', 'christ', 'good', 'things']
Topic 6:
['chapter', 'thou', 'ye', 'shalt', '13', '22', 'came', '21', 'son', 'went']
Topic 7:
['lord', 'said', 'thy', 'chapter', 'god', 'thee', 'hath', 'let', 'us', 'know']
Topic 8:
['thy', 'thee', 'ye', 'king', 'unto', '11', 'come', 'son', 'came', '119']
Topic 9:
['king', 'said', 'ye', 'thy', 'son', 'thou', 'david', 'israel', 'house', 'judah']



<font color= 'red'>
    
**=> Topic 5: maybe somethhing about Jesus topic**
    
**=> Topic 3: about kingdoom topic**</font>



## LDA

In [58]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def build_lda(x_train, x_test, num_of_topic=10):
    vec = CountVectorizer()
    transformed_x_train = vec.fit_transform(x_train)
    transformed_x_test = vec.transform(x_test)
    feature_names = vec.get_feature_names()

    lda = LatentDirichletAllocation(
        n_components=num_of_topic, max_iter=5, 
        learning_method='online', random_state=0)
    x_train_lda = lda.fit_transform(transformed_x_train)
    x_test_lda =  lda.fit(transformed_x_test)

    return lda, vec, feature_names, x_train_lda, x_test_lda

def display_word_distribution(model, feature_names, n_word):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        words = []
        for i in topic.argsort()[:-n_word - 1:-1]:
            words.append(feature_names[i])
        print(words)

lda_model, vec, feature_names, x_train_lda,  x_test_lda= build_lda(X_train,X_test)
display_word_distribution(
    model=lda_model, feature_names=feature_names, 
    n_word=10)

Topic 0:
['blood', 'offering', 'without', 'congregation', 'burnt', 'kill', 'master', 'chief', 'wait', 'receive']
Topic 1:
['could', 'must', 'mrs', 'would', 'said', 'might', 'much', 'one', 'miss', 'without']
Topic 2:
['sake', 'enter', 'enemy', 'grass', '44', 'built', 'prepared', 'edge', 'moab', 'sick']
Topic 3:
['whale', 'hundred', 'ahab', 'twenty', 'three', 'boat', 'thousand', 'sea', 'whales', 'length']
Topic 4:
['shall', 'unto', 'lord', 'thou', 'thy', 'god', 'thee', 'ye', 'said', 'king']
Topic 5:
['34', 'glory', 'chapter', '40', 'judgment', 'places', 'disciples', 'song', 'vain', 'mountain']
Topic 6:
['fast', 'appearance', 'arthur', 'joab', 'armed', 'perceive', 'perceived', 'wounded', 'fir', 'exeunt']
Topic 7:
['said', 'like', 'one', 'mr', 'see', 'little', 'would', 'well', 'old', 'man']
Topic 8:
['book', 'ham', 'haue', 'laid', 'worship', 'st', 'grey', 'names', 'ha', 'bones']
Topic 9:
['stones', 'silent', 'devil', 'sons', '119', 'box', 'justice', 'jonathan', 'forgotten', 'row']


<font color= 'red'>

**=> Topic 0: violent topic**
    
**=> Topic 3: above age or somthing related to years, era**
</font>


## Conclusion

**=>BagofWords: Gradient boosting and Randomforest give the best result over 80% of accuracy**


**=>TF-IDF: The accuracy of the best is still 83%, but accuaracy of Naive Bayes and Decision tree are improved when comparing to Bagofword**

**=>Word2Vec: The performance worse than the others NLP techniques, the best case is 80% while the worse one is 65%**

**Still not implemnet LDA to show the top 10 words**


**=> Wait for next week to reference project of other peoples in class to improve and revise my project**

<font color= 'red'> Bag of words

**=> In previous version the best one is gradient bossting 82%. However, the best on in this version is decision tree which is up to 85%**</font>

<font color= 'red'> TF-IDF
    
**=> In previous version the best one is gradient bossting 82%. However, the best on in this version is decision tree which is up to 85%**</font>

<font color= 'red'> Word2Vec

**=> In previous version the best one is gradient bossting 80%. However, the best on in this version are decision tree and gradient boosting which is up to 88%**

**=> This is also give the best result for all technique as we expect from the theory**
</font>

<font color= 'red'>**=>Clustering the most used words of 12 authors**</font>


<font color= 'red'> For LSA
    
**=> Topic 5: maybe somethhing about Jesus topic**
    
**=> Topic 3: about kingdoom topic**</font>

<font color= 'red'> For LDA

**=> Topic 0: violent topic**
    
**=> Topic 3: above age or somthing related to years, era**
</font>