
# <center> Author Classification </center>

## Introduction

Using NLP and techniques to classify author from texts from Gutenberg project.
1. Pre-process data using Spacy and other methods.
2. Perform data exploration
3. Using Bag of Word, apply supervised models such as Naive Bayes,  Decision Tree, Random Forest, and Gradient Boosting.
4. Similar to 3., but using TF-IDF.
5. Similar to 3., but using word2vec.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Explore-Data" data-toc-modified-id="Explore-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Explore Data</a></span></li><li><span><a href="#Prepare-Data" data-toc-modified-id="Prepare-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Prepare Data</a></span></li><li><span><a href="#Bag-of-words" data-toc-modified-id="Bag-of-words-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Bag of words</a></span></li><li><span><a href="#TF-IDF" data-toc-modified-id="TF-IDF-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>TF-IDF</a></span></li><li><span><a href="#Word2vec" data-toc-modified-id="Word2vec-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Word2vec</a></span></li><li><span><a href="#LDA" data-toc-modified-id="LDA-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>LDA</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

## Explore Data

In [276]:
import nltk
from nltk.corpus import gutenberg
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from itertools import chain

nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-3-more-fun-with-word-vectors

In [277]:
Novels = gutenberg.fileids()
Novels

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

The data is name of author followed title of the book

In [0]:
numNovels = len(gutenberg.fileids())

There are 18 book in this project

In [279]:
Authors = []
for i in range(numNovels):
  author = Novels[i].split('-')[0]
  if  (author in Authors ):
    continue
  Authors.append(Novels[i].split('-')[0])
print(len(Authors))
Authors

12


['austen',
 'bible',
 'blake',
 'bryant',
 'burgess',
 'carroll',
 'chesterton',
 'edgeworth',
 'melville',
 'milton',
 'shakespeare',
 'whitman']

There are 12 authors who wrote 18 books above

In [280]:
for i in Novels:
  print(i.split('.')[0] + " has " + str(len(gutenberg.words(i))) + ' words'  )


austen-emma has 192427 words
austen-persuasion has 98171 words
austen-sense has 141576 words
bible-kjv has 1010654 words
blake-poems has 8354 words
bryant-stories has 55563 words
burgess-busterbrown has 18963 words
carroll-alice has 34110 words
chesterton-ball has 96996 words
chesterton-brown has 86063 words
chesterton-thursday has 69213 words
edgeworth-parents has 210663 words
melville-moby_dick has 260819 words
milton-paradise has 96825 words
shakespeare-caesar has 25833 words
shakespeare-hamlet has 37360 words
shakespeare-macbeth has 23140 words
whitman-leaves has 154883 words


Results above show total of words in each book.

They will be transformed to dataframe for easier to read, and this data frame sumarize all information about words, senteces and vocalbulary

In [0]:
num_word = []
num_sent = []
num_vocab = []
for fileid in gutenberg.fileids():
    num_word.append(len(gutenberg.words(fileid)) )
    num_sent.append(len(gutenberg.sents(fileid)) )
    num_vocab.append(len(set(gutenberg.words(fileid))) )



In [0]:
suma = pd.DataFrame( index= Novels, columns = ['Words','Sentences','Vocabulary'], data = np.array([num_word, num_sent,num_vocab]).T )  


In [283]:
suma


Unnamed: 0,Words,Sentences,Vocabulary
austen-emma.txt,192427,7752,7811
austen-persuasion.txt,98171,3747,6132
austen-sense.txt,141576,4999,6833
bible-kjv.txt,1010654,30103,13769
blake-poems.txt,8354,438,1820
bryant-stories.txt,55563,2863,4420
burgess-busterbrown.txt,18963,1054,1764
carroll-alice.txt,34110,1703,3016
chesterton-ball.txt,96996,4779,8947
chesterton-brown.txt,86063,3806,8299


**bible-kjv is the book which has largest amount of words than the others. while blake-poems is the least. It can understand that poems is less words than novels.**

Now extract an random book to show its content

**=> it is raw data beccasue it has a lot symbol like \n, ...**

In [285]:
gutenberg.paras('austen-emma.txt')[:2]

[[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']], [['VOLUME', 'I']]]

**=>Because the number of sentences are too large, this project focuse on "paras" which consider as set of sentences. to reduce the number of samples**

In [286]:
s = 0
for i in Novels:
  print(i.split('.')[0] + " has " +  str(len(gutenberg.paras(i)))  + " paragraphs")
  s = s + len(gutenberg.paras(i))
print(s)


austen-emma has 2371 paragraphs
austen-persuasion has 1032 paragraphs
austen-sense has 1862 paragraphs
bible-kjv has 24608 paragraphs
blake-poems has 284 paragraphs
bryant-stories has 1194 paragraphs
burgess-busterbrown has 266 paragraphs
carroll-alice has 817 paragraphs
chesterton-ball has 1606 paragraphs
chesterton-brown has 1161 paragraphs
chesterton-thursday has 1288 paragraphs
edgeworth-parents has 3726 paragraphs
melville-moby_dick has 2793 paragraphs
milton-paradise has 29 paragraphs
shakespeare-caesar has 744 paragraphs
shakespeare-hamlet has 950 paragraphs
shakespeare-macbeth has 678 paragraphs
whitman-leaves has 2478 paragraphs
47887


**Each sample will have 500 paras to reduce the number of samples and process data faster**

In [287]:
for i in Novels:
  if (len(gutenberg.paras(i)) < 500):
    print(i.split('.')[0] + " has " +  str(len(gutenberg.paras(i)))  + " paragraphs")

blake-poems has 284 paragraphs
burgess-busterbrown has 266 paragraphs
milton-paradise has 29 paragraphs


## Prepare Data

Generate data from the books which has 3 features titles, paras and authors

In [288]:
# Titles, Sentences, Authors
Titles = []
Paras = []
Authors = []
import time
tick = time.time()
# get the data
from itertools import chain

for fileid in gutenberg.fileids():
    author = fileid.split('-')[0] 
    kk = gutenberg.paras(fileid) 
    title = fileid.split('-')[1].split('.')[0] 
    for para in kk:
        Authors.append(author)
        Titles.append(title)
        para = list(chain.from_iterable(para)) 
        Paras.append(para)
    
print(time.time() - tick)
  

4.847790479660034


In [289]:
dataOrig = pd.DataFrame({ 'Titles' : Titles,
                      'Paras':    Paras,
                      'Authors': Authors})
dataOrig

Unnamed: 0,Titles,Paras,Authors
0,emma,"[[, Emma, by, Jane, Austen, 1816, ]]",austen
1,emma,"[VOLUME, I]",austen
2,emma,"[CHAPTER, I]",austen
3,emma,"[Emma, Woodhouse, ,, handsome, ,, clever, ,, a...",austen
4,emma,"[She, was, the, youngest, of, the, two, daught...",austen
...,...,...,...
47882,leaves,"[}, Good, -, Bye, My, Fancy, !]",whitman
47883,leaves,"[Good, -, bye, my, Fancy, !, Farewell, dear, m...",whitman
47884,leaves,"[Now, for, my, last, --, let, me, look, back, ...",whitman
47885,leaves,"[Long, have, we, lived, ,, joy, ', d, ,, cares...",whitman


Using stop word in english to filter data

In [0]:
data = dataOrig.copy()
stop_words = set(stopwords.words('english')) 
for i in range(data.shape[0]):
  words = ''
  for w in data["Paras"][i]:
    if not w in stop_words:
      words = words + " " + w 
  data["Paras"][i] = words


In [291]:
data.head()

Unnamed: 0,Titles,Paras,Authors
0,emma,[ Emma Jane Austen 1816 ],austen
1,emma,VOLUME I,austen
2,emma,CHAPTER I,austen
3,emma,"Emma Woodhouse , handsome , clever , rich , c...",austen
4,emma,"She youngest two daughters affectionate , ind...",austen


**=> after filtering the data is more cleaner**

In [292]:
data['Authors'].value_counts()

bible          24608
austen          5265
chesterton      4055
edgeworth       3726
melville        2793
whitman         2478
shakespeare     2372
bryant          1194
carroll          817
blake            284
burgess          266
milton            29
Name: Authors, dtype: int64

**Total number of paras for each author => the data is imbalace (milton only 29 paras)**

split data to 20% test and 80% training

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['Paras'], data['Authors'], test_size=0.2, random_state=12)

In [294]:
print("training shape: {}{}".format(X_train.shape,y_train.shape))
print("testing shape : {}{}".format(X_test.shape,y_test.shape))

training shape: (38309,)(38309,)
testing shape : (9578,)(9578,)


## Bag of words

In [295]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
count_vect = CountVectorizer(max_features = 5000)
count_vect.fit(data['Paras'])
X_train_counts = count_vect.transform(X_train)
X_test_counts = count_vect.transform(X_test)
X_train_counts.shape

(38309, 5000)

In [296]:
print("training shape: {}{}".format(X_train_counts.shape,y_train.shape))
print("testing shape : {}{}".format(X_test_counts.shape,y_test.shape))

training shape: (38309, 5000)(38309,)
testing shape : (9578, 5000)(9578,)


5000 Words in bag

In [297]:
model = RandomForestClassifier(n_estimators=20, random_state=1)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.69      0.83      0.75      1003
       bible       0.93      0.99      0.96      4939
       blake       0.36      0.09      0.15        53
      bryant       0.56      0.43      0.48       246
     burgess       0.97      0.64      0.77        58
     carroll       0.92      0.63      0.75       169
  chesterton       0.73      0.73      0.73       803
   edgeworth       0.71      0.57      0.63       758
    melville       0.73      0.61      0.67       554
      milton       0.50      0.67      0.57         3
 shakespeare       0.75      0.82      0.78       496
     whitman       0.63      0.44      0.51       496

    accuracy                           0.83      9578
   macro avg       0.71      0.62      0.65      9578
weighted avg       0.82      0.83      0.82      9578



In [298]:
model = DecisionTreeClassifier(random_state=2)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.66      0.74      0.69      1003
       bible       0.94      0.94      0.94      4939
       blake       0.24      0.15      0.18        53
      bryant       0.38      0.45      0.41       246
     burgess       0.84      0.66      0.74        58
     carroll       0.83      0.66      0.73       169
  chesterton       0.64      0.63      0.64       803
   edgeworth       0.64      0.55      0.59       758
    melville       0.60      0.58      0.59       554
      milton       0.33      0.33      0.33         3
 shakespeare       0.60      0.73      0.66       496
     whitman       0.48      0.43      0.45       496

    accuracy                           0.78      9578
   macro avg       0.60      0.57      0.58      9578
weighted avg       0.78      0.78      0.78      9578



In [299]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
pr = model.fit(X_train_counts.toarray(), y_train)
pr = model.predict(X_test_counts.toarray())
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.91      0.78      0.84      1003
       bible       0.99      0.92      0.95      4939
       blake       0.10      0.40      0.16        53
      bryant       0.38      0.50      0.43       246
     burgess       0.31      0.69      0.43        58
     carroll       0.40      0.66      0.50       169
  chesterton       0.81      0.64      0.72       803
   edgeworth       0.68      0.78      0.72       758
    melville       0.76      0.67      0.71       554
      milton       0.07      0.67      0.12         3
 shakespeare       0.81      0.88      0.85       496
     whitman       0.46      0.59      0.52       496

    accuracy                           0.82      9578
   macro avg       0.56      0.68      0.58      9578
weighted avg       0.86      0.82      0.83      9578



In [300]:
tick = time.time()
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=1)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
print(time.time() - tick)
print(classification_report(y_test, pr))

83.10484266281128
              precision    recall  f1-score   support

      austen       0.88      0.74      0.81      1003
       bible       0.79      1.00      0.88      4939
       blake       0.45      0.17      0.25        53
      bryant       0.85      0.52      0.64       246
     burgess       0.98      0.88      0.93        58
     carroll       0.93      0.74      0.82       169
  chesterton       0.91      0.67      0.77       803
   edgeworth       0.95      0.64      0.77       758
    melville       0.90      0.68      0.78       554
      milton       0.40      0.67      0.50         3
 shakespeare       0.98      0.74      0.84       496
     whitman       0.80      0.34      0.48       496

    accuracy                           0.83      9578
   macro avg       0.82      0.65      0.70      9578
weighted avg       0.84      0.83      0.82      9578



**=>Gradient boosting and Randomforest give the best result over 80% of accuracy**


some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

## TF-IDF

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF.

Term Frequency: This summarizes how often a given word appears within a document.

Inverse Document Frequency: This downscales words that appear a lot across documents.

=> TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

In [301]:
from sklearn.feature_extraction.text import TfidfVectorizer
# create the transform
vectorizer = TfidfVectorizer(max_features= 5000)
# tokenize and build vocab
vectorizer.fit(data['Paras'])
# summarize
#print(vectorizer.vocabulary_)
#print(vectorizer.idf_)
# encode document
X_train_counts = vectorizer.transform(X_train)
X_test_counts = vectorizer.transform(X_test)
# summarize encoded vector
#print(vector.shape)
#print(vector.toarray())
X_train_counts.shape

(38309, 5000)

In [302]:
print("training shape: {}{}".format(X_train_counts.shape,y_train.shape))
print("testing shape : {}{}".format(X_test_counts.shape,y_test.shape))

training shape: (38309, 5000)(38309,)
testing shape : (9578, 5000)(9578,)


In [303]:
model = RandomForestClassifier(n_estimators=20, random_state=1)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.70      0.84      0.76      1003
       bible       0.92      0.99      0.95      4939
       blake       0.78      0.13      0.23        53
      bryant       0.66      0.38      0.48       246
     burgess       0.97      0.67      0.80        58
     carroll       0.96      0.62      0.76       169
  chesterton       0.74      0.72      0.73       803
   edgeworth       0.73      0.58      0.65       758
    melville       0.83      0.63      0.72       554
      milton       0.50      0.67      0.57         3
 shakespeare       0.79      0.80      0.79       496
     whitman       0.61      0.48      0.54       496

    accuracy                           0.83      9578
   macro avg       0.76      0.63      0.66      9578
weighted avg       0.83      0.83      0.83      9578



In [304]:
model = DecisionTreeClassifier(random_state=2)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
classification_report(y_test, pr)
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.70      0.73      0.72      1003
       bible       0.93      0.95      0.94      4939
       blake       0.13      0.09      0.11        53
      bryant       0.47      0.45      0.46       246
     burgess       0.90      0.64      0.75        58
     carroll       0.79      0.67      0.72       169
  chesterton       0.64      0.64      0.64       803
   edgeworth       0.63      0.59      0.61       758
    melville       0.65      0.61      0.63       554
      milton       0.25      0.33      0.29         3
 shakespeare       0.65      0.68      0.66       496
     whitman       0.44      0.43      0.44       496

    accuracy                           0.79      9578
   macro avg       0.60      0.57      0.58      9578
weighted avg       0.78      0.79      0.79      9578



In [305]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
pr = model.fit(X_train_counts.toarray(), y_train)
pr = model.predict(X_test_counts.toarray())
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.86      0.81      0.83      1003
       bible       0.99      0.92      0.96      4939
       blake       0.17      0.38      0.23        53
      bryant       0.35      0.48      0.41       246
     burgess       0.19      0.67      0.29        58
     carroll       0.41      0.63      0.49       169
  chesterton       0.81      0.71      0.75       803
   edgeworth       0.77      0.74      0.75       758
    melville       0.69      0.72      0.70       554
      milton       0.05      0.67      0.10         3
 shakespeare       0.86      0.86      0.86       496
     whitman       0.52      0.56      0.54       496

    accuracy                           0.83      9578
   macro avg       0.55      0.68      0.58      9578
weighted avg       0.86      0.83      0.84      9578



In [306]:
tick = time.time()
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=1)
model.fit(X_train_counts,y_train)
pr = model.predict(X_test_counts)
print(time.time() - tick)
print(classification_report(y_test, pr))

259.0457031726837
              precision    recall  f1-score   support

      austen       0.87      0.77      0.82      1003
       bible       0.79      1.00      0.88      4939
       blake       0.27      0.11      0.16        53
      bryant       0.81      0.52      0.64       246
     burgess       1.00      0.79      0.88        58
     carroll       0.88      0.73      0.80       169
  chesterton       0.91      0.67      0.77       803
   edgeworth       0.94      0.65      0.77       758
    melville       0.92      0.67      0.77       554
      milton       0.33      0.33      0.33         3
 shakespeare       0.98      0.73      0.84       496
     whitman       0.82      0.34      0.49       496

    accuracy                           0.83      9578
   macro avg       0.79      0.61      0.68      9578
weighted avg       0.84      0.83      0.82      9578




**=>The accuracy of the best is still 83%, but accuaracy of Naive Bayes and Decision tree are improved when comparing to Bagofword**

## Word2vec

In [0]:
from gensim.models.word2vec import Word2Vec
from string import punctuation
punc = set(punctuation)

In [308]:
dataWV = dataOrig['Paras'].copy()
dataWV

0                     [[, Emma, by, Jane, Austen, 1816, ]]
1                                              [VOLUME, I]
2                                             [CHAPTER, I]
3        [Emma, Woodhouse, ,, handsome, ,, clever, ,, a...
4        [She, was, the, youngest, of, the, two, daught...
                               ...                        
47882                      [}, Good, -, Bye, My, Fancy, !]
47883    [Good, -, bye, my, Fancy, !, Farewell, dear, m...
47884    [Now, for, my, last, --, let, me, look, back, ...
47885    [Long, have, we, lived, ,, joy, ', d, ,, cares...
47886    [Yet, let, me, not, be, too, hasty, ,, Long, i...
Name: Paras, Length: 47887, dtype: object

In [309]:
stop_words = set(stopwords.words('english')) 
for i in range(len(dataWV)):
  words = []
  for w in dataWV[i]:
    if not( ( w in stop_words) or (w in punc )) :
      words.append(w) 
  dataWV[i] = words
dataWV


0                               [Emma, Jane, Austen, 1816]
1                                              [VOLUME, I]
2                                             [CHAPTER, I]
3        [Emma, Woodhouse, handsome, clever, rich, comf...
4        [She, youngest, two, daughters, affectionate, ...
                               ...                        
47882                               [Good, Bye, My, Fancy]
47883    [Good, bye, Fancy, Farewell, dear, mate, dear,...
47884    [Now, last, --, let, look, back, moment, The, ...
47885    [Long, lived, joy, caress, together, Delightfu...
47886    [Yet, let, hasty, Long, indeed, lived, slept, ...
Name: Paras, Length: 47887, dtype: object

In [0]:
sz = 200
model = Word2Vec(dataWV, size=sz, window=5, min_count=1, workers=4, iter=5)

In [0]:
#model.wv.most_similar(positive="girl", topn =3)
#len(model.wv.vocab)

In [0]:
from sklearn.model_selection import train_test_split
data_train, data_test, y_train, y_test = train_test_split(dataWV, dataOrig['Authors'], test_size=0.2, random_state=12)

In [0]:
import numpy as np  # Make sure that numpy is imported

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    # 
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.wv.index2word)
    #
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    # 
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec


def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one and return a 2D numpy array 
    # 
    # Initialize a counter
    counter = 0.
    # 
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    # 
    # Loop through the reviews
    for review in reviews:
       #
       # Print a status message every 1000th review
       if counter%10000. == 0.:
           print ("Review {} of {}" .format(counter, len(reviews)))

       # Call the function (defined above) that makes average feature vectors
       reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, num_features)
       #
       # Increment the counter
       counter = counter + 1.
    return reviewFeatureVecs

In [314]:
X_train = getAvgFeatureVecs(data_train, model, sz)
X_test = getAvgFeatureVecs(data_test, model, sz)

Review 0.0 of 38309




Review 10000.0 of 38309
Review 20000.0 of 38309
Review 30000.0 of 38309
Review 0.0 of 9578


In [0]:
X_train = np.nan_to_num(X_train) 
X_test = np.nan_to_num(X_test) 

In [316]:
tick = time.time()
model = RandomForestClassifier(n_estimators=20, random_state=1)
model.fit(X_train,y_train)
pr = model.predict(X_test)
print(classification_report(y_test, pr))
time.time() - tick

              precision    recall  f1-score   support

      austen       0.67      0.79      0.72      1003
       bible       0.98      1.00      0.99      4939
       blake       0.41      0.17      0.24        53
      bryant       0.47      0.21      0.29       246
     burgess       0.50      0.09      0.15        58
     carroll       0.64      0.28      0.39       169
  chesterton       0.50      0.61      0.55       803
   edgeworth       0.41      0.37      0.39       758
    melville       0.57      0.48      0.52       554
      milton       1.00      0.67      0.80         3
 shakespeare       0.76      0.79      0.77       496
     whitman       0.61      0.64      0.62       496

    accuracy                           0.79      9578
   macro avg       0.63      0.51      0.54      9578
weighted avg       0.78      0.79      0.78      9578



17.42493462562561

In [317]:
model = DecisionTreeClassifier(random_state=2)
model.fit(X_train,y_train)
pr = model.predict(X_test)
classification_report(y_test, pr)
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.64      0.64      0.64      1003
       bible       0.98      0.98      0.98      4939
       blake       0.12      0.13      0.13        53
      bryant       0.19      0.21      0.20       246
     burgess       0.14      0.16      0.15        58
     carroll       0.25      0.26      0.26       169
  chesterton       0.40      0.38      0.39       803
   edgeworth       0.30      0.31      0.31       758
    melville       0.40      0.38      0.39       554
      milton       0.33      1.00      0.50         3
 shakespeare       0.67      0.66      0.66       496
     whitman       0.51      0.52      0.52       496

    accuracy                           0.72      9578
   macro avg       0.41      0.47      0.43      9578
weighted avg       0.73      0.72      0.72      9578



In [318]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
pr = model.fit(X_train, y_train)
pr = model.predict(X_test)
print(classification_report(y_test, pr))

              precision    recall  f1-score   support

      austen       0.60      0.68      0.64      1003
       bible       0.98      0.92      0.95      4939
       blake       0.10      0.38      0.16        53
      bryant       0.18      0.20      0.19       246
     burgess       0.08      0.38      0.13        58
     carroll       0.28      0.44      0.34       169
  chesterton       0.48      0.13      0.21       803
   edgeworth       0.32      0.22      0.26       758
    melville       0.21      0.13      0.16       554
      milton       0.01      1.00      0.02         3
 shakespeare       0.58      0.45      0.51       496
     whitman       0.26      0.50      0.34       496

    accuracy                           0.65      9578
   macro avg       0.34      0.45      0.33      9578
weighted avg       0.70      0.65      0.66      9578



In [324]:
tick = time.time()
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=1)
model.fit(X_train,y_train)
pr = model.predict(X_test)
print(time.time() - tick)
print(classification_report(y_test, pr))

972.6748280525208
              precision    recall  f1-score   support

      austen       0.71      0.79      0.75      1003
       bible       0.99      0.99      0.99      4939
       blake       0.35      0.21      0.26        53
      bryant       0.49      0.25      0.33       246
     burgess       0.52      0.19      0.28        58
     carroll       0.56      0.38      0.45       169
  chesterton       0.55      0.61      0.58       803
   edgeworth       0.45      0.42      0.43       758
    melville       0.59      0.53      0.56       554
      milton       0.00      0.00      0.00         3
 shakespeare       0.74      0.79      0.76       496
     whitman       0.60      0.70      0.65       496

    accuracy                           0.80      9578
   macro avg       0.55      0.49      0.50      9578
weighted avg       0.80      0.80      0.80      9578



**The performance worse than the others NLP techniques, the best case is 80% while the worse one is 65%**

## LDA

In [0]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(dataWV)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in dataWV]

In [322]:
len(dictionary)

50957

In [323]:
tick = time.time()
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=7, id2word = dictionary, passes=50)Saliency
time.time - tick

SyntaxError: ignored

**Still not implemnet LDA to show the top 10 words**



## Conclusion

**=>BagofWords: Gradient boosting and Randomforest give the best result over 80% of accuracy**


**=>TF-IDF: The accuracy of the best is still 83%, but accuaracy of Naive Bayes and Decision tree are improved when comparing to Bagofword**

**=>Word2Vec: The performance worse than the others NLP techniques, the best case is 80% while the worse one is 65%**

**Still not implemnet LDA to show the top 10 words**


**=> Wait for next week to reference project of other peoples in class to improve and revise my project**