<a href="https://colab.research.google.com/github/Fantagma/GNG5125/blob/master/Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#GNG5125 Data Science Assignment 2 Report
#Group 2: Reza Pramudita and Thomas Tourigny

---



#**Introduction**


**Assignment Goal:** The goal of this assignment will be to develop classifiers using three supervised learning methods to attempt to identify the author of the document.

**Data Sets:** For our data set we chose to use five different books provided by The Gutenberg Project. There are five different classes for our documents, meaning that there are five different authors from which we can label a document. Every book will be split into 200 document with each's length limited to 150 words.

#**Part 1: Data Preparation**

We choose five books of different genres as follows (INPUT):
1.   Grimms’ Fairy Tales, by The Brothers Grimm (Fiction) -https://www.gutenberg.org/files/2591/2591-0.txt
2.   History Of Egypt, Chaldæa, Syria, Babylonia, And Assyria In The Light Of Recent Discovery, by L.W. King and H.R. Hall (History) - https://www.gutenberg.org/files/17321/17321-0.txt
3. Lectures on Language, by William S. Balch (Languages/Literature) - https://www.gutenberg.org/ebooks/17594.txt.utf-8
4. Concrete Construction, by Halbert P. Gillette and Charles S. Hill (Engineering) - https://www.gutenberg.org/ebooks/24855.txt.utf-8
5. Commentaries on the Laws of England, by William Blackstone (Law) - https://www.gutenberg.org/ebooks/30802.txt.utf-8

Let's start to import all relevant package for this assignment and download these books to preprocess the text with the following techniques:
- Removal of Gutenberg header and footer as these are not part of the books texts
- Lemmatization
- Tokenization
- Change all tokens to lowercase
- Removal of numbers
- Removal of punctuation
- Removal of stop words

Numpy arrays are used to split the books into documents for 150 words. These are past on to pandas dataframes in the next section.



In [0]:
#NLTK imports
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

#scikitlearn imports
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split

#additional imports
from urllib import request
%matplotlib inline
import re
import string
import pandas as pd
import numpy as np
import os
import glob
import matplotlib as mpl
import urllib.request

# Just making the plots look better
mpl.style.use('ggplot')
mpl.rcParams['figure.figsize'] = (8,6)
mpl.rcParams['font.size'] = 12

if not os.path.exists('data'):
    os.mkdir('data')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [0]:
lemmatizer = WordNetLemmatizer()

#this is for POS tagging for Lemmatization
def nltk2wn_tag(nltk_tag):
  if nltk_tag.startswith('J'):
    return wordnet.ADJ
  elif nltk_tag.startswith('V'):
    return wordnet.VERB
  elif nltk_tag.startswith('N'):
    return wordnet.NOUN
  elif nltk_tag.startswith('R'):
    return wordnet.ADV
  else:          
    return None

#actual lemmatization function  
def lemmatize_sentence(sentence):
  nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
  wn_tagged = map(lambda x: (x[0], nltk2wn_tag(x[1])), nltk_tagged)
  res_words = []
  for word, tag in wn_tagged:
    if tag is None:            
      res_words.append(word)
    else:
      res_words.append(lemmatizer.lemmatize(word, tag))
  return " ".join(res_words)


#This section downloads the book from gutenberg and removes the gutenberg header and footer
#consider using https://pypi.org/project/Gutenberg/ for import and strip of gutenberg books
book_urls = ["https://www.gutenberg.org/files/2591/2591-0.txt",
             "https://www.gutenberg.org/files/17321/17321-0.txt",
             "https://www.gutenberg.org/ebooks/17594.txt.utf-8",
             "https://www.gutenberg.org/ebooks/24855.txt.utf-8",
             "https://www.gutenberg.org/ebooks/30802.txt.utf-8"]
#responsegrim = request.urlopen(book_urls[0])
#grimraw = responsegrim.read().decode('utf8')
trimmed_raw_books=[]
for i in range(len(book_urls)):
  raw = (request.urlopen(book_urls[i])).read().decode('utf8')
  raw = re.search("\ \*\*\*(.*)End of (?:the\s)*Project Gutenberg",raw,re.DOTALL).groups()[0] #trim the gutenberg header and footer before using the book
  raw = lemmatize_sentence(raw)
  raw = raw.lower() #lowercase
  raw = re.sub(r'\d+', '', raw) #remove numbers
  #raw = raw.translate(str.maketrans({key: None for key in string.punctuation})) #remove ponctuation
  raw = re.sub(r'[^\w\s]','',raw) # remove ponctuation
  #white space removal?
  trimmed_raw_books.append(raw)
#m = re.search("\ \*\*\*(.*)End of (?:the\s)*Project Gutenberg",grimraw,re.DOTALL).groups()[0]
#print (trimmed_raw_books[2])
print("Cleaned the strings and added {} books strings to trimmed_raw_books[]".format(len(trimmed_raw_books)))
#end section



#creates an array of word tokenized books
stop_words = stopwords.words('english')
#lem = WordNetLemmatizer()
word_tokenized_books=[]
for i in range(len(trimmed_raw_books)):
  a = nltk.word_tokenize(trimmed_raw_books[i])
  a = [word for word in a if word not in stop_words] #remove stop words https://www.geeksforgeeks.org/removing-stop-words-nltk-python/, https://chrisalbon.com/machine_learning/preprocessing_text/remove_stop_words/
  #a = [lem.lemmatize(word) for word in a] #not sure this one works
  word_tokenized_books.append(a)
  #word_tokenized_books.append(nltk.word_tokenize(trimmed_raw_books[i]))
print("Word tokenized {} books and books have {}, {}, {}, {} and {} tokens".format(len(word_tokenized_books), len(word_tokenized_books[0]), len(word_tokenized_books[1]), len(word_tokenized_books[2]), len(word_tokenized_books[3]), len(word_tokenized_books[4])))
#print(word_tokenized_books[0])

book1array=np.array(word_tokenized_books[0])
book2array=np.array(word_tokenized_books[1])
book3array=np.array(word_tokenized_books[2])
book4array=np.array(word_tokenized_books[3])
book5array=np.array(word_tokenized_books[4])

resizedbook1array=np.resize(book1array, (296, 150))
resizedbook2array=np.resize(book2array, (370, 150))
resizedbook3array=np.resize(book3array, (240, 150))
resizedbook4array=np.resize(book4array, (722, 150))
resizedbook5array=np.resize(book5array, (566, 150))
print("Resized 5 arrays and they have {}, {}, {}, {} and {} tokens".format(len(resizedbook1array), len(resizedbook2array), len(resizedbook3array), len(resizedbook4array), len(resizedbook5array)))
dataset1=np.concatenate((resizedbook1array, resizedbook2array, resizedbook3array, resizedbook4array, resizedbook5array))
labels1=[1 for i in range(296)]+[2 for i in range(370)]+[3 for i in range(240)]+[4 for i in range(722)]+[5 for i in range(566)]
print("len(labels1):", len(labels1))



#forpdbook1=np.resize(book1array, (150, 296))
#forpdbook2=np.resize(book1array, (150, 370))
#forpdbook3=np.resize(book1array, (150, 240))
#forpdbook4=np.resize(book1array, (150, 722))
#forpdbook5=np.resize(book1array, (150, 566))

Cleaned the strings and added 5 books strings to trimmed_raw_books[]
Word tokenized 5 books and books have 44440, 55528, 36025, 108375 and 84938 tokens
Resized 5 arrays and they have 296, 370, 240, 722 and 566 tokens
len(labels1): 2194


In [0]:
#table = [[1 , 2], [3, 4]]
#df = pd.DataFrame(dataset1)
#df = df.transpose()
#df.columns = labels1
#labels1=[1 for i in range(296)]+[2 for i in range(370)]+[3 for i in range(240)]+[4 for i in range(722)]+[5 for i in range(566)]
#print("Created labels array len(labels1):", len(labels1))
#df.head()

grims = pd.DataFrame(resizedbook1array)
grims['body_text'] = pd.Series(grims.fillna('').values.tolist()).str.join(' ')
grims['label'] = 'A'
grims['label_num'] = 0
print(grims.shape)

history_egypt = pd.DataFrame(resizedbook2array)
history_egypt['body_text'] = pd.Series(history_egypt.fillna('').values.tolist()).str.join(' ')
history_egypt['label'] = 'B'
history_egypt['label_num'] = 1
print(history_egypt.shape)

lectures_on_language = pd.DataFrame(resizedbook3array)
lectures_on_language['body_text'] = pd.Series(lectures_on_language.fillna('').values.tolist()).str.join(' ')
lectures_on_language['label'] = 'C'
lectures_on_language['label_num'] = 2
print(lectures_on_language.shape)

concrete_construction = pd.DataFrame(resizedbook4array)
concrete_construction['body_text'] = pd.Series(concrete_construction.fillna('').values.tolist()).str.join(' ')
concrete_construction['label'] = 'D'
concrete_construction['label_num'] = 3
print(concrete_construction.shape)

comment_laws_england = pd.DataFrame(resizedbook5array)
comment_laws_england['body_text'] = pd.Series(comment_laws_england.fillna('').values.tolist()).str.join(' ')
comment_laws_england['label'] = 'E'
comment_laws_england['label_num'] = 4
print(comment_laws_england.shape)


(296, 153)
(370, 153)
(240, 153)
(722, 153)
(566, 153)


#**Part 2: Data Pre-processing**

In [0]:
dataset1 = pd.DataFrame(grims.iloc[0:200, 150:153])
dataset2 = pd.DataFrame(history_egypt.iloc[0:200, 150:153])
dataset3 = pd.DataFrame(lectures_on_language.iloc[0:200, 150:153])
dataset4 = pd.DataFrame(concrete_construction.iloc[0:200, 150:153])
dataset5 = pd.DataFrame(comment_laws_england.iloc[0:200, 150:153])
full_dataset = pd.concat([dataset1, dataset2, dataset3, dataset4, dataset5])
full_dataset = full_dataset.reset_index(drop=True)
#print(full_dataset)
print(full_dataset.head(5))
print(full_dataset.tail(5))

                                           body_text label  label_num
0  produced emma dudding john bickers dagny fairy...     A          0
1  mrs fox first story second story salad story y...     A          0
2  whole bird gardener eldest son set think find ...     A          0
3  golden bird father would listen long fond son ...     A          0
4  lose lie close think droll thing bring away fi...     A          0
                                             body_text label  label_num
995  law therefore strictly guard usurpation abuse ...     E          4
996  former method usually great tendency aggrandiz...     E          4
997  forty shilling annual value sum would proper i...     E          4
998  person shall vote respect annuity rentcharge u...     E          4
999  county number parliament men increase since fo...     E          4


#**Part 3: Feature Engineering**

## Count vectorization (Bag of Words)

Creates a document-term matrix where the entry of each cell will be a count of the number of times that word occurred in that document.

### Create function to remove punctuation, tokenize, remove stopwords, and stem

In [0]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stopwords = nltk.corpus.stopwords.words('english')

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

### Apply CountVectorizer

In [0]:
from sklearn.feature_extraction.text import CountVectorizer


count_vect = CountVectorizer(analyzer=clean_text)
X_counts = count_vect.fit_transform(full_dataset['body_text'])
print(X_counts.shape)
print(count_vect.get_feature_names())

(1000, 11057)
['aaron', 'ab', 'abadiya', 'abandon', 'abbey', 'abbez', 'abbot', 'abbrevi', 'abdic', 'aberr', 'abettor', 'abhor', 'abid', 'abil', 'abisha', 'abjur', 'abl', 'ablut', 'abnub', 'abod', 'abolendu', 'abolish', 'abolit', 'aborigin', 'abortivam', 'abound', 'abovement', 'abr', 'abraham', 'abram', 'abras', 'abridg', 'abroad', 'abrog', 'abrogari', 'abrogatur', 'absenc', 'absent', 'absolut', 'absolv', 'absorb', 'absorpt', 'abstain', 'abstract', 'abstruct', 'abstrus', 'absurd', 'abu', 'abuf', 'abund', 'abundantli', 'abus', 'abusir', 'abusîr', 'abut', 'abydo', 'abyss', 'abyssinia', 'abyssinian', 'abzubanda', 'abzuega', 'abêshu', 'abû', 'ac', 'academ', 'academi', 'acced', 'accent', 'accept', 'access', 'accid', 'accident', 'accommod', 'accompani', 'accomplish', 'accord', 'accordingli', 'account', 'accret', 'accru', 'accumul', 'accur', 'accuraci', 'accus', 'accustom', 'acervatarum', 'acet', 'acheul', 'acheulian', 'achiev', 'achæmenian', 'acid', 'acknowledg', 'acknowleg', 'acm', 'acquaint

### Vectorizers output sparse matrices

_**Sparse Matrix**: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements._

In [0]:
X_counts_df = pd.DataFrame(X_counts.toarray())
X_counts_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,11017,11018,11019,11020,11021,11022,11023,11024,11025,11026,11027,11028,11029,11030,11031,11032,11033,11034,11035,11036,11037,11038,11039,11040,11041,11042,11043,11044,11045,11046,11047,11048,11049,11050,11051,11052,11053,11054,11055,11056
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
X_counts_df.columns = count_vect.get_feature_names()
X_counts_df

Unnamed: 0,aaron,ab,abadiya,abandon,abbey,abbez,abbot,abbrevi,abdic,aberr,abettor,abhor,abid,abil,abisha,abjur,abl,ablut,abnub,abod,abolendu,abolish,abolit,aborigin,abortivam,abound,abovement,abr,abraham,abram,abras,abridg,abroad,abrog,abrogari,abrogatur,absenc,absent,absolut,absolv,...,yolk,yonder,yonker,york,yorkshir,youd,young,younger,youth,yusuf,z,zarmu,zarzaru,zeal,zealot,zealou,zebra,zenith,zephyr,zero,zeu,ziggurat,zigzag,zimbabw,zn,zone,zoophyt,²,¼,¼in,½,½in,½¼,½½,¾,¾in,à,âge,égypt,ûn
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


##Vectorizing Raw Data: TF-IDF

### TF-IDF

Creates a document-term matrix where the columns represent single unique terms (unigrams) but the cell represents a weighting meant to represent how important a word is to a document.

### Apply TfidfVectorizer

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(full_dataset['body_text'])
print(X_tfidf.shape)
print(tfidf_vect.get_feature_names())
print("Number of TFIDF Features: %d"%len(tfidf_vect.get_feature_names()))

(1000, 11057)
['aaron', 'ab', 'abadiya', 'abandon', 'abbey', 'abbez', 'abbot', 'abbrevi', 'abdic', 'aberr', 'abettor', 'abhor', 'abid', 'abil', 'abisha', 'abjur', 'abl', 'ablut', 'abnub', 'abod', 'abolendu', 'abolish', 'abolit', 'aborigin', 'abortivam', 'abound', 'abovement', 'abr', 'abraham', 'abram', 'abras', 'abridg', 'abroad', 'abrog', 'abrogari', 'abrogatur', 'absenc', 'absent', 'absolut', 'absolv', 'absorb', 'absorpt', 'abstain', 'abstract', 'abstruct', 'abstrus', 'absurd', 'abu', 'abuf', 'abund', 'abundantli', 'abus', 'abusir', 'abusîr', 'abut', 'abydo', 'abyss', 'abyssinia', 'abyssinian', 'abzubanda', 'abzuega', 'abêshu', 'abû', 'ac', 'academ', 'academi', 'acced', 'accent', 'accept', 'access', 'accid', 'accident', 'accommod', 'accompani', 'accomplish', 'accord', 'accordingli', 'account', 'accret', 'accru', 'accumul', 'accur', 'accuraci', 'accus', 'accustom', 'acervatarum', 'acet', 'acheul', 'acheulian', 'achiev', 'achæmenian', 'acid', 'acknowledg', 'acknowleg', 'acm', 'acquaint

In [0]:
training_time_container={'linear_svm':0}
prediction_time_container={'linear_svm':0}
accuracy_container={'linear_svm':0}

#**Part 4: Train Machine Learning Model**

With the data preprocessed, now is the time to develop the models. When it comes to developing machine learning models (and in our particular case, classifiers), we first need to train them on the labeled training data to learn from and then use the test data-set to make predictions. To do so, we will proceed with splitting our existing data-set into training and test data as follows:

In [0]:
# As no separate test data-set was given so the provided data set is split into training and test data set using 70-30% ratio 
#as follows:
#variables_tfidf = X_tfidf
testsizevar = .3
#considering the TFIDF features as independent variables to be input to the classifier.
labels = full_dataset.label
labels_num = full_dataset.label_num
#considering the label values as the class labels for the classifier.


#splitting the data into random training and test sets for both independent variables and labels.
variables_train_tfidf, variables_test_tfidf, labels_train_tfidf, labels_test_tfidf = train_test_split(X_tfidf, labels, test_size=testsizevar)
variables_train_BOW, variables_test_BOW, labels_train_BOW, labels_test_BOW = train_test_split(X_counts, labels, test_size=testsizevar)

In [0]:
#analyzing the shape of the training and test data-set:
print('Shape of Training Data: '+str(variables_train_tfidf.shape))
print('Shape of Test Data: '+str(variables_test_tfidf.shape))

Shape of Training Data: (700, 11057)
Shape of Test Data: (300, 11057)


##**Applying K-means**

The k-means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:

The "cluster center" is the arithmetic mean of all the points belonging to the cluster.
Each point is closer to its own cluster center than to other cluster centers.

source: https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html

References:

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://www.datacamp.com/community/tutorials/k-means-clustering-python

https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py






###Bag of Words (BOW)

In [0]:
#https://www.datacamp.com/community/tutorials/k-means-clustering-python

from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from time import time
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

kmeans_classifier = KMeans(n_clusters=5) # we want to cluster the documents into 5 clusters that hopefully represent the books

#fitting/training
t0=time()
#kmeans_classifier.fit(variables_train_BOW)
kmeans_classifier.fit(X_counts)
#svm_classifier=svm_classifier.fit(variables_train_BOW, labels_train_BOW)
training_time_container['kmeans_bow']=time()-t0
print("Training Time: %fs"%training_time_container['kmeans_bow'])

t0=time()
#kmeans_predictions = kmeans_classifier.predict(variables_test_BOW)
kmeans_predictions = kmeans_classifier.predict(X_counts)
#svm_predictions=svm_classifier.predict(variables_test_BOW)
prediction_time_container['kmeans_bow']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['kmeans_bow'])
print (kmeans_predictions)

accuracy_container['kmeans_bow']=accuracy_score(labels_num, kmeans_predictions)
print ("Accuracy Score of K-means Classifier: %f \n"%accuracy_container['kmeans_bow'])
print ("Confusion Matrix:")
print(confusion_matrix(labels_num, kmeans_predictions))
print ("\nClassification Report:")
print (classification_report(labels_num, kmeans_predictions)) 


Training Time: 15.343956s
Prediction Time: 0.003284s
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 

In [0]:
#if we train the SGD Classifier with elastic net penalty, it  brings more sparsity to the model not possible with the L2:
svm_classifier_enet=linear_model.SGDClassifier(loss='hinge',alpha=0.0001,penalty='elasticnet')
svm_classifier_enet=svm_classifier_enet.fit(variables_train_tfidf, labels_train_tfidf)

####**Kmeans BOW Evaluation and Error Analysis**

This model used with BOW mostly had mostly nearly perfect prediction scores with some perfect scores.

The ten-fold cross validation also displays mostly almost perfect scores with the lowest variance amongst the models tested.

This model had the best results overall.

The results with BOW are slightly lower than with TFIDF

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics


#####Kappa#####

Between Classifiers

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html


#####Consistency#####

Not finding anything

#####Silhouette#####

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

https://stackoverflow.com/questions/51138686/how-to-use-silhouette-score-in-k-means-clustering-from-sklearn-library

In [0]:
from sklearn.metrics import silhouette_score

silhouette_avg = silhouette_score(X_counts, kmeans_predictions)
print("The average silhouette_score is :", silhouette_avg)

The average silhouette_score is : 0.026985523893217987


#####Coherence#####

In the context of LSA/LDA? Topics related 

Also slide 10 of Lecture

https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python





###TFIDF

In [0]:
#I've used hinge loss which gives linear Support Vector Machine. Also set the learning rate to 0.0001 (also the default value)
#which is a constant that's gets multiplied with the regularization term. For penalty, I've used L2 which is the standard
#regularizer for linear SVMs:
from time import time
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics.cluster import adjusted_rand_score

kmeans_classifier = KMeans(n_clusters=5) # we want to cluster the documents into 5 clusters that hopefully represent the books
#svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)

t0=time()
#svm_classifier=svm_classifier.fit(variables_train_tfidf, labels_train_tfidf)
kmeans_classifier.fit(X_tfidf)
training_time_container['kmeans_tfidf']=time()-t0
print("Training Time: %fs"%training_time_container['kmeans_tfidf'])

t0=time()
kmeans_predictions=kmeans_classifier.predict(X_tfidf)
prediction_time_container['kmeans_tfidf']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['kmeans_tfidf'])

accuracy_container['kmeans_tfidf']=accuracy_score(labels_num, kmeans_predictions)
print ("Accuracy Score of K-Means Classifier: %f \n"%accuracy_container['kmeans_tfidf'])
print ("Confusion Matrix:")
print(confusion_matrix(labels_num,kmeans_predictions))
print ("\nClassification Report:")
print (classification_report(labels_num, kmeans_predictions)) 

print(adjusted_rand_score(labels_num, kmeans_predictions))
from sklearn.metrics.cluster import contingency_matrix
print(contingency_matrix(labels_num, kmeans_predictions))


Training Time: 8.243313s
Prediction Time: 0.002684s
Accuracy Score of K-Means Classifier: 0.200000 

Confusion Matrix:
[[200   0   0   0   0]
 [  0   0 200   0   0]
 [  1   0   0   2 197]
 [  0 200   0   0   0]
 [  0   0   0 200   0]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       200
           1       0.00      0.00      0.00       200
           2       0.00      0.00      0.00       200
           3       0.00      0.00      0.00       200
           4       0.00      0.00      0.00       200

    accuracy                           0.20      1000
   macro avg       0.20      0.20      0.20      1000
weighted avg       0.20      0.20      0.20      1000

0.9925139729504762
[[200   0   0   0   0]
 [  0   0 200   0   0]
 [  1   0   0   2 197]
 [  0 200   0   0   0]
 [  0   0   0 200   0]]


####**Kmeans TFIDF Evaluation and Error Analysis**

This model with TFIDF mostly had perfect prediction scores and some near perfect prediction scores.

The ten-fold cross validation also displays almost perfect scores with the lowest variance amongst the models tested.

This model had the best results overall.

TFIDF had better results than BOW, the variance with TFIDF was also lower

#####Silhouette#####

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

https://stackoverflow.com/questions/51138686/how-to-use-silhouette-score-in-k-means-clustering-from-sklearn-library

In [0]:
from sklearn.metrics import silhouette_score

silhouette_avg = silhouette_score(X_tfidf, kmeans_predictions)
print("The average silhouette_score is :", silhouette_avg)

##**Applying Expectation-Maximization (EM)**

The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models.

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.


![How the EM algorithm works](https://www.wilsonmongwe.co.za/wp-content/uploads/2015/07/400px-EM.jpg)

Source: Wikipedia and https://scikit-learn.org/stable/modules/mixture.html

References:

https://scikit-learn.org/stable/modules/mixture.html

https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

https://scikit-learn.org/stable/unsupervised_learning.html

https://www.geeksforgeeks.org/gaussian-mixture-model/





###Bag of Words (BOW)

In [0]:
#I've used hinge loss which gives linear Support Vector Machine. Also set the learning rate to 0.0001 (also the default value)
#which is a constant that's gets multiplied with the regularization term. For penalty, I've used L2 which is the standard
#regularizer for linear SVMs:
from time import time
from sklearn import linear_model
from sklearn.mixture import GaussianMixture
from sklearn import cluster, datasets, mixture
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)
gmm = GaussianMixture(n_components=5)
#gmm.fit(X_counts.todense())
gmm.fit(X_counts.toarray())

t0=time()
#svm_classifier=svm_classifier.fit(variables_train_BOW, labels_train_BOW)
training_time_container['EM_BOW']=time()-t0
print("Training Time: %fs"%training_time_container['EM_BOW'])

t0=time()
EM_predictions=gmm.predict(X_counts)
prediction_time_container['EM_BOW']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['EM_BOW'])

accuracy_container['EM_BOW']=accuracy_score(labels_test_BOW, svm_predictions)
print ("Accuracy Score of EM Classifier: %f \n"%accuracy_container['EM_BOW'])
print ("Confusion Matrix:")
print(confusion_matrix(labels_num,EM_predictions))
print ("\nClassification Report:")
print (classification_report(labels_num, EM_predictions)) 


####**EM BOW Evaluation and Error Analysis**

This model used with BOW mostly had mostly nearly perfect prediction scores with some perfect scores.

The ten-fold cross validation also displays mostly almost perfect scores with the lowest variance amongst the models tested.

This model had the best results overall.

The results with BOW are slightly lower than with TFIDF

###TFIDF

In [0]:
#I've used hinge loss which gives linear Support Vector Machine. Also set the learning rate to 0.0001 (also the default value)
#which is a constant that's gets multiplied with the regularization term. For penalty, I've used L2 which is the standard
#regularizer for linear SVMs:
from time import time
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)

t0=time()
svm_classifier=svm_classifier.fit(variables_train_tfidf, labels_train_tfidf)
training_time_container['linear_svm']=time()-t0
print("Training Time: %fs"%training_time_container['linear_svm'])

t0=time()
svm_predictions=svm_classifier.predict(variables_test_tfidf)
prediction_time_container['linear_svm']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['linear_svm'])

accuracy_container['linear_svm']=accuracy_score(labels_test_tfidf, svm_predictions)
print ("Accuracy Score of Linear SVM Classifier: %f \n"%accuracy_container['linear_svm'])
print ("Confusion Matrix:")
print(confusion_matrix(labels_test_tfidf,svm_predictions))
print ("\nClassification Report:")
print (classification_report(labels_test_tfidf, svm_predictions)) 


####**EM TFIDF Evaluation and Error Analysis**

This model with TFIDF mostly had perfect prediction scores and some near perfect prediction scores.

The ten-fold cross validation also displays almost perfect scores with the lowest variance amongst the models tested.

This model had the best results overall.

TFIDF had better results than BOW, the variance with TFIDF was also lower

##**Applying Hierarchical Clustering**

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. See the Wikipedia page for more details.

The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:

Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.
Average linkage minimizes the average of the distances between all observations of pairs of clusters.
Single linkage minimizes the distance between the closest observations of pairs of clusters.

Source: https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering



###Bag of Words (BOW)

In [0]:
#I've used hinge loss which gives linear Support Vector Machine. Also set the learning rate to 0.0001 (also the default value)
#which is a constant that's gets multiplied with the regularization term. For penalty, I've used L2 which is the standard
#regularizer for linear SVMs:
from time import time
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.cluster import AgglomerativeClustering
import numpy as np
#X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
clustering = AgglomerativeClustering().fit(X_counts)
clustering = AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto', connectivity=None, distance_threshold=None, linkage='ward', memory=None, n_clusters=5, pooling_func='deprecated')
print(clustering.labels)

#svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)

t0=time()
#svm_classifier=svm_classifier.fit(variables_train_BOW, labels_train_BOW)
#training_time_container['linear_SVM']=time()-t0
#print("Training Time: %fs"%training_time_container['linear_SVM'])

t0=time()
#svm_predictions=svm_classifier.predict(variables_test_BOW)
#prediction_time_container['linear_SVM']=time()-t0
#print("Prediction Time: %fs"%prediction_time_container['linear_SVM'])

#accuracy_container['linear_SVM']=accuracy_score(labels_test_BOW, svm_predictions)
#print ("Accuracy Score of Linear SVM Classifier: %f \n"%accuracy_container['linear_SVM'])
#print ("Confusion Matrix:")
#print(confusion_matrix(labels_test_BOW,svm_predictions))
#print ("\nClassification Report:")
#print (classification_report(labels_test_BOW, svm_predictions)) 


TypeError: ignored

####**SVM BOW Ten-Fold Cross Validation**

In [0]:
#https://scikit-learn.org/stable/modules/cross_validation.html
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
from sklearn.model_selection import cross_val_score
scores = cross_val_score(svm_classifier, X_counts, labels, cv=10)
print("Scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


In [0]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
from sklearn.model_selection import cross_validate
cv_results = cross_validate(svm_classifier, X_counts, labels, cv=10)
sorted(cv_results.keys())                         

#print(cv_results, "\n")  
print("Fit Time:", cv_results['fit_time'])
print("Fit Time: %0.8f (+/- %0.8f)" % (cv_results['fit_time'].mean(), cv_results['fit_time'].std() * 2)+"\n")

print("Score Time:", cv_results['score_time'])
print("Score time: %0.8f (+/- %0.8f)" % (cv_results['score_time'].mean(), cv_results['score_time'].std() * 2)+"\n")

print("Scores:", cv_results['test_score'])
print("Accuracy: %0.2f (+/- %0.2f)" % (cv_results['test_score'].mean(), cv_results['test_score'].std() * 2))


####**SVM BOW Evaluation and Error Analysis**

This model used with BOW mostly had mostly nearly perfect prediction scores with some perfect scores.

The ten-fold cross validation also displays mostly almost perfect scores with the lowest variance amongst the models tested.

This model had the best results overall.

The results with BOW are slightly lower than with TFIDF

###TFIDF

In [0]:
#I've used hinge loss which gives linear Support Vector Machine. Also set the learning rate to 0.0001 (also the default value)
#which is a constant that's gets multiplied with the regularization term. For penalty, I've used L2 which is the standard
#regularizer for linear SVMs:
from time import time
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)

t0=time()
svm_classifier=svm_classifier.fit(variables_train_tfidf, labels_train_tfidf)
training_time_container['linear_svm']=time()-t0
print("Training Time: %fs"%training_time_container['linear_svm'])

t0=time()
svm_predictions=svm_classifier.predict(variables_test_tfidf)
prediction_time_container['linear_svm']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['linear_svm'])

accuracy_container['linear_svm']=accuracy_score(labels_test_tfidf, svm_predictions)
print ("Accuracy Score of Linear SVM Classifier: %f \n"%accuracy_container['linear_svm'])
print ("Confusion Matrix:")
print(confusion_matrix(labels_test_tfidf,svm_predictions))
print ("\nClassification Report:")
print (classification_report(labels_test_tfidf, svm_predictions)) 


In [0]:
#if we train the SGD Classifier with elastic net penalty, it  brings more sparsity to the model not possible with the L2:
svm_classifier_enet=linear_model.SGDClassifier(loss='hinge',alpha=0.0001,penalty='elasticnet')
svm_classifier_enet=svm_classifier_enet.fit(variables_train_tfidf, labels_train_tfidf)

####**SVM TFIDF Ten-Fold Cross Validation**

In [0]:
#https://scikit-learn.org/stable/modules/cross_validation.html
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
from sklearn.model_selection import cross_val_score
scores = cross_val_score(svm_classifier, X_tfidf, labels, cv=10)
print("Scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


In [0]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
from sklearn.model_selection import cross_validate
cv_results = cross_validate(svm_classifier, X_tfidf, labels, cv=10)
sorted(cv_results.keys())                         

#print(cv_results, "\n")  
print("Fit Time:", cv_results['fit_time'])
print("Fit Time: %0.8f (+/- %0.8f)" % (cv_results['fit_time'].mean(), cv_results['fit_time'].std() * 2)+"\n")

print("Score Time:", cv_results['score_time'])
print("Score time: %0.8f (+/- %0.8f)" % (cv_results['score_time'].mean(), cv_results['score_time'].std() * 2)+"\n")

print("Scores:", cv_results['test_score'])
print("Accuracy: %0.2f (+/- %0.2f)" % (cv_results['test_score'].mean(), cv_results['test_score'].std() * 2))


####**SVM TFIDF Evaluation and Error Analysis**

This model with TFIDF mostly had perfect prediction scores and some near perfect prediction scores.

The ten-fold cross validation also displays almost perfect scores with the lowest variance amongst the models tested.

This model had the best results overall.

TFIDF had better results than BOW, the variance with TFIDF was also lower

##**Applying Linear Classifier (SVM) using Stochastic Gradient Descent**

Stochastic Gradient Descent (SGD) is a one of the most efficient approaches used in linear classifiers under convex loss functions such as (linear) Support Vector Machines. It has proven to perform well in in large-scale and sparse machine learning problems. Such problems are also encountered in text classification and natural language processing tasks. This motivated its use this in our situation.


###Bag of Words (BOW)

In [0]:
#I've used hinge loss which gives linear Support Vector Machine. Also set the learning rate to 0.0001 (also the default value)
#which is a constant that's gets multiplied with the regularization term. For penalty, I've used L2 which is the standard
#regularizer for linear SVMs:
from time import time
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)

t0=time()
svm_classifier=svm_classifier.fit(variables_train_BOW, labels_train_BOW)
training_time_container['linear_SVM']=time()-t0
print("Training Time: %fs"%training_time_container['linear_SVM'])

t0=time()
svm_predictions=svm_classifier.predict(variables_test_BOW)
prediction_time_container['linear_SVM']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['linear_SVM'])

accuracy_container['linear_SVM']=accuracy_score(labels_test_BOW, svm_predictions)
print ("Accuracy Score of Linear SVM Classifier: %f \n"%accuracy_container['linear_SVM'])
print ("Confusion Matrix:")
print(confusion_matrix(labels_test_BOW,svm_predictions))
print ("\nClassification Report:")
print (classification_report(labels_test_BOW, svm_predictions)) 


In [0]:
#if we train the SGD Classifier with elastic net penalty, it  brings more sparsity to the model not possible with the L2:
svm_classifier_enet=linear_model.SGDClassifier(loss='hinge',alpha=0.0001,penalty='elasticnet')
svm_classifier_enet=svm_classifier_enet.fit(variables_train_tfidf, labels_train_tfidf)

####**SVM BOW Ten-Fold Cross Validation**

In [0]:
#https://scikit-learn.org/stable/modules/cross_validation.html
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
from sklearn.model_selection import cross_val_score
scores = cross_val_score(svm_classifier, X_counts, labels, cv=10)
print("Scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


In [0]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
from sklearn.model_selection import cross_validate
cv_results = cross_validate(svm_classifier, X_counts, labels, cv=10)
sorted(cv_results.keys())                         

#print(cv_results, "\n")  
print("Fit Time:", cv_results['fit_time'])
print("Fit Time: %0.8f (+/- %0.8f)" % (cv_results['fit_time'].mean(), cv_results['fit_time'].std() * 2)+"\n")

print("Score Time:", cv_results['score_time'])
print("Score time: %0.8f (+/- %0.8f)" % (cv_results['score_time'].mean(), cv_results['score_time'].std() * 2)+"\n")

print("Scores:", cv_results['test_score'])
print("Accuracy: %0.2f (+/- %0.2f)" % (cv_results['test_score'].mean(), cv_results['test_score'].std() * 2))


####**SVM BOW Evaluation and Error Analysis**

This model used with BOW mostly had mostly nearly perfect prediction scores with some perfect scores.

The ten-fold cross validation also displays mostly almost perfect scores with the lowest variance amongst the models tested.

This model had the best results overall.

The results with BOW are slightly lower than with TFIDF

###TFIDF

In [0]:
#I've used hinge loss which gives linear Support Vector Machine. Also set the learning rate to 0.0001 (also the default value)
#which is a constant that's gets multiplied with the regularization term. For penalty, I've used L2 which is the standard
#regularizer for linear SVMs:
from time import time
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)

t0=time()
svm_classifier=svm_classifier.fit(variables_train_tfidf, labels_train_tfidf)
training_time_container['linear_svm']=time()-t0
print("Training Time: %fs"%training_time_container['linear_svm'])

t0=time()
svm_predictions=svm_classifier.predict(variables_test_tfidf)
prediction_time_container['linear_svm']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['linear_svm'])

accuracy_container['linear_svm']=accuracy_score(labels_test_tfidf, svm_predictions)
print ("Accuracy Score of Linear SVM Classifier: %f \n"%accuracy_container['linear_svm'])
print ("Confusion Matrix:")
print(confusion_matrix(labels_test_tfidf,svm_predictions))
print ("\nClassification Report:")
print (classification_report(labels_test_tfidf, svm_predictions)) 


In [0]:
#if we train the SGD Classifier with elastic net penalty, it  brings more sparsity to the model not possible with the L2:
svm_classifier_enet=linear_model.SGDClassifier(loss='hinge',alpha=0.0001,penalty='elasticnet')
svm_classifier_enet=svm_classifier_enet.fit(variables_train_tfidf, labels_train_tfidf)

####**SVM TFIDF Ten-Fold Cross Validation**

In [0]:
#https://scikit-learn.org/stable/modules/cross_validation.html
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
from sklearn.model_selection import cross_val_score
scores = cross_val_score(svm_classifier, X_tfidf, labels, cv=10)
print("Scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


In [0]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
from sklearn.model_selection import cross_validate
cv_results = cross_validate(svm_classifier, X_tfidf, labels, cv=10)
sorted(cv_results.keys())                         

#print(cv_results, "\n")  
print("Fit Time:", cv_results['fit_time'])
print("Fit Time: %0.8f (+/- %0.8f)" % (cv_results['fit_time'].mean(), cv_results['fit_time'].std() * 2)+"\n")

print("Score Time:", cv_results['score_time'])
print("Score time: %0.8f (+/- %0.8f)" % (cv_results['score_time'].mean(), cv_results['score_time'].std() * 2)+"\n")

print("Scores:", cv_results['test_score'])
print("Accuracy: %0.2f (+/- %0.2f)" % (cv_results['test_score'].mean(), cv_results['test_score'].std() * 2))


####**SVM TFIDF Evaluation and Error Analysis**

This model with TFIDF mostly had perfect prediction scores and some near perfect prediction scores.

The ten-fold cross validation also displays almost perfect scores with the lowest variance amongst the models tested.

This model had the best results overall.

TFIDF had better results than BOW, the variance with TFIDF was also lower

##**Applying Decision Tree**

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. In our case, we'll be using them for supervised classification.

References: 

https://scikit-learn.org/stable/modules/tree.html

###Bag of Words (BOW)

In [0]:
from sklearn import tree
import matplotlib.pyplot
import graphviz

tree_classifier = tree.DecisionTreeClassifier()
tree_classifier = tree_classifier.fit(variables_train_BOW, labels_train_BOW)

t0=time()
tree.plot_tree(tree_classifier)
training_time_container['tree']=time()-t0
print("Training Time: %fs"%training_time_container['tree'])

t0=time()
tree_predictions=tree_classifier.predict(variables_test_BOW)
prediction_time_container['tree']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['tree'])

accuracy_container['decision_tree']=accuracy_score(labels_test_BOW, tree_predictions)
print ("Accuracy Score of Decision Tree Classifier: %f \n"%accuracy_container['decision_tree'])
print ("Confusion Matrix:")
print(confusion_matrix(labels_test_BOW,tree_predictions))
print ("\nClassification Report:")
print (classification_report(labels_test_BOW, tree_predictions)) 

dot_data = tree.export_graphviz(tree_classifier, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("tree")

#fig = matplotlib.pyplot.gcf()
#fig.set_size_inches(150, 100)
#fig.savefig('tree.png')



####**Decision Tree BOW Ten-Fold Cross Validation**

In [0]:
#https://scikit-learn.org/stable/modules/cross_validation.html
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_classifier, X_counts, labels, cv=10)
print("Scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [0]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
from sklearn.model_selection import cross_validate
cv_results = cross_validate(tree_classifier, X_counts, labels, cv=10)
sorted(cv_results.keys())                         

#print(cv_results, "\n")  
print("Fit Time:", cv_results['fit_time'])
print("Fit Time: %0.8f (+/- %0.8f)" % (cv_results['fit_time'].mean(), cv_results['fit_time'].std() * 2)+"\n")

print("Score Time:", cv_results['score_time'])
print("Score time: %0.8f (+/- %0.8f)" % (cv_results['score_time'].mean(), cv_results['score_time'].std() * 2)+"\n")

print("Scores:", cv_results['test_score'])
print("Accuracy: %0.2f (+/- %0.2f)" % (cv_results['test_score'].mean(), cv_results['test_score'].std() * 2))

####**Decision Tree BOW Evaluation and Error Analysis**

Roughly 87% of this model's predictions were accurate during testing. During the test, it demonstrated problems predicting documents from Lectures on Language with documents from this book mostly misclassified as documents from Grimms' Fairy Tales, Commentaries on the Laws of England and History of Egypt, Chaldea, Syria, Babylonia, and Assyria in the Light of Recent Discovery. The model tends to overpredict for Grimms’ Fairy Tales, Lectures on Language and Commentaries on the Laws of England, sometimes History of Egypt, Chaldea, Syria, Babylonia, and Assyria in the Light of Recent Discovery (but when it does, it’s for most overpredictions). 

The results of the ten-fold cross validation present an 87% accuracy with a higher level of variance than the other models (+/- 0.11).

This model had the worst results overall.

TFIDF and BOW results were similar.

###TFIDF

In [0]:
from sklearn import tree
import matplotlib.pyplot
import graphviz

tree_classifier = tree.DecisionTreeClassifier()
tree_classifier = tree_classifier.fit(variables_train_tfidf, labels_train_tfidf)

t0=time()
tree.plot_tree(tree_classifier)
training_time_container['tree']=time()-t0
print("Training Time: %fs"%training_time_container['tree'])

t0=time()
tree_predictions=tree_classifier.predict(variables_test_tfidf)
prediction_time_container['tree']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['tree'])

accuracy_container['decision_tree']=accuracy_score(labels_test_tfidf, tree_predictions)
print ("Accuracy Score of Decision Tree Classifier: %f \n"%accuracy_container['decision_tree'])
print ("Confusion Matrix:")
print(confusion_matrix(labels_test_tfidf,tree_predictions))
print ("\nClassification Report:")
print (classification_report(labels_test_tfidf, tree_predictions)) 

dot_data = tree.export_graphviz(tree_classifier, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("tree")

#fig = matplotlib.pyplot.gcf()
#fig.set_size_inches(150, 100)
#fig.savefig('tree.png')


####**Decision Tree TFIDF Ten-Fold Cross Validation**

In [0]:
#https://scikit-learn.org/stable/modules/cross_validation.html
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_classifier, X_tfidf, labels, cv=10)
print("Scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [0]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
from sklearn.model_selection import cross_validate
cv_results = cross_validate(tree_classifier, X_tfidf, labels, cv=10)
sorted(cv_results.keys())                         

#print(cv_results, "\n")  
print("Fit Time:", cv_results['fit_time'])
print("Fit Time: %0.8f (+/- %0.8f)" % (cv_results['fit_time'].mean(), cv_results['fit_time'].std() * 2)+"\n")

print("Score Time:", cv_results['score_time'])
print("Score time: %0.8f (+/- %0.8f)" % (cv_results['score_time'].mean(), cv_results['score_time'].std() * 2)+"\n")

print("Scores:", cv_results['test_score'])
print("Accuracy: %0.2f (+/- %0.2f)" % (cv_results['test_score'].mean(), cv_results['test_score'].std() * 2))

####**Decision Tree TFIDF Evaluation and Error Analysis**

Roughly 90% of this model's predictions were accurate during testing. During the test, it demonstrated problems predicting documents from Lectures on Language with documents from this book mostly misclassified as documents from Grimms' Fairy Tales, Commentaries on the Laws of England and History of Egypt, Chaldea, Syria, Babylonia, and Assyria in the Light of Recent Discovery. The model tends to overpredict for Grimms’ Fairy Tales, Lectures on Language and Commentaries on the Laws of England, sometimes History of Egypt, Chaldea, Syria, Babylonia, and Assyria in the Light of Recent Discovery (but when it does, it’s for most overpredictions). 

The results of the ten-fold cross validation present an 88-89% accuracy with a higher level of variance than the other models (+/- 0.12-0.13).

This model had the worst results overall.

TFIDF and BOW results were similar.

##**Applying K-Nearest Neighbors (KNN)**

The K-nearest neighbors (KNN) algorithm is a type of supervised machine learning algorithm. KNN is extremely easy to implement in its most basic form, and yet performs quite complex classification tasks. It is a lazy learning algorithm since it doesn't have a specialized training phase. Rather, it uses all of the data for training while classifying a new data point or instance. KNN is a non-parametric learning algorithm, which means that it doesn't assume anything about the underlying data.

Source: https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/

References:

https://scikit-learn.org/stable/modules/neighbors.html

###Bag of Words (BOW)

In [0]:
#https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py

from sklearn import neighbors



NN_classifier = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform')

t0=time()
NN_classifier.fit(variables_train_BOW, labels_train_BOW)
training_time_container['NN']=time()-t0
print("Training Time: %fs"%training_time_container['NN'])

t0=time()
NN_predictions=NN_classifier.predict(variables_test_BOW)
prediction_time_container['NN']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['NN'])

accuracy_container['NN']=accuracy_score(labels_test_BOW, NN_predictions)
print ("Accuracy Score of Decision Nearest Neighbours Classifier: %f \n"%accuracy_container['NN'])
print ("Confusion Matrix:")
print (confusion_matrix(labels_test_BOW, NN_predictions))
print ("\nClassification Report:")
print (classification_report(labels_test_BOW, NN_predictions)) 

####**KNN BOW Ten-Fold Cross Validation**

In [0]:
#https://scikit-learn.org/stable/modules/cross_validation.html
from sklearn.model_selection import cross_val_score
scores = cross_val_score(NN_classifier, X_counts, labels, cv=10)
print("Scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [0]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
from sklearn.model_selection import cross_validate
cv_results = cross_validate(NN_classifier, X_counts, labels, cv=10)
sorted(cv_results.keys())                         

#print(cv_results, "\n")  
print("Fit Time:", cv_results['fit_time'])
print("Fit Time: %0.8f (+/- %0.8f)" % (cv_results['fit_time'].mean(), cv_results['fit_time'].std() * 2)+"\n")

print("Score Time:", cv_results['score_time'])
print("Score time: %0.8f (+/- %0.8f)" % (cv_results['score_time'].mean(), cv_results['score_time'].std() * 2)+"\n")

print("Scores:", cv_results['test_score'])
print("Accuracy: %0.2f (+/- %0.2f)" % (cv_results['test_score'].mean(), cv_results['test_score'].std() * 2))

####**KNN BOW Evaluation and Error Analysis**

This model with BOW mostly had near perfect prediction scores, but no perfect scores.

Its mistakes in testing tend to be misclassification of documents from *Concrete Construction* that it most frequently classifies as part of the *Lectures on Language*. 

The ten-fold cross validation also displays almost perfect scores (.095) with a low variance (+/-o.o6).

This model had the second best results overall.

TFIDF performed better than BOW with this model, both in accuracy and variance.


###TFIDF

In [0]:
#https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py

from sklearn import neighbors



NN_classifier = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform')

t0=time()
NN_classifier.fit(variables_train_tfidf, labels_train_tfidf)
training_time_container['NN']=time()-t0
print("Training Time: %fs"%training_time_container['NN'])

t0=time()
NN_predictions=NN_classifier.predict(variables_test_tfidf)
prediction_time_container['NN']=time()-t0
print("Prediction Time: %fs"%prediction_time_container['NN'])

accuracy_container['NN']=accuracy_score(labels_test_tfidf, NN_predictions)
print ("Accuracy Score of Decision Nearest Neighbours Classifier: %f \n"%accuracy_container['NN'])
print ("Confusion Matrix:")
print (confusion_matrix(labels_test_tfidf, NN_predictions))
print ("\nClassification Report:")
print (classification_report(labels_test_tfidf, NN_predictions)) 

####**KNN TFIDF Ten-Fold Cross Validation**

In [0]:
#https://scikit-learn.org/stable/modules/cross_validation.html
from sklearn.model_selection import cross_val_score
scores = cross_val_score(NN_classifier, X_tfidf, labels, cv=10)
print("Scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [0]:
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
from sklearn.model_selection import cross_validate
cv_results = cross_validate(NN_classifier, X_tfidf, labels, cv=10)
sorted(cv_results.keys())                         

#print(cv_results, "\n")  
print("Fit Time:", cv_results['fit_time'])
print("Fit Time: %0.8f (+/- %0.8f)" % (cv_results['fit_time'].mean(), cv_results['fit_time'].std() * 2)+"\n")

print("Score Time:", cv_results['score_time'])
print("Score time: %0.8f (+/- %0.8f)" % (cv_results['score_time'].mean(), cv_results['score_time'].std() * 2)+"\n")

print("Scores:", cv_results['test_score'])
print("Accuracy: %0.2f (+/- %0.2f)" % (cv_results['test_score'].mean(), cv_results['test_score'].std() * 2))

####**KNN Evaluation and Error Analysis**

This model mostly had near perfect prediction scores and some perfect prediction scores.

Its mistakes in testing tend to be misclassification of documents from *Lectures on Language* that it most frequently classifies as part of the *Grimms' Fairy Tales*. 

The ten-fold cross validation also displays almost perfect scores (0.99) with a very lower variance (+/- 0.02).

This model had the second best results overall.

TFIDF performed better than BOW with this model, both in accuracy and variance.

#**Part 5: Model Evaluation**

Model Evaluation and Ten-fold cross validation is included in Part 4: Traing Machine Learning Model, under each model.

##Champion Model
SVM is the most accurate model as it presents the best accuracy score while also having the lowest variance. While its training time is higher than KNN, its prediction time is lower. Considering that for real-life applications, predictions should be more frequent than training, it is preferable to have a lower prediction time. This combination makes SVM the best model. SVM performed better with TFIDF than BOW.


Metric  | SVM | Decision Tree | KNN
--- | --- | --- | ---
Training Time | 0.01523516 (+/- 0.00191166) | 0.14498570 (+/- 0.02412371) | 0.00355504 (+/- 0.00309887)
Prediction Time | 0.00110841 (+/- 0.00008156) | 0.00153821 (+/- 0.00006291) | 0.01523516 (+/- 0.00191166)
Accuracy | 1.00 (+/- 0.01) | 0.88 (+/- 0.12) | 0.99 (+/- 0.02)

#**Part 6: Error Analysis**

Error analysis is included in Part 4: Traing Machine Learning Model, under each model as appropriate. It includes a review of the ten-fold cross validation and of the confusion matrix for each model, as appropriate. A classification report is also included for each model.

##**SVM Evaluation and Error Analysis**

This model mostly with had perfect prediction scores and some near perfect prediction scores.

The ten-fold cross validation also displays almost perfect scores with the lowest variance amongst the models tested.

This model had the best results overall. 

This model performed slightly better with TFIDF than with BOW.

Error analysis was deemed unnecessary given its performance.

##**Decision Tree Evaluation and Error Analysis**
Roughly 90% of this model's predictions were accurate during testing. During the test, it demonstrated problems predicting documents from Lectures on Language with documents from this book mostly misclassified as documents from Grimms' Fairy Tales, Commentaries on the Laws of England and History of Egypt, Chaldea, Syria, Babylonia, and Assyria in the Light of Recent Discovery. The model tends to overpredict for Grimms’ Fairy Tales, Lectures on Language and Commentaries on the Laws of England, sometimes History of Egypt, Chaldea, Syria, Babylonia, and Assyria in the Light of Recent Discovery (but when it does, it’s for most overpredictions).

The results of the ten-fold cross validation present an 88% accuracy with a higher level of variance than the other models (+/- 0.12).

This model did not present significant accuracy differences between TFIDF and BOW.

This model had the worst results overall.

##**KNN Evaluation and Error Analysis**

This model mostly had near perfect prediction scores and some perfect prediction scores.

Its mistakes in testing tend to be misclassification of documents from *Lectures on Language* that it most frequently classifies as part of the *Grimms' Fairy Tales*. 

The ten-fold cross validation also displays almost perfect scores with a very lower variance.

This model performed better with TFIDF than with BOW.

This model had the second best results overall.

#**Part 7: Visualization**

Visualizations are in Part 4: Traing Machine Learning Model, under each model as appropriate. These include tables and reports such as ten-fold cross validation outputs, confusion matrices, classification reports. is also included for each model. A diagram is also produced for the Decision Tree model under Part 4 and below.

Additional visualizations are included in the [Google Slides presentation](https://drive.google.com/open?id=1Vo6i2yx9pMAlTfpcvv-Vz4KbMTE4E43uNs7-tuCa0S8).



In [0]:
tree.plot_tree(tree_classifier)
dot_data = tree.export_graphviz(tree_classifier, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("tree")

#**Part 8: Limitations and Considerations**

**Choice of books**. The books used were from different styles and authors which likely made it easier to tell them apart. How would the models perform on books of the same style/category? From the same author? 

It is likely that the similarity between books and therfore documents increases for documents of the same style/category which would lead to lower accuracy.

**Text only**. These models only take into consideration the text of the books. Other visual elements such as pictures, diagrams and typography should perhaps be considered to improve classification.

**Impact of existing classification**. How would previous human decisions (e.g., Dewey Decimal Classification System) impact the machine’s training and predictions? How could a machine compensate or leverage?

Erroneous existing classification could lead to lower accuracy, just as mislabeling a document in our dataset would likely reduce the effectiveness of the classifiers. However, in an environment where both machine and human classifers cooperate, the machine could leverage the classfication produced by humans and re-fit its classifer as new human-classified data becomes available. This could be particularly beneficial in case such as classification of a piece by an expert in that field or its creator.


#**Part 9: Real Life Applications**

**Library**. Supporting librarians in classifying books using the Dewey Decimal Classification System, which is the most widely used method for classifying books in libraries. Given an increasing number of electronic books to classify, automatic document classification will help librarians to manage documents with accuracy and speed.

**Archive and Document Management**. categorizing or classifying web pages, could be useful for organizations such as Library and Archives Canada that have a mandate for archiving of web content

**Information Management**. Supporting information management in organizations by offering recommendations when saving a new document in a structured classification structure, or to clean up classification and placement of existing documents. Automated system to identify, classify and capability to continuously improve classification accuracy will be beneficial to organizations that produce and manage high volume of documents.

**Natural Language Generation**. Literature or document classification based on authorship and genre will underpin the development the tasks of natural language processing that focuses on generating natural language from structured data. A system integration with speech features provides potential new product or services for story telling for machine generated children or adult stories.

