## Data620 Assignment: Document Classification
#### Gabriel Santos 
#### Date: 03-28-2023 

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data: [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase)

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.


### Corpus selection

Let's choose the gutenberg corpus texts.

In [40]:
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\tqkar\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [41]:
import nltk
import random
random.seed(200)
import pandas as pd
pd.set_option('display.max_rows', 400)
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

I chose the following books:

- Books by shakespeare (Caesar, Hamlet and Macbeth)
- Poems by Blake

The objective is to be able to determine who wrote a phrase or a specific passage.

### 1. Compare Shakespeare vs Blake

The first step is to take all three of Shakespeare's plays and combine them to create a single text. Then we'll convert all to lowercase to remove duplicate words. We will also remove punctuation. Then we create a list of text segments. Each segment will have an extension of 800 words.

In [42]:
shakespeare = nltk.corpus.gutenberg.words('shakespeare-caesar.txt')+nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')+nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')
shakespeare = [word.lower() for word in shakespeare if word.isalpha()]
shakespeare1=[]
for i in range(87):
    shakespeare1.append([shakespeare[i*800:(i+1)*800],'shak'])
len(shakespeare)

69340

In [43]:
len(shakespeare1)

87

We now have a list of 87 segments each with 800 words, for Shakespeare.

Now with Blake's text, we'll convert everything to lowercase to eliminate duplicate words. We will also remove punctuation. Then we create a list of text segments. Each segment will have an extension of 800 words.

In [44]:
blake = nltk.corpus.gutenberg.words('blake-poems.txt')
blake = [word.lower() for word in blake if word.isalpha()]
blake1=[]
for i in range(9):
    blake1.append([blake[i*800:(i+1)*800],'blak'])
len(blake)

6934

In [45]:
len(blake1)

9

We now have a list of 9 segments each with 800 words, for Blake. We have fewer segments because the number of words is less.

### 2. Feature 

We are going to join the two generated lists (Shakespeare and Blake ), in one. Then we are going to identify the 2000 most frequent words. To then be able to create a list that allows us to use them as a text classifier.

In [46]:
sb=shakespeare+blake
all_words = nltk.FreqDist(w.lower() for w in sb)
word_features = list(all_words)[:2000] 

wlist = []
for i in range(0, 2000, 200):
    df = pd.DataFrame(word_features[i:(i+200)])
    df.columns=['200 words']
    wlist.append(df)

pd.concat(wlist, axis=1)

Unnamed: 0,200 words,200 words.1,200 words.2,200 words.3,200 words.4,200 words.5,200 words.6,200 words.7,200 words.8,200 words.9
0,the,euen,la,touch,vile,satisfied,feet,eternall,sheep,hate
1,and,nature,ouer,musicke,arm,flesh,mild,plucke,alexander,anger
2,to,still,dost,flourish,already,liberty,dark,fiery,priest,earnest
3,i,hold,fit,patience,crown,dust,grass,conference,deuill,generals
4,of,wife,alas,land,proofe,instrument,itself,chanc,osricke,touching
5,a,pol,giuen,awake,reade,beside,proper,sad,carriages,tydings
6,you,rosse,crowne,iudgement,faces,wounds,indeede,aske,foyles,mess
7,my,ant,vnto,gaue,lyes,enemies,stones,manner,toth,addition
8,in,off,lesse,greefe,season,doo,infants,faine,title,waste
9,that,once,meet,whole,braue,buriall,shout,falling,attend,twixt


We are going to use the function the Natural Language Processing with Python textbook on page 228 to create a function generator that uses the list of the 2000 most frequent words and indicates whether or not each word is present in the text as a function.

In [47]:
def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

Let's perform the test on the complete text of Blake.

In [48]:
features = document_features(blake)
list(features.items())[:20]

[('contains(the)', True),
 ('contains(and)', True),
 ('contains(to)', True),
 ('contains(i)', True),
 ('contains(of)', True),
 ('contains(a)', True),
 ('contains(you)', True),
 ('contains(my)', True),
 ('contains(in)', True),
 ('contains(that)', True),
 ('contains(is)', True),
 ('contains(it)', True),
 ('contains(not)', True),
 ('contains(d)', True),
 ('contains(his)', True),
 ('contains(with)', True),
 ('contains(for)', True),
 ('contains(me)', True),
 ('contains(this)', True),
 ('contains(but)', True)]

### 3. Create Test Train Dataset

We are going to create a single list with all the segments of Shakespeare and Blake. to create the text corpus that we will use to train and test our classifier model.

In [49]:
documents = shakespeare1+blake1

Training and testing a classifier for document classification:

In [50]:
import random
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

96

We split our dataset into test and training sections, train our classifier on the training set, and check the accuracy of our model on the test set.

In [56]:
train_set, test_set = featuresets[:50], featuresets[50:]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [57]:
print(nltk.classify.accuracy(classifier, test_set)) 

1.0


According to the result it is easy for NLTK to distinguish between Shakespeare and Blake.

### We are going to include more authors to test: Adding Stories by Bryant

In [58]:
bryant = nltk.corpus.gutenberg.words('bryant-stories.txt')
bryant = [word.lower() for word in bryant if word.isalpha()]
bryant1=[]
for i in range(46):
    bryant1.append([bryant[i*1000:(i+1)*1000],'brya'])
len(bryant)

46611

In [59]:
sbb=shakespeare+blake+bryant
all_words = nltk.FreqDist(w.lower() for w in sbb)
word_features = list(all_words)[:2000] 

documents=shakespeare1+blake1+bryant1

Training and testing a classifier for document classification:

In [60]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

142

In [61]:
train_set, test_set = featuresets[:100], featuresets[100:]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [62]:
print(nltk.classify.accuracy(classifier, test_set)) 

1.0


According to the result it is easy for NLTK to distinguish between Shakespeare, Blake, and Bryant.

### Let's include more authors to test: Adding Busterbrown by Burgess 

In [63]:
burgess = nltk.corpus.gutenberg.words('burgess-busterbrown.txt')
burgess = [word.lower() for word in burgess if word.isalpha()]
burgess1=[]
for i in range(16):
    burgess1.append([burgess[i*1000:(i+1)*1000],'burg'])
len(burgess)

16327

In [64]:
sbbb=shakespeare+blake+bryant+burgess
all_words = nltk.FreqDist(w.lower() for w in sbbb)
word_features = list(all_words)[:2000] 

documents=shakespeare1+blake1+bryant1+burgess1

In [65]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

158

In [66]:
train_set, test_set = featuresets[:100], featuresets[100:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.9827586206896551


According to the result it is easy for NLTK to distinguish between Shakespeare, Blake, Bryant, and Burgess. The precision decreased but it is still easy to distinguish the author. 

The most important features of the model are:

In [67]:
classifier.show_most_informative_features(25)

Most Informative Features
          contains(down) = True             burg : shak   =     37.8 : 1.0
         contains(birds) = True             blak : shak   =     36.0 : 1.0
         contains(shade) = True             blak : shak   =     36.0 : 1.0
         contains(voice) = True             blak : shak   =     36.0 : 1.0
             contains(d) = False            brya : shak   =     35.3 : 1.0
          contains(felt) = True             burg : shak   =     33.3 : 1.0
       contains(started) = True             burg : shak   =     33.3 : 1.0
        contains(behind) = True             burg : shak   =     28.9 : 1.0
          contains(your) = False            burg : shak   =     28.9 : 1.0
          contains(hear) = True             blak : shak   =     28.0 : 1.0
      contains(pleasant) = True             blak : shak   =     28.0 : 1.0
           contains(sun) = True             blak : shak   =     28.0 : 1.0
          contains(wept) = True             blak : shak   =     28.0 : 1.0

- According to these characteristics we can see that a text that contains the word 'down' is 37.8 times more likely to be from Burgess than from Shakespeare.

- A text that does not contain the word 'your' is 28.9 times more likely to be by Burgess than by Shakespeare.


- We can also see that a text that contains the word 'hear', 'pleasant', 'sun', are 28 times more likely to be from Blake than from Shakespeare.

- Texts containing the word 'began', 'tree', are 20 times more likely to be by Burgess than by Shakespeare.

- Texts containing the word 'chimney', 'cloud', are 20 times more likely to be by Blake than by Shakespeare.


## Lets include more authors to test: Adding Alice by Carroll

In [68]:
carroll = nltk.corpus.gutenberg.words('carroll-alice.txt')
carroll = [word.lower() for word in carroll if word.isalpha()]
carroll1=[]
for i in range(27):
    carroll1.append([carroll[i*1000:(i+1)*1000],'carr'])
len(carroll)

27333

In [69]:
sbbbc=shakespeare+blake+bryant+burgess+carroll
all_words = nltk.FreqDist(w.lower() for w in sbbbc)
word_features = list(all_words)[:2000] 

documents=shakespeare1+blake1+bryant1+burgess1+carroll1

In [70]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

185

In [71]:
train_set, test_set = featuresets[:100], featuresets[100:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

1.0


According to the result it is easy for NLTK to distinguish between Shakespeare, Blake, Bryant, and Carroll.  

In [72]:
classifier.show_most_informative_features(25)

Most Informative Features
          contains(bear) = True             burg : shak   =     31.2 : 1.0
          contains(felt) = True             burg : shak   =     31.2 : 1.0
          contains(down) = True             blak : shak   =     31.0 : 1.0
           contains(got) = True             carr : shak   =     30.4 : 1.0
          contains(back) = True             burg : shak   =     27.1 : 1.0
          contains(than) = True             burg : shak   =     27.1 : 1.0
          contains(your) = False            burg : shak   =     27.1 : 1.0
        contains(bright) = True             blak : shak   =     26.2 : 1.0
         contains(voice) = True             blak : shak   =     26.2 : 1.0
        contains(caught) = True             burg : shak   =     22.9 : 1.0
        contains(farmer) = True             burg : shak   =     22.9 : 1.0
       contains(started) = True             burg : shak   =     22.9 : 1.0
          contains(tree) = True             burg : shak   =     22.9 : 1.0

According to these characteristics we can see that a text that does not contain the word 'you' is 21.4 times more likely to be from Blake than from Shakespeare.

We can also see that a text that contains the word 'got' is 30.4 times more likely to be from Carroll than from Shakespeare.

Texts containing the word 'sun', 'tender', are 21.4 times more likely to be from Blake than from Shakespeare.

### Let's include more authors to try: Adding Ball by Chersterton

In [73]:
chesterson = nltk.corpus.gutenberg.words('chesterton-ball.txt')+nltk.corpus.gutenberg.words('chesterton-brown.txt')+nltk.corpus.gutenberg.words('chesterton-thursday.txt')
chesterson = [word.lower() for word in chesterson if word.isalpha()]
chesterson1=[]
for i in range(214):
    chesterson1.append([chesterson[i*1000:(i+1)*1000],'ches'])
len(chesterson)

214692

In [74]:
sbbbcc=shakespeare+blake+bryant+burgess+carroll+chesterson
all_words = nltk.FreqDist(w.lower() for w in sbbbcc)
word_features = list(all_words)[:2000] 

documents=shakespeare1+blake1+bryant1+burgess1+carroll1+chesterson1

In [75]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

399

Since we now have 399 texts, let's increase our training set to keep it at about 25% of the corpus.

In [76]:
train_set, test_set = featuresets[:170], featuresets[170:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.9563318777292577


According to the results, it is easy for NLTK to distinguish between Shakespeare, Blake, Bryant, Carroll, and Chesterson.

In [77]:
classifier.show_most_informative_features(25)

Most Informative Features
          contains(sing) = True             blak : ches   =     51.6 : 1.0
         contains(sweet) = True             blak : ches   =     51.6 : 1.0
          contains(thou) = True             blak : ches   =     51.6 : 1.0
       contains(herself) = True             carr : ches   =     50.7 : 1.0
       contains(chimney) = True             blak : ches   =     40.1 : 1.0
          contains(fish) = True             burg : ches   =     40.1 : 1.0
           contains(joy) = True             blak : ches   =     40.1 : 1.0
         contains(quick) = True             burg : ches   =     40.1 : 1.0
         contains(tears) = True             blak : ches   =     40.1 : 1.0
          contains(thee) = True             blak : ches   =     40.1 : 1.0
           contains(thy) = True             blak : ches   =     40.1 : 1.0
          contains(haue) = False            ches : shak   =     29.2 : 1.0
         contains(angel) = True             blak : ches   =     28.7 : 1.0

According to these characteristics, we can see that a text that contains the word 'sing' is 51.6 times more likely to be from Blake than from Chesterson.

We can also see that a text containing the word 'herself' is 50.7 times more likely to be from Carroll than from Chesterson.

## Let's include more authors to test.

In [78]:
austen = nltk.corpus.gutenberg.words('austen-emma.txt')+nltk.corpus.gutenberg.words('austen-persuasion.txt')+nltk.corpus.gutenberg.words('austen-sense.txt')
austen = [word.lower() for word in austen if word.isalpha()]
austen1=[]
for i in range(69):
    austen1.append([austen[i*1000:(i+1)*1000],'aust'])
len(austen)

366454

In [79]:
edgeworth = nltk.corpus.gutenberg.words('edgeworth-parents.txt')
edgeworth = [word.lower() for word in edgeworth if word.isalpha()]
edgeworth1=[]
for i in range(170):
    edgeworth1.append([edgeworth[i*1000:(i+1)*1000],'edge'])
len(edgeworth)

170737

In [80]:
melville = nltk.corpus.gutenberg.words('melville-moby_dick.txt')
melville = [word.lower() for word in melville if word.isalpha()]
melville1=[]
for i in range(218):
    melville1.append([melville[i*1000:(i+1)*1000],'melv'])
len(melville)

218361

In [81]:
whitman = nltk.corpus.gutenberg.words('whitman-leaves.txt')
whitman = [word.lower() for word in whitman if word.isalpha()]
whitman1=[]
for i in range(126):
    whitman1.append([whitman[i*1000:(i+1)*1000],'whit'])
len(whitman)

126276

In [82]:
milton = nltk.corpus.gutenberg.words('milton-paradise.txt')
milton = [word.lower() for word in milton if word.isalpha()]
milton1=[]
for i in range(126):
    milton1.append([milton[i*1000:(i+1)*1000],'milt'])
len(milton)

80493

In [83]:
sbbbccemswam=austen+blake+bryant+burgess+carroll+chesterson+edgeworth+milton+melville+shakespeare+whitman
all_words = nltk.FreqDist(w.lower() for w in sbbbccemswam)
word_features = list(all_words)[:2000] 

documents=austen1+blake1+bryant1+burgess1+carroll1+chesterson1+edgeworth1+milton1+melville1+shakespeare1+whitman1
len(sbbbccemswam)

1343558

In [84]:
random.shuffle(documents)
featuresets = [(document_features(d), c) for (d,c) in documents]
len(featuresets)

1108

In [85]:
train_set, test_set = featuresets[:320], featuresets[320:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.9758883248730964


According to the result it is easy for NLKT to distinguish between all the authors that we included for the model: Austen, Blake, Bryant, Burgess, Carroll, Chesterson, Edgeworth, Melville, Shakespeare, and Whitman.

In [86]:
classifier.show_most_informative_features(50)

Most Informative Features
         contains(alice) = True             carr : ches   =     42.2 : 1.0
      contains(children) = True             blak : melv   =     38.5 : 1.0
          contains(mild) = True             blak : ches   =     38.5 : 1.0
         contains(smile) = True             blak : melv   =     38.5 : 1.0
        contains(smiles) = True             blak : ches   =     38.5 : 1.0
        contains(sorrow) = True             blak : ches   =     38.5 : 1.0
         contains(sweet) = True             blak : ches   =     38.5 : 1.0
           contains(had) = False            whit : ches   =     34.0 : 1.0
            contains(an) = False            milt : ches   =     33.9 : 1.0
           contains(mrs) = True             aust : melv   =     33.0 : 1.0
           contains(you) = False            milt : edge   =     31.4 : 1.0
           contains(big) = True             burg : edge   =     31.2 : 1.0
          contains(thou) = True             shak : edge   =     31.2 : 1.0

According to these characteristics we can see that a text that contains the word 'children' is 38.5 times more likely to be from Blake than from Melville.

We can also see that a text that contains the word 'glad' is 27.5 times more likely to be from Carroll than from Melville.

[https://www.youtube.com/watch?v=lRxMWDj8QWw](https://www.youtube.com/watch?v=lRxMWDj8QWw)