#  Classification and Clustering Analysis: Automated Topic Recommendations

<br>


<br>

Initiation and Data Preprocessing
* Import Packages and Files
* Data Cleaning

Data Exploration and Analysis
* Analyzing the Sources of the Text 
* Analyzing The Context of The Text 

Unsupervised Feature Preparation
* Text Vectorization 
* Feature Reduction

Creating Recommendations Using Cosine Similarity

* Real-time Predictions
* Batch Predictions

Recommendation Analysis
* Evaluating Recommendation Techniques

Final Modeling Pipeline
* Full Code
* Analysis and Conclusion

## Initiation and Data Preprocessing

The data used in this study is taken from texts found in the Gutenberg corpora. 
Excerpts taken from the writing of the ten authors were used in clustering and classification.
Excerpts were labeled using the authors' last names: Chesterton, Bryant, Edgeworth, Austen, Whitman, Milton, Melville, Carroll, Shakespeare, and Burgess.
After labeling, two sets of features were created using bag of words and TF-IDF.
Both sets of features were reduced using singular value decomposition to reduce computational complexity and remove noise.

### Import Packages and Data

In [1]:
%%time


!python -m spacy download en

import numpy as np
import pandas as pd
import re

import nltk
import spacy
import textwrap
from nltk.corpus import gutenberg
nltk.download('gutenberg')

from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

# Suppress Warnings

import warnings
warnings.filterwarnings(
    action="ignore"  
    )

# Display Preferences

pd.options.display.float_format = '{:.3f}'.format
pd.options.mode.chained_assignment = None

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
CPU times: user 727 ms, sys: 133 ms, total: 860 ms
Wall time: 7.35 s


In [2]:
%%time

## Import Files and Process the Raw Data.
print(gutenberg.fileids())

brown = gutenberg.raw('chesterton-brown.txt')
stories = gutenberg.raw('bryant-stories.txt')
parents = gutenberg.raw('edgeworth-parents.txt')
emma = gutenberg.raw('austen-emma.txt')
leaves = gutenberg.raw('whitman-leaves.txt')
paradise = gutenberg.raw('milton-paradise.txt')
moby_dick = gutenberg.raw('melville-moby_dick.txt')
alice = gutenberg.raw('carroll-alice.txt')
hamlet = gutenberg.raw('shakespeare-hamlet.txt')
busterbrown = gutenberg.raw('burgess-busterbrown.txt')

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
CPU times: user 4.25 ms, sys: 5.02 ms, total: 9.26 ms
Wall time: 10.4 ms


### Data Cleaning

In [3]:
%%time

## Clean text data

def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

brown = text_cleaner(brown[:60000])
stories = text_cleaner(stories[:60000])
parents = text_cleaner(parents[:60000])
emma = text_cleaner(emma[:60000])
leaves = text_cleaner(leaves[:60000])
paradise = text_cleaner(paradise[:60000])
moby_dick = text_cleaner(moby_dick[:60000])
alice = text_cleaner(alice[:60000])
hamlet = text_cleaner(hamlet[:60000])
busterbrown = text_cleaner(busterbrown[:60000])

CPU times: user 6.69 ms, sys: 3.95 ms, total: 10.6 ms
Wall time: 10.6 ms


In [4]:
%%time

## Extract ~ 100 Excerpts Each from All Texts, Totaling ~ 1000 Excerpts to Be Processed

all_texts = [ 
brown, stories, parents, emma,
leaves, paradise, moby_dick,
alice, hamlet, busterbrown
]

cleaned_excerpts = []
for texts in all_texts:
  cleaned_excerpts.append(textwrap.wrap(texts, len(texts)// 100))

CPU times: user 158 ms, sys: 1.53 ms, total: 160 ms
Wall time: 160 ms


In [5]:
%%time

## Parse the cleaned excerpts

nlp = spacy.load('en')

parsed_excerpts = []
for excerpts in cleaned_excerpts:
  single_text = []
  for excerpt in excerpts:
    single_text.append(nlp(excerpt))
  parsed_excerpts.append(single_text)


CPU times: user 23.7 s, sys: 262 ms, total: 24 s
Wall time: 24.1 s


In [6]:
%%time

## Convert Excerpts into Strings and Load them into Dataframes

all_sents = []
for excerpts in parsed_excerpts:
  single_text = []
  for excerpt in excerpts:
    single_text.append(str(excerpt))
  all_sents.append(pd.DataFrame(single_text).applymap(str).apply(lambda x: x + ' '))
  

CPU times: user 113 ms, sys: 1.05 ms, total: 114 ms
Wall time: 127 ms


In [7]:
%%time

## Assign Authors to Excerpts

author_names = [
"chesterton",
"bryant",
"edgeworth",
"austen",
"whitman",
"milton",
"melville",
"carroll",
"shakespeare",
"burgess"
]

for index_number in list(range(0,10)):
  all_sents[index_number]['author'] = author_names[index_number]

CPU times: user 6.57 ms, sys: 41 µs, total: 6.62 ms
Wall time: 7.35 ms


In [8]:
%%time

## Load Labeled Excerpts into single DataFrame

labeled_excerpts = pd.concat(all_sents)

print(labeled_excerpts.head())

                                                   0      author
0  I. The Absence of Mr Glass THE consulting-room...  chesterton
1  poetry. These things were there, in their plac...  chesterton
2  room was lined with as complete a set of Engli...  chesterton
3  ballads and the tables laden with drink and to...  chesterton
4  shot with grey, but growing thick and healthy;...  chesterton
CPU times: user 14.2 ms, sys: 987 µs, total: 15.2 ms
Wall time: 14.9 ms


## Data Exploration and Analysis

In [9]:
for sents in all_sents:
  print(sents.iloc[1,1] + ':\n' + sents.iloc[1,0] + '\n')

chesterton:
poetry. These things were there, in their place; but one felt that they were never allowed out of their place. Luxury was there: there stood upon a special table eight or ten boxes of the best cigars; but they were built upon a plan so that the strongest were always nearest the wall and the mildest nearest the window. A tantalus containing three kinds of spirit, all of a liqueur excellence, stood always on this table of luxury; but the fanciful have asserted that the whisky, brandy, and rum seemed always to stand at the same level. Poetry was there: the left-hand corner of the 

bryant:
"It's the Rain, and I want to come in," said a soft, sad, little voice. "No, you can't come in," the little Tulip said. By and by she heard another little _tap, tap, tap_ on the window-pane. "Who is there?" she said. The same soft little voice answered, "It's the Rain, and I want to come in!" "No, you can't come in," said the little Tulip. Then it was very still for a long time. At last, the

## Unsupervised Feature Preparation

In [10]:
%%time

## Vectorizing Text Data 

# Using Porter Stemmer

porter_stemmer = PorterStemmer()

def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words
  

# Using TfIdf

tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, max_features=1500, use_idf=True)
X = tfidf_vectorizer.fit_transform(labeled_excerpts.iloc[:,0])
df_tfidf = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

# View Shape of TfIdf Matrix

print(df_tfidf.shape)


(1010, 1500)
CPU times: user 1.61 s, sys: 8.92 ms, total: 1.62 s
Wall time: 1.64 s


In [11]:
%%time

labeled_excerpts.head()

CPU times: user 575 µs, sys: 10 µs, total: 585 µs
Wall time: 525 µs


Unnamed: 0,0,author
0,I. The Absence of Mr Glass THE consulting-room...,chesterton
1,"poetry. These things were there, in their plac...",chesterton
2,room was lined with as complete a set of Engli...,chesterton
3,ballads and the tables laden with drink and to...,chesterton
4,"shot with grey, but growing thick and healthy;...",chesterton


In [12]:
df_tfidf.head(5)

Unnamed: 0,abl,abov,absenc,accept,accord,account,acorn,acquaint,act,activ,actual,ad,address,admir,admit,advanc,advantag,advic,advis,affect,afford,afraid,afternoon,afterward,age,agent,ago,agre,ah,ahead,ain,air,ala,alic,allow,alon,aloud,alreadi,altar,altogeth,...,wide,widow,wife,wild,william,wind,window,wine,wing,wink,winter,wise,wish,wit,woe,wolf,woman,women,won,wonder,wont,wood,woodhous,word,work,world,wors,worst,worth,wouldn,wound,wretch,write,wrong,ye,year,yell,yellow,young,youth
0,0.0,0.0,0.194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.154,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Creating Recommendations

In [13]:
%%time

## Select a text to create recommendations for

# Text Was Selected Using Index From Labeled Excerpts DataFrame
test_text = labeled_excerpts.iloc[2,:]

print('\n' + test_text[0][:100] + '\n' + test_text[0][100:200] + '\n' + test_text[0][200:300] + '\n')


room was lined with as complete a set of English classics as the right hand could show of English an
d foreign physiologists. But if one took a volume of Chaucer or Shelley from that rank, its absence 
irritated the mind like a gap in a man's front teeth. One could not say the books were never read; p

CPU times: user 794 µs, sys: 0 ns, total: 794 µs
Wall time: 706 µs


In [14]:
%%time

## Transform the text to tokens based on the vectorizer used for the original data

test_tfidf = tfidf_vectorizer.transform([test_text[0]])

CPU times: user 4.63 ms, sys: 0 ns, total: 4.63 ms
Wall time: 6.5 ms


In [15]:
%%time

## Compute cos similarity between test text and the rest of the documents

cos_similarity_tfidf = map(lambda x: cosine_similarity(test_tfidf, x),X)
results = list(cos_similarity_tfidf)
results = [[x][0][0][0] for x in results]

CPU times: user 586 ms, sys: 0 ns, total: 586 ms
Wall time: 591 ms


In [16]:
%%time

## Create a dataframe for text recommendations

test_text_recommendations = labeled_excerpts.reset_index(drop=True)
test_text_recommendations.loc[:,'test_recommendation'] = results
test_text_recommendations = test_text_recommendations.sort_values(by=['test_recommendation'], ascending=0, axis=0)

CPU times: user 3.26 ms, sys: 0 ns, total: 3.26 ms
Wall time: 3.65 ms


In [17]:
%%time

## Display Top 10 Recommendations

test_text_recommendations.head(10)

CPU times: user 495 µs, sys: 12 µs, total: 507 µs
Wall time: 504 µs


Unnamed: 0,0,author,test_recommendation
2,room was lined with as complete a set of Engli...,chesterton,1.0
3,ballads and the tables laden with drink and to...,chesterton,0.289
57,cloudy ideals or cloudy compromises of the nor...,chesterton,0.22
0,I. The Absence of Mr Glass THE consulting-room...,chesterton,0.198
683,"nigh the tail, and, like a restless needle soj...",melville,0.191
427,"and foot, and parks of artillery, And artiller...",whitman,0.183
7,"bobbing bow over it, as if setting everything ...",chesterton,0.163
92,with great attention. The brigand captain took...,chesterton,0.154
77,"the English, no nobler crests or chasms than t...",chesterton,0.141
377,as to send Mrs. Goddard a beautiful goose the ...,austen,0.141


In [18]:
list(test_text_recommendations.iloc[:,0][:5])

["room was lined with as complete a set of English classics as the right hand could show of English and foreign physiologists. But if one took a volume of Chaucer or Shelley from that rank, its absence irritated the mind like a gap in a man's front teeth. One could not say the books were never read; probably they were, but there was a sense of their being chained to their places, like the Bibles in the old churches. Dr Hood treated his private book-shelf as if it were a public library. And if this strict scientific intangibility steeped even the shelves laden with lyrics and ",
 "ballads and the tables laden with drink and tobacco, it goes without saying that yet more of such heathen holiness protected the other shelves that held the specialist's library, and the other tables that sustained the frail and even fairylike instruments of chemistry or mechanics. Dr Hood paced the length of his string of apartments, bounded as the boys' geographies say on the east by the North Sea and on the

## Analysis and Evaluation of Methods

## Final Modeling Pipeline

In [19]:
test_text_recommendations

Unnamed: 0,0,author,test_recommendation
2,room was lined with as complete a set of Engli...,chesterton,1.000
3,ballads and the tables laden with drink and to...,chesterton,0.289
57,cloudy ideals or cloudy compromises of the nor...,chesterton,0.220
0,I. The Absence of Mr Glass THE consulting-room...,chesterton,0.198
683,"nigh the tail, and, like a restless needle soj...",melville,0.191
...,...,...,...
832,"Incestuous sheets: It is not, nor it cannot co...",shakespeare,0.000
504,"righteous, I make appointments with all, I wil...",whitman,0.000
278,"potato to put into my mouth! I, that have been...",edgeworth,0.000
525,"or what more lost in Hell?"" So Satan spake; an...",milton,0.000


### Batch Recommendations

In [0]:

from sklearn.metrics.pairwise import linear_kernel

In [0]:
cosine_similarities = linear_kernel(df_tfidf, df_tfidf)

In [22]:
cosine_similarities

array([[1.        , 0.16154482, 0.19772786, ..., 0.        , 0.01794152,
        0.03803436],
       [0.16154482, 1.        , 0.05019935, ..., 0.01772728, 0.00891493,
        0.01057863],
       [0.19772786, 0.05019935, 1.        , ..., 0.02195751, 0.03157508,
        0.01097119],
       ...,
       [0.        , 0.01772728, 0.02195751, ..., 1.        , 0.38400856,
        0.05550186],
       [0.01794152, 0.00891493, 0.03157508, ..., 0.38400856, 1.        ,
        0.02839026],
       [0.03803436, 0.01057863, 0.01097119, ..., 0.05550186, 0.02839026,
        1.        ]])

In [0]:
r = labeled_excerpts.iterrows()

In [24]:
next(r)

(0, 0         I. The Absence of Mr Glass THE consulting-room...
 author                                           chesterton
 Name: 0, dtype: object)

In [26]:
labeled_excerpts.sort_values(by=['test_recommendation'], ascending=0, axis=0)

KeyError: ignored

## Analysis and Conclusion

Both classification and clustering were able to reliably label the excerpts by author in this study.
The neural network built on TF-IDF features, the best performing model, had over a 95% accuracy and regularly labeled each author's excerpts correctly over 80% of the time.
Clustering was able to yield high accuracy as well, grouping over 80% of each author's excerpts together with the exception of Melville's excerpts. Melville's excerpts was consistently the least consistently labeled across all models, implying that discrepancies in its labeling are due to the data itsself rather than the method of feature preparation and classification. 
Modeling did have the advantage of being more accurate but clustering was able to show strong reliability even in the abscence of labels.

This established the relative ability of clustering and modeling unsupervised learning generated features to classify the authors of writing samples.
By being able to reliably label data using unsupervised and supervised methods, large amounts of text can be analyzed, even in the abscence of pre-established labels. A condition often found in real-world data.
The next step in using this data to discern the source of the text data would be to collect more text from different authors and more data about the context in which the text was generated. Afterwards the study can be expanded to include different types of feature preparation.

Understanding how to better utilize unsupervised modeling techniques to predict author, will give insight as to what kind of people are generating different types of texts. 
This can be practically be applied to consumer data such as reviews and other types of user generated text.
This can allow for more direct marketing to users or changes in products that better match how services are used. 
Being able to utilize readily available data that isn't always processed and labeled, allows for revenue generating decisions to be made at very little cost to stakeholders.

