## Unsupervised Learning Natural Language Processing Capstone 
In this unsupervised learning capstone, I use 24 novels from 12 authors from the NLTK Gutenberg corpus and [Project Gutenberg](https://www.gutenberg.org/) (which were manually addded to the corpus). 


Steps and techniques:
-  Pick a set of texts. I used 24 different texts from different authors on Project Gutenberg.
-  Perform standard data cleaning on the text using things such as spacy and stopwords.
-  Break the data in to two groups, the training group (75%) and the holdout group(25%).
-  Perform various clustering methods, decide which technique best represents the data, and explain your reasoning.
-  Perform some unsupervised feature generation and selection using techniques such as Latent Semantics Analysis (LSA), tf-idf term-document matrix, word2vec packaging, Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). 
-  Perform the clustering techniques on the holdout group and document the performance for changes, stability, and consistencies in comparison to the original model.
- Summarize all findings including visuals in a separate but linked document.

##### Imported Modules Cell

In [1]:
import numpy as np
import pandas as pd
import scipy
import spacy
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
import requests
import pickle
import string
import en_core_web_sm

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")

#sklearn modules
import sklearn
from sklearn import ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn import ensemble
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.preprocessing import Normalizer, normalize
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

#clustering
from sklearn.cluster import MeanShift, estimate_bandwidth, KMeans
from sklearn.cluster import SpectralClustering, AgglomerativeClustering, AffinityPropagation 
from sklearn.datasets.samples_generator import make_blobs
from sklearn import metrics
from sklearn.metrics import silhouette_score
import itertools
from itertools import cycle
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity


#nltk modules
import nltk
from nltk.corpus import gutenberg
from nltk.stem import WordNetLemmatizer 










### Data
- Emma by Jane Austen
- Sense and Sensibility by Jane Austen
- Alice's Adventures in Wonderland by Lewis Carroll
- Through the Looking Glass by Lewis Carroll
- A Christmas Carol by Charles Dickens
- David Copperfield by Charles Dickens 
- The Tragedie of Hamlet by William Shakespeare
- The Tragedie of Macbeth by William Shakespeare 
- Adventures Of Huckleberry Finn By Mark Twain 
- The Adventures of Tom Sawyer by Mark Twain

*** Needs editing ***

In [4]:
#Load the data/novels/text

data = {'book' :["The Ivory Child", "Eric Brighteyes",
                 "The Sea-Hawk", "Scaramouche: A Romance Of The French Revolution",
                 "Moby Dick", "A Romance Of The South Seas",
                 "Tarzan The Terrible", "Pellucidar",
                 "Adventures Of Huckleberry Finn", "The Adventures Of Tom Sawyer"],
        'author' :['Henry Rider Haggard', 'Henry Rider Haggard', 
                   'Rafael Sabatini', 'Rafael Sabatini', 
                   'Herman Melville', 'Herman Melville', 
                   'Edgar Rice Burroughs', 'Edgar Rice Burroughs',
                   'Mark Twain', 'Mark Twain'],
       'novel':[gutenberg.raw('haggard-ivory.txt'), gutenberg.raw('haggard-brighteyes.txt'), 
                gutenberg.raw('sabatini-seahawk.txt'), gutenberg.raw('sabatini-scaramouche.txt'), 
                gutenberg.raw('melville-moby_dick.txt'), gutenberg.raw('melville-southsea.txt'),   
                gutenberg.raw('burroughs-tarzan.txt'), gutenberg.raw('burroughs-pellucidar.txt'), 
                gutenberg.raw('twain-huckleberry.txt'), gutenberg.raw('twain-sawyer.txt')],
       'genre' :['Adventure', 'Adventure',
                 'Adventure', 'Adventure',
                 'Adventure', 'Adventure',
                 'Adventure', 'Adventure', 
                 'Adventure', 'Adventure']}

In [22]:
str(data).encode('utf8','replace')



In [23]:
#place the data in a dataframe
books = pd.DataFrame(data, columns= ['book','author','novel','genre'])
books.head(10)

Unnamed: 0,book,author,novel,genre
0,The Ivory Child,Henry Rider Haggard,ï»¿THE IVORY CHILD\r\n\r\nby H. Rider Haggard\...,Adventure
1,Eric Brighteyes,Henry Rider Haggard,ï»¿\r\nERIC BRIGHTEYES\r\n\r\nby H. Rider Hagg...,Adventure
2,The Sea-Hawk,Rafael Sabatini,ï»¿THE SEA-HAWK\r\n\r\n\r\nBy Rafael Sabatini\...,Adventure
3,Scaramouche: A Romance Of The French Revolution,Rafael Sabatini,ï»¿\r\nSCARAMOUCHE\r\n\r\nA ROMANCE OF THE FRE...,Adventure
4,Moby Dick,Herman Melville,ï»¿[Moby Dick by Herman Melville 1851]\r\n\r\n...,Adventure
5,A Romance Of The South Seas,Herman Melville,ï»¿A ROMANCE OF THE SOUTH SEAS\r\n\r\n\r\nBy H...,Adventure
6,Tarzan The Terrible,Edgar Rice Burroughs,ï»¿Tarzan the Terrible\r\n\r\n\r\nBy\r\n\r\nEd...,Adventure
7,Pellucidar,Edgar Rice Burroughs,"ï»¿The Project Gutenberg EBook of Pellucidar, ...",Adventure
8,Adventures Of Huckleberry Finn,Mark Twain,ï»¿ADVENTURES\r\n\r\nOF\r\n\r\nHUCKLEBERRY FIN...,Adventure
9,The Adventures Of Tom Sawyer,Mark Twain,ï»¿THE ADVENTURES OF TOM SAWYER\r\n\r\nBy Mark...,Adventure


## Data Cleaning

In [24]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    text = re.sub("project gutenberg", "0", text)
    text = re.sub("gutenberg", "0", text)
    text = re.sub("project",  "0", text)
 
    text = re.sub(r'--',' ',text)
    text = re.sub(r'_',' ',text)
    text = re.sub("[\[].*[\]]", "", text)
    
    #get rid of chapter titles
    text = re.sub(r'Chapter \d+','',text)
    text = re.sub(r'CHAPTER \d+', '', text)
    
    #change Mr. Mrs. Ms. St. etc. to another value for future sentence creation
    text = re.sub('Mrs\. ', 'Mrs0 ',text)
    text = re.sub('Mr\. ', 'Mr0 ', text)
    text = re.sub('St\. ', 'St0 ',text)
    text = re.sub('Ms\. ', 'Ms0 ',text)

    #get rid of \n line breaks
    text = re.sub("\\n\\n.*?\\n\\n", '', text)
    
   #get rid of extra spacing and a random set of characters I saw
    text = re.sub("  ", " ",text)
    text = re.sub('[ï»¿]', '',text)
   
    
    text = ' '.join(text.split())
    return text
round0= lambda x: text_cleaner(x)

In [25]:
# Let's take a look at the updated text
books['novel'] = books.novel.apply(round0)

books.head(20)

Unnamed: 0,book,author,novel,genre
0,The Ivory Child,Henry Rider Haggard,THE IVORY CHILD by H. Rider Haggard CHAPTER I ...,Adventure
1,Eric Brighteyes,Henry Rider Haggard,ERIC BRIGHTEYES by H. Rider Haggard DEDICATION...,Adventure
2,The Sea-Hawk,Rafael Sabatini,THE SEA-HAWK By Rafael Sabatini NOTE Lord Henr...,Adventure
3,Scaramouche: A Romance Of The French Revolution,Rafael Sabatini,SCARAMOUCHE A ROMANCE OF THE FRENCH REVOLUTION...,Adventure
4,Moby Dick,Herman Melville,ETYMOLOGY. (Supplied by a Late Consumptive Ush...,Adventure
5,A Romance Of The South Seas,Herman Melville,A ROMANCE OF THE SOUTH SEAS By Herman Melville...,Adventure
6,Tarzan The Terrible,Edgar Rice Burroughs,Tarzan the Terrible By Edgar Rice Burroughs CH...,Adventure
7,Pellucidar,Edgar Rice Burroughs,"The Project Gutenberg EBook of Pellucidar, by ...",Adventure
8,Adventures Of Huckleberry Finn,Mark Twain,ADVENTURES OF HUCKLEBERRY FINN (Tom Sawyer's C...,Adventure
9,The Adventures Of Tom Sawyer,Mark Twain,THE ADVENTURES OF TOM SAWYER By Mark Twain (Sa...,Adventure


In [26]:
#turn text into sentences
sentences = []
for row in books.itertuples():
    for sentence in row[3].split('.'):
        if sentence != '':
            sentences.append((row[1],row[2], sentence, row[4] ))
books = pd.DataFrame(sentences, columns=['book', 'author', 'sentence', 'genre'])

In [27]:
books.head()

Unnamed: 0,book,author,sentence,genre
0,The Ivory Child,Henry Rider Haggard,THE IVORY CHILD by H,Adventure
1,The Ivory Child,Henry Rider Haggard,Rider Haggard CHAPTER I ALLAN GIVES A SHOOTIN...,Adventure
2,The Ivory Child,Henry Rider Haggard,Amongst many other things it tells of the war...,Adventure
3,The Ivory Child,Henry Rider Haggard,Often since then I have wondered if this crea...,Adventure
4,The Ivory Child,Henry Rider Haggard,"It seems improbable, even impossible, but the...",Adventure


In [38]:
# Utility function for standard text cleaning.
def text_cleaner(text):
  #change Mr. Mrs. Ms. St. etc. to another value for future sentence creation
    text = re.sub('Mrs0 ', 'Mrs ',text)
    text = re.sub('Mr0 ', 'Mr ', text)
    text = re.sub('St0 ', 'St ',text)
    text = re.sub('Ms0 ', 'Ms ',text)
    text = re.sub('â\?', ' ',text)

    #get rid of some punctuation and brackets
    text = re.sub("/.*? ", " ",text)
    text = re.sub("[\[].,*?[\]]", "", text)
    text = re.sub("\\./\\.", "",text)
    text = re.sub("``", "",text)
    text = re.sub("''", "",text)
    text = re.sub("  ", " ",text)
    text = re.sub("./", " ",text)
    
    #digits
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub(r'[0-9\.]+', '', text)

    
    #get rid of extra spacing and a random set of characters I saw
    text = re.sub("  ", " ",text)
  
    text = re.sub("'s", " ",text)
    
    text = ' '.join(text.split())
    return text
round1= lambda x: text_cleaner(x)

In [39]:
# Let's take a look at the updated text

books['sentence'] = books.sentence.apply(round1)
books.tail(20)

Unnamed: 0,book,author,sentence,genre
51697,The Adventures Of Tom Sawyer,Mark Twain,"âAll right, Huck, it a whiz! Come along, old...",Adventure
51698,The Adventures Of Tom Sawyer,Mark Twain,"âWill you, Tom now will you? That good",Adventure
51699,The Adventures Of Tom Sawyer,Mark Twain,If she'll let up on some of the roughest thing...,Adventure
51700,The Adventures Of Tom Sawyer,Mark Twain,When you going to start the gang and turn robb...,Adventure
51701,The Adventures Of Tom Sawyer,Mark Twain,We'll get the boys together and have the initi...,Adventure
51702,The Adventures Of Tom Sawyer,Mark Twain,âHave the which? âHave the initiation,Adventure
51703,The Adventures Of Tom Sawyer,Mark Twain,âWhat that? âIt to swear to stand by one a...,Adventure
51704,The Adventures Of Tom Sawyer,Mark Twain,"âThat gay that mighty gay, Tom, I tell you",Adventure
51705,The Adventures Of Tom Sawyer,Mark Twain,"âWell, I bet it is",Adventure
51706,The Adventures Of Tom Sawyer,Mark Twain,And all that swearing got to be done at midnig...,Adventure


In [None]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    
    # get rid of all the XML markup
    text = re.sub('<.*>','',text)
    
    #get rid of the "ENDOFARTICLE." text
    text = re.sub('ENDOFARTICLE.','',text)
    
    text = ' '.join(text.split())
    return text
round2= lambda x: text_cleaner(x)

In [None]:
# Let's take a look at the updated text
books['sentence'] = books.sentence.apply(round2)
books.head(10)

In [None]:
#make novel lowercase
books['sentence']= books['sentence'].str.lower()


In [None]:
books.index=books.book

In [None]:
train_test, holdout= train_test_split(books, test_size=0.25, random_state=45)
train, test= train_test_split(train_test, test_size=0.30, random_state=45)

print('Train:', train.shape[0])
print('Test:', test.shape[0]) 
print('Holdout:', holdout.shape[0])

In [None]:
vectorizer = TfidfVectorizer(max_df=0.6, min_df=2,
                            lowercase=True, use_idf=True,
                            norm='l2', stop_words='english',
                            smooth_idf=True)

# Applying the vectorizer
TfIdf = vectorizer.fit_transform(train['sentence'])

TfIdf_csr = TfIdf.tocsr()

In [None]:
#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,TfIdf_csr.shape[0])]

# List of features
terms = vectorizer.get_feature_names()
#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*TfIdf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = TfIdf_csr[i, j]

#Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Sentence:\n', books['sentence'][2000],  '\nTf_idf vector:\n', tfidf_bypara[0])

## LSA

In [None]:
#Our SVD data reducer. Features are reduced down to 250.
svd = TruncatedSVD(250, random_state=45)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
LSA = lsa.fit_transform(TfIdf)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print(
    'The percentage of total variance in the dataset explained by each',
    'component from LSA.\n',
    variance_explained[:5]
)
print("Percent variance captured by all components:",total_variance*100)

# Compare the sklearn solution to ours – a perfect match.

plt.figure(figsize=(20,10))
for i, c in enumerate([(0,1), (0,2), (1,2), (1,3)]): 
    plt.subplot(2,2,i+1)
    sns.scatterplot(x=LSA[:, c[0]], y=LSA[:, c[1]], hue=train['author'])
    plt.legend('')
    plt.xlabel('Component ' + str(c[0]+1))
    plt.ylabel('Component ' + str(c[1]+1))
plt.legend(loc = (1.05, 1))
plt.show()



# Compute document similarity using LSA component
similarity = cosine_similarity(LSA)
#Only taking the first 10 sentences
sim_matrix=pd.DataFrame(similarity,index=train['sentence']).iloc[0:10,0:10]
#Making a plot
plt.figure(figsize=(20,5))
plt.subplot(121)
ax = sns.heatmap(sim_matrix,yticklabels=range(10), cmap='binary')
plt.subplot(122)
ax = sns.heatmap(sim_matrix,yticklabels=range(10), cmap='Paired')
plt.show()

#Generating a key for the plot.
print('Key:')
for i in range(10):
    print(i,sim_matrix.index[i])

#### Tfidf & LSA Summary

In [None]:
# Transform test set 
test_tfidf = vectorizer.transform(test['sentence'])
LSA_test = lsa.transform(test_tfidf)

# model vars
x_train = LSA
x_test = LSA_test
y_train = train['author']
y_test = test['author']

## Supervised Learning

### Logistic Regression

In [None]:
lr = LogisticRegression(random_state=45, solver='saga')
lr.fit(x_train, y_train)

print('cross-validation:', cross_val_score(lr, x_train, y_train, cv=20))
print('Training set score:', lr.score(x_train, y_train))
print('Test set score:', lr.score(x_test, y_test))
pd.crosstab(y_test, lr.predict(x_test))

### Random Forest

In [None]:

rfc = ensemble.RandomForestClassifier(random_state=45)
rfc.fit(x_train, y_train)

print('cross-validation:', cross_val_score(rfc, x_train, y_train, cv=20))
print('Training set score:', rfc.score(x_train, y_train))
print('Test set score:', rfc.score(x_test, y_test))
pd.crosstab(y_test, rfc.predict(x_test))

### Support Vector

In [None]:
# #support vector with CV

# svc = SVC(random_state=45)
# svc.fit(x_train, y_train)
# print('cross-validation:', cross_val_score(svc, x_train, y_train, cv=20))
# print('Training set score:', svc.score(x_train, y_train))
# print('Test set score:', svc.score(x_test, y_test))
# pd.crosstab(y_test, svc.predict(x_test))


### Gradient Boost Classifier

In [None]:
# #Gradient Boost
# gbc = ensemble.GradientBoostingClassifier(random_state=45)
# gbc.fit(x_train, y_train)
# print('cross-validation:', cross_val_score(gbc, x_train, y_train, cv=20))
# print('Training set score:', gbc.score(x_train, y_train))
# print('Test set score:', gbc.score(x_test, y_test))
# pd.crosstab(y_test, gbc.predict(x_test))


#### Supervised Learning Summary

In [None]:
n_clusters2 = 2
n_clusters5 = 5
n_clusters10 = 10

## Unsupervised Learning Methods

### K-Means

In [None]:
# Split the data to test for consistent clustering
lsa1, lsa3= train_test_split(LSA, test_size=0.50, random_state=45)
lsa1, lsa2= train_test_split(LSA, test_size=0.50, random_state=45)
lsa3, lsa4= train_test_split(LSA, test_size=0.50, random_state=45)

plt.figure(figsize=(20,20))
# Calculate predicted values.
preds = {}
models = {}
clusters = (2,5,10)

for row, data in enumerate([lsa1, lsa2, lsa3, lsa4, LSA]):
    
    # Generate cluster predictions and store them for clusters 2 to 4.
    for col, nclust in  enumerate(clusters):
        models[row, nclust] = KMeans(n_clusters=nclust, random_state=42).fit(data)
        preds[row, nclust] = models[row, nclust].predict(data)
        
        if row != 4:
            plt.subplot(4, 4, row*4 + (col+1))
            plt.scatter(data[:, 0], data[:, 1], c=preds[row, nclust])
            plt.title('Subset ' + str(row + 1) +' with ' + str(nclust) +' clusters')
            plt.xlabel('Component 1')
            plt.ylabel('Component 2')
        

In [None]:
# # Function to evaluate the clustering
# def cluster_eval(clusters, preds, models, n):
#     for i in clusters: 
#         pred = preds[4,i]
#         model = models[4,i]
#         labels = model.labels_
#         print('Adjusted Rand index for', i, 'clusters:', 
#               round(metrics.adjusted_rand_score(train['author'], pred),5))
#         print('The silhouette coefficient for %d clusters: %.4f \n' % (i, metrics.silhouette_score(LSA, labels, metric='euclidean')))

#     return(pd.crosstab(train['author'], preds[4,n]).T)

# cluster_eval(clusters, preds, models, 10)

In [None]:
km2 = KMeans(n_clusters=n_clusters2, random_state=45)
km2.fit(x_train)

train['clusterkm2'] = km2.labels_ 
print(pd.crosstab(train['author'], train['clusterkm2']), '\n')

In [None]:
#clusters - 5
km5 = KMeans(n_clusters=n_clusters5, random_state=45)
km5.fit(x_train)

train['clusterkm5'] = km5.labels_ 
print(pd.crosstab(train['author'], train['clusterkm5']), '\n')

In [None]:
km10 = KMeans(n_clusters=n_clusters10, random_state=45)
km10.fit(x_train)

train['clusterkm10'] = km10.labels_ 
print(pd.crosstab(train['author'], train['clusterkm10']), '\n')

### Agglomerative clustering

In [None]:
plt.figure(figsize=(20,20))
# Calculate predicted values.
preds_2 = {}
models_2 = {}
clusters = (2,5,10)
for row, data in enumerate([lsa1, lsa2, lsa3, lsa4, LSA]):
    
    # Generate cluster predictions and store them for clusters 2 to 4.
    for col, nclust in  enumerate(clusters):
        models_2[row, nclust] = AgglomerativeClustering(n_clusters=nclust).fit(data)
        preds_2[row, nclust] = AgglomerativeClustering(n_clusters=nclust).fit_predict(data)
        
        if row != 4:
            plt.subplot(4, 4, row*4 + (col+1))
            plt.scatter(data[:, 0], data[:, 1], c=preds_2[row, nclust])
            plt.title('Subset ' + str(row + 1) +' with ' + str(nclust) +' clusters')
            plt.xlabel('Component 1')
            plt.ylabel('Component 2')
        
plt.show()

In [None]:
ag2 = AgglomerativeClustering(n_clusters=n_clusters2)
ag2.fit(x_train)

train['clusterag2'] = ag2.labels_ 
print(pd.crosstab(train['author'], train['clusterag2']), '\n')

In [None]:
ag5 = AgglomerativeClustering(n_clusters=n_clusters5)
ag5.fit(x_train)

train['clusterag5'] = ag5.labels_ 
print(pd.crosstab(train['author'], train['clusterag5']), '\n')

In [None]:
ag10 = AgglomerativeClustering(n_clusters=n_clusters10)
ag10.fit(x_train)

train['clusterag10'] = ag10.labels_ 
print(pd.crosstab(train['author'], train['clusterag10']), '\n')

### Spectral Clustering

In [None]:
# plt.figure(figsize=(20,20))
# # Calculate predicted values.
# preds = {}
# models = {}
# clusters = (2,5,10)
# for row, data in enumerate([lsa1, lsa2, lsa3, lsa4, LSA]):
    
#     # Generate cluster predictions and store them for clusters 2 to 4.
#     for col, nclust in  enumerate(clusters):
#         models_2[row, nclust] = SpectralClustering(n_clusters=nclust).fit(data)
#         preds_2[row, nclust] = SpectralClustering(n_clusters=nclust).fit_predict(data)
        
#         if row != 4:
#             plt.subplot(4, 4, row*4 + (col+1))
#             plt.scatter(data[:, 0], data[:, 1], c=preds_2[row, nclust])
#             plt.title('Subset ' + str(row + 1) +' with ' + str(nclust) +' clusters')
#             plt.xlabel('Component 1')
#             plt.ylabel('Component 2')
        
# plt.show()

In [None]:
# sc2 = SpectralClustering(n_clusters=n_clusters2, affinity='rbf')
# sc2.fit(train_lsa)

# train['clustersc2'] = sc2.labels_

# print(pd.crosstab(train['author'], train['clustersc2']), '\n')

In [None]:
# #with 5 clusters
# sc5 = SpectralClustering(n_clusters=n_clusters5, affinity='rbf')
# sc5.fit(train_lsa)

# train['clustersc5'] = sc5.labels_

# print(pd.crosstab(train['author'], train['clustersc5']), '\n')

In [None]:
# #with 10 clusters

# sc10 = SpectralClustering(n_clusters=n_clusters10, affinity='rbf')
# sc10.fit(train_lsa)

# train['clustersc10'] = sc10.labels_

# print(pd.crosstab(train['author'], train['clustersc10']), '\n'))

### Affinity Propagation

In [None]:
# af = AffinityPropagation()
# af.fit(train_lsa)

# train['clusteraf'] = af.labels_

# cluster_df = pd.crosstab(train['author'], train['clusteraf'])
# cluster_df