## Unsupervised Learning Natural Language Processing Capstone 
In this unsupervised learning capstone, I used 10 novels from 5 authors from the NLTK Gutenberg corpus and [Project Gutenberg](https://www.gutenberg.org/) (which were manually added to the corpus). 


Steps and techniques:
-  Pick a set of texts. I used 10 different texts from different authors on Project Gutenberg.
-  Perform standard data cleaning on the text using things such as spacy and stopwords.
-  Break the data in to two groups, the training group (75%) and the holdout group(25%).
-  Perform various clustering methods, decide which technique best represents the data, and explain your reasoning.
-  Perform some unsupervised feature generation and selection using techniques such as Latent Semantics Analysis (LSA), tf-idf term-document matrix, word2vec packaging, Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). 
-  Perform the clustering techniques on the holdout group and document the performance for changes, stability, and consistencies in comparison to the original model.
- Summarize all findings including visuals in a separate but linked document.
- Link to write-up: https://docs.google.com/document/d/1M7Ps1RfgudP8AfGlO6JXvav7QfrYjCdlIgjJgUDxfwY/edit?usp=sharing

##### Imported Modules Cell

In [1]:
import numpy as np
import pandas as pd
import scipy
import spacy
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re, os, sys
import requests
import pickle
import string
import en_core_web_sm
import urllib.request

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")

#sklearn modules
import sklearn
from sklearn import ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn import ensemble
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.preprocessing import Normalizer, normalize
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.cluster import MeanShift, estimate_bandwidth, KMeans
from sklearn.cluster import SpectralClustering, AgglomerativeClustering, AffinityPropagation 
from sklearn.datasets.samples_generator import make_blobs
from sklearn import metrics
from sklearn.metrics import silhouette_score
import itertools
from itertools import cycle
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity


#nltk modules
import nltk
from nltk.corpus import gutenberg
from nltk.stem import WordNetLemmatizer 










#### Novels
- The Ivory Child by Henry Rider Haggard
- Eric Brighteyes by Henry Rider Haggard
- The Sea-Hawk by Rafael Sabatini
- Scaramouche: A Romance Of The French Revolution by Rafael Sabatini
- Moby Dick by Herman Melville
- A Romance Of The South Seas by Herman Melville
- Tarzan The Terrible by Edgar Rice Burroughs
- Pellucidar by Edgar Rice Burroughs
- Adventures Of Huckleberry Finn by Mark Twain
- The Adventures Of Tom Sawyer by Mark Twain




In [2]:
# #checking sizes (explained below)
# print( "ivory:",len(gutenberg.raw('burroughs-pellucidar.txt')  ))
# print('bright:', len(gutenberg.raw('haggard-brighteyes.txt') ))
# print('seahawk:', len(gutenberg.raw('sabatini-seahawk.txt') ))
# print('scar', len(gutenberg.raw('sabatini-scaramouche.txt') ))
# print('moby', len(gutenberg.raw('melville-moby_dick.txt') ))
# print('southsea', len(gutenberg.raw('melville-southsea.txt') ))
# print('tarzan', len(gutenberg.raw('burroughs-tarzan.txt')))
# print('pell', len(gutenberg.raw('burroughs-pellucidar.txt') ))
# print('huck', len(gutenberg.raw('twain-huckleberry.txt') ))
# print('sawyer', len(gutenberg.raw('twain-sawyer.txt')))

In [3]:
#load, encode, decode, and reduce size
#The sizes were reduced to prevent crashing. 
#At full size, certain algorithms either created a memory or crashed the computer.


#The Ivory Child By Haggard
ivory=gutenberg.raw('haggard-ivory.txt').encode('ascii', 'replace').decode('ascii', 'replace')
ivory = ivory[:310000]

#Eric Brighteyes by Haggard
bright = gutenberg.raw('haggard-brighteyes.txt').encode('ascii', 'replace').decode('ascii', 'replace')
bright = bright[:310000]

#The Sea-Hawk by Sabatini
seahawk = gutenberg.raw('sabatini-seahawk.txt').encode('ascii', 'replace').decode('ascii', 'replace')
seahawk = seahawk[:310000]

#Scaramouche: A Romance Of The French Revolution by Sabatini
scar = gutenberg.raw('sabatini-scaramouche.txt').encode('ascii', 'replace').decode('ascii', 'replace')
scar = scar[:310000]

#Moby Dick by Melville
moby = gutenberg.raw('melville-moby_dick.txt').encode('ascii', 'replace').decode('ascii', 'replace') 
moby = moby[:310000]

#A Romance Of The South Seas by Melville
southsea = gutenberg.raw('melville-southsea.txt').encode('ascii', 'replace').decode('ascii', 'replace') 
southsea = southsea[:310000]

#Tarzan The Terrible by Burroughs
tarzan = gutenberg.raw('burroughs-tarzan.txt').encode('ascii', 'replace').decode('ascii', 'replace') 
tarzan = tarzan[:310000]


#Pellucidar by Burroughs
pell = gutenberg.raw('burroughs-pellucidar.txt').encode('ascii', 'replace').decode('ascii', 'replace') 
pell = pell[:310000]

#Adventures Of Huckleberry Finn by Twain
huck = gutenberg.raw('twain-huckleberry.txt').encode('ascii', 'replace').decode('ascii', 'replace') 
huck = huck[:310000]

#The Adventures Of Tom Sawyer by Twain
saw = gutenberg.raw('twain-sawyer.txt').encode('ascii', 'replace').decode('ascii', 'replace')
saw = saw[:310000]

In [4]:
#Load the data/novels/text

data = {'book' :["The Ivory Child", "Eric Brighteyes",
                 "The Sea-Hawk", "Scaramouche: A Romance Of The French Revolution",
                 "Moby Dick", "A Romance Of The South Seas",
                 "Tarzan The Terrible", "Pellucidar",
                 "Adventures Of Huckleberry Finn", "The Adventures Of Tom Sawyer"],
        'author' :['Henry Rider Haggard', 'Henry Rider Haggard', 
                   'Rafael Sabatini', 'Rafael Sabatini', 
                   'Herman Melville', 'Herman Melville', 
                   'Edgar Rice Burroughs', 'Edgar Rice Burroughs',
                   'Mark Twain', 'Mark Twain'],
       'novel':[ivory, bright, seahawk, scar, moby, southsea, tarzan, pell, huck, saw],
       'genre' :['Adventure', 'Adventure',
                 'Adventure', 'Adventure',
                 'Adventure', 'Adventure',
                 'Adventure', 'Adventure', 
                 'Adventure', 'Adventure']}

In [5]:
#place the data in a dataframe
books = pd.DataFrame(data, columns= ['book','author','novel','genre'])
books.head(10)

Unnamed: 0,book,author,novel,genre
0,The Ivory Child,Henry Rider Haggard,???THE IVORY CHILD\r\n\r\nby H. Rider Haggard\...,Adventure
1,Eric Brighteyes,Henry Rider Haggard,???\r\nERIC BRIGHTEYES\r\n\r\nby H. Rider Hagg...,Adventure
2,The Sea-Hawk,Rafael Sabatini,???THE SEA-HAWK\r\n\r\n\r\nBy Rafael Sabatini\...,Adventure
3,Scaramouche: A Romance Of The French Revolution,Rafael Sabatini,???\r\nSCARAMOUCHE\r\n\r\nA ROMANCE OF THE FRE...,Adventure
4,Moby Dick,Herman Melville,???[Moby Dick by Herman Melville 1851]\r\n\r\n...,Adventure
5,A Romance Of The South Seas,Herman Melville,???A ROMANCE OF THE SOUTH SEAS\r\n\r\n\r\nBy H...,Adventure
6,Tarzan The Terrible,Edgar Rice Burroughs,???Tarzan the Terrible\r\n\r\n\r\nBy\r\n\r\nEd...,Adventure
7,Pellucidar,Edgar Rice Burroughs,"???The Project Gutenberg EBook of Pellucidar, ...",Adventure
8,Adventures Of Huckleberry Finn,Mark Twain,???ADVENTURES\r\n\r\nOF\r\n\r\nHUCKLEBERRY FIN...,Adventure
9,The Adventures Of Tom Sawyer,Mark Twain,???THE ADVENTURES OF TOM SAWYER\r\n\r\nBy Mark...,Adventure


## Data Cleaning

In [7]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    text = re.sub("project gutenberg", "0", text)
    text = re.sub("gutenberg", "0", text)
    text = re.sub("project",  "0", text)
 
    text = re.sub(r'--',' ',text)
    text = re.sub(r'_',' ',text)
    text = re.sub("[\[].*[\]]", "", text)
    
    #get rid of chapter titles
    text = re.sub(r'Chapter \d+','',text)
    text = re.sub(r'CHAPTER \d+', '', text)
    text = re.sub('CHAPTER', '', text)
    
    #change Mr. Mrs. Ms. St. etc. to another value for future sentence creation
    text = re.sub('H. Rider Haggard', 'Henry Rider Haggard', text)
    text = re.sub('Mrs\. ', 'Mrs0 ',text)
    text = re.sub('Mr\. ', 'Mr0 ', text)
    text = re.sub('St\. ', 'St0 ',text)
    text = re.sub('Ms\. ', 'Ms0 ',text)

    #get rid of \n line breaks
    text = re.sub("\\n\\n.*?\\n\\n", '', text)
    
   #get rid of extra spacing and a random set of characters I saw
    text = re.sub("  ", " ",text)
    text = re.sub('[ï»¿]', '',text)
   
    
    text = ' '.join(text.split())
    return text
round0= lambda x: text_cleaner(x)

In [8]:
# Let's take a look at the updated text
books['novel'] = books.novel.apply(round0)

books.head(10)

Unnamed: 0,book,author,novel,genre
0,The Ivory Child,Henry Rider Haggard,???THE IVORY CHILD by Henry Rider Haggard I AL...,Adventure
1,Eric Brighteyes,Henry Rider Haggard,??? ERIC BRIGHTEYES by Henry Rider Haggard DED...,Adventure
2,The Sea-Hawk,Rafael Sabatini,???THE SEA-HAWK By Rafael Sabatini NOTE Lord H...,Adventure
3,Scaramouche: A Romance Of The French Revolution,Rafael Sabatini,??? SCARAMOUCHE A ROMANCE OF THE FRENCH REVOLU...,Adventure
4,Moby Dick,Herman Melville,??? ETYMOLOGY. (Supplied by a Late Consumptive...,Adventure
5,A Romance Of The South Seas,Herman Melville,???A ROMANCE OF THE SOUTH SEAS By Herman Melvi...,Adventure
6,Tarzan The Terrible,Edgar Rice Burroughs,???Tarzan the Terrible By Edgar Rice Burroughs...,Adventure
7,Pellucidar,Edgar Rice Burroughs,"???The Project Gutenberg EBook of Pellucidar, ...",Adventure
8,Adventures Of Huckleberry Finn,Mark Twain,???ADVENTURES OF HUCKLEBERRY FINN (Tom Sawyer'...,Adventure
9,The Adventures Of Tom Sawyer,Mark Twain,???THE ADVENTURES OF TOM SAWYER By Mark Twain ...,Adventure


In [9]:
#turn text into sentences
sentences = []
for row in books.itertuples():
    for sentence in row[3].split('.'):
        if sentence != '':
            sentences.append((row[1],row[2], sentence, row[4] ))
books = pd.DataFrame(sentences, columns=['book', 'author', 'sentence', 'genre'])

In [10]:
books.head()

Unnamed: 0,book,author,sentence,genre
0,The Ivory Child,Henry Rider Haggard,???THE IVORY CHILD by Henry Rider Haggard I AL...,Adventure
1,The Ivory Child,Henry Rider Haggard,Amongst many other things it tells of the war...,Adventure
2,The Ivory Child,Henry Rider Haggard,Often since then I have wondered if this crea...,Adventure
3,The Ivory Child,Henry Rider Haggard,"It seems improbable, even impossible, but the...",Adventure
4,The Ivory Child,Henry Rider Haggard,Also he can form his opinion as to the religi...,Adventure


In [11]:
# Utility function for standard text cleaning.
def text_cleaner(text):
  #change Mr. Mrs. Ms. St. etc. to another value for future sentence creation
    text = re.sub('Mrs0 ', 'Mrs ',text)
    text = re.sub('Mr0 ', 'Mr ', text)
    text = re.sub('St0 ', 'St ',text)
    text = re.sub('Ms0 ', 'Ms ',text)
   

    #get rid of some punctuation and brackets
    text = re.sub("/.*? ", " ",text)
    text = re.sub("[\[].,*?[\]]", "", text)
    text = re.sub("\\./\\.", "",text)
    text = re.sub("``", "",text)
    text = re.sub("''", "",text)
    text = re.sub("  ", " ",text)
    text = re.sub("./", " ",text)
    
    #digits
   
   

    
    #get rid of extra spacing and a random set of characters I saw
    text = re.sub("  ", " ",text)
  
    text = re.sub("'s", " ",text)
    
    text = ' '.join(text.split())
    return text
round1= lambda x: text_cleaner(x)

In [12]:
# Let's take a look at the updated text

books['sentence'] = books.sentence.apply(round1)
books.head(5)

Unnamed: 0,book,author,sentence,genre
0,The Ivory Child,Henry Rider Haggard,???THE IVORY CHILD by Henry Rider Haggard I AL...,Adventure
1,The Ivory Child,Henry Rider Haggard,Amongst many other things it tells of the war ...,Adventure
2,The Ivory Child,Henry Rider Haggard,Often since then I have wondered if this creat...,Adventure
3,The Ivory Child,Henry Rider Haggard,"It seems improbable, even impossible, but the ...",Adventure
4,The Ivory Child,Henry Rider Haggard,Also he can form his opinion as to the religio...,Adventure


In [13]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    
    # get rid of all the XML markup
    text = re.sub('<.*?>','',text)
    
    #get rid of the "ENDOFARTICLE." text
    text = re.sub('ENDOFARTICLE.','',text)
    text = re.sub('\?','', text)
    text = re.sub('â', '', text)
    text = ' '.join(text.split())
    return text
round2= lambda x: text_cleaner(x)

In [14]:
# Let's take a look at the updated text
books['sentence'] = books.sentence.apply(round2)
books.head(10)

Unnamed: 0,book,author,sentence,genre
0,The Ivory Child,Henry Rider Haggard,THE IVORY CHILD by Henry Rider Haggard I ALLAN...,Adventure
1,The Ivory Child,Henry Rider Haggard,Amongst many other things it tells of the war ...,Adventure
2,The Ivory Child,Henry Rider Haggard,Often since then I have wondered if this creat...,Adventure
3,The Ivory Child,Henry Rider Haggard,"It seems improbable, even impossible, but the ...",Adventure
4,The Ivory Child,Henry Rider Haggard,Also he can form his opinion as to the religio...,Adventure
5,The Ivory Child,Henry Rider Haggard,Of this magic I will make only one remark: If ...,Adventure
6,The Ivory Child,Henry Rider Haggard,"To take a single instance, Hart and Mart were ...",Adventure
7,The Ivory Child,Henry Rider Haggard,Yet in the end it was Hans who killed him,Adventure
8,The Ivory Child,Henry Rider Haggard,Jana nearly killed me! Now to my tale,Adventure
9,The Ivory Child,Henry Rider Haggard,"In another history, called The Holy Flower, I ...",Adventure


In [15]:
#make novel lowercase
books['sentence']= books['sentence'].str.lower()
books.tail(10)

Unnamed: 0,book,author,sentence,genre
25339,The Adventures Of Tom Sawyer,Mark Twain,then it occurred to him that the great adventu...,Adventure
25340,The Adventures Of Tom Sawyer,Mark Twain,he had never seen as much as fifty dollars in ...,Adventure
25341,The Adventures Of Tom Sawyer,Mark Twain,he never had supposed for a moment that so lar...,Adventure
25342,The Adventures Of Tom Sawyer,Mark Twain,if his notions of hidden treasure had been ana...,Adventure
25343,The Adventures Of Tom Sawyer,Mark Twain,but the incidents of his adventure grew sensib...,Adventure
25344,The Adventures Of Tom Sawyer,Mark Twain,this uncertainty must be swept away,Adventure
25345,The Adventures Of Tom Sawyer,Mark Twain,he would snatch a hurried breakfast and go and...,Adventure
25346,The Adventures Of Tom Sawyer,Mark Twain,"huck was sitting on the gunwale of a flatboat,...",Adventure
25347,The Adventures Of Tom Sawyer,Mark Twain,tom concluded to let huck lead up to the subject,Adventure
25348,The Adventures Of Tom Sawyer,Mark Twain,"if he did not do it, then the adventure would ...",Adventure


In [16]:
import pickle
books.to_pickle("books.pkl")