# CONTENT
1. [Webscraping Data](#webscraping)
1. [Functions for Text PreProcessing, Stemming \& Model Evaluation](#functions)
1. [Data Wrangling](#datawrangling)
1. [NLP PipeLine](#pipeline)
    1. [Basic NLP Count-Based Features](#CountBasedFeatures)
    1. [Sentiment Analysis](#sentimentanalysis)
    1. [Term Frequency-Inverse Document Frequency](#tfidf) 
    1. [Logistic Regression](#logreg)
    1. [Random Forest Classifier](#rfc)
        1. [Hyperparameter Tuning](#gs)
1. [Conclusions](#conclusions)

In [26]:
# import required libraries
import pandas as pd # import dataset, create and manipulate dataframes
import numpy as np # vectorize functions and perform calculations
import contractions # expand contractions
import re # regular expressions
import string # count-based features
import seaborn as sns # visualization
import matplotlib.pyplot as plt # visualization

from nltk.corpus import stopwords
from pprint import pprint # pretty print
from nltk.tokenize import word_tokenize # tokenize string or sentences
from nltk.stem import PorterStemmer # stemming\
from sklearn.linear_model import LogisticRegression # model
from sklearn.ensemble import RandomForestClassifier # model
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # count-based language models
from sklearn.metrics import confusion_matrix, classification_report, make_scorer # model evaluation metrics
from sklearn.metrics import accuracy_score, f1_score # model evaluation metrics
from sklearn.model_selection import GridSearchCV, cross_validate # split & evaluate dataset, hyperparameter optimization
from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_val_predict # cross-validation
from sklearn.decomposition import LatentDirichletAllocation, NMF, TruncatedSVD
from collections import Counter # count-based calculations
from textblob import TextBlob # sentiment analysis
from wordcloud import WordCloud # visualization
from gensim.parsing.preprocessing import remove_stopwords, STOPWORDS # removing stopwords

pd.options.mode.chained_assignment = None  # hide warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

We are gonna work with an e-book from __[Project Gutenberg](https://www.gutenberg.org/)__ which we will import directly from the __nltk gutenberg corpus__. 

We can also import any book manually by [going on the website, selecting the desired book, and download the Plain Text UTF-8 version locally](https://www.gutenberg.org/ebooks/2701).

In [27]:
from nltk.corpus import gutenberg

# check book titles
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


After selecting the book title we can import its content in various forms, but we will import it as a __single string__ for this project.

In [28]:
# select the desired title
moby_dick = gutenberg.raw('melville-moby_dick.txt')

# check book's length
print(f"Book's length: {len(moby_dick)} words.\n")

# check the first 20 words
print(f"First 500 characters the book:\n\n{moby_dick[:500]}.")

Book's length: 1242990 words.

First 500 characters the book:

[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.

"While you take in hand to school others, and to teac.


<a name="textpreprocessing"></a>
# 2. Text Preprocessing

We want to normalize our tokens before passing them to our ML model. Some of the common text preprocessign tasks include:

1. Removing __special characters__, i.e. anything other than letters or numbers.
1. Remove trailing __whitespace__.
1. __Expand contracted words__, e.g. "It's" &rarr; "It is" (so the latter, i.e. "is", can be removed as a stopword later).
1. __Lower-case__ text.
1. __Tokenize__ text.
1. Remove __stopwords__ such as "the", "a", "an", etc.
1. Perform basic __stemming__.
1. Join tokens back into a __single string__, i.e. like it was first inputted, but "cleaned".

__Note__: A great [article](https://towardsdatascience.com/text-pre-processing-stop-words-removal-using-different-libraries-f20bac19929a) about the differences of removing stopwords using different libraries (__NLTK__, __spaCy__, __gensim__, __scikit-learn__).

In [68]:
def normalize_document(doc):
    """Normalize the document by performing basic text pre-processing tasks."""
    # remove special characters
    #doc = re.sub(r'[^a-zA-Z0-9\s]', ' ', doc)
    doc = re.sub('[^A-Za-z0-9]+', '', doc)
    # remove trailing whitespace
    nowhite = doc.strip()
    # expand contractions
    expanded = contractions.fix(nowhite)
    # lower case string
    lower_str = str(expanded).lower()
    # tokenize document
    tokens = word_tokenize(lower_str)
    # remove stopwords
    filtered_tokens = [token for token in tokens if token not in STOPWORDS]
    filtered_tokens = [token for token in tokens if len(token) > 1]
    
    # instantiate stemmer
    ps = PorterStemmer()
    # simple porter stemming
    stemmed_tokens = [ps.stem(token) for token in filtered_tokens]
    
    # re-create document from tokens
    doc = ' '.join(stemmed_tokens)

    return doc

# vectorize function for faster computations
normalize_corpus = np.vectorize(normalize_document)

In [None]:
# invoke function on our text
normalized_text = normalize_corpus(moby_dick)

In [None]:
normalized_text.shape

In [None]:
normalized_text[:50]

# Vectorization

In [150]:
# vectorize tokens using BoW
vectorizer = CountVectorizer(min_df=5, max_df=0.9)
data_vectorized = vectorizer.fit_transform(normalized_text)
data_vectorized_array = data_vectorized.toarray()

In [141]:
cv = pd.DataFrame(data_vectorized, columns=vectorizer.get_feature_namesure_names())
cv.head()

Unnamed: 0,abandon,abat,abid,abl,aboard,abomin,aborigin,abound,abroad,absenc,...,yojo,yoke,yon,yonder,york,young,youth,zealand,zodiac,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# vectorize tokens using tfidf
tfidf_vec = TfidfVectorizer()
data_tfidf = tfidf_vec.fit_transform(normlized_text)
data_tfidf_array = data_tfidf.toarray()

In [141]:
tfidf = pd.DataFrame(data_tfidf, columns=tfidf_vec.get_feature_namesure_names())
tfidf.head()

Unnamed: 0,abandon,abat,abid,abl,aboard,abomin,aborigin,abound,abroad,absenc,...,yojo,yoke,yon,yonder,york,young,youth,zealand,zodiac,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [151]:
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_components= 15, max_iter=10, learning_method='online') # no of components = no. of topics
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)

(212030, 15)


In [None]:
# Build a Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components= 5)
nmf_Z = nmf_model.fit_transform(data_vectorized)
print(nmf_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)

In [142]:
# Build a Latent Semantic Indexing Model
lsi_model = TruncatedSVD(n_components=5)
lsi_Z = lsi_model.fit_transform(data_vectorized)
print(lsi_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)

(212030, 5)
(212030, 5)
(212030, 5)


In [143]:
# inspect the inferred topics
def print_topics(model, vectorizer, top_n=5):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])


In [144]:
print("LDA Model:")
print_topics(lda_model, vectorizer)
print("=" * 20)
 

LDA Model:
Topic 0:
[('like', 590.3545435342066), ('hand', 353.5343440444929), ('thou', 345.63628996274093), ('aye', 175.79984736740377), ('know', 173.71079714412272)]
Topic 1:
[('ship', 559.0912200564542), ('look', 321.2497252777063), ('white', 281.4138416598732), ('day', 260.8267959509847), ('water', 242.37121801385726)]
Topic 2:
[('ahab', 647.5000450553521), ('man', 521.4814261270236), ('ye', 474.47215489582726), ('time', 435.44133664263074), ('long', 337.20139480270416)]
Topic 3:
[('boat', 511.4311357150518), ('head', 354.9244949964444), ('captain', 292.4165170203486), ('men', 251.6417716370002), ('said', 250.96484558171414)]
Topic 4:
[('whale', 1413.1558989588868), ('sea', 475.79810800262345), ('old', 464.3366926699091), ('thing', 315.8009202371099), ('stubb', 293.2111487875179)]


In [145]:
print("NMF Model:")
print_topics(nmf_model, vectorizer)
print("=" * 20)


NMF Model:
Topic 0:
[('whale', 6.179301431367646), ('ye', 1.9375842801821416e-76), ('sea', 2.350514241269914e-79), ('old', 7.80454270217664e-83), ('time', 4.439649090236502e-83)]
Topic 1:
[('like', 4.926277328702363), ('ye', 6.171018218160643e-16), ('time', 1.322949884766949e-22), ('head', 1.5830645030304464e-39), ('look', 6.087739353319033e-41)]
Topic 2:
[('ship', 4.85364572358902), ('ye', 1.607543972617152e-11), ('look', 1.1318689156763126e-36), ('thing', 9.618113449810123e-40), ('great', 3.038069975380811e-43)]
Topic 3:
[('man', 4.724655149104589), ('ye', 0.0011256263882984067), ('captain', 2.584049049996265e-27), ('head', 9.759766168189473e-29), ('look', 1.499360856795322e-29)]
Topic 4:
[('ahab', 4.708434589266911), ('sea', 2.686966711567919e-06), ('old', 9.603795479844764e-10), ('time', 1.8443320806195892e-10), ('boat', 4.2128648524378275e-12)]


In [146]:
print("LSI Model:")
print_topics(lsi_model, vectorizer)
print("=" * 20)

LSI Model:
Topic 0:
[('whale', 0.9999999949542627), ('hand', 6.307831973168089e-05), ('captain', 1.8181608896668615e-05), ('look', 1.4649980368183027e-05), ('thing', 1.4252480510105377e-05)]
Topic 1:
[('like', 0.9998851854072981), ('great', 0.007158110511763926), ('way', 0.006394079665458567), ('hand', 0.0038120112721361124), ('thou', 0.003409631886065413)]
Topic 2:
[('ship', 0.9995852787996308), ('man', 0.010122441459243138), ('queequeg', 0.006694094612006227), ('old', 0.006276999949258497), ('stubb', 0.005429219499055022)]
Topic 3:
[('man', 0.9968854393121116), ('hand', 0.044696296440504275), ('look', 0.02457891447804474), ('sea', 0.01711575731214732), ('thing', 0.014401969914370656)]
Topic 4:
[('ahab', 0.9990944033216037), ('said', 0.018802011170033876), ('white', 0.013083593784204131), ('ye', 0.012171709359727839), ('come', 0.010889876521822484)]


In [60]:
#!pip install pyLDAvis

In [152]:
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer)
panel