## Outline of steps
* [step0](#step0): import necessary packages
* [step1](#step1): import dataset `part5_dataset.pickle` as `part6_dataset`
* [step2](#step2): extract `combined_review` from `part6_dataset`
* [step3](#step3): create necessary `class` and `self-defined-fun` for LDA model
* [step4](#step4): create a bag of words solely for LDA model
* [step5](#step5): build up a LDA model
* [step6](#step6): calculate the topic probabilities for each document
* [step7](#step7): have a look at the doc topic assigned and the doc text
* [step8](#step8): join the topic back to the original dataset
* [step9](#step9): save the output as `part6_dataset.pickle`

In [1]:
# import necessary packages
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno # module for missing value visualization
from scipy import stats # implement box-cox transformation
from math import ceil
from string import strip # Return a copy of the string with leading and trailing characters removed
from sklearn.utils import shuffle # shuffling the dataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer


# Pretty display for notebooks
%matplotlib inline

<a id="step1"></a>
## step1: import dataset part5_dataset.pickle as part6_dataset

In [2]:
part6_dataset = pd.read_pickle("part5_dataset.pickle")

<a id="step2"></a>
## step2: extract `combined_review` from `part6_dataset`

In [3]:
combined_review = part6_dataset["combined_review"]

<a id="step3"></a>
## step3: create necessary `class` and `self-defined-fun` for LDA model

In [4]:
# create a class for lemmatizer
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [5]:
# define a function to show up the topic words
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

Ref: 
1. [customized lemmatizer](http://scikit-learn.org/stable/modules/feature_extraction.html)
2. [Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation](http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py)
3. [extended slices](https://docs.python.org/2.3/whatsnew/section-slices.html)

<a id="step4"></a>
## step4: create a bag of words solely for LDA model
There are some specifications for the **bag of words**:
1. include all dataset
2. using Term frequency - Inverse Document Frequency Tokenizer 
3. only pick up 5000 words
4. use advanced Lemmatization
5. only pick up word fequency less than 25% and shows at least in 2 documents
6. exclude English stop words

In [6]:
# build up a bag of words for LDA model
n_features = 5000

lda_tfidf_vectorizer = TfidfVectorizer(tokenizer=LemmaTokenizer(),
                                       max_df=0.25, min_df=2, # word fequency less than 25% and shows at least in 2 doc
                                       max_features=n_features,
                                       stop_words="english")

# fit and transform data
lda_tfidf = lda_tfidf_vectorizer.fit_transform(combined_review)

In [7]:
# extract the feature names in the bag of words
lda_tfidf_feature_names = lda_tfidf_vectorizer.get_feature_names()

<a id="step5"></a>
## step5: build up a LDA model
Notice that in LDA model, we also need to set up the `random_state` for reproductive purpose.

In [8]:
# build up a LDA model
n_topics = 4
n_top_words = 20
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

# fit in the data
lda.fit(lda_tfidf)

print("Topics in LDA model:")
lda_tfidf_feature_names = lda_tfidf_vectorizer.get_feature_names()
print_top_words(lda, lda_tfidf_feature_names, n_top_words)

Topics in LDA model:
Topic #0: t u check time reception day service stay like didn night breakfast positive did booking thing slow got booked bar
Topic #1: great breakfast friendly good excellent helpful clean bed nice comfortable service perfect amazing lovely pool comfy small food fantastic facility
Topic #2: positive bed bathroom breakfast shower good small coffee wifi water air t window poor hot noisy tea star old floor
Topic #3: close station city good metro great walk central near clean nice breakfast easy center parking restaurant centre train far minute
()


>Interpretation for the topics

| Topic 	| Interpretation                                                                   	| Top N words in the topic                                                                                                                           	|
|------:	|:----------------------------------------------------------------------------------	|:----------------------------------------------------------------------------------------------------------------------------------------------------	|
|     0 	| (possibly negative) related to check, reception, booking, slow                   	| check time reception day service stay like didn night breakfast positive did booking thing slow got booked bar                                     	|
|     1 	| (quite positive) related to friendly, breakfast, clean, comfortable, service     	| great breakfast friendly good excellent helpful clean bed nice comfortable service perfect amazing lovely pool comfy small food fantastic facility 	|
|     2 	| (possibly negative) related to bathroom, shower, small, coffee, wifi, noisy, old 	| bed bathroom breakfast shower good small coffee wifi water air t window poor hot noisy tea star old floor                                          	|
|     3 	| (quite positive) related to close, station, metro, central, near, parking        	| close station city good metro great walk central near clean nice breakfast easy center parking restaurant centre train far minute                  	|

In [10]:
# create a dataframe of topic interpretation
topic_interpretation = pd.DataFrame({"topic":[0,1,2,3],
                                     "topic_interpretation":[
                                                "(possibly negative) related to check, reception, booking, slow ",
                                                "(quite positive) related to friendly, breakfast, clean, comfortable, service",
                                                "(possibly negative) related to bathroom, shower, small, coffee, wifi, noisy, old",
                                                "(quite positive) related to close, station, metro, central, near, parking"]})

<a id="step6"></a>
## step6: calculate the topic probabilities for each document

In [11]:
# compute the topic probabilities for each document(row)
doc_topic_proba = lda.transform(lda_tfidf) # already being normalized and will sum up to 1

# receive the index for topic with maximum probability
doc_topic_max = doc_topic_proba.argmax(axis=1)

Ref: 
1. [LDA probability for document](https://github.com/scikit-learn/scikit-learn/issues/6320)
2. [Use .transform for topic probatility](https://stackoverflow.com/questions/45150329/how-to-get-the-topics-probability-of-a-specific-document-using-scikit-learn)

<a id="step"></a>
## step7: have a look at the doc topic assigned and the doc text
See if the topic being assigned matches up the doc text review.

In [12]:
# extract the full length of doc text - use df.iloc[]
combined_review.iloc[24]

'Nothing Lovely hotel with extremely comfortable huge double bed We stayed in the split level room which we really liked If you have difficulty getting up stairs request if you can stay in a room all on one level The Oosterpark is beautiful the shops and restaurants are great with lots of variety to choose from You can get the Metro close by 8min walk or the Tram is a short walk away and runs from the station and you can get off within a 5 mins walk to the Hotel All in all a beautiful hotel with friendly staff shampoo and soap in the shower Tea and coffee facilities in your room and in a location that is more relaxing than the central Amsterdam We will be returning'

In [13]:
# look at the topic assigned to the doc
doc_topic_max[24]

3

<a id="step8"></a>
## step8: join the topic back to the original dataset

In [14]:
# join the topic back to the original dataset
part6_dataset["topic_assign_by_LDA"] = doc_topic_max

In [15]:
# join the topic interpretation
part6_dataset = pd.merge(part6_dataset, topic_interpretation,
                         left_on="topic_assign_by_LDA", right_on="topic", how="left")

In [16]:
# drop duplicated column
part6_dataset.drop("topic", axis=1, inplace=True)

<a id="step9"></a>
## step9: save the output as part6_dataset.pickle

In [17]:
part6_dataset.to_pickle("part6_dataset.pickle")