* [step0](#step0): import necessary packages
* [step1](#step1): import dataset part4_dataset.pickle as part5_dataset
* [step2](#step2): combine Positive_Review and Negative_Review into one text column
* [step3](#step3): replace the punctuation in the string `combined_review`

In [94]:
# import necessary packages
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno # module for missing value visualization
from scipy import stats # implement box-cox transformation
from math import ceil
from string import strip # Return a copy of the string with leading and trailing characters removed
from sklearn.utils import shuffle # shuffling the dataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer


# Pretty display for notebooks
%matplotlib inline

<a id="step1"></a>
#### step1: import dataset part4_dataset.pickle as part5_dataset

In [95]:
part5_dataset = pd.read_pickle("part4_dataset.pickle")

<a id="step2"></a>
#### step2: combine Positive_Review and Negative_Review into one text column

In [96]:
# combine Positive_Review and Negative_Review into one text column
# strip the whitespace at both ends
part5_dataset.Negative_Review = part5_dataset.Negative_Review.apply(lambda x: strip(x))
part5_dataset.Positive_Review = part5_dataset.Positive_Review.apply(lambda x: strip(x))

# combine the two text column
part5_dataset["combined_review"] = part5_dataset[["Negative_Review","Positive_Review"]].apply(lambda x: " ".join(x), axis=1)

# have a look at the result
display(part5_dataset[["combined_review","Negative_Review","Positive_Review"]].iloc[2,0])

'Rooms are nice but for elderly a bit difficult as most rooms are two story with narrow steps So ask for single level Inside the rooms are very very basic just tea coffee and boiler and no bar empty fridge Location was good and staff were ok It is cute hotel the breakfast range is nice Will go back'

<a id="step3"></a>
#### step3: replace the punctuation in the string `combined_review`

In [97]:
# replace the punctuation in the string "combined_review" except alphanumeric character and white-space
part5_dataset["combined_review"] = part5_dataset["combined_review"].str.replace("[^\w\s]","")

<a id="step4"></a>
#### step4: save the output as `part5_dataset.pickle`

In [98]:
part5_dataset.to_pickle("part5_dataset.pickle")

<a id="step3"></a>
#### step3: shuffle and sampling 50% of the dataset
1. Use the first 50% of dataset as training and validation dataset for sentiment analysis.
2. Also for the simplicity of analysis, I will also use the training dataset as the training text for LDA model.

In [5]:
# shuffle and sampling 50% of the dataset
shuffled_data = shuffle(part5_dataset, random_state=20)

# separate target variable out
target_variable = shuffled_data.review_sentiment
target_variable = target_variable.astype("category")

# just sample 50% of the whole dataset - use train_test_split() to achieve same result
X_first50, X_remaining50, y_first50, y_remaining50 = train_test_split(shuffled_data, target_variable, \
                                                                      test_size = 0.5, stratify = target_variable)

<a id="step4"></a>
#### step4: create a bag of words solely for LDA model

In [6]:
# create a class for lemmatizer
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [7]:
# define a function to show up the topic words
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

Ref: [extended slices](https://docs.python.org/2.3/whatsnew/section-slices.html)

In [72]:
# build up a bag of words for LDA model
n_features = 5000

lda_tfidf_vectorizer = TfidfVectorizer(tokenizer=LemmaTokenizer(), \
                                       max_df=0.25, min_df=2, \
                                       max_features=n_features, \
                                       stop_words="english")

# fit and transform data
lda_tfidf = lda_tfidf_vectorizer.fit_transform(X_first50["combined_review"])

In [73]:
# extract the feature names in the bag of words
lda_tfidf_feature_names = lda_tfidf_vectorizer.get_feature_names()

In [74]:
# build up a LDA model
n_topics = 4
n_top_words = 20
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

# fit in the data
lda.fit(lda_tfidf)

print("\nTopics in LDA model:")
lda_tfidf_feature_names = lda_tfidf_vectorizer.get_feature_names()
print_top_words(lda, lda_tfidf_feature_names, n_top_words)


Topics in LDA model:
Topic #0: t u time check like night didn reception positive day stay service did breakfast got booking star bar stayed booked
Topic #1: great friendly excellent helpful breakfast good clean comfortable bed nice lovely perfect amazing service fantastic stay comfy view really bar
Topic #2: close station good walk value city metro money near parking restaurant easy minute great tube far breakfast nice walking central
Topic #3: positive small bed bathroom breakfast shower good coffee air hot water window noisy comfortable poor tea nice bit clean little
()


In [75]:
# compute the probabilities of topics for each document(row)
doc_topic_distribution = lda.transform(lda_tfidf) # already being normalized and will sum up to 1

# receive the index for topic with maximum probability
topic_for_doc = doc_topic_distribution.argmax(axis=1)

Ref: 
1. [LDA probability for document](https://github.com/scikit-learn/scikit-learn/issues/6320)
2. [Use .transform for topic probatility](https://stackoverflow.com/questions/45150329/how-to-get-the-topics-probability-of-a-specific-document-using-scikit-learn)

In [90]:
X_first50["combined_review"].iloc[19]

'Actually everything was perfect with Pullman Hotel but after we checked out on Dec 23rd around 10AM they sent us email at 7 40PM told us that the housekeeper couldn t find the hairdryer in our room and they wanted to charge us 80 We didn t take the hairdryer and we explained to them where we put the hairdryer back to it s place They replied that they will check again to the housekeeping and until now still no news from them I hope they already find the hairdryer and I still wait for my credit card bill to make sure there is no charge from Pullman Hotel The Hotel is very near from Eiffel Tower 2mins walking distance The room was spacious and very clean They upgraded our room from superior king room eiffel tower view to balcony eiffel tower view because that was our honeymoon trip Thank you so much Pullman'

In [89]:
topic_for_doc[16:20]

array([1, 3, 3, 0])

In [None]:
# compute the probabilities of topics for each document(row) - the original dataset
original_doc = lda_tfidf_vectorizer.transform(part5_dataset["combined_review"]) # transform data into bag of words

original_doc_topic = lda.transform(original_doc)
