# **NLP Topic Labelling**


Topic analysis is a Natural Language Processing (NLP) technique that allows us to automatically extract meaning from texts by identifying recurrent themes or topics.

Importing the necessary libraries to read the data. The data is composed of reviews, already sorted by Positive and Negative of Hotels in a region.

In [None]:
import pandas as pd
import pyarrow.parquet as pq
import numpy as np

Mounting the google drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Storing the data in a pandas dataframe. The data is in parquet format that is read using the pyarrow engine.

In [None]:
df = pd.read_parquet('/content/drive/MyDrive/Copy of split.parquet', engine='pyarrow')

In [None]:
df.head()

Unnamed: 0,Positive,Negative
0,"[positive]: service is good, location is the h...",[negative]: handling guests with irresponsible...
1,[positive]: great place,
2,"[positive]: the ambience, staff and the locati...",[negative]: nothing which we did not like
3,[positive]: the place and its ambience was ser...,[negative]: not sharing of bills of restaurant...
4,[positive]: excellent property at a wonderful ...,[negative]: food can be further improved


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 797005 entries, 0 to 797098
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   Positive  117886 non-null  object
 1   Negative  115359 non-null  object
dtypes: object(2)
memory usage: 18.2+ MB


Splitting the dataset into two dataframes consisting of Positive and Negative reviews.

In [None]:
df1 = df['Positive']
df2 = df['Negative']

In [None]:
df1.head()

0    [positive]: service is good, location is the h...
1                              [positive]: great place
2    [positive]: the ambience, staff and the locati...
3    [positive]: the place and its ambience was ser...
4    [positive]: excellent property at a wonderful ...
Name: Positive, dtype: object

Dropping Null values corresponding to rows.

In [None]:
df1 = df1.dropna(axis=0,how='any')

In [None]:
df1.head()

0    [positive]: service is good, location is the h...
1                              [positive]: great place
2    [positive]: the ambience, staff and the locati...
3    [positive]: the place and its ambience was ser...
4    [positive]: excellent property at a wonderful ...
Name: Positive, dtype: object

In [None]:
df1.count()

117886

**Importing the NLTK Library and its prerequisites.**

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

In [None]:
import string
import nltk
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

An NLTK interface for WordNet

WordNet is a lexical database of English.
Using synsets, helps find conceptual relationships between words
such as hypernyms, hyponyms, synonyms, antonyms etc.

In [None]:
from nltk.corpus import wordnet
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

Function to perform cleaning of sentence, tokenizing and lematizing.

In [None]:
def clean_text(text):
    # lower text
    text = text.lower()
    # tokenize text and remove puncutation
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # remove words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    # pos tag text
    pos_tags = pos_tag(text)
    # lemmatize text
    text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]
    # remove words with only one letter
    text = [t for t in text if len(t) > 1]
    # join all
    text = " ".join(text)
    return(text)

First we clean the Positive reviews in df1 dataframe.

In [None]:
df1["cleaned_pos"] = df1.apply(lambda x: clean_text(x))

The cleaned text.

In [None]:
df1['cleaned_pos'].head(10)

0    positive service be good location be the high ...
1                                 positive great place
2    positive the ambience staff and the location b...
3    positive the place and it ambience be serene a...
4    positive excellent property at wonderful locat...
5    positive the property be beautiful clean and w...
6           positive the hotel location be excellent\n
7            positive staff friendly good facilities\n
8    positive facility staff food and overall manag...
9             positive fast service and friendly staff
Name: Positive, dtype: object

Removing the word 'Positive' from the sentences.

In [None]:
df1['cleaned_pos']=df1['cleaned_pos'].replace('positive','',regex=True).str.strip()

In [None]:
df1['cleaned_pos'].head(10)

0    service be good location be the high point hot...
1                                          great place
2    the ambience staff and the location be excelle...
3    the place and it ambience be serene and calm t...
4             excellent property at wonderful location
5    the property be beautiful clean and welcome th...
6                      the hotel location be excellent
7                       staff friendly good facilities
8           facility staff food and overall management
9                      fast service and friendly staff
Name: Positive, dtype: object

We do the same process for negative reviews.

In [None]:
df2 = df2.dropna(axis=0,how='any')

In [None]:
df2.head()

0    [negative]: handling guests with irresponsible...
2            [negative]: nothing which we did not like
3    [negative]: not sharing of bills of restaurant...
4             [negative]: food can be further improved
5    [negative]: we felt the was not to the standar...
Name: Negative, dtype: object

In [None]:
df2.count()

115359

In [None]:
df2["cleaned_neg"] = df2.apply(lambda x: clean_text(x))

In [None]:
df2['cleaned_neg']=df2['cleaned_neg'].replace('negative','',regex=True).str.strip()

In [None]:
df2['cleaned_neg'].head(10)

0     handling guest with irresponsible behaviour ho...
2                          nothing which we do not like
3     not share of bill of restaurant dhikala while ...
4                               food can be far improve
5     we felt the be not to the standard we would ex...
6     the hotel be totally eco friendly which be rar...
7     feel it’s over price only male service staff l...
10    be the second time that stay here and im both ...
11    would have be nice to have double bed try to c...
12    try dinner in the hotel spaghetti bolognese it...
Name: Negative, dtype: object

In [None]:
data = df1.cleaned_pos.values.tolist()

In [None]:
import re, spacy, gensim
from pprint import pprint

Removing special characters from the text.

In [None]:
data = [re.sub(r'\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub(r'\s+', ' ', sent) for sent in data]
data = [re.sub(r"\'", "", sent) for sent in data]
pprint(data[:1])

['service be good location be the high point hotel commitment towards '
 'environment common area be perfectly clean.']


Converting the cleaned sentence into a bag of words.

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations
data_words = list(sent_to_words(data))
print(data_words[:1])

[['service', 'be', 'good', 'location', 'be', 'the', 'high', 'point', 'hotel', 'commitment', 'towards', 'environment', 'common', 'area', 'be', 'perfectly', 'clean']]


Lemmatizing the text again, seperating the subject from the sentence. 

In [None]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): #'NOUN', 'ADJ', 'VERB', 'ADV'
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

In [None]:
!python3 -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
nlp = spacy.load('en', disable=['parser', 'ner'])
# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'VERB']) #select noun and verb
print(data_lemmatized[:2])

['service location point hotel commitment environment area', 'place']


**LDA - Latent Dirichlet allocation**

In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV

In [None]:
vectorizer = CountVectorizer(analyzer='word',       
                             min_df=10,
                             stop_words='english',             
                             lowercase=True,                   
                             token_pattern='[a-zA-Z0-9]{3,}',)
data_vectorized = vectorizer.fit_transform(data_lemmatized)

Training the model on the vectorised data.

In [None]:
lda_model = LatentDirichletAllocation(n_components=7,               # Number of topics
                                      max_iter=10,               
                                      learning_method='online',   
                                      random_state=100,          
                                      batch_size=128,            
                                      evaluate_every = -1,       
                                      n_jobs = -1,)
lda_output = lda_model.fit_transform(data_vectorized)
print(lda_model)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='online', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=7, n_jobs=-1,
                          perp_tol=0.1, random_state=100, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)


Log likelihood and Perplexity - to check how well the model will perform.

In [None]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))
# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))
# See model parameters
pprint(lda_model.get_params())

Log Likelihood:  -1828287.2259879634
Perplexity:  205.5528220551506
{'batch_size': 128,
 'doc_topic_prior': None,
 'evaluate_every': -1,
 'learning_decay': 0.7,
 'learning_method': 'online',
 'learning_offset': 10.0,
 'max_doc_update_iter': 100,
 'max_iter': 10,
 'mean_change_tol': 0.001,
 'n_components': 7,
 'n_jobs': -1,
 'perp_tol': 0.1,
 'random_state': 100,
 'topic_word_prior': None,
 'total_samples': 1000000.0,
 'verbose': 0}


In [None]:
pip install pyLDAvis



In [None]:
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

We choose the best parameters using Grid Search algorithm. 

Grid search is essentially an optimization algorithm which lets you select the best parameters for your optimization problem from a list of parameter options that you provide, hence automating the 'trial-and-error' method.

In [None]:
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}
# Init the Model
lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(data_vectorized)

GridSearchCV(cv=None, error_score=nan,
             estimator=LatentDirichletAllocation(batch_size=128,
                                                 doc_topic_prior=None,
                                                 evaluate_every=-1,
                                                 learning_decay=0.7,
                                                 learning_method='online',
                                                 learning_offset=50.0,
                                                 max_doc_update_iter=100,
                                                 max_iter=5,
                                                 mean_change_tol=0.001,
                                                 n_components=10, n_jobs=None,
                                                 perp_tol=0.1, random_state=0,
                                                 topic_word_prior=None,
                                                 total_samples=1000000.0,
                               

In [None]:
# Best Model
best_lda_model = model.best_estimator_
# Model Parameters
print("Best Model's Params: ", model.best_params_)
# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)
# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

Best Model's Params:  {'learning_decay': 0.9, 'n_components': 10}
Best Log Likelihood Score:  -389047.79156347434
Model Perplexity:  225.18111502992446


Creating the topic matrix.

In [None]:
# Create Document — Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)
# column names
topicnames = ['Topic' + str(i) for i in range(best_lda_model.n_components)]
# index names
docnames = ['Doc' + str(i) for i in range(len(data))]
# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)
# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic
# Styling
def color_green(val):
 color = 'green' if val > .1 else 'black'
 return 'color: {col}'.format(col=color)
def make_bold(val):
 weight = 700 if val > .1 else 400
 return 'font-weight: {weight}'.format(weight=weight)
# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,dominant_topic
Doc0,0.16,0.3,0.01,0.44,0.01,0.01,0.01,0.01,0.01,0.01,3
Doc1,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.55,0.05,8
Doc2,0.63,0.02,0.02,0.02,0.02,0.21,0.02,0.02,0.02,0.02,0
Doc3,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.9,0.01,8
Doc4,0.37,0.03,0.37,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0
Doc5,0.03,0.03,0.27,0.03,0.03,0.24,0.03,0.32,0.03,0.03,7
Doc6,0.37,0.03,0.03,0.03,0.03,0.03,0.37,0.03,0.03,0.03,0
Doc7,0.03,0.03,0.03,0.03,0.03,0.7,0.03,0.03,0.03,0.03,5
Doc8,0.02,0.02,0.02,0.02,0.02,0.82,0.02,0.02,0.02,0.02,5
Doc9,0.03,0.03,0.03,0.03,0.03,0.7,0.03,0.03,0.03,0.03,5


Checking the keywords pertaining to each topic.

In [None]:
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)
# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topicnames
# View
df_topic_keywords.head()

Unnamed: 0,abend,aber,ability,acce,accept,access,accessibility,accommodate,accommodation,accomodate,accomodation,accompany,accord,account,accueillant,act,action,activite,activity,add,addition,address,adjoin,adjust,adult,advance,advantage,adventure,advertise,advice,advise,aeroport,aeroporto,affordability,afternoon,age,agency,agent,agra,agradavel,...,weather,web,website,wedding,week,weekend,weight,welcome,westerner,wheelchair,wie,wife,wifi,wildlife,win,window,wine,wing,winter,wise,wish,woman,wonder,wood,word,work,worker,working,workout,world,worry,worth,write,wurden,year,yoga,zentrale,zip,zone,zuvorkommend
Topic0,0.100222,0.100192,0.10038,0.100168,0.10022,0.100729,0.100818,0.100254,99.913555,0.100233,0.100756,0.101319,0.100154,0.100207,0.100243,0.100164,0.105073,0.100217,0.100551,0.100662,1.640099,0.100261,0.101534,0.100289,0.101525,0.102268,0.100328,0.992336,0.110314,0.100411,0.10021,0.100192,0.100171,0.1002,0.101855,0.100181,0.100183,0.100238,0.100466,0.100186,...,0.101314,6.536794,2.89067,4.072311,12.887365,0.100243,0.100194,0.100278,0.100195,0.105447,0.1002,0.10036,790.061874,0.100198,0.10019,0.103387,0.100216,0.10018,0.100292,4.727966,0.100882,0.101323,0.100215,0.100482,0.100201,26.492088,0.100212,0.405042,0.100193,0.100898,0.1032,0.100211,0.316904,0.10021,0.192351,0.100247,0.100202,0.136819,0.100258,0.100172
Topic1,0.100182,0.100185,0.100177,0.100215,0.100202,34.313129,0.100251,0.100437,0.835813,0.100453,0.100183,0.102008,0.100165,0.100223,0.100185,0.10019,0.100227,0.100158,0.1003,0.1008,0.112509,0.100287,0.107738,0.100205,0.100191,0.103928,0.100218,0.100218,0.100399,0.104254,0.100497,0.100179,0.100198,0.100256,0.100695,0.100215,0.10025,0.100179,0.115105,0.10023,...,0.117839,0.100237,0.101055,0.100672,0.100397,0.100203,0.100752,0.100284,0.10018,0.110093,0.100187,0.101037,0.102872,0.106759,0.179215,0.100544,0.100526,0.100206,0.103479,0.100215,0.100752,0.207269,0.10035,1.650633,0.187421,1.038322,0.100183,0.100211,0.101115,0.100241,0.1011,0.645742,0.101351,0.100167,0.100415,0.106738,0.100155,0.10118,0.100181,0.100245
Topic2,0.100183,0.100172,0.104552,0.106686,29.027071,0.121005,0.101302,0.101015,0.10074,0.100209,0.100204,11.406149,0.100371,1.971308,0.100236,0.100229,17.808324,0.100236,0.100229,23.102767,0.104174,24.986158,11.850775,0.100452,0.109604,10.667278,0.100228,0.100251,0.101087,13.6249,0.100256,12.748212,0.100163,0.100222,0.100887,7.466971,0.100226,0.100321,0.103063,0.100147,...,3.809956,0.10022,25.126041,19.025417,134.735405,0.100518,0.101163,0.100288,0.100271,0.100186,0.100172,0.100457,0.100374,0.100238,0.101803,0.100401,0.100315,0.100237,13.184675,0.154745,41.019704,12.490598,6.964627,57.407666,0.125827,0.18629,0.100545,0.100214,0.101559,35.470624,0.100761,0.100864,5.959413,0.100207,122.121642,0.100476,0.100216,10.368994,0.100194,0.100205
Topic3,0.100167,0.100201,7.903508,0.104019,0.100278,401.753515,0.100186,0.102511,77.532559,0.100511,0.100305,0.124284,0.100289,0.131963,0.100194,15.556676,15.037929,0.100205,0.100562,0.159962,20.18064,0.100627,0.104201,0.100373,10.271584,23.077129,0.100252,0.100533,28.471766,22.673901,0.100576,0.100186,0.100196,0.100185,0.390905,0.10032,0.355296,6.83272,0.146098,0.100199,...,42.829492,2.163068,0.109302,0.101582,1.89447,0.100278,0.102003,6.856161,9.485692,0.100791,0.100208,12.052228,0.112174,0.109559,11.996219,215.659856,0.102284,8.19102,0.993046,4.11479,73.515917,8.842332,0.105737,0.101291,0.100419,748.997337,0.100447,0.103152,0.106495,0.101433,20.559358,0.100671,0.105951,0.100167,113.099928,0.100203,0.100193,0.101424,0.100391,0.100232
Topic4,19.567057,19.88726,0.100219,0.100262,0.100348,0.100251,147.652456,0.100238,0.100422,0.100156,0.10019,0.100197,0.100162,0.10016,0.100219,0.100218,1.488567,0.100194,0.100194,0.10067,0.100181,0.100253,0.101231,0.10017,0.100454,0.101165,0.100677,0.100203,0.104697,0.100319,0.103101,0.100168,0.100247,0.100259,0.100217,0.100195,0.100521,0.1002,0.100359,0.100754,...,0.1033,2.358181,0.100315,0.100322,0.100368,0.1002,0.103096,0.100233,0.730853,0.100216,10.053649,0.100196,0.102261,0.1002,0.100268,0.100255,0.100212,0.100216,0.100268,0.103605,0.100238,0.100233,0.100211,0.100274,0.100207,0.100306,0.100211,0.100209,0.100971,0.100336,0.10038,0.100219,0.100818,26.558963,0.100253,0.100276,27.963759,0.103299,25.171015,35.623872


Choosing 10 topics for our labelling problem, with words selected by the model that best suits the topic.

In [None]:
# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)
# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,location,room,bed,wifi,shower,breakfast,coffee,facility,air,water,toilet,ambience,street,noise,condition
Topic 1,value,money,locate,parking,station,cafe,point,environment,secure,access,security,mall,bar,tune,area
Topic 2,stay,check,property,hotel,family,lot,tour,flight,rate,proximity,choice,person,wait,way,problem
Topic 3,room,view,hotel,stay,area,time,price,need,staff,city,night,day,place,floor,service
Topic 4,die,lage,pillow,hotel,market,lobby,gute,home,alle,war,man,und,ist,freundliche,accessibility
Topic 5,staff,pool,room,food,service,facility,people,size,rooftop,swim,clean,gym,request,desk,accommodate
Topic 6,hotel,airport,mall,walk,love,restaurant,staff,bar,cleanliness,distance,thing,shop,minute,taxi,room
Topic 7,tre,book,help,enjoy,make,staff,use,reception,thank,night,come,dinner,hotel,manager,day
Topic 8,place,water,beach,personnel,resort,comfort,relax,internet,meal,activity,light,island,situation,villa,site
Topic 9,breakfast,staff,room,bed,food,restaurant,hotel,buffet,bathroom,recommend,feel,look,experience,service,security


Naming the topics according to user preference, according to the words under the topic.

In [None]:
Topics = ["Room Facilities","Location/Amenities","Transportation/Accessibility","Service/Staff","Misc", 
          "Recreation", "Shops/Restaurants/Accessibility", "Staff/Responsible", "Relaxation/View", "Morning/Lunch Service"]
df_topic_keywords["Topics"]=Topics
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14,Topics
Topic 0,location,room,bed,wifi,shower,breakfast,coffee,facility,air,water,toilet,ambience,street,noise,condition,Room Facilities
Topic 1,value,money,locate,parking,station,cafe,point,environment,secure,access,security,mall,bar,tune,area,Location/Amenities
Topic 2,stay,check,property,hotel,family,lot,tour,flight,rate,proximity,choice,person,wait,way,problem,Transportation/Accessibility
Topic 3,room,view,hotel,stay,area,time,price,need,staff,city,night,day,place,floor,service,Service/Staff
Topic 4,die,lage,pillow,hotel,market,lobby,gute,home,alle,war,man,und,ist,freundliche,accessibility,Misc
Topic 5,staff,pool,room,food,service,facility,people,size,rooftop,swim,clean,gym,request,desk,accommodate,Recreation
Topic 6,hotel,airport,mall,walk,love,restaurant,staff,bar,cleanliness,distance,thing,shop,minute,taxi,room,Shops/Restaurants/Accessibility
Topic 7,tre,book,help,enjoy,make,staff,use,reception,thank,night,come,dinner,hotel,manager,day,Staff/Responsible
Topic 8,place,water,beach,personnel,resort,comfort,relax,internet,meal,activity,light,island,situation,villa,site,Relaxation/View
Topic 9,breakfast,staff,room,bed,food,restaurant,hotel,buffet,bathroom,recommend,feel,look,experience,service,security,Morning/Lunch Service


Creating a function to accept reviews and label them according to the topic applicable.

In [None]:
# Define function to predict topic for a given text document.
nlp = spacy.load('en', disable=['parser', 'ner'])
def predict_topic(text, nlp=nlp):
    global sent_to_words
    global lemmatization
# Step 1: Clean with simple_preprocess
    mytext_2 = list(sent_to_words(text))
# Step 2: Lemmatize
    mytext_3 = lemmatization(mytext_2, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
# Step 3: Vectorize transform
    mytext_4 = vectorizer.transform(mytext_3)
# Step 4: LDA Transform
    topic_probability_scores = best_lda_model.transform(mytext_4)
    topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), 1:14].values.tolist()
    
    # Step 5: Infer Topic
    infer_topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), -1]
    
    #topic_guess = df_topic_keywords.iloc[np.argmax(topic_probability_scores), Topics]
    return infer_topic, topic, topic_probability_scores
# Predict the topic
mytext = ["Very nice hotel room and the staff is very responsible. Also liked the overall ambience. It was very peaceful and quiet."]
infer_topic, topic, prob_scores = predict_topic(text = mytext)
print(topic)
print(infer_topic)

['view', 'hotel', 'stay', 'area', 'time', 'price', 'need', 'staff', 'city', 'night', 'day', 'place', 'floor']
Service/Staff


**CONCLUSION**

Natural Language Processing (or NLP) is the science of dealing with human language or text data. One of the NLP applications is Topic Analysis, which is a technique used to discover topics across text documents.

Topic analysis (also called topic detection, topic modeling, or topic extraction) is a machine learning technique that organizes and understands large collections of text data, by assigning “tags” or categories according to each individual text’s topic or theme.

Topic analysis uses natural language processing (NLP) to break down human language so that you can find patterns and unlock semantic structures within texts to extract insights and help make data-driven decisions.

The two most common approaches for topic analysis with machine learning are NLP topic modeling and NLP topic classification.

Topic analysis can be applied at different levels of scope:

1) Document-level: the topic model obtains the different topics from within a complete text. For example, the topics of an email or a news article.

2) Sentence-level: the topic model obtains the topic of a single sentence. For example, the topic of a news article headline.

3) Sub-sentence level: the topic model obtains the topic of sub-expressions from within a sentence. For example, different topics within a single sentence of a product review.


The performance of topic models is dependent on the terms present in the corpus, represented as document-term-matrix. Since this matrix is sparse in nature, reducing the dimensionality may improve the model performance. However, since our corpus was not very large, we can be reasonably confident with the achieved results.
