<h1><center>Part 3: Topics Modeling </center></h1>

In part 3, we use the result found in part 1 and apply Latent Dirichlet Allocation (LDA) model to identify different topics discussed by the reviewers. The data is split into two parts:
1. Positive reviewes, we will find the main topics in this reviews.
2. Negative reviewes, we will find the main topics in this reviews.

Topic modeling is an unsupervised learning technique used to represent text documents with the help of several topics. It does not require a predefined list of tags for documents, instead, it analyzes text data to determine cluster words(topics) for a set of documents


## 1. Data Preparation 

#### Import of different libraries that will be used in this part of the project.

In [1]:
import pandas as pd
import unicodedata
import re
import math

#Gensim
import nltk
import contractions
import string
#Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
#spacy
import spacy
from nltk.corpus import stopwords
#vis
import pyLDAvis
import pyLDAvis.gensim_models

In [2]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nordine.quadar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\nordine.quadar\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [3]:
# read already saved data
dat = pd.read_csv('asin_sample_GPT_Rev.csv')

In [4]:
# Checking the data 
dat

Unnamed: 0.1,Unnamed: 0,overall,vote,verified,reviewerID,asin,reviewText,summary,unixReviewTime,reviewText_clean,reviewText_deep_clean,year,month,Sentiment_Score,Sentiment_GPT_Score,Reviewer_Type
0,0,5,2,No,AFDH6LFI9LP4E,B00KJ07SEM,Im not buying the GE one again. This one works...,Best Water Filter Ever!,2014-07-08,im not buying the ge one again. this one works...,im buy ge one thi one work great much less cos...,2014,7,5,0,-1
1,1,5,2,Yes,A1GZT67WOLNL5F,B00KJ07SEM,Removed the GE MWF Smartwater filter inserted ...,Worked like a charm,2014-07-06,removed the ge mwf smartwater filter inserted ...,remov ge mwf smartwat filter insert waterfal f...,2014,7,5,1,0
2,2,4,2,Yes,A3CMR3EQ6NSYEE,B00KJ07SEM,This a good filter and fits our needs quite we...,Good non OEM filter,2014-06-30,this a good filter and fits our needs quite we...,thi good filter fit need quit well work well g...,2014,6,5,1,1
3,3,3,5,No,A2ZHDU2DP6VYU9,B00KJ07SEM,Update: Within hours of my posting this review...,"Update: Terrible filter, but great customer se...",2014-06-27,update: within hours of my posting this review...,updat within hour post review first time jenni...,2014,6,3,0,0
4,4,5,30,Yes,A1GI72ZP1HD0V8,B00KJ07SEM,I have a GE two door refrigerator and use to b...,Very good,2014-06-22,i have a ge two door refrigerator and use to b...,i ge two door refriger use buy origin filter c...,2014,6,5,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3194,3195,5,0,Yes,A1FK4897PKMVSJ,B00KJ07SEM,"Snapped in no problem, filters ok, lasted abou...",Five Stars,2018-07-22,"snapped in no problem, filters ok, lasted abou...",snap problem filter ok last year,2018,7,4,0,0
3195,3196,5,0,Yes,AA5HIR6MSWTIL,B00KJ07SEM,I have ordered this filter from this seller mu...,Fridge filter,2018-07-22,i have ordered this filter from this seller mu...,i order filter seller multipl time alway posit...,2018,7,5,1,1
3196,3197,5,0,Yes,A35NA14R0332AZ,B00KJ07SEM,Why spend $400 more a year for some name brand...,"Buy this now, but read funny reveiw",2018-07-17,why spend $400 more a year for some name brand...,whi spend 400 year name brand crap these work ...,2018,7,1,-1,-1
3197,3198,5,0,Yes,A16N4OCSB7J50F,B00KJ07SEM,Filter works great so much so I just reordered...,Five Stars,2018-07-14,filter works great so much so i just reordered...,filter work great much i reorder backup,2018,7,5,1,1


After pulling and refining the data from part 1 and importing the required packages, we need to clean the data from NaN and urls so we can tokenize it for the next steps.

In [5]:
# Remove 'nan' values from cleaned reveiw 
for j in dat['reviewText_clean'].index:
    x = dat['reviewText_clean'][j]
    if isinstance(x, (int, float)) and math.isnan(x):
        dat = dat.drop(index=j)
        print(j)

In [6]:
# standardizing accented characters if any
def standardize_accented_chars(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

#standardizing accented characters for every row
dat.reviewText_clean = dat.reviewText_clean.apply(standardize_accented_chars)

In [7]:
def get_number_of_urls(documents):
    print("{:.2f}% of documents contain urls".format(sum
(documents.apply(lambda x:x.find('http'))>0)/len
(documents)*100))
    
# Passing the 'reviewText_clean' column of the dataframe as the argument
print(get_number_of_urls(dat.reviewText_clean))

0.13% of documents contain urls
None


In [8]:
def remove_url(text):
     return re.sub(r'https?:\S*', '', text)

#removing urls from every row
dat.reviewText_clean=dat.reviewText_clean.apply(remove_url)

In [9]:
def expand_contractions(text):
    expanded_words = [] 
    for word in text.split():
           expanded_words.append(contractions.fix(word)) 
    return ' ' .join(expanded_words)

#expanding contractions for every row
dat.reviewText_clean=dat.reviewText_clean.apply(expand_contractions)

**Expanding Contractions:** Contractions are shortened versions of words. They are created by removing, one or more letters from words. More often than not, multiple words are combined to create a contraction. For example, I will is contracted into I’ll, do not into don’t. We wouldn’t want our model to consider I will and I’ll differently. Hence, we will convert each contraction into its expanded form using the below-mentioned code.

In [10]:
def keep_only_alphabet(text):
    return re.sub(r'[^a-z]', ' ', text)

#for all the rows
dat.reviewText_clean=dat.reviewText_clean.apply(keep_only_alphabet)

**Keeping only Alphabet:** Punctuations, numbers, and special characters like ‘$, %, etc.’ don’t provide any information. Hence, we will keep only letters and remove everything else present in the text using the below-mentioned function.

In [11]:
def remove_stopwords(text,nlp,custom_stop_words=None, remove_small_tokens=True,min_len=2):
    # if custom stop words are provided, then add them to default stop words list
    if custom_stop_words:
        nlp.Defaults.stop_words |= custom_stop_words
    
    filtered_sentence =[] 
    doc=nlp(text)
    for token in doc:
        if token.is_stop == False: 
            
            # if small tokens have to be removed, then select only those which are longer than the min_len 
            if remove_small_tokens:
                if len(token.text)>min_len:
                    filtered_sentence.append(token.text)
            else:
                filtered_sentence.append(token.text)
    # if after the stop word removal, words are still left in the sentence, then return the sentence as a string else return null 
    return ' '.join(filtered_sentence) if len(filtered_sentence)>0 else None
#creating a spaCy object. 
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
#removing stop-words and short words from every row
dat.reviewText_clean=dat.reviewText_clean.apply(lambda x:remove_stopwords(x,nlp))

In [12]:
# Remove 'nan' values from cleaned reveiw 
for j in dat['reviewText_clean'].index:
    x = dat['reviewText_clean'][j]
    if type(x) != str:
        dat = dat.drop(index=j)

In [13]:
def lemmatize(text, nlp):
    
    doc = nlp(text)
    lemmatized_text = []
    for token in doc:
        lemmatized_text.append(token.lemma_)
    return ' '.join(lemmatized_text)

#Performing lemmatization on every row
dat.reviewText_clean=dat.reviewText_clean.apply(lambda x:lemmatize(x,nlp))

**Lemmatization:** Lemmatization generates the root of the word. It makes use of vocabulary and morphological analysis of words, to generate the root form of a word. We will use the spaCy library for performing lemmatization.

In [14]:
def generate_tokens(review):
    words=[]
    for word in review.split(' '):
    # using the if condition because we introduced extra spaces during text cleaning
        if word!=' ':
            words.append(word)
    return words

#storing the generated tokens in a new column named 'words'
dat['tokens']=dat.reviewText_clean.apply(generate_tokens)

In [15]:
# Remove empty element from tokens
for elem in dat['tokens']:
    for subelem in elem:
        if '' in elem:
            elem.remove('')

## 2. Topics for Positive sentiment

In [16]:
# Select just the postive reviews to anlyze
dat_pos = dat[dat['Sentiment_GPT_Score']==1]

**Generating Document Matrix and Dictionary:**


The LDA topic model algorithm requires a document word matrix and a dictionary as the main inputs.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

A Dictionary is collection of all unique tokens present in documents.

For generating the document-term matrix and dictionary, first we will convert the reviews into tokens using the fucntion mentioned below.

In [17]:
def create_dictionary(words):
    return corpora.Dictionary(words)
#passing the dataframe column having tokens as the argument
id2word=create_dictionary(dat_pos.tokens)
print(id2word)

Dictionary(1608 unique tokens: ['charm', 'filter', 'insert', 'instruction', 'like']...)


In [18]:
def create_document_matrix(tokens,id2word):
    corpus = []
    for text in tokens:
        corpus.append(id2word.doc2bow(text))
    return corpus
#passing the dataframe column having tokens and dictionary
corpus=create_document_matrix(dat_pos.tokens,id2word)

**LDA Implementing:**

Gensim library is used for LDA. For generating the base model, we use num_topics=5.

In [19]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
 id2word=id2word,
 num_topics=5,
 random_state=100,
 )

**LDA Topics Generation:**

We iterate over the topics identified by our LDA model, get the top words in each topic. We will store the top 10 words for each topic in a dataframe using the below-mentioned function. The top_n_words parameter controls the number of top words we want to store for each topic

In [20]:
def get_lda_topics(model, num_topics, top_n_words):
    word_dict = {}
    for i in range(num_topics):
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in model.show_topic(i, topn = top_n_words)];
 
    return pd.DataFrame(word_dict)
                  
get_lda_topics(lda_model,5,10)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05
0,filter,good,work,filter,filter
1,great,filter,great,water,great
2,water,water,easy,taste,water
3,work,product,install,great,product
4,fit,easy,price,work,taste
5,easy,price,good,price,work
6,install,work,brand,good,buy
7,price,time,product,refrigerator,price
8,good,install,original,fit,service
9,product,arrive,cheap,easy,order


**Visualizing Topics:**

We use the pyLDAvis library for visualizing the results. The below-mentioned code will generate a dashboard that displays the results.

In [21]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds='mmds', R=20)
vis

  default_term_info = default_term_info.sort_values(


**Comment:** We can see from this identified topics of the positive reviews that customers are mainly mentionning, in topic 1, that product is easy to fit and cheap and work great. We can see different features of the rpoduct for these topics.

## 3. Topics for Negative sentiment

We do the same for negative reviews

In [22]:
dat_neg = dat[dat['Sentiment_GPT_Score']== -1]

In [23]:
def create_dictionary(words):
    return corpora.Dictionary(words)
#passing the dataframe column having tokens as the argument
id2word=create_dictionary(dat_neg.tokens)
print(id2word)

Dictionary(1171 unique tokens: ['company', 'compatible', 'drain', 'forward', 'go']...)


In [24]:
def create_document_matrix(tokens,id2word):
    corpus = []
    for text in tokens:
        corpus.append(id2word.doc2bow(text))
    return corpus
#passing the dataframe column having tokens and dictionary
corpus=create_document_matrix(dat_neg.tokens,id2word)

In [25]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
 id2word=id2word,
 num_topics=5,
 random_state=100,
 )

In [26]:
def get_lda_topics(model, num_topics, top_n_words):
    word_dict = {}
    for i in range(num_topics):
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in model.show_topic(i, topn = top_n_words)];
 
    return pd.DataFrame(word_dict)
                  
get_lda_topics(lda_model,5,10)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05
0,filter,filter,taste,filter,filter
1,water,water,water,water,water
2,taste,buy,filter,taste,taste
3,work,taste,like,buy,work
4,fit,work,month,like,like
5,flow,product,work,work,product
6,mwf,instal,product,good,save
7,refrigerator,refrigerator,bad,fridge,purchase
8,old,brand,plastic,bad,get
9,month,bad,buy,return,return


In [27]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds='mmds', R=30)
vis

  default_term_info = default_term_info.sort_values(


**Comment:** We can see from this identified topics of the negative reviews that customers are mainly mentioning, like in topic 1, something about the taste of water. 

**Conclusion:**
AS we can see these topics modeling can help to understand what customers are discussing. It's still not clear. The words are repeated in different topics. This issue can be solved by performing hyper-parameter tuning.


### -----------------------------------------------------------------------------------------------------------------------------------------------------
### End Part 3