# Topic modelling using Gensim

### Steps
- Clean the reviews
- Get the tokens and their ids
- Identify bag of words for each document
- Use LDA model to get the topics
- Assign topic to each document based on probability

In [4]:
import re
import nltk
import gensim
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from itertools import combinations
from gensim import corpora, models

In [49]:
reviews = pd.read_csv('e:/datasets/amazon_reviews/amazon_reviews_11.csv')
reviews = reviews[~pd.isnull(reviews['reviewText'])]
reviews.head()

Unnamed: 0.1,Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,528881469,"[0, 0]",5.0,We got this GPS for my husband who is an (OTR)...,"06 2, 2013",AO94DHGC771SJ,amazdnu,Gotta have GPS!,1370131000.0
1,1,528881469,"[12, 15]",1.0,"I'm a professional OTR truck driver, and I bou...","11 25, 2010",AMO214LNFCEI4,Amazon Customer,Very Disappointed,1290643000.0
2,2,528881469,"[43, 45]",3.0,"Well, what can I say. I've had this unit in m...","09 9, 2010",A3N7T0DY83Y4IG,C. A. Freeman,1st impression,1283990000.0
3,3,528881469,"[9, 10]",2.0,"Not going to write a long review, even thought...","11 24, 2010",A1H8PY3QHMQQA0,"Dave M. Shaw ""mack dave""","Great grafics, POOR GPS",1290557000.0
4,4,528881469,"[0, 0]",1.0,I've had mine for a year and here's what we go...,"09 29, 2011",A24EV6RXELQZ63,Wayne Smith,"Major issues, only excuses for support",1317254000.0


### Get Stop Words

- Common stop words can be taken directly from nltk corpus. [Refer](http://www.nltk.org/data.html) this documentation for further help in getting data from `nltk`
- Custom stop words can be either manually inserted in the below list or can also be directly read from a file
- Combine common and custom stop words

In [50]:
common_stop_words = nltk.corpus.stopwords.words('english')
custom_stop_words = []
all_stop_words = np.hstack([common_stop_words, custom_stop_words])
all_stop_words[:5]

array(['i', 'me', 'my', 'myself', 'we'],
      dtype='<U32')

### Clean words
The below function is used to clean the reviews and to get the word tokens one by one
- Convert all text to lower case using `.lower()`
- Split the text by space to get individual words: `.split()`
- Remove stop words using `setdiff1d`

In [51]:
def clean_review(review_text):
    words_clean = (re.sub('[^a-z ]', '', review_text.lower()).split())
    words_imp = np.setdiff1d(words_clean, all_stop_words)
    return words_imp

- Create a empty list in which we will append nested lists. One list per review.
- The below for loop will iterate through each review text and calls the above function for cleaning the text and to get the individual tokens for each document.
- One list for each document will be appended to `texts` list

In [52]:
texts = []
for review_text in reviews['reviewText'].dropna().values:
    words_token = clean_review(review_text)
    texts.append(list(words_token))

### Tokens to IDs
- corpora.Dictionary(texts) will identify unique tokens in our corpus
- For each token i.e. for each unique word an id will be generated

In [53]:
dictionary = corpora.Dictionary(texts)
print(dictionary)

Dictionary(9454 unique tokens: ['addresses', 'around', 'arrived', 'back', 'bad']...)


`dicitionary.token2id` will give a dictionary with unique words. Keys will be tokens, values will be the unique id for each token

In [54]:
len(dictionary.token2id)

9454

In [55]:
dictionary.token2id['addresses'],dictionary.token2id['around'] 

(0, 1)

### Bag of words
Bag of words is nothing but frequency of each word in a document. Here will be using `.doc2bow` to get frequency of individual words for each document. Try uncommenting and print the first document in corpus

In [56]:
corpus = [dictionary.doc2bow(text) for text in texts]
# print (corpus[0])

### TFIDF - Transformation
[Refer](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) this wikipedia article for understanding TFIDF

In [57]:
tfidf = models.TfidfModel(corpus)

### Topic Modelling using LDA
Here we are explicity asking for 3 topics. To the model we have to pass our cleaned corpus and dictionary which has the unique tokens with their ids.

In [58]:
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=3)
lda.print_topics()

[(0,
  '0.005*"nook" + 0.005*"one" + 0.005*"good" + 0.004*"like" + 0.004*"use" + 0.004*"easy" + 0.004*"bought" + 0.004*"would" + 0.004*"get" + 0.004*"books"'),
 (1,
  '0.005*"use" + 0.005*"works" + 0.004*"nook" + 0.004*"great" + 0.004*"one" + 0.004*"would" + 0.004*"screen" + 0.004*"bought" + 0.004*"good" + 0.004*"price"'),
 (2,
  '0.005*"great" + 0.005*"one" + 0.004*"well" + 0.004*"get" + 0.004*"tv" + 0.004*"much" + 0.004*"nook" + 0.004*"would" + 0.004*"screen" + 0.003*"use"')]

### Topics distribution for each document
using `get_document_topics` function to get the probability of each topic in a document. In the following example the probability of first topic is 0.84 and probability of second topic is 0.14

In [61]:
lda.get_document_topics(dictionary.doc2bow(texts[0]))

[(0, 0.84772807660700011), (1, 0.1459467461881922)]

In [62]:
lda.get_document_topics(dictionary.doc2bow(texts[50]))

[(0, 0.51998272789982625), (1, 0.46531199807315365), (2, 0.014705274027020153)]

### Assign topic to each document based on probability

In [59]:
reviews['topics'] = [lda.get_document_topics(dictionary.doc2bow(text))[0][0] for text in texts]

### Topic distribution in reviews

In [60]:
reviews['topics'].value_counts()

0    854
1    101
2     42
Name: topics, dtype: int64