QUESTION

Group each words to their respective topics

In [2]:

# Importing Libraries we need for our analysis

import pandas as pd
import numpy as np
from numpy import array,asarray,zeros

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import nltk
import re #n“re” stands for regular expression. It provides shortcuts to manipulate our data.


In [3]:
#Loading the dataset

reddit= pd.read_csv('rspct.tsv', delimiter='\t')
reddit.head()

Unnamed: 0,id,subreddit,title,selftext
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,58mbft,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,6ti6re,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...
4,77sxto,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,..."


In [4]:
#Shape of our dataset

reddit.shape

(1013000, 4)

## DATA CLEANING

In [5]:
#Dropping all columns expect selftext

reddit.drop(['id','subreddit','title'], axis=1, inplace= True)

In [6]:
reddit.head()

Unnamed: 0,selftext
0,"Hi there, <lb>The usual. Long time lerker, fi..."
1,Did he ever say what his addiction was or is h...
2,Funny story. I went to college in Las Vegas. T...
3,I know this is a sub for the 'Ring Doorbell' b...
4,"Prime95 (regardless of version) and OCCT both,..."


In [7]:
#Sampling our data

data= reddit.sample(frac=0.02)
data.shape

(20260, 1)

> Remove punctuation/lower casing

Next, let’s perform a simple preprocessing on the content of paper_text column to make them more amenable for analysis, and reliable results. To do that, we’ll use a regular expression to remove any punctuation, and then lowercase the text

In [8]:
#Remove punctuation

data['selftext_processed']= data['selftext'].map(lambda x: re.sub('[,\.!?]', '', x))

In [9]:
#Preview the new data

data['selftext_processed'].head()

609343    It's most likely a format but not the one with...
840632    Hi I come from a very old fashioned country It...
355056    I found a person recently selling batch 16L01 ...
702574    Looking for some sound financial advice regard...
321436    This is a repost I'm not sure what happened to...
Name: selftext_processed, dtype: object

> Tokenize words and further clean up text

Let's tokenize each sentence into a list of words, removing punctuations and unnecessary characters all together

In [10]:
#Loading the necessary library

import gensim
from gensim.utils import simple_preprocess

In [11]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) #deacc=True removes unctuation
        

In [12]:
df= data.selftext_processed.values.tolist()
df_words= list(sent_to_words(df))

print(df_words[:1][0][:30])

['it', 'most', 'likely', 'format', 'but', 'not', 'the', 'one', 'with', 'overwriting', 'partition', 'scheme', 'is', 'most', 'likely', 'fat', 'lb', 'lb', 'haven', 'over', 'written', 'anything', 'good', 'nor', 'did', 'put', 'any', 'data', 'lb', 'lb']


## MODELLING

> Creating Bigram and Trigram Models

Bigrams are two words that frequently occur together in a document. Trigrams are three words that frequently occur.


In [14]:
#Build the bigram and trigram models

bigram= gensim.models.Phrases(df_words, min_count= 5, threshold= 100) #Higher threshold fewer phrases
trigram= gensim.models.Phrases(bigram[df_words], threshold= 100)

In [15]:
# Faster way to get a sentence clubbed as a trigram/bigram

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

> Remove Stopwords, Make Bigrams and Lemmatize

The phrase models are ready. Let’s define the functions to remove the stopwords, make trigrams and lemmatization and call them sequentially.

In [17]:
#NLTK Stopwords

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


In [18]:
stop_words= stopwords.words('english')
stop_words.extend(['from','subject','re','edu','use'])

In [19]:
#Define functions for stopwords, bigrams, trigrams and lemmatization

def remove_stopwords(texts):
    return[[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return[bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return[trigrams_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN','ADJ','VERB','ADV']):
    
    
    texts_out=[]
    for sent in texts:
        doc= nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
    
    

Let's call the functions in order.

In [20]:
import spacy

In [21]:
#Remove Stop Words

df_words_nostops= remove_stopwords(df_words)

In [22]:
#Form Bigrams

df_words_bigrams= make_bigrams(df_words_nostops)

In [23]:
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)

nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

In [24]:
# Do lemmatization keeping only noun, adj, vb, adv

df_lemmatized = lemmatization(df_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])


In [25]:
#Previewing output

print(df_lemmatized[:1][0][:30])

['likely', 'format', 'overwrite', 'partition', 'scheme', 'likely', 'fat', 'write', 'good', 'put', 'drive', 'ever', 'fill', 'never', 'fill', 'make', 'sure', 'situation', 'bad', 'try', 'instance', 'also', 'testdisk', 'connect', 'drive', 'see']


> Data transformation: Corpus and Dictionary

The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let’s create them.

In [26]:
import gensim.corpora as corpora

In [27]:
# Create Dictionary

id2word = corpora.Dictionary(df_lemmatized)

In [28]:
# Create Corpus

texts = df_lemmatized

In [29]:
# Term Document Frequency

corpus = [id2word.doc2bow(text) for text in texts]

In [30]:
# View

print(corpus[:1][0][:30])

[(0, 1), (1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1)]


> Building the Basic Model

We have everything required to train the base LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well. Apart from that, alpha and eta are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior (we'll use default for the base model).

chunksize controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory.

passes controls how often we train the model on the entire corpus (set to 10). Another word for passes might be "epochs". iterations is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. It is important to set the number of "passes" and "iterations" high enough.

In [31]:
#Build LDA Model

lda_model= gensim.models.LdaMulticore(corpus= corpus,
                                     id2word= id2word,
                                     num_topics= 10,
                                     random_state= 100,
                                     chunksize= 100,
                                     passes= 10,
                                     per_word_topics= True)

> View the topics in LDA model

The above LDA model is built with 10 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics()

In [32]:
from pprint import pprint

In [33]:
# Print the Keyword in the 10 topics

pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.025*"card" + 0.018*"buy" + 0.014*"deck" + 0.012*"price" + 0.011*"sell" + '
  '0.010*"good" + 0.008*"ship" + 0.008*"look" + 0.008*"cost" + 0.007*"new"'),
 (1,
  '0.016*"want" + 0.015*"know" + 0.013*"say" + 0.012*"go" + 0.010*"tell" + '
  '0.010*"think" + 0.009*"make" + 0.008*"get" + 0.008*"year" + 0.008*"friend"'),
 (2,
  '0.015*"think" + 0.014*"see" + 0.011*"know" + 0.010*"make" + 0.009*"people" '
  '+ 0.008*"show" + 0.008*"say" + 0.007*"watch" + 0.007*"really" + '
  '0.006*"also"'),
 (3,
  '0.021*"go" + 0.018*"get" + 0.017*"feel" + 0.016*"time" + 0.014*"take" + '
  '0.012*"day" + 0.012*"year" + 0.010*"know" + 0.010*"really" + 0.010*"start"'),
 (4,
  '0.016*"get" + 0.016*"day" + 0.014*"go" + 0.012*"time" + 0.011*"say" + '
  '0.010*"order" + 0.009*"week" + 0.009*"month" + 0.008*"back" + 0.006*"see"'),
 (5,
  '0.012*"work" + 0.009*"thank" + 0.008*"want" + 0.008*"look" + 0.007*"find" + '
  '0.007*"help" + 0.007*"know" + 0.007*"question" + 0.007*"get" + '
  '0.006*"make"'),
 (6,


> Compute Model Perplexity and Coherence Score

Let's calculate the baseline coherence score

In [34]:
from gensim.models import CoherenceModel

In [35]:
# Compute Coherence Score

coherence_model_lda = CoherenceModel(model=lda_model, texts=df_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Coherence Score:  0.36586190443839095


> Hyperparameter Tuning

First, let's differentiate between model hyperparameters and model parameters :

Model hyperparameters can be thought of as settings for a machine learning algorithm that are tuned by the data scientist before training. Examples would be the number of trees in the random forest, or in our case, number of topics K

Model parameters can be thought of as what the model learns during training, such as the weights for each word in a given topic.

Now that we have the baseline coherence score for the default LDA model, let's perform a series of sensitivity tests to help determine the following model hyperparameters:

Number of Topics (K)
Dirichlet hyperparameter alpha: Document-Topic Density
Dirichlet hyperparameter beta: Word-Topic Density
We'll perform these tests in sequence, one parameter at a time by keeping others constant and run them over the two difference validation corpus sets. We'll use C_v as our choice of metric for performance comparison

In [36]:
# supporting function

def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=df_lemmatized, dictionary=id2word, coherence='c_v')
    
    return coherence_model_lda.get_coherence()


Let's call the function, and iterate it over the range of topics, alpha, and beta parameter values

In [37]:
import tqdm

In [38]:
grid = {}
grid['Validation_Set'] = {}

In [39]:
# Topics range

min_topics = 2
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

In [40]:
# Alpha parameter

alpha = list(np.arange(0.01, 1, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')

In [41]:
# Beta parameter

beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

In [42]:
# Validation sets

num_of_docs = len(corpus)
corpus_sets = [# gensim.utils.ClippedCorpus(corpus, num_of_docs*0.25), 
               # gensim.utils.ClippedCorpus(corpus, num_of_docs*0.5), 
               # gensim.utils.ClippedCorpus(corpus, num_of_docs*0.75), 
               corpus]

In [43]:
corpus_title= ['100% Corpus']

In [44]:
model_results = {'Validation_Set': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }

In [45]:
# Can take a long time to run

if 1 == 1:
    pbar = tqdm.tqdm(total=(len(beta)*len(alpha)*len(topics_range)*len(corpus_title)))
    
    # iterate through validation corpuses
    for i in range(len(corpus_sets)):
        # iterate through number of topics
        for k in topics_range:
            # iterate through alpha values
            for a in alpha:
                # iterare through beta values
                for b in beta:
                    # get the coherence score for the given parameters
                    cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word, 
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Validation_Set'].append(corpus_title[i])
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)
                    
                    pbar.update(1)
    pd.DataFrame(model_results).to_csv('lda_tuning_results.csv', index=False)
    pbar.close()

100%|█████████████████████████████████████████████████████████████████████████████| 270/270 [9:42:22<00:00, 145.19s/it]


In [75]:
#tuning = pd.read_csv('lda_tuning_result')
tuning.head()

sc=tuning.sort_values('Coherence', ascending=False)
sc.head()

Unnamed: 0,Validation_Set,Topics,Alpha,Beta,Coherence
113,100% Corpus,5,symmetric,0.91,0.466149
98,100% Corpus,5,0.31,0.91,0.451412
218,100% Corpus,9,0.31,0.91,0.443292
262,100% Corpus,10,symmetric,0.61,0.434213
232,100% Corpus,9,symmetric,0.61,0.434126


In [78]:
np.arange(0.01, 1, 0.3)

array([0.01, 0.31, 0.61, 0.91])

# FINAL MODEL TRAINING

Based on external evaluation (Code to be added from Excel based analysis), train the final model

In [77]:
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=5, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           eta=0.91)

In [71]:
from pprint import pprint

In [72]:
#Print the keyword in the 10 topics

pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.012*"use" + 0.009*"try" + 0.009*"work" + 0.008*"get" + 0.008*"card" + '
  '0.006*"want" + 0.006*"set" + 0.006*"need" + 0.005*"file" + 0.005*"new"'),
 (1,
  '0.014*"go" + 0.013*"feel" + 0.012*"get" + 0.011*"know" + 0.010*"want" + '
  '0.010*"time" + 0.009*"think" + 0.008*"say" + 0.008*"really" + 0.007*"make"'),
 (2,
  '0.011*"play" + 0.010*"game" + 0.010*"get" + 0.010*"think" + 0.010*"make" + '
  '0.009*"see" + 0.008*"know" + 0.008*"look" + 0.007*"good" + 0.007*"really"'),
 (3,
  '0.011*"people" + 0.009*"year" + 0.008*"work" + 0.008*"know" + 0.008*"want" '
  '+ 0.006*"think" + 0.006*"question" + 0.006*"thank" + 0.005*"good" + '
  '0.005*"read"'),
 (4,
  '0.019*"get" + 0.015*"go" + 0.012*"day" + 0.011*"time" + 0.009*"work" + '
  '0.008*"take" + 0.008*"week" + 0.007*"look" + 0.007*"thank" + 0.006*"month"')]


In [73]:
import pyLDAvis.gensim
import pickle 
import pyLDAvis

In [74]:
# Visualize the topics

pyLDAvis.enable_notebook()

LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)

LDAvis_prepared

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
