### LDA

Approach
- Only selected noun, adj, verb, adverb in the corpus
- Got the lemma of those words
- Removed stopwords including accents (deacc) like Polish etc and broke sentence down into individual words
- LDA algo needs BOW dictionary so you get the BOW dictionary - unique word and frequency in the doc

Update:
- Since the words picked up were not very informative and looked like stopwords, we needed to figure out ways to remove them in order to get better words which would help us come up with topics. 
- For this purpose we create bigrams and trigrams in the corpus.
- Use TFIDF to remove words which are still too frequent. This sometimes removes even important words but you need to manually check those and maybe add them back. 
- Get BOW on this new data with bigrams and trigrams and create LDA model.


In [3]:
#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#1introduction
import numpy as np
import json
import glob

#Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

#spacy
import spacy
from nltk.corpus import stopwords

#vis
import pyLDAvis
import pyLDAvis.gensim

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

Prepare the data

In [4]:
def load_data(file):
    with open (file, "r", encoding="utf-8") as f:
        data = json.load(f) 
    return (data)

def write_data(file, data):
    with open (file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)

In [5]:
stopwords = stopwords.words("english")

In [6]:
print (stopwords)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [32]:
data = load_data("data/ushmm_dn.json")["texts"]

print (data[0][0:90])

 My name David Kochalski. I was born in a small town called , and I was born May 5, 1928. 


You can use nltk or spacy for lemmatization, here we will use spacy

Consider only noun, adjective, verb and adverb

In [33]:
def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    texts_out = []
    for text in texts:
        doc = nlp(text)
        new_text = []
        for token in doc:
            if token.pos_ in allowed_postags:
                new_text.append(token.lemma_)
        final = " ".join(new_text)
        texts_out.append(final)
    return (texts_out)


lemmatized_texts = lemmatization(data)
print (lemmatized_texts[0][0:90])

name bear small town call bear very hard work child father mother small mill flour buckwhe


Remove stopwords and get individual words. Gensim model needs individual words

In [34]:
def gen_words(texts):
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return (final)

data_words = gen_words(lemmatized_texts)

print (data_words[0][0:20])

['name', 'bear', 'small', 'town', 'call', 'bear', 'very', 'hard', 'work', 'child', 'father', 'mother', 'small', 'mill', 'flour', 'buckwheat', 'prosperous', 'comfortable', 'go', 'school']


In [35]:
#BIGRAMS AND TRIGRAMS
bigram_phrases = gensim.models.Phrases(data_words, min_count=5, threshold=100) ## we need to see 2 words occur together at least 5 times for them to be considered bigram. Threshold is strict means we'd get fewer such phrases
trigram_phrases = gensim.models.Phrases(bigram_phrases[data_words], threshold=100)

bigram = gensim.models.phrases.Phraser(bigram_phrases)
trigram = gensim.models.phrases.Phraser(trigram_phrases)

def make_bigrams(texts):
    return([bigram[doc] for doc in texts])

def make_trigrams(texts):
    return ([trigram[bigram[doc]] for doc in texts])

data_bigrams = make_bigrams(data_words)
data_bigrams_trigrams = make_trigrams(data_bigrams)

print (data_bigrams_trigrams[0][0:100])

['name', 'bear', 'small', 'town', 'call', 'bear', 'very', 'hard', 'work', 'child', 'father', 'mother', 'small', 'mill', 'flour', 'buckwheat', 'prosperous', 'comfortable', 'go', 'school', 'public', 'school', 'morning', 'afternoon', 'go', 'religious', 'school', 'almost', 'late', 'night', 'raise', 'spirit', 'school', 'little', 'city', 'segregate', 'mind', 'small', 'town', 'say', 'majority', 'people', 'small', 'town', 'jewish', 'people', 'town', 'somehow', 'know', 'separate', 'jewish', 'child', 'catholic', 'child', 'know', 'most', 'people', 'use', 'friend', 'feel', 'maybe', 'personally', 'know', 'lot', 'incident', 'small', 'little', 'call', 'separate', 'other', 'word', 'hardly', 'get', 'together', 'incident', 'incident', 'pleasant', 'incident', 'call', 'house', 'people', 'regardless', 'religious', 'believe', 'really', 'religious', 'people', 'other', 'lovely', 'family', 'city', 'even', 'time', 'go', 'underground', 'religious', 'institution', 'parent', 'say', 'very']


In [36]:
print (data_bigrams_trigrams[0])

['name', 'bear', 'small', 'town', 'call', 'bear', 'very', 'hard', 'work', 'child', 'father', 'mother', 'small', 'mill', 'flour', 'buckwheat', 'prosperous', 'comfortable', 'go', 'school', 'public', 'school', 'morning', 'afternoon', 'go', 'religious', 'school', 'almost', 'late', 'night', 'raise', 'spirit', 'school', 'little', 'city', 'segregate', 'mind', 'small', 'town', 'say', 'majority', 'people', 'small', 'town', 'jewish', 'people', 'town', 'somehow', 'know', 'separate', 'jewish', 'child', 'catholic', 'child', 'know', 'most', 'people', 'use', 'friend', 'feel', 'maybe', 'personally', 'know', 'lot', 'incident', 'small', 'little', 'call', 'separate', 'other', 'word', 'hardly', 'get', 'together', 'incident', 'incident', 'pleasant', 'incident', 'call', 'house', 'people', 'regardless', 'religious', 'believe', 'really', 'religious', 'people', 'other', 'lovely', 'family', 'city', 'even', 'time', 'go', 'underground', 'religious', 'institution', 'parent', 'say', 'very', 'religious', 'aware', 'g

Create a id2word i.e. BOW of this new data with bigrams and trigrams

In [37]:
### TF-IDF
from gensim.models import TfidfModel

id2word = corpora.Dictionary(data_bigrams_trigrams)

texts = data_bigrams_trigrams

corpus = [id2word.doc2bow(text) for text in texts]
# print (corpus[0][0:20])

tfidf = TfidfModel(corpus, id2word=id2word)

low_value = 0.03
words  = []
words_missing_in_tfidf = []

### Removing words that occur too frequently as they do not add value to the clustering algorithm

### However in some cases it may drop words which are actually valuable, so you can see which words are being dropped and cultivate a new list
for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    tfidf_ids = [id for id, value in tfidf[bow]]
    bow_ids = [id for id, value in bow]
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    drops = low_value_words+words_missing_in_tfidf
    for item in drops:
        words.append(id2word[item])
    words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing

    new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]
    corpus[i] = new_bow

Create bag of words dictionary. It will have unique words and the frequency in the corpus

In [None]:
# id2word = corpora.Dictionary(data_words)

# corpus = []
# for text in data_words:
#     new = id2word.doc2bow(text)
#     corpus.append(new)

# print (corpus[0][0:20])

# word = id2word[[0][:1][0]]
# print (word)

[(0, 2), (1, 10), (2, 1), (3, 2), (4, 1), (5, 1), (6, 2), (7, 3), (8, 1), (9, 12), (10, 1), (11, 8), (12, 1), (13, 2), (14, 1), (15, 3), (16, 2), (17, 1), (18, 1), (19, 1)]
able


In [38]:
id2word[[5][0:1][0]]

'accord'

In [39]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,  ## BOW
                                           num_topics=7,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha="auto")

### Visualize the data

We see some problems:
1. We asked for 30 clusters but we can see only 10
2. The words showing up in the cluster seem like stopwords which should be removed, they don't explain any particular topic so we need to work on custom stopword removal to get better topics

In [22]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
vis



## New model built on bigrams, trigrams and freuqnet words removed using TFIDF

In [40]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
vis

