## **<span style="color:#023e8a"><center> 🔥Guided LDA. Semi-supervised TM.🔥</center></span>**
## **<center><span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 5px">If you find this notebook useful or interesting, please, support with an upvote :)</span></center>**

## **<span style="color:#023e8a;font-size:1000%"><center>NLP</center></span><span style="color:#023e8a;font-size:200%"><center>Topic Modeling. Semi-supervised LDA.</center></span>**
>**<span style="color:#023e8a;">Hello everyone!</span>**  
>**<span style="color:#023e8a;">I hope that this notebook will be interesting and useful for you. Guided LDA gives more opportunities to work with topic comparing with original LDA.</span>**

# **<a id="Content" style="color:#023e8a;">Table of Content</a>**
* [**<span style="color:#023e8a;">1. Downloading data</span>**](#Downloading)  
* [**<span style="color:#023e8a;">2. Data prep and stemming</span>**](#Data)  
* [**<span style="color:#023e8a;">3. Modeling</span>**](#Modeling)   

In [None]:
import os
import pandas as pd
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer
from nltk import word_tokenize
import numpy as np
from gensim.models.ldamulticore import LdaMulticore
import gensim
from nltk.corpus import stopwords
stops = stopwords.words("english")
import re

# **<span id="Downloading" style="color:#023e8a;">1. Downloading data</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
train = pd.read_csv('../input/nlp-getting-started/train.csv')

In [None]:
train.head()

# **<span id="Data" style="color:#023e8a;">2. Data prep and stemming</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

**<span style="color:#023e8a;">For more efficient work of `LDA` we need to lemmatize text. `Lemmatization` is necessary to bring words to their initial form. That is helpful to consider words "student" and, for instance, "students" as the same word. However, `stemming` (that is the procedure consisting in separating the root of the word only) is a is an appropriate tool for English too and in terms of the speed it is much more beneficial than `lemmatization`.</span>**

**<span style="color:#023e8a;">Learn more</span>**: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

**<span style="color:#023e8a;">Cleaning def. Thanks to:</span>** https://www.kaggle.com/shahules/basic-eda-cleaning-and-glove

In [None]:
def text_cleaning(texts):
    texts_cleaning = []
    for txt in tqdm(texts):
        url = re.compile(r'https?://\S+|www\.\S+')
        html = re.compile(r'<.*?>')
        emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
        txt = emoji_pattern.sub(r'', txt)
        txt = html.sub(r'',txt)
        txt = url.sub(r'',txt)
        txt = re.sub('[^A-Za-z\s]', '', txt)
        
        texts_cleaning.append(txt.lower())
    return texts_cleaning
text = text_cleaning(train.text.tolist())

In [None]:
from nltk.stem import PorterStemmer
text = [t.split() for t in text]
stemmed_text = []
ps = PorterStemmer()
for sentence in tqdm(text):
    sent = []
    for word in sentence:
        sent.append(ps.stem(word))
    stemmed_text.append(sent)

**<span style="color:#023e8a;">Comparing original and stemmed texts.</span>**

In [None]:
print(*stemmed_text[5][:20])
print(*text[5][:20])

**<span style="color:#023e8a;">After that, we need to bring the words to a numerical expression. For this you can use:</span>**
* `Countvectorizer`
* `Tf-idf`
* `Embeddings`  

`Countvectorizer` **<span style="color:#023e8a;">gives matrix num_words X texts where each number is a number of count in all texts.</span>**

`TF-IDF` **<span style="color:#023e8a;">is an abbreviation standing for frequency–inverse document frequency,which is a numerical statistics that are aimed to reflect how important a word is for a document in a collection or corpus. </span>**

**<span style="color:#023e8a;">Learn more</span>**: https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

**<span style="color:#023e8a;">`Embeddings` allows to represent words like numerical vector.</span>**:



**<span style="color:#023e8a;">`Gensim`  allows to get bow by method `doc2bow`. This method converts document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples.  </span>**

**<span style="color:#023e8a;">TF-IDF doesnt use for LDA because authors recommend to use bow:</span>**  

>In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA >addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not >necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in >documents and words in topics. The weighting of TF-IDF is not necessary for this.  

**<span style="color:#023e8a;">Learn more</span>**: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf


**<span style="color:#023e8a;">So, `count-vectorizer` of `bow` are appropriate methods to use for LDA. </span>**

In [None]:
dictionary = gensim.corpora.Dictionary(stemmed_text)

**<span style="color:#023e8a;">Filter dictionary by stopwords and most common words (more than in 70% of texts) and not frequently used words (<20 counts).</span>**

In [None]:
stopword_ids = map(dictionary.token2id.get, stops)
dictionary.filter_tokens(bad_ids=stopword_ids)
dictionary.filter_extremes(no_below=20, no_above=0.7, keep_n=None)
dictionary.compactify() # remove gaps in id sequence
bow = [dictionary.doc2bow(line) for line in tqdm(stemmed_text)]


**<span style="color:#023e8a;">`Seeded (or Guided) LDA` is a method that allows to add apriori information about the distribution of words in topics. Thus, we can get a desired topic with the given dictionary and do not depend only on the black box results.</span>**


**<span style="color:#023e8a;">Learn more</span>** https://nlp.stanford.edu/pubs/llda-emnlp09.pdf

**<span style="color:#023e8a;">Here we create one topic, dedicated disasters. The second topic will include the another part of tweets.</span>**

In [None]:
disasters = ['disaster', 'bloodbath', 'collapse', 'crash', 'meltdown', 'doomsday', 'convulsion', 'accident', 'casualty', 'fatality', 
            'blast', 'catastrophe', 'traffic','hybrid', 'engine', 'license', 
            'tsunami', 'volcano','tornado','avalanche','earthquake','blizzard','drought','bushfire','tremor','magma','twister',
            'windstorm','cyclone','flood','fire','hailstorm','lava','lightning','hail','hurricane','seismic','erosion','whirlpool','whirlwind',
            'cloud','thunderstorm','barometer','gale','blackout','gust','force','volt','snowstorm','rainstorm','storm','nimbus','violent storm',
            'sandstorm','fatal','cumulonimbus','death','lost','destruction','money','tension','cataclysm','damage','uproot','underground',
            'destroy','arsonist','wind scale','arson','rescue','permafrost','fault','shelter', 'bomb', 'suicide', 'tragedy', 'weapon']

disasters = [ps.stem(word) for word in disasters]

**<span style="color:#023e8a;">`Create_eta` function gives eta matrix with apriori words in topics. Here we create the dict of words for our first topic - `disaster`. After this we create ones matrix and fill the huge number for words from the created dict in topic 1. </span>**

In [None]:
seed_topics = {}
for word in disasters:
    seed_topics[word] = 0

In [None]:
def create_eta(priors, etadict, ntopics):
    eta = np.full(shape=(ntopics, len(etadict)), fill_value=1) # create a (ntopics, nterms) matrix and fill with 1
    for word, topic in priors.items(): # for each word in the list of priors
        keyindex = [index for index,term in etadict.items() if term==word] # look up the word in the dictionary
        if (len(keyindex)>0): # if it's in the dictionary
            eta[topic,keyindex[0]] = 1e7  # put a large number in there
    eta = np.divide(eta, eta.sum(axis=0)) # normalize so that the probabilities sum to 1 over all topics
    return eta

In [None]:
eta = create_eta(seed_topics, dictionary, 2)

**<span style="color:#023e8a;">Number of topics = 2:</span>**
* `disasters`
* `common topic`

# **<span id="Modeling" style="color:#023e8a;">3. Modeling</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
lda_model = LdaMulticore(corpus=bow, #bag of words
                         id2word=dictionary, #our common dict, need for print words in topics, not numbers from bow
                         num_topics=2,
                         eta=eta, #our eta matrix
                         chunksize=2000,
                         passes=10,
                         random_state=42,
                         alpha='symmetric', #param of LDA distribution. If you dont know use symmetric
                         per_word_topics=True)

**<span style="color:#023e8a;">You may change the number of topics and check `Coherence` for model selection. Moreover, you may set initially more words in topics for better results.</span>**

In [None]:
for num, params in lda_model.print_topics():
    print(f'{num}: {params}\n')

**<span style="color:#023e8a;">You also may enhance results, comparing `Coherence` metric of different number of topics LDA. </span>**

**<span style="color:#023e8a;">These results may try to make better main model results, or you may try to separate disaster tweets from another part only using Guided LDA. Create a-priori word distribution in topics helps to get appropriate results. </span>**

## **<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 5px">Thank you for reading! Please, upvote this notebook if you learned smth new :)</span>**