## **<span style="color:#023e8a"><center> 📊Guided LDA. Semi-supervised TM.</center></span>**
## **<center><span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 5px">If you find this notebook useful or interesting, please, support with an upvote :)</span></center>**

## **<span style="color:#023e8a;font-size:1000%"><center>NLP</center></span><span style="color:#023e8a;font-size:200%"><center>Topic Modeling. Guided LDA.</center></span>**
>**<span style="color:#023e8a;">Hello everyone!</span>**  
>**<span style="color:#023e8a;">I hope that this notebook will be interesting and useful for you. Guided LDA gives more opportunities to work with topic comparing with original LDA.</span>**  
>**<span style="color:#023e8a;">it can be helpful in other competitions and here like new feature. Anyway, here, I try to show how it uses.</span>**

# **<a id="Content" style="color:#023e8a;">Table of Content</a>**
* [**<span style="color:#023e8a;">1. Loading data</span>**](#Loading)  
* [**<span style="color:#023e8a;">2. Text desc and cloud of words</span>**](#Cloud) 
* [**<span style="color:#023e8a;">3. Data prep and stemming</span>**](#Data)  
* [**<span style="color:#023e8a;">4. Modeling</span>**](#Modeling)  
* [**<span style="color:#023e8a;">5. References</span>**](#References)  

# **<span style="color:#023e8a;">Imports</span>**

In [None]:
import os
import pandas as pd
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer
from nltk import word_tokenize
import numpy as np
from gensim.models.ldamulticore import LdaMulticore
import gensim
from nltk.corpus import stopwords
stops = stopwords.words("english")

# **<span id="Loading" style="color:#023e8a;">1. Loading data</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

Topic modeling is an integral part of `NLP`. If used correctly, it could give a sufficient boost to any analysis. Along with the classical LDA there is a semi-supervised alghorithm - `Guided (Seeded) LDA`.

In [None]:
def load_df():
    train_names, train_texts = [], []
    for f in tqdm(list(os.listdir('../input/feedback-prize-2021/train'))):
        train_names.append(f.replace('.txt', ''))
        train_texts.append(open('../input/feedback-prize-2021/train/' + f, 'r').read())
    train_text_df = pd.DataFrame({'id': train_names, 'text': train_texts})
    return train_text_df

df = load_df()
df.head()

For more efficient work of `LDA` we need to lemmatize text. `Lemmatization` is necessary to bring words to their initial form. That is helpful to consider words "student" and, for instance, "students" as the same word. However, `stemming` (that is the procedure consisting in separating the root of the word only) is a is an appropriate tool for English too and in terms of the speed it is much more beneficial than `lemmatization`.

**Learn more**: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

# **<span id="Cloud" style="color:#023e8a;">2. Text desc and cloud of Words</span>**

In [None]:
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
import seaborn as sns

In [None]:
df['len_text'] = df['text'].apply(len)
df['text_split'] = df['text'].str.split()
df['len_words'] = df['text_split'].apply(len)

**<span style="color:#023e8a;">Histograms of word and text lens</span>**

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=df, x='len_text', bins=30, color='orange')
ax.set_xlabel('length of text in symbols')
plt.show()

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=df, x='len_words', bins=30, color='orange')
ax.set_xlabel('length of text in words')
plt.show()

**<span style="color:#023e8a;">Both statistics have a significant right tale.</span>**

In [None]:
df[['len_text', 'len_words']].describe()

**<span style="color:#023e8a;">Pay attention to text with max of symbols (>> then mean).</span>**

In [None]:
df[df['len_text'] == 18322].text.values

**<span style="color:#023e8a;">This text contains many spaces (\xa0). Remove it.</span>**

In [None]:
df['text'] = df['text'].str.replace('\xa0', '')
df['text'] = df['text'].str.strip()

df['len_text'] = df['text'].apply(len)
df['text_split'] = df['text'].str.split()
df['len_words'] = df['text_split'].apply(len)

In [None]:
df[['len_text', 'len_words']].describe()

**<span style="color:#023e8a;">Now tails are less.</span>**

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=df, x='len_text', bins=30, color='orange')
ax.set_xlabel('length of text in symbols')
plt.show()

In [None]:
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=df, x='len_words', bins=30, color='orange')
ax.set_xlabel('length of text in words')
plt.show()

**<span style="color:#023e8a;">Using words from our data create Wordcloud</span>**

In [None]:
cloud = WordCloud(background_color="white", max_words=50, stopwords=set(STOPWORDS), width=600, height=300)
f, ax = plt.subplots(figsize=(8, 8))
f.suptitle('WordCloud', fontsize=14)
cloud = cloud.generate(' '.join(df.text.tolist()))
ax.imshow(cloud, interpolation='bilinear')
ax.axis('off')
f.show()

# **<span id="Data" style="color:#023e8a;">3. Data prep and stemming</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
text = [t.split() for t in df.text.tolist()]

In [None]:
stemmed_text = []
ps = PorterStemmer()
for sentence in tqdm(text):
    sent = []
    for word in sentence:
        sent.append(ps.stem(word))
    stemmed_text.append(sent)

Just compare original text and stemmed one.

In [None]:
print(*stemmed_text[5][:20])
print(*text[5][:20])

After that, we need to bring the words to a numerical expression. For this you can use:
* `Countvectorizer`
* `Tf-idf`
* `Embeddings`

`Countvectorizer` gives matrix num_words X texts where each number is a number of count in all texts.

`TF-IDF` is an abbreviation standing for frequency–inverse document frequency,which is a numerical statistics that are aimed to reflect how important a word is for a document in a collection or corpus. 

**Learn more**: https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

`Gensim` allows to get bow by method `doc2bow`. This method converts document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. 

In [None]:
dictionary = gensim.corpora.Dictionary(stemmed_text)

Filter dictionary by stopwords and most common words (more than in 70% of texts) and not frequently used words (<20 counts).

In [None]:
stopword_ids = map(dictionary.token2id.get, stops)
dictionary.filter_tokens(bad_ids=stopword_ids)
dictionary.filter_extremes(no_below=20, no_above=0.7, keep_n=None)
dictionary.compactify() # remove gaps in id sequence
bow = [dictionary.doc2bow(line) for line in tqdm(stemmed_text)]

`Seeded (or Guided) LDA` is a method that allows to add apriori information about the distribution of words in topics. Thus, we can get a desired topic with the given dictionary and do not depend only on the black box results.


**Learn more**: https://nlp.stanford.edu/pubs/llda-emnlp09.pdf

Just let us consider "cars" as our first topic. The second one will be politics and the last one will be devoted to school life.

In [None]:
cars = ['saloon', 'sedan', 'car', 'automobile', 'corvette', 'motor', 'wheel', 'vehicle', 'roadster', 'supercar', 'driver', 'garage', 'traffic',
       'hybrid', 'engine', 'license']
politics = ['senate', 'democracy', 'negotiation', 'power', 'party', 'government', 'convention', 'delegate', 'political', 'state']
school = ['student', 'teacher', 'principal', 'project', 'subject', 'cirriculum', 'mark', 'assesment', 'test', 'discipline', 'graduation']

school = [ps.stem(word) for word in school]
politics = [ps.stem(word) for word in politics]
cars = [ps.stem(word) for word in cars]

Prepare topics with topic words.

In [None]:
seed_topics = {}
for word in cars:
    seed_topics[word] = 0
for word in politics:
    seed_topics[word] = 1
for word in school:
    seed_topics[word] = 2

Create_eta function gives eta matrix with apriori words in topics.

In [None]:
def create_eta(priors, etadict, ntopics):
    eta = np.full(shape=(ntopics, len(etadict)), fill_value=1) # create a (ntopics, nterms) matrix and fill with 1
    for word, topic in priors.items(): # for each word in the list of priors
        keyindex = [index for index,term in etadict.items() if term==word] # look up the word in the dictionary
        if (len(keyindex)>0): # if it's in the dictionary
            eta[topic,keyindex[0]] = 1e7  # put a large number in there
    eta = np.divide(eta, eta.sum(axis=0)) # normalize so that the probabilities sum to 1 over all topics
    return eta

Number of topics = 4:
* `cars`
* `politics`
* `school life`
* `common topic`

# **<span id="Modeling" style="color:#023e8a;">4. Modeling</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
eta = create_eta(seed_topics, dictionary, 4)

In [None]:
lda_model = LdaMulticore(corpus=bow,
                         id2word=dictionary,
                         num_topics=4,
                         eta=eta,
                         chunksize=2000,
                         passes=5,
                         random_state=42,
                         alpha='symmetric',
                         per_word_topics=True)

You may change the number of topics and check `Coherence` for model selection. Moreover, you may set initially more words in topics for better results.

Topics which are concerned with cars, politics and school are easy to detect.

In [None]:
for num, params in lda_model.print_topics():
    print(f'{num}: {params}\n')

May all of you be lucky in the competition. Hopefully, this notebook will be useful for you.

# **<span id="References" style="color:#023e8a;">5. References</span>**

https://www.kaggle.com/raghavendrakotala/fine-tunned-on-roberta-base-as-ner-problem-0-533  
https://www.kaggle.com/julian3833/topic-modeling-with-lda

## **<center><span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 5px">Thanks for reading! If you find this notebook useful or interesting, please, support with an upvote :)</span></center>**