# Idea:
Our solution: LDA + keywords from clusters of BERT based embeddings of noun phrases and verbs :
- Each noun phrase and verb in the texts is  transformed to embedding vector using Universal Sentence Encoder (transformer based on BERT)
- Embedding vectors from (a) are clustered (HDBSCAN + UNET)
- Words/phrases with embedding vectors closest to the centers of resulting clusters form key word/phrase
- Each text in the training sample is converted to collection of key-phrases by replacing its noun phrases and verbs with keyword/phrases and deleting other words
- LDA is performed on the transformed texts


**Reference:**<br>
- Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. **Universal Sentence Encoder.** *arXiv:1803.11175, 2018.*
- McInnes, L, Healy, J, **UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction**, *ArXiv e-prints 1802.03426, 2018*

# Load data and python libraries

In [1]:
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

# topic modeling libraries
import pyLDAvis.gensim 

# data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# topic modeling libraries
from gensim import models, corpora
from gensim.models.coherencemodel import CoherenceModel


# supporting libraries
import pandas as pd
import time
import pickle
import topic_modeling_v3 as tm

  from collections import Iterable
  from collections import Mapping
  from numpy.dual import register_func
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  from numpy import (exp, inf, pi, sqrt, floor, sin, cos, around, int,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  from numpy import (exp, inf, pi, sqrt, floor, sin, cos, around, int,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).e

In [2]:
# load data
with open("./transition_files/df_train_for_LDA.pickle", 'rb') as f:
    # The protocol version used is detected automatically, so we do not
    # have to specify it.
    df_train = pickle.load(f)

print("df_train.shape:", df_train.shape)
print("df_train.columns:",df_train.columns)

df_train.shape: (33982, 18)
df_train.columns: Index(['date', 'author', 'title', 'url', 'section', 'publication',
       'first_10_sents', 'list_of_first_10_sents', 'list_of_verb_lemmas',
       'noun_phrases', 'list_of_nouns', 'list_of_lemmas', 'ID',
       'group_level_1', 'group_level_2', 'group_level_3', 'all_words',
       'all_key_words'],
      dtype='object')


***

In [3]:
#prepare data for LDA
start_time = time.time()
df_data_1 = tm.prepare_for_modeling(data_path="", model_type="LDA-KeyWords",
                                    params={"TEXT_prepared_df": df_train,
                                     "save_LDA_dictionary_path": "./output/lda_keywords/dictionary1.pickle",
                                     "text_column": "text"
                                     },
                                    verbose=2)
end_time = time.time()
print("Processing time in minutes:", round((end_time - start_time)/60,2))

loaded data shape: (33982, 18)

Number of unique key-words for topic modeling dictionary: 4330
LDA dictionary file is saved to: ./output/lda_keywords/dictionary1.pickle

Number of texts processed:  33982
Number of extracted key-words:  4330

Each text is represented by list of  4330  tuples: 
		(key-words's index in bag-of-words dictionary, key-words's term frequency)
Processing time in minutes: 0.05


In [4]:
#first level of topics
start_time = time.time()
df_first_level = tm.train_model(model_type="LDA-KeyWords",
                            params={"num_topics": 10,
                                    "LDA_prepared_df": df_data_1,
                                    "LDA_dictionary_path": "./output/lda_keywords/dictionary1.pickle",
                                    "save_LDA_model_path": "./output/lda_keywords/LDA_model1"
                                    },
                               verbose=2)
end_time = time.time()
print("Processing time in minutes:", round((end_time - start_time)/60,2))

Training LDA with BERT-UMAP-HDBSCAN clustered KeyWords (NOUN_PHRASEs and VERBs)
loaded data shape: (33982, 19)

Creating document-term matrix for LDA...

Training LDA model with  10  topics...
LDA model file is saved to: ./output/lda_keywords/LDA_model1
Top topic indexes are selected. NOTE "-1" corresponds to top topic with probability < 20%
Processing time in minutes: 1.38


In [5]:
#value count of TOP level topics
df_first_level['first_level_topic'] = df_first_level['top_topic']
df_first_level['first_level_topic_proba'] = df_first_level['top_topic_proba']
df_first_level['first_level_topic'].value_counts().sort_index()

0        1
1        2
2        5
3     5593
4     1050
5     4065
6     1368
7    13293
8     8598
9        7
Name: first_level_topic, dtype: int64

In [6]:
df_first_level.columns

Index(['date', 'author', 'title', 'url', 'section', 'publication',
       'first_10_sents', 'list_of_first_10_sents', 'list_of_verb_lemmas',
       'noun_phrases', 'list_of_nouns', 'list_of_lemmas', 'ID',
       'group_level_1', 'group_level_2', 'group_level_3', 'all_words',
       'all_key_words', 'doc2bow', 'infered_topics', 'top_topic',
       'top_topic_proba', 'first_level_topic', 'first_level_topic_proba'],
      dtype='object')

In [7]:
#df_first_level[df_first_level['first_level_topic'] == 0]

In [8]:
#df_first_level[df_first_level['first_level_topic'] == 0]['first_10_sents'].iloc[0]

In [9]:
#df_first_level[df_first_level['first_level_topic'] == 0]['all_key_words'].iloc[0]

In [10]:
df_first_level = df_first_level.drop(columns=['doc2bow',
       'infered_topics', 'top_topic', 'top_topic_proba'])

***
# Get SECOND level topics (LDA)

In [11]:
first_level_topics = list(set(df_first_level['first_level_topic']))
first_level_topics

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [12]:
start = time.time()
list_dfs = []
for topic in first_level_topics[:2]:
    print("\nSelected topic index:", topic)
    df_topic = df_first_level[df_first_level['first_level_topic'] == topic].copy()
    save_dict_path = "./output/lda_keywords/dictionary1_"+str(topic+1)+".pickle"
    save_LDA_model_path = "./output/lda_keywords/LDA_model1_" + str(topic + 1)
    
    df_data_tmp = tm.prepare_for_modeling(data_path="", model_type="LDA-KeyWords",
                                       params={"TEXT_prepared_df": df_topic,
                                               "save_LDA_dictionary_path": save_dict_path
                                               },
                                       verbose=1)

    df_2nd_tmp = tm.train_model(model_type="LDA-KeyWords",
                                params={"num_topics": 10,
                                        "LDA_prepared_df": df_data_tmp,
                                        "LDA_dictionary_path": save_dict_path,
                                        "save_LDA_model_path": save_LDA_model_path
                                        },
                                verbose=1)

    #value counts of SECOND level topics
    print("\nValue counts of SECOND level topics:")
    df_2nd_tmp['second_level_topic'] = df_2nd_tmp['top_topic']
    df_2nd_tmp['second_level_topic_proba'] = df_2nd_tmp['top_topic_proba']
    print(df_2nd_tmp['second_level_topic'].value_counts().sort_index())

    print("#"*50)
    df_2nd_tmp = df_2nd_tmp.drop(columns=['doc2bow',
                                           'infered_topics', 'top_topic', 'top_topic_proba'])
    list_dfs.append(df_2nd_tmp)
finish = time.time()


Selected topic index: 0
Training LDA with BERT-UMAP-HDBSCAN clustered KeyWords (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_1

Value counts of SECOND level topics:
1    1
Name: second_level_topic, dtype: int64
##################################################

Selected topic index: 1
Training LDA with BERT-UMAP-HDBSCAN clustered KeyWords (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_2

Value counts of SECOND level topics:
0    1
3    1
Name: second_level_topic, dtype: int64
##################################################


In [13]:
print("Time of gettig Second level topics in minutes:", round((finish-start)/60,2))
df_second_level = pd.concat(list_dfs)
df_second_level.columns

Time of gettig Second level topics in minutes: 0.01


Index(['date', 'author', 'title', 'url', 'section', 'publication',
       'first_10_sents', 'list_of_first_10_sents', 'list_of_verb_lemmas',
       'noun_phrases', 'list_of_nouns', 'list_of_lemmas', 'ID',
       'group_level_1', 'group_level_2', 'group_level_3', 'all_words',
       'all_key_words', 'first_level_topic', 'first_level_topic_proba',
       'second_level_topic', 'second_level_topic_proba'],
      dtype='object')

***
# Get THIRD level topics

In [14]:
df_second_level[['first_level_topic',
       'first_level_topic_proba', 'second_level_topic',
       'second_level_topic_proba']].describe()

Unnamed: 0,first_level_topic,first_level_topic_proba,second_level_topic,second_level_topic_proba
count,3.0,3.0,3.0,3.0
mean,0.666667,0.614413,1.333333,0.886535
std,0.57735,0.175463,1.527525,0.161597
min,0.0,0.412583,0.0,0.7
25%,0.5,0.556282,0.5,0.837838
50%,1.0,0.699982,1.0,0.975676
75%,1.0,0.715329,2.0,0.979802
max,1.0,0.730676,3.0,0.983928


In [15]:
start = time.time()
list_dfs = []

for topic_1st in first_level_topics:
    print("\nSelected FIRST level topic index:",topic_1st)
    df_1st_tmp = df_second_level[df_second_level['first_level_topic'] == topic_1st].copy()
    second_level_topics = list(set(df_1st_tmp['second_level_topic']))
    print("second_level_topics", second_level_topics)
    
    for topic_2nd in second_level_topics:
        print("\nSelected topics' indexes:", (topic_1st, topic_2nd))
        
        save_dict_path = "./output/lda_keywords/dictionary1_"+str(topic_1st+1)+"_"+str(topic_2nd+1)+".pickle"
        save_LDA_model_path = "./output/lda_keywords/LDA_model1_"+str(topic_1st+1)+"_"+str(topic_2nd+1)
        
        df_2nd_tmp = df_1st_tmp[df_1st_tmp['second_level_topic'] == topic_2nd].copy()
        
        df_data_tmp = tm.prepare_for_modeling(data_path="", model_type="LDA-KeyWords",
                                           params={"TEXT_prepared_df": df_2nd_tmp,
                                                   "save_LDA_dictionary_path": save_dict_path
                                                   },
                                           verbose=1)

        df_3d_tmp = tm.train_model(model_type="LDA-KeyWords",
                                    params={"num_topics": 10,
                                            "LDA_prepared_df": df_data_tmp,
                                            "LDA_dictionary_path": save_dict_path,
                                            "save_LDA_model_path": save_LDA_model_path,
                                            },
                                    verbose=1)

        #value counts of SECOND level topics
        print("\nValue counts of SECOND level topics:")
        df_3d_tmp['third_level_topic'] = df_3d_tmp['top_topic']
        df_3d_tmp['third_level_topic_proba'] = df_3d_tmp['top_topic_proba']
        print(df_3d_tmp['second_level_topic'].value_counts().sort_index())

        print("#"*50)
        df_3d_tmp = df_3d_tmp.drop(columns=['doc2bow',
                                               'infered_topics', 'top_topic', 'top_topic_proba'])
        list_dfs.append(df_3d_tmp)
finish = time.time()


Selected FIRST level topic index: 0
second_level_topics [1]

Selected topics' indexes: (0, 1)
Training LDA with BERT-UMAP-HDBSCAN clustered KeyWords (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_1_2

Value counts of SECOND level topics:
1    1
Name: second_level_topic, dtype: int64
##################################################

Selected FIRST level topic index: 1
second_level_topics [0, 3]

Selected topics' indexes: (1, 0)
Training LDA with BERT-UMAP-HDBSCAN clustered KeyWords (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_2_1

Value counts of SECOND level topics:
0    1
Name: second_level_topic, dtype: int64
##################################################

Selected topics' indexes: (1, 3)
Training LDA with BERT-UMAP-HDBSCAN clustered KeyWords (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_2_4

Value counts of SECOND level topics:
3    1
Name: second_level_topi

In [16]:
print("Time of gettig Third level topics in minutes:", round((finish-start)/60,2))
df_third_level = pd.concat(list_dfs)
df_third_level.columns

Time of gettig Third level topics in minutes: 0.01


Index(['date', 'author', 'title', 'url', 'section', 'publication',
       'first_10_sents', 'list_of_first_10_sents', 'list_of_verb_lemmas',
       'noun_phrases', 'list_of_nouns', 'list_of_lemmas', 'ID',
       'group_level_1', 'group_level_2', 'group_level_3', 'all_words',
       'all_key_words', 'first_level_topic', 'first_level_topic_proba',
       'second_level_topic', 'second_level_topic_proba', 'third_level_topic',
       'third_level_topic_proba'],
      dtype='object')

***

# Evaluate 

In [17]:
df_result = df_third_level.copy()

# Name Topics (as a most frequent noun in the cluster)

In [18]:
df = tm.get_topic_names(df_result, 'first_level_topic', 'list_of_nouns')
df['second_level_topic'] = tm.get_topic_names(df_result, 
                                              'second_level_topic', 'list_of_nouns')['second_level_topic']
df['third_level_topic'] = tm.get_topic_names(df_result, 
                                             'third_level_topic', 'list_of_nouns')['third_level_topic']
df[['publication', 
    'section',
    'first_level_topic',
    'second_level_topic',
    'third_level_topic'
   ]].iloc[::1000].head(10).T

Unnamed: 0,22590
publication,Wired
section,culture
first_level_topic,0
second_level_topic,1
third_level_topic,1


# For test:
1) extract noun_phrases_lemmatised and verb_lemmas from text

2) For Each word from (1):
- get text emmbedings
- pretrained UNET -> reduced dimentions
- get clusters from pretrained HDBSCSAN clustering
- get claster label (keyWords)

3) replace text with keyWords
4) get topics from pretrained LDA