# Idea:
Our solution: LDA + BERT based embeddings of noun phrases and verbs :
- Each noun phrase and verb in the texts is  transformed to embedding vector using Universal Sentence Encoder (transformer based on BERT)
- Embedding vectors from (a) are clustered (HDBSCAN + UNET)
- Words/phrases with embedding vectors closest to the centers of resulting clusters form key word/phrase
- Each text in the training sample is converted to collection of key-phrases by replacing its noun phrases and verbs with keyword/phrases and deleting other words
- LDA is performed on the transformed texts


**Reference:**<br>
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. **Universal Sentence Encoder.** *arXiv:1803.11175, 2018.*

# Load data and python libraries

In [1]:
# data processing libraries
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

# display wider columns in pandas data frames where necessary
pd.set_option('max_colwidth',150)

In [2]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.2.0


In [3]:
import tensorflow_hub as hub

#Load the Universal Sentence Encoder's TF Hub module
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/5"
model = hub.load(module_url)

print ("module %s loaded" % module_url)

module https://tfhub.dev/google/universal-sentence-encoder-large/5 loaded


In [4]:
df_train = pd.read_csv("./transition_files/train.tsv", sep='\t')
print("df_train.shape:", df_train.shape)
print("df_train.shape:",df_train.columns)

df_train.shape: (33982, 12)
df_train.shape: Index(['date', 'author', 'title', 'url', 'section', 'publication',
       'first_10_sents', 'list_of_first_10_sents', 'list_of_verb_lemmas',
       'noun_phrases', 'list_of_nouns', 'list_of_lemmas'],
      dtype='object')


# Getting text clusters through sentence embedding comparison

In [5]:
def get_embeddings(input):
    return model(input)

In [6]:
def get_word_embeddings(df_data, column = "word", N_batches=1):
    #split data into N batches
    N = N_batches

    part = int(len(df_data)/N)
    print(N, "batches with", part + 1, column + "s each")

    #get embeddings for each N words
    index = 0
    batch_num = 0
    list_dfs = []

    while index < len(df_data): 
        df_tmp = df_data.iloc[index : index + part].copy()
        df_tmp = df_tmp.reset_index(drop=True)
        print ("Batch number:", batch_num + 1, "out of ", N)

        df_batch_embeddings = pd.DataFrame(get_embeddings(list(df_tmp[column])).numpy())

        num_embeddings = df_batch_embeddings.shape[1]
        columns = ["emb_" + str(i) for i in range(512)]
        df_tmp[columns] = df_batch_embeddings

        list_dfs.append(df_tmp)
        batch_num = batch_num + 1
        index = index + part

    #concatinate batches into single dataset
    df_emb = pd.concat(list_dfs)

    return df_emb

In [7]:
df_train['noun_phrases'] = df_train['noun_phrases'].str[2:-2]
df_train['noun_phrases'] = df_train['noun_phrases'].str.lower().str.split("', '")
df_train['noun_phrases'].head()

0    [rise, big emerging economy, china, india, steady march, globalisation, surge, number, people, business, tourism, result, demand, visa, unpreceden...
1    [pfizer, commitment, corporate social responsibility csr, drugs giant talk, responsibility, society, world, access, product, work, ngos, global he...
2    [week, federal reserve, interest rate, time, year, world, central bank, rate, recent year, long spell, course, chart, outcome, americas rate rise,...
3    [cruise line, wave, year, nearly, holiday, sea, result, december 18th carnival, worlds largest operator, global market, fullyear earning, demand, ...
4    [investors, calendar year, buoyant mood, unexpected event, consensus, respect, view, investor, market price, column, potential surprise, definitio...
Name: noun_phrases, dtype: object

In [8]:
all_NPs = list(df_train['noun_phrases'])
all_NPs = [np for l in all_NPs for np in l if len(np)>0]
all_NPs[:5], len(all_NPs)

(['rise', 'big emerging economy', 'china', 'india', 'steady march'], 1417049)

In [9]:
df_train['list_of_verb_lemmas'].iloc[0]

'[emerging, led, wanting, travel, granted, Upgrade, travel, apply, submit, streamline, scrap]'

In [10]:
df_train['list_of_verb_lemmas'] = df_train['list_of_verb_lemmas'].str[2:-2]
df_train['list_of_verb_lemmas'] = df_train['list_of_verb_lemmas'].str.lower().str.split(", ")
df_train['list_of_verb_lemmas'].head()

0                                                               [merging, led, wanting, travel, granted, upgrade, travel, apply, submit, streamline, scra]
1    [rided, embracing, insists, gain, strengthen, improve, deterred, seeking, intends, shift, domiciled, rejoiced, saved, paid, outraged, promised, im...
2    [aised, ended, celebrate, tried, lift, forced, reverse, cut, help, understand, upgrade, strike, wish, save, spend, try, escape, slashing, encourag...
3      [race, booked, improve, announced, control, demand, peaking, piling, based, got, moving, upgrade, increase, announced, establish, aimed, based, ad]
4    [tart, caught, proved, reflected, like, suggest, judged, betting, expect, upgrade, weakens, having, pushed, tighten, buy, priced, doubt, tighten, ...
Name: list_of_verb_lemmas, dtype: object

In [11]:
all_Vs = list(df_train['list_of_verb_lemmas'])
all_Vs = [v for l in all_Vs for v in l if len(v)>0]
all_Vs[:5], len(all_Vs)

(['merging', 'led', 'wanting', 'travel', 'granted'], 675330)

In [12]:
all_words =  list(set(all_NPs + all_Vs))
len(set(all_words))

419327

In [13]:
df_words = pd.DataFrame({'word': all_words})
df_words.head()

Unnamed: 0,word
0,rival fortnite
1,potentially shameinducing data point
2,largest recorded leak
3,voracious reader
4,perfect pendular motion


In [14]:
%%time
#creating word2vec matrix
df_w2v = get_word_embeddings(df_words, column = "word", N_batches=100)
df_w2v.head()

100 batches with 4194 words each
Batch number: 1 out of  100
Batch number: 2 out of  100
Batch number: 3 out of  100
Batch number: 4 out of  100
Batch number: 5 out of  100
Batch number: 6 out of  100
Batch number: 7 out of  100
Batch number: 8 out of  100
Batch number: 9 out of  100
Batch number: 10 out of  100
Batch number: 11 out of  100
Batch number: 12 out of  100
Batch number: 13 out of  100
Batch number: 14 out of  100
Batch number: 15 out of  100
Batch number: 16 out of  100
Batch number: 17 out of  100
Batch number: 18 out of  100
Batch number: 19 out of  100
Batch number: 20 out of  100
Batch number: 21 out of  100
Batch number: 22 out of  100
Batch number: 23 out of  100
Batch number: 24 out of  100
Batch number: 25 out of  100
Batch number: 26 out of  100
Batch number: 27 out of  100
Batch number: 28 out of  100
Batch number: 29 out of  100
Batch number: 30 out of  100
Batch number: 31 out of  100
Batch number: 32 out of  100
Batch number: 33 out of  100
Batch number: 34 ou

Unnamed: 0,word,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,emb_8,...,emb_502,emb_503,emb_504,emb_505,emb_506,emb_507,emb_508,emb_509,emb_510,emb_511
0,rival fortnite,-0.028948,0.064054,0.01419,0.07321,-0.088359,0.110766,0.07767,0.035829,-0.032533,...,-0.042242,-0.030448,-0.030571,-0.032998,0.034555,0.113697,0.018538,0.038067,-0.000274,0.024339
1,potentially shameinducing data point,0.021808,-0.037296,-0.00588,0.041247,0.018105,-0.053182,-0.016855,0.056972,-0.046322,...,-0.033671,0.052779,-0.049987,-0.060884,-0.043943,0.062954,0.008635,0.013899,-0.009536,-0.042446
2,largest recorded leak,0.021657,0.065861,0.025957,0.004588,-0.07457,-0.005124,0.038522,-0.015959,-0.038741,...,-0.038165,0.023003,-0.03718,-0.007455,-0.026193,0.066191,-0.11558,0.04889,0.070551,0.033378
3,voracious reader,0.068322,-0.005061,2.5e-05,0.000334,0.007211,-0.036811,0.044334,0.021194,0.008174,...,0.006287,0.035995,-0.044218,-0.004517,0.006305,0.001964,-0.029977,-0.027318,0.01579,-0.007895
4,perfect pendular motion,0.0207,-0.016234,0.029166,0.038285,0.013007,-0.013796,0.030645,-0.022268,0.110818,...,0.023359,-0.040579,-0.006551,-0.035399,0.013381,-0.003827,-0.056309,0.024389,0.005806,0.035837


In [15]:
df_w2v.iloc[::15000]

Unnamed: 0,word,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,emb_8,...,emb_502,emb_503,emb_504,emb_505,emb_506,emb_507,emb_508,emb_509,emb_510,emb_511
0,rival fortnite,-0.028948,0.064054,0.01419,0.07321,-0.088359,0.110766,0.07767,0.035829,-0.032533,...,-0.042242,-0.030448,-0.030571,-0.032998,0.034555,0.113697,0.018538,0.038067,-0.000274,0.024339
2421,philip morris internationals,0.071751,0.058493,-0.002705,0.040606,-0.022526,0.081023,-0.014479,0.043452,-0.051692,...,0.03539,-0.051603,-0.014338,-0.007827,0.075689,0.006039,-0.02207,-0.014119,0.06653,0.036217
649,unplugged,-0.044945,0.019505,0.042838,-0.024796,-0.085768,-0.033376,0.019646,0.06518,-0.061526,...,0.07208,0.064702,-0.025158,-0.011246,-0.058174,-0.105868,0.006102,-0.044,-0.010283,-0.025208
3070,annual meeting,-0.083934,-0.006209,-0.011543,-0.071184,0.041844,0.046135,-0.01282,-0.112799,0.030378,...,-0.026774,0.015173,0.041242,0.017139,0.021064,0.052726,0.011243,0.062688,0.013496,-0.055573
1298,rigidly enforced gender expectation,-0.020639,0.051757,-0.02431,-0.052477,0.06412,-0.079127,-0.069198,0.018345,-0.065469,...,0.068233,0.042767,-0.023927,-0.012723,-0.028446,0.042471,-0.007701,-0.015071,0.005724,-0.003037
3719,postpothole repair,0.007223,-0.036479,-0.012374,0.029721,-0.00458,-0.002598,-0.000942,0.003334,-0.04068,...,-0.033686,-0.029744,0.007175,-0.008106,-0.031989,0.044643,-0.019443,-0.005706,-0.006232,0.067022
1947,republican senator jon kyl,-0.074448,0.017038,-0.037902,-0.01197,0.002294,0.048126,-0.020591,0.00625,-0.027261,...,-0.036317,-0.027186,0.125992,0.015419,-0.039449,0.055503,-0.024551,0.07603,-0.022705,-0.019698
175,magdalena zernickagoetz,0.004264,0.018595,0.012869,0.032346,0.016008,0.017975,0.048702,-0.011801,-0.042248,...,-0.007407,-0.020057,-0.000438,0.023585,-0.046515,0.003248,0.00125,0.018682,-0.015019,0.022444
2596,nearly 2bn,-0.03594,0.005396,-0.032637,0.071618,-0.031461,0.046243,-0.01178,0.033512,0.061371,...,-0.065193,-0.060827,-0.003603,0.006782,-0.003571,0.047496,-0.034156,0.002539,0.012971,0.039395
824,border angola,-0.016436,0.029175,-0.115872,-0.052781,0.047338,-0.02511,-0.045445,0.012612,-0.048793,...,-0.010449,-0.112085,0.11498,-0.005818,-0.031272,0.061169,0.032391,0.069337,-0.006547,-0.051745


# Next:
- Unet
- HDBSCAN
- defining cluster center names (key words)
- replacing texts with list of key-words
- running LDA


https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6