---

In [1]:
%load_ext autoreload
%autoreload 2

In [12]:
#Adding path to util 
import sys
sys.path[-1] = f'{sys.path[0]}'.replace('notebooks', 'src')
#/src/feature_engineering/engineering_util.py
import feature_engineering.engineering_util as eng

---

In [13]:
data = eng.load_data('../data/interim/processed_sample.csv')

In [14]:
data.head()

Unnamed: 0.1,Unnamed: 0,id,url,title,text,processed
0,0,8kqg3d,https://www.reddit.com/r/relationship_advice/c...,Staying friends with an ex,I was with this guy for about 3 months. During...,"['stay', 'friend', 'ex', 'guy', '3', 'month', ..."
1,1,8kqe1l,https://www.reddit.com/r/relationship_advice/c...,I [29/f] am not sure if he [34/m] is really in...,I met this guy over the past summer and we hun...,"['29', 'f', 'sure', '34', 'm', 'interested', '..."
2,2,8kqb8v,https://www.reddit.com/r/relationship_advice/c...,How to give my number to my work-crush W/O bei...,(please delete if not allowed) \nHi all! Not s...,"['number', 'work', 'crush', 'without', 'creepy..."
3,3,8kqa1n,https://www.reddit.com/r/relationship_advice/c...,im a cheater,i know what i did was wrong. but what should i...,"['be', 'cheater', 'know', 'wrong', 'kiss', 'be..."
4,4,8kq87l,https://www.reddit.com/r/relationship_advice/c...,I am afraid of getting hurt by my boyfriend again,"Me and my boyfriend are 5 months together, we ...","['afraid', 'get', 'hurt', 'boyfriend', 'boyfri..."


<br>
The first thing to do is extract our corpus from the dataframe, then format it for use in Gensim. 

In [15]:
corpus = eng.get_corpus(data, 'processed')

In [16]:
lda = eng.LDA(corpus)
lda.format_corpus()

<feature_engineering.engineering_util.LDA at 0x7fae4c2ed9d0>

# LDA

The final LDA model used Gensim's Mallet wrapper. I compared this with the ```LDAmulticore``` model from Gensim, measuring model performance with the ```c_v``` coherence measure. This ```c_v``` coherence measure, in addition to inspection of the highest probability topics' words, led to choosing 20 topics as the optimal amount. 

**Multicore**

In [17]:
lda_multi = lda.train()

**Mallet**

In [19]:
mallet_path = '../models/mallet-2.0.8/bin/mallet' 
lda_mallet = lda.train(mallet_path = mallet_path)
# lda_mallet.save('models/lda/lda_model') # saving model

### LDA Coherence

In [20]:
for model in [('Multicore',lda_multi), ('Mallet', lda_mallet)]:
    
    cv_score = eng.LdaEval(model[1]).coherence_score(texts = lda.pruned_corpus, 
                                                     dictionary = lda.formatted_dict
                                                    )
    print(f'{model[0]} coherence: {cv_score:.3f}')

Multicore coherence: 0.268
Mallet coherence: 0.345


```
Original mallet coherence score: 0.3940258230372272
```
<br>
I'm also getting LDA topic probability vectors for use in embeddings later.

In [21]:
lda_vectors = lda.get_vec_lda(model = lda_mallet, corpus = lda.formatted_corpus, num_topics = 20)

### LDAvis

In [22]:
# pyLDAvis.enable_notebook()
# eng.LdaEval(lda_mallet).lda_vis(corpus = lda.formatted_corpus, dictionary = lda.formatted_dict)

# Doc2Vec

The ```D2V``` class essentially trains a Doc2Vec distributed bag of words (DBOW) model. There are a few extra methods to have some fun with an experimental method, Topic2Vec, inspired by https://arxiv.org/pdf/1506.08422.pdf. The implementation using document tags was adapted from this thread: https://groups.google.com/u/1/g/gensim/c/BVu5-pD6910/m/7G_UM9vBJAAJ. Unfortunately the results weren't as promising as the original paper's, and I did not pursue it beyond the exploratory phase of eye-checking it. 

In [23]:
d2v = eng.D2V(corpus = lda.pruned_corpus, lda_model = lda_mallet, lda_vocab = lda.pruned_vocab)

Each unique word in the corpus receives it's own topic tag based on the highest probability topic that it belongs to. During the standard Doc2Vec TaggedDocument phase, all topics that appear in each document are entered as tags. Using below as an example, if ```fight``` appeared in a document, that document would get an additional tag for ```topic_0```. For a more granular view, please refer to src folder. 

In [24]:
#Get topic tags
d2v.get_topic_tags()
d2v.topic_tags['fight']

('topic_0', 0.030745814307458142)

In [25]:
d2v.tag_docs(topic2vec=True)

<feature_engineering.engineering_util.D2V at 0x7fae49fa8590>

In [26]:
d2v_model = d2v.model_train()

In [27]:
doc_vectors = d2v.get_vec_d2v(d2v_model)

### Similarity sanity checks

Checking topic vectors for discernability (which we do not see).

In [28]:
# for i in range(20): # range(k)
#     print(f'Topic {i}:')
#     print('\n')
#     print(model.wv.similar_by_vector(model.docvecs[f'topic_{i}'], topn = 5))
#     print('\n')

```
Topic 0:
[('basically', 0.5020793080329895), ('think', 0.4969105124473572), ('know', 0.48571306467056274), ('go', 0.48498255014419556), ('turn', 0.47766807675361633)]


Topic 1:
[('basically', 0.49959078431129456), ('think', 0.4901582896709442), ('go', 0.4857536554336548), ('turn', 0.47744661569595337), ('know', 0.4764932692050934)]


Topic 2:
[('basically', 0.5042178630828857), ('think', 0.5036580562591553), ('know', 0.49271059036254883), ('go', 0.492182195186615), ('turn', 0.48015865683555603)]


Topic 3:
[('basically', 0.5106383562088013), ('think', 0.5039139986038208), ('go', 0.49394136667251587), ('know', 0.49392950534820557), ('turn', 0.4874500036239624)]


Topic 4:
[('think', 0.5074610114097595), ('basically', 0.5069828629493713), ('know', 0.49426162242889404), ('go', 0.49001604318618774), ('turn', 0.4818304479122162)]


Topic 5:
[('basically', 0.5152652263641357), ('think', 0.4949542284011841), ('go', 0.4939420223236084), ('come', 0.485797643661499), ('turn', 0.48388174176216125)]


Topic 6:
[('basically', 0.5038674473762512), ('think', 0.4972285032272339), ('go', 0.4946080446243286), ('know', 0.48309510946273804), ('turn', 0.4805246591567993)]


Topic 7:
[('think', 0.5109632015228271), ('basically', 0.5051131844520569), ('know', 0.4949929416179657), ('go', 0.4863765835762024), ('turn', 0.48439496755599976)]


Topic 8:
[('basically', 0.5024288892745972), ('go', 0.49992817640304565), ('think', 0.49113553762435913), ('know', 0.47811466455459595), ('turn', 0.47655951976776123)]


Topic 9:
[('basically', 0.5008957386016846), ('think', 0.4980354309082031), ('know', 0.48374733328819275), ('go', 0.48040711879730225), ('turn', 0.47383102774620056)]


Topic 10:
[('basically', 0.5187561511993408), ('think', 0.5091937780380249), ('go', 0.5029682517051697), ('know', 0.494728684425354), ('turn', 0.49202293157577515)]


Topic 11:
[('think', 0.5103544592857361), ('basically', 0.5048021078109741), ('go', 0.4966704249382019), ('know', 0.4958295226097107), ('turn', 0.4859817326068878)]


Topic 12:
[('think', 0.5095230340957642), ('basically', 0.5090795159339905), ('turn', 0.5081111192703247), ('go', 0.4984132647514343), ('know', 0.4896523356437683)]


Topic 13:
[('basically', 0.5102134346961975), ('go', 0.5037297010421753), ('think', 0.4998404383659363), ('know', 0.48913341760635376), ('turn', 0.48528119921684265)]


Topic 14:
[('basically', 0.5128666758537292), ('think', 0.500403642654419), ('go', 0.4965251684188843), ('know', 0.4870407283306122), ('turn', 0.4844188988208771)]


Topic 15:
[('basically', 0.504505455493927), ('think', 0.4960896372795105), ('go', 0.4941573441028595), ('know', 0.47950294613838196), ('turn', 0.4766482710838318)]


Topic 16:
[('think', 0.5055217742919922), ('basically', 0.4969904124736786), ('turn', 0.4829714894294739), ('know', 0.4826893210411072), ('go', 0.478921502828598)]


Topic 17:
[('basically', 0.5069231986999512), ('go', 0.49580562114715576), ('think', 0.4897996187210083), ('turn', 0.4851585924625397), ('come', 0.4790429472923279)]


Topic 18:
[('basically', 0.5036921501159668), ('think', 0.49252209067344666), ('go', 0.4887089431285858), ('know', 0.4831080138683319), ('turn', 0.47812655568122864)]


Topic 19:
[('basically', 0.5063236951828003), ('think', 0.501762330532074), ('know', 0.49207597970962524), ('go', 0.48725268244743347), ('turn', 0.4832802414894104)]
```

<br>
Always need to have fun with a couple vector operations.

In [29]:
# model.wv.most_similar(positive=['king', 'woman'], negative=['man'])

```
[('queen', 0.45131027698516846),
 ('sized', 0.33133023977279663),
 ('girl', 0.3231159746646881),
 ('size', 0.3218972086906433),
 ('favorite', 0.3210022449493408),
 ('lion', 0.3180762231349945),
 ('princess', 0.31422391533851624),
 ('blast', 0.31358087062835693),
 ('movie', 0.30862587690353394),
 ('league', 0.3079490065574646)]
```

In [30]:
# model.wv.most_similar(positive=['marriage', 'cheating'], negative=['trust'])

```
[('divorce', 0.5356012582778931),
 ('married', 0.5003902912139893),
 ('marry', 0.4952297806739807),
 ('affair', 0.48001086711883545),
 ('relationship', 0.44852685928344727),
 ('infidelity', 0.4479179382324219),
 ('wife', 0.4458017647266388),
 ('ltr', 0.44243866205215454),
 ('husband', 0.43437737226486206),
 ('engage', 0.42213284969329834)]
```

# BERT embeddings

The last step before constructing the contextual embeddings is to extract the BERT embeddings for each document.

In [31]:
bert = eng.Bert(lda.corpus, eng.SentenceTransformer('bert-base-nli-max-tokens'))
bert_embeddings = bert.join_docs().transform_corpus()

# Contextual embeddings

In [33]:
encodings = eng.ConcatVectors(lda_vectors, doc_vectors, bert_embeddings)

In [34]:
lda_d2v_concatted = encodings.transform_lda_d2v()
lda_bert_concatted = encodings.transform_lda_bert()

In [35]:
encoded_lda_d2v = eng.Autoencoder()
encoded_lda_d2v.fit(lda_d2v_concatted)
lda_d2v_embeddings = encoded_lda_d2v.encoder.predict(lda_d2v_concatted)

In [36]:
encoded_lda_bert = eng.Autoencoder()
encoded_lda_bert.fit(lda_bert_concatted)
lda_bert_embeddings = encoded_lda_bert.encoder.predict(lda_bert_concatted)

# Saving models

In [37]:
# eng.save(lda_mallet, lda_vectors, d2v_model, bert_embeddings, encoded_lda_d2v.encoder, lda_d2v_embeddings, 
#          encoded_lda_bert.encoder, lda_bert_embeddings)