# Word2Vec Term Analysis

## aims

* Augment LDA clustering with contextual understanding of key terms
* Augment this contextual understanding with semantic algebra
* Produce similar terms to network and interlink terms for anlaysis 

## method

1. Load sanitised corpus, sentances & dictionary
2. Train Word2Vec model 
3. Test for similar terms to core term
4. Perform semantic algebra to further understand term context

## questions

* Of the most representitive LDA terms, how are they used?
* Of core terms, how can we extract bias and perceptions?
* How do these usages differ to our expectations?


In [1]:
from gensim.models import Word2Vec
from gensim.models import word2vec
# from gensim.models import LDA
import gensim
import logging
import stop_words
import nltk
import string
import os
from nltk.stem import WordNetLemmatizer
import numpy as np
import pickle

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

wordnet_lemmatizer = WordNetLemmatizer()




In [2]:
#load corpus

with open("corp.cor", "rb") as cp: 
    corp = pickle.load(cp)
with open("sentances2.sent", "rb") as sent: 
    sentances = pickle.load(sent)
dictionary = gensim.corpora.dictionary.Dictionary.load("Dictionary.dict")

print (len(corp))
print (len(sentances))

total_words = 0
for snet in sentances: 
    for word in snet: 
        total_words += 1
        
print (total_words)

2018-03-17 11:33:05,957 : INFO : loading Dictionary object from Dictionary.dict
2018-03-17 11:33:05,996 : INFO : loaded Dictionary.dict


82312
59094
2150151


Train the model according to documentation defaults

In [3]:
num_features = 300    # Word vector dimensionality
min_word_count = 2  # Minimum word count
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size
downsampling = 1e-2



model = Word2Vec(sentances, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling, seed=1)





2018-03-17 11:33:06,238 : INFO : collecting all words and their counts
2018-03-17 11:33:06,239 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-03-17 11:33:06,325 : INFO : PROGRESS: at sentence #10000, processed 358880 words, keeping 28160 word types
2018-03-17 11:33:06,409 : INFO : PROGRESS: at sentence #20000, processed 725730 words, keeping 41580 word types
2018-03-17 11:33:06,503 : INFO : PROGRESS: at sentence #30000, processed 1090374 words, keeping 52865 word types
2018-03-17 11:33:06,589 : INFO : PROGRESS: at sentence #40000, processed 1459146 words, keeping 62797 word types
2018-03-17 11:33:06,681 : INFO : PROGRESS: at sentence #50000, processed 1819290 words, keeping 71638 word types
2018-03-17 11:33:06,764 : INFO : collected 79140 word types from a corpus of 2150151 raw words and 59094 sentences
2018-03-17 11:33:06,765 : INFO : Loading a fresh vocabulary
2018-03-17 11:33:07,000 : INFO : min_count=2 retains 31661 unique words (40% of original 791

Shit... that was easy!

In [4]:
model.save("word2vec.vec")

2018-03-17 11:33:26,938 : INFO : saving Word2Vec object under word2vec.vec, separately None
2018-03-17 11:33:26,939 : INFO : not storing attribute vectors_norm
2018-03-17 11:33:26,941 : INFO : not storing attribute cum_table
2018-03-17 11:33:28,137 : INFO : saved word2vec.vec


Taking term 'prince', what terms are used in the model?

In [5]:
lemma = WordNetLemmatizer()

term = "prince"

term = lemma.lemmatize(term)

print (term, "\n")


try:
    x = model.wv.similar_by_word(term, topn = 20)
    print ("by word", x)
    
    
except Exception as e: 
    print (e)


2018-03-17 11:33:30,269 : INFO : precomputing L2-norms of word weight vectors


prince 

by word [('pocahontas', 0.7822363376617432), ('devout', 0.7541205286979675), ('persecutor', 0.7417248487472534), ('cinderella', 0.71034836769104), ('ephesian', 0.7089697122573853), ('dixie', 0.6935038566589355), ('fitzgerald', 0.6817329525947571), ('everlasting', 0.6791607737541199), ('mermaid', 0.6788723468780518), ('disney', 0.6758568286895752), ('sparrow', 0.6751775145530701), ('queen', 0.6721416711807251), ('penelope', 0.6675920486450195), ('tolstoy', 0.6667650938034058), ('plejune', 0.6611024737358093), ('virgin', 0.6601336598396301), ('europemaximus', 0.6543794870376587), ('strove', 0.6537072062492371), ('rebut', 0.6533612012863159), ('holy', 0.6514393091201782)]


The term 'princess' is used in LDA models, is it the equivalent of prince but with bias of promiscuity coming from the TRP concept of a 'bratty' type 'princess'?

In [6]:
term = "princess"

term = lemma.lemmatize(term)

print (term, "\n")


try:
    x = model.wv.most_similar(positive= [term], negative = ["slut", "whore"], topn = 20)
    print ("by word", x)
    
    
except Exception as e: 
    print (e)

princess 

by word [('limping', 0.6588572263717651), ('routed', 0.6575058102607727), ('emperor', 0.6438796520233154), ('brutally', 0.6360428929328918), ('spanish', 0.6316596269607544), ('o', 0.6258246898651123), ('shuts', 0.6160863637924194), ('babylon', 0.6156758069992065), ('merchant', 0.6121267080307007), ('melians', 0.6112450361251831), ('josemara', 0.60951167345047), ('tk', 0.6050773859024048), ('statesman', 0.5992505550384521), ('bean', 0.5910166501998901), ('philip', 0.5894211530685425), ('empire', 0.5837744474411011), ('fortress', 0.5818959474563599), ('wifebeater', 0.579513430595398), ('ruler', 0.5778967142105103), ('prefiguration', 0.576200544834137)]


Sure looks like it! 

Below is a rough attempt to build a dendrite diagram to demonstrate the clusters and interlinkage of similar terms. 

In [7]:
#dendrite split

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
import scipy.cluster.hierarchy as scihy


l = scihy.linkage(model.wv.syn0,
            method= "single",
            metric='seuclidean')

# l = scihy.single(model.wv.syn0)

  


In [None]:
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')

dendrogram(
    l,
#     leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=16.,  # font size for the x axis labels
    orientation='left',
    leaf_label_func=lambda v: str(model.wv.index2word[v]),
)
plt.show()