# Dependency2vec code word identification

We experiment with methods for evaluating code words based on average cosine similarity and traditional word embeddings

### Related issues 
- [#85](https://github.com/JherezTaylor/thesis-preprocessing/issues/85)
- [#93](https://github.com/JherezTaylor/thesis-preprocessing/issues/93)
- [#116](https://github.com/JherezTaylor/thesis-preprocessing/issues/116)

In [1]:
import sys
sys.path.append("../")
%load_ext autoreload
%autoreload 2

In [None]:
import glob, os
from collections import OrderedDict
from pprint import pprint
import joblib
import pandas as pd
from gensim.models import KeyedVectors, Word2Vec
import fasttext
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm
from modules.db import elasticsearch_base
from modules.preprocessing import neural_embeddings
from modules.utils import file_ops, model_helpers, settings, word_enrichment, visualization

visualization.init_plotly()

## Initialize params and objects

Here we define common functions for loading our embeddings and extracting the vocabulary and vocabulary counts. ft_word_embeddings and w2v_word_embeddings each store a list of references to embedding models that exist on disk.

#### Method definitions

In [3]:
def get_raw_vocab(df):
    vect = CountVectorizer(analyzer='word', stop_words='english')
    X = df["text"]
    fit_result = vect.fit(X)
    vocabulary = fit_result.vocabulary_
    return len(vocabulary), vocabulary

#### Load models

In [4]:
dep_model_ids = [1,2]
loaded_embeddings = neural_embeddings.get_embeddings("dep2vec", model_ids=dep_model_ids, load=True)
dep_embeddings = {}
dep_embeddings["core_hate"] = loaded_embeddings[0]
dep_embeddings["core_clean"] = loaded_embeddings[1]

0 dim200vecs_core_combined_corpus
1 dim200vecs_core_hate_corpus
2 dim200vecs_core_tweets_clean
3 dim200vecs_core_tweets_hs_keyword
4 dim200vecs_dstormer_conll
5 dim200vecs_inaug_conll
6 dim200vecs_manch_conll
7 dim200vecs_melvynhs_conll
8 dim200vecs_twitter_conll
9 dim200vecs_uselec_conll
10 dim200vecs_ustream_conll


In [5]:
word_model_ids = [1,2]
word_embeddings = {}
loaded_embeddings = neural_embeddings.get_embeddings("ft", model_ids=word_model_ids, load=True)
word_embeddings["core_hate"] = loaded_embeddings[0]
word_embeddings["core_clean"] = loaded_embeddings[1]

0 fasttext_core_combined_corpus.vec
1 fasttext_core_hate_corpus.vec
2 fasttext_core_tweets_clean.vec
3 fasttext_core_tweets_hs_keyword.vec
4 fasttext_embedding_daily_stormer.vec
5 fasttext_embedding_inauguration.vec
6 fasttext_embedding_manchester.vec
7 fasttext_embedding_melvyn_hs.vec
8 fasttext_embedding_twitter.vec
9 fasttext_embedding_unfiltered_stream.vec
10 fasttext_embedding_uselections.vec


#### Load dataframes and other objects

In [6]:
_es = elasticsearch_base.connect(settings.ES_URL)
positive_hs_filter = "_exists_:hs_keyword_matches"
negative_hs_filter = "!_exists_:hs_keyword_matches"

hs_keywords = set(file_ops.read_csv_file("refined_hs_keywords", settings.TWITTER_SEARCH_PATH))

## Let's get to work

### Get word frequencies from corpuses

#### Manchester event

In [7]:
debug_size = 5000
test_size = 100000
min_doc_count = 2

subset_sizes = elasticsearch_base.get_els_subset_size(_es, "manchester_event", "hs_keyword_matches")
doc_count = subset_sizes["positive_count"] + subset_sizes["negative_count"]

manchester_pos_subset = elasticsearch_base.aggregate(_es, "manchester_event", "tokens.keyword", False, positive_hs_filter, size=test_size, min_doc_count=min_doc_count)
manchester_neg_subset = elasticsearch_base.aggregate(_es, "manchester_event", "tokens.keyword", False, negative_hs_filter, size=test_size, min_doc_count=min_doc_count)

manchester_pos_hs_freqs, manchester_pos_vocab_freqs, manchester_pos_hs_idfs, manchester_pos_vocab_idfs = model_helpers.get_els_word_weights(manchester_pos_subset[0], doc_count, hs_keywords)
_, manchester_neg_vocab_freqs, _, manchester_neg_vocab_idfs = model_helpers.get_els_word_weights(manchester_neg_subset[0], doc_count, hs_keywords)

del manchester_neg_subset
del manchester_pos_subset

print("Manchester tweet count: {0}\n".format(doc_count))
print("Clean: {0} | Has HS Keyword: {1}".format(subset_sizes["negative_count"], subset_sizes["positive_count"]))

Manchester tweet count: 617698

Clean: 607696 | Has HS Keyword: 10002


#### Dailystormer

In [8]:
debug_size = 5000
test_size = 100000
min_doc_count = 2

subset_sizes = elasticsearch_base.get_els_subset_size(_es, "dailystormer", "hs_keyword_matches")
doc_count = subset_sizes["positive_count"] + subset_sizes["negative_count"]

dailystormer_pos_subset = elasticsearch_base.aggregate(_es, "dailystormer", "tokens.keyword", False, positive_hs_filter, size=test_size, min_doc_count=min_doc_count)
dailystormer_neg_subset = elasticsearch_base.aggregate(_es, "dailystormer", "tokens.keyword", False, negative_hs_filter, size=test_size, min_doc_count=min_doc_count)

dailystormer_pos_hs_freqs, dailystormer_pos_vocab_freqs, dailystormer_pos_hs_idfs, dailystormer_pos_vocab_idfs = model_helpers.get_els_word_weights(dailystormer_pos_subset[0], doc_count, hs_keywords)
_, dailystormer_neg_vocab_freqs, _, dailystormer_neg_vocab_idfs = model_helpers.get_els_word_weights(dailystormer_neg_subset[0], doc_count, hs_keywords)

del dailystormer_neg_subset
del dailystormer_pos_subset

print("Dailystormer doc count: {0}\n".format(doc_count))
print("Clean: {0} | Has HS Keyword: {1}".format(subset_sizes["negative_count"], subset_sizes["positive_count"]))

Dailystormer doc count: 26015

Clean: 19756 | Has HS Keyword: 6259


#### melvyn_hs_users

In [9]:
debug_size = 5000
test_size = 150000
min_doc_count = 2

subset_sizes = elasticsearch_base.get_els_subset_size(_es, "melvyn_hs", "hs_keyword_matches")
doc_count = subset_sizes["positive_count"] + subset_sizes["negative_count"]

melvyn_pos_subset = elasticsearch_base.aggregate(_es, "melvyn_hs", "tokens.keyword", False, positive_hs_filter, size=test_size, min_doc_count=min_doc_count)
melvyn_neg_subset = elasticsearch_base.aggregate(_es, "melvyn_hs", "tokens.keyword", False, negative_hs_filter, size=test_size, min_doc_count=min_doc_count)

melvyn_pos_hs_freqs, melvyn_pos_vocab_freqs, melvyn_pos_hs_idfs, melvyn_pos_vocab_idfs = model_helpers.get_els_word_weights(melvyn_pos_subset[0], doc_count, hs_keywords)
_, melvyn_neg_vocab_freqs, _, melvyn_neg_vocab_idfs = model_helpers.get_els_word_weights(melvyn_neg_subset[0], doc_count, hs_keywords)

print("Melvyn tweet count: {0}\n".format(doc_count))
print("Clean: {0} | Has HS Keyword: {1}".format(subset_sizes["negative_count"], subset_sizes["positive_count"]))

del melvyn_neg_subset
del melvyn_pos_subset

try: 
    pprint(melvyn_pos_hs_freqs["faggot"])
    pprint(model_helpers.get_model_word_count(dep2vec_melvyn_hs, "faggot"))
except Exception as e:
    print(e)

Melvyn tweet count: 328627

Clean: 320236 | Has HS Keyword: 8391
0.00119
name 'dep2vec_melvyn_hs' is not defined


#### unfiltered_stream

In [10]:
debug_size = 5000
test_size = 150000
min_doc_count = 10

subset_sizes = elasticsearch_base.get_els_subset_size(_es, "unfiltered_stream", "hs_keyword_matches")
doc_count = subset_sizes["positive_count"] + subset_sizes["negative_count"]

unfiltered_stream_pos_subset = elasticsearch_base.aggregate(_es, "unfiltered_stream", "tokens.keyword", False, positive_hs_filter, size=test_size, min_doc_count=min_doc_count)
unfiltered_stream_neg_subset = elasticsearch_base.aggregate(_es, "unfiltered_stream", "tokens.keyword", False, negative_hs_filter, size=test_size, min_doc_count=min_doc_count)

unfiltered_stream_pos_hs_freqs, unfiltered_stream_pos_vocab_freqs, unfiltered_stream_pos_hs_idfs, unfiltered_stream_pos_vocab_idfs = model_helpers.get_els_word_weights(unfiltered_stream_pos_subset[0], doc_count, hs_keywords)
_, unfiltered_stream_neg_vocab_freqs, _, unfiltered_stream_neg_vocab_idfs = model_helpers.get_els_word_weights(unfiltered_stream_neg_subset[0], doc_count, hs_keywords)

print("Unfiltered tweet count: {0}\n".format(doc_count))
print("Clean: {0} | Has HS Keyword: {1}".format(subset_sizes["negative_count"], subset_sizes["positive_count"]))

del unfiltered_stream_pos_subset
del unfiltered_stream_neg_subset

try:
    pprint(unfiltered_stream_pos_hs_freqs['faggot'])
except Exception as e:
    print(e)

Unfiltered tweet count: 3241381

Clean: 3177232 | Has HS Keyword: 64149
6e-05


#### core_tweets

In [11]:
debug_size = 5000
test_size = 200000
min_doc_count = 5

subset_sizes = elasticsearch_base.get_els_subset_size(_es, "core_tweets", "hs_keyword_matches")
doc_count = subset_sizes["positive_count"] + subset_sizes["negative_count"]

core_tweets_pos_subset = elasticsearch_base.aggregate(_es, "core_tweets", "tokens.keyword", False, positive_hs_filter, size=test_size, min_doc_count=min_doc_count)
core_tweets_neg_subset = elasticsearch_base.aggregate(_es, "core_tweets", "tokens.keyword", False, negative_hs_filter, size=test_size, min_doc_count=min_doc_count)

core_tweets_pos_hs_freqs, core_tweets_pos_vocab_freqs, core_tweets_pos_hs_idfs, core_tweets_pos_vocab_idfs = model_helpers.get_els_word_weights(core_tweets_pos_subset[0], doc_count, hs_keywords)
_, core_tweets_neg_vocab_freqs, _, core_tweets_neg_vocab_idfs = model_helpers.get_els_word_weights(core_tweets_neg_subset[0], doc_count, hs_keywords)

del core_tweets_neg_subset
del core_tweets_pos_subset

print("Core tweet count: {0}\n".format(doc_count))
print("Clean: {0} | Has HS Keyword: {1}".format(subset_sizes["negative_count"], subset_sizes["positive_count"]))

Core tweet count: 6843555

Clean: 3014831 | Has HS Keyword: 3828724


#### Sample similarity results

Main Twitter set: bomber
[('blazer', 0.9450125694274902),
 ('windbreaker', 0.937454879283905),
 ('linen', 0.930109977722168)]

DailyStormer: savages
[('monkeys', 0.9646297693252563),
 ('monsters', 0.9601025581359863),
 ('thugs', 0.9567341208457947)]

Melvyn HS users: savages
[('degenerates', 0.8997732400894165),
 ('rings', 0.8950327634811401),
 ('assholes', 0.8936536908149719)]

Unfiltered stream: savages
[('morons', 0.8233602643013),
 ('motherfuckers', 0.8196897506713867),
 ('cowards', 0.8059471249580383)]

Manchester: bomber
[('#washington', 0.7842807769775391),
 ('#bomber', 0.747549831867218),
 ('#political', 0.7453497648239136)]

### Plot HS keyword frequency and IDF weights against Unfiltered Twitter Stream

Values are divided by 10^6 https://en.wikipedia.org/wiki/Micro-

#### Dailystormer

In [33]:
try:
    dailystormer_pos_labels, dailystormer_pos_freq_values, unfiltered_stream_pos_labels, unfiltered_stream_pos_freq_values = model_helpers.get_overlapping_weights(dailystormer_pos_hs_freqs, unfiltered_stream_pos_hs_freqs)
    _, dailystormer_pos_idf_values, _, unfiltered_stream_pos_idf_values = model_helpers.get_overlapping_weights(dailystormer_pos_hs_idfs, unfiltered_stream_pos_hs_idfs)
    visualization.plot_bar_chart(dailystormer_pos_labels, dailystormer_pos_freq_values, unfiltered_stream_pos_freq_values, "DailyStormer", "Unfiltered", "Dailystormer HS vs Unfiltered Stream: Keyword Frequency", orientation="v")
    visualization.plot_bar_chart(dailystormer_pos_labels, dailystormer_pos_idf_values, unfiltered_stream_pos_idf_values, "DailyStormer", "Unfiltered", "Dailystormer HS vs Unfiltered Stream: Keyword IDF", orientation="v")
    
    del dailystormer_pos_labels
    del dailystormer_pos_freq_values
    del dailystormer_pos_idf_values
except Exception as e:
    print(e)
    pass

#### Manchester Event

In [34]:
try:
    manchester_pos_labels, manchester_pos_freq_values, unfiltered_stream_pos_labels, unfiltered_stream_pos_freq_values = model_helpers.get_overlapping_weights(manchester_pos_hs_freqs, unfiltered_stream_pos_hs_freqs)
    _, manchester_pos_idf_values, _, unfiltered_stream_pos_idf_values = model_helpers.get_overlapping_weights(manchester_pos_hs_idfs, unfiltered_stream_pos_hs_idfs)
    
    visualization.plot_bar_chart(manchester_pos_labels, manchester_pos_freq_values, unfiltered_stream_pos_freq_values, "Manchester Event", "Unfiltered", "Manchester Event HS vs Unfiltered Stream: Keyword Frequency", orientation="v")
    visualization.plot_bar_chart(manchester_pos_labels, manchester_pos_idf_values, unfiltered_stream_pos_idf_values, "Manchester Event", "Unfiltered", "Manchester Event HS vs Unfiltered Stream: Keyword IDF", orientation="v")
    
    del manchester_pos_labels
    del manchester_pos_freq_values
    del manchester_pos_idf_values
except Exception as e:
    print(e)
    pass

#### Core Tweets

In [14]:
try:
    core_tweets_pos_labels, core_tweets_pos_freq_values, unfiltered_stream_pos_labels, unfiltered_stream_pos_freq_values = model_helpers.get_overlapping_weights(core_tweets_pos_hs_freqs, unfiltered_stream_pos_hs_freqs)
    _, core_tweets_pos_idf_values, _, unfiltered_stream_pos_idf_values = model_helpers.get_overlapping_weights(core_tweets_pos_hs_idfs, unfiltered_stream_pos_hs_idfs)
    
    visualization.plot_bar_chart(core_tweets_pos_labels, core_tweets_pos_freq_values, unfiltered_stream_pos_freq_values, "Core Tweets", "Unfiltered", "Core Tweets HS vs Unfiltered Stream: Keyword Frequency", orientation="v")
    visualization.plot_bar_chart(core_tweets_pos_labels, core_tweets_pos_idf_values, unfiltered_stream_pos_idf_values, "Core Tweets", "Unfiltered", "Core Tweets HS vs Unfiltered Stream: Keyword IDF", orientation="v")
    
    del core_tweets_pos_labels
    del core_tweets_pos_freq_values
    del core_tweets_pos_idf_values
except Exception as e:
    print(e)
    pass

#### Melvyn Users

In [15]:
try:
    melvyn_pos_labels, melvyn_pos_freq_values, unfiltered_stream_pos_labels, unfiltered_stream_pos_freq_values = model_helpers.get_overlapping_weights(melvyn_pos_hs_freqs, unfiltered_stream_pos_hs_freqs)
    _, melvyn_pos_idf_values, _, unfiltered_stream_pos_idf_values = model_helpers.get_overlapping_weights(melvyn_pos_hs_idfs, unfiltered_stream_pos_hs_idfs)
    
    visualization.plot_bar_chart(melvyn_pos_labels, melvyn_pos_freq_values, unfiltered_stream_pos_freq_values, "Melvyn Users", "Unfiltered", "Melvyn Users HS vs Unfiltered Stream: Keyword Frequency", orientation="v")
    visualization.plot_bar_chart(melvyn_pos_labels, melvyn_pos_idf_values, unfiltered_stream_pos_idf_values, "Melvyn Users", "Unfiltered", "Melvyn Users HS vs Unfiltered Stream: Keyword IDF", orientation="v")
    
    del melvyn_pos_labels
    del melvyn_pos_freq_values
    del melvyn_pos_idf_values
    del unfiltered_stream_pos_freq_values
    del unfiltered_stream_pos_labels
    del unfiltered_stream_pos_idf_values
except Exception as e:
    print(e)
    pass

## Candidate code words and Contextual Representation

### Codeword Search

In this experiment we will compare the contextual representation output from a dataset that is dense in HS vs the unfiltered twitter stream.

A word's contextual representation consists of the following:

| Feature                | Description                                            |
|------------------------|--------------------------------------------------------|
| hs_rel_words           | Collocated HS words                                    |
| hs_rel_words_unbiased  | Collocated HS words from the unbiased data             |
| hs_sim_words           | dep2vec similar HS words                               |
| hs_sim_words_unbiased  | dep2vec similar HS words from the unbiased data        |
| alt_rel_words          | Collocated words not in HS                             |
| alt_rel_words_unbiased | Collocated words not in HS from the unbiased data      |
| alt_sim_words          | dep2vec similar words not in HS                        |
| alt_sim_words_unbiased | dep2vec similar words not in HS from the unbiased data |
| biased_freq            | word frequency                                         |
| unbiased_freq          | word frequency from the unbiased data                  |

#### 1. Dailystormer vocab search

In [16]:
search_args = {}
hs_keywords = set(file_ops.read_csv_file("refined_hs_keywords", settings.TWITTER_SEARCH_PATH))

search_args["biased_embeddings"] = [dep_embeddings["core_hate"], word_embeddings["core_hate"]]
search_args["unbiased_embeddings"] = [dep_embeddings["core_clean"], word_embeddings["core_clean"]]
search_args["freq_vocab_pair"] = [dailystormer_neg_vocab_freqs, unfiltered_stream_neg_vocab_freqs]
search_args["idf_vocab_pair"] = [dailystormer_neg_vocab_idfs, unfiltered_stream_neg_vocab_idfs]
search_args["hs_keywords"] = hs_keywords
search_args["graph_depth"] = 2
search_args["topn"] = 5
search_args["p_at_k_threshold"] = 0.2
search_args["hs_check"] = True

In [17]:
%%time
ds_candidate_codewords, ds_pagerank, ds_candidate_graph, ds_singular_tokens = word_enrichment.candidate_codeword_search(**search_args)

CPU times: user 33min 56s, sys: 1h, total: 1h 33min 56s
Wall time: 12min 53s


In [None]:
# https://docs.python.org/3/library/collections.html#ordereddict-examples-and-recipes

In [18]:
output_name = "ds"
file_ops.write_json_file("ds_candidate_codewords", settings.OUTPUT_PATH, ds_candidate_codewords)
file_ops.write_json_file("ds_pagerank", settings.OUTPUT_PATH, ds_pagerank)
file_ops.write_json_file("ds_singular_tokens", settings.OUTPUT_PATH, ds_singular_tokens)

joblib.dump(ds_candidate_graph, settings.MODEL_PATH + output_name + "_candidate_graph.pkl.compressed", compress=True)
joblib.dump(ds_candidate_codewords, settings.MODEL_PATH + output_name + "_candidate_codewords.pkl.compressed", compress=True)
joblib.dump(ds_pagerank, settings.MODEL_PATH + output_name + "_candidate_pagerank.pkl.compressed", compress=True)

['../data/persistence/ds_candidate_pagerank.pkl.compressed']

In [36]:
candidates_unbiased_sim_p_at_k = {token:ds_candidate_codewords[token]["p@k_sim_unbiased"][0] for token in ds_candidate_codewords}
candidates_unbiased_rel_p_at_k = {token:ds_candidate_codewords[token]["p@k_rel_unbiased"][0] for token in ds_candidate_codewords}
candidates_biased_sim_p_at_k = {token:ds_candidate_codewords[token]["p@k_sim_biased"][0] for token in ds_candidate_codewords}
candidates_biased_rel_p_at_k = {token:ds_candidate_codewords[token]["p@k_rel_biased"][0] for token in ds_candidate_codewords}

visualization.plot_bar_chart(list(candidates_unbiased_rel_p_at_k.keys()), list(candidates_unbiased_rel_p_at_k.values()), list(candidates_biased_rel_p_at_k.values()), "Biased", "Unbiased", "Dailystormer P@k Comparison", orientation="v")
visualization.plot_basic_bar_chart(list(ds_pagerank.keys()), list(ds_pagerank.values()), "Dailystormer PageRank Results", orientation="v")

#### 2. Melvyn HS vocab search

In [20]:
search_args = {}
hs_keywords = set(file_ops.read_csv_file("refined_hs_keywords", settings.TWITTER_SEARCH_PATH))

search_args["biased_embeddings"] = [dep_embeddings["core_hate"], word_embeddings["core_hate"]]
search_args["unbiased_embeddings"] = [dep_embeddings["core_clean"], word_embeddings["core_clean"]]
search_args["freq_vocab_pair"] = [melvyn_neg_vocab_freqs, unfiltered_stream_neg_vocab_freqs]
search_args["idf_vocab_pair"] = [melvyn_neg_vocab_idfs, unfiltered_stream_neg_vocab_idfs]
search_args["hs_keywords"] = hs_keywords
search_args["graph_depth"] = 2
search_args["topn"] = 5
search_args["p_at_k_threshold"] = 0.2
search_args["hs_check"] = True

In [21]:
%%time
mhs_candidate_codewords, mhs_pagerank, mhs_candidate_graph, mhs_singular_tokens = word_enrichment.candidate_codeword_search(**search_args)

CPU times: user 1h 5min 31s, sys: 2h 12min 20s, total: 3h 17min 51s
Wall time: 27min 57s


In [22]:
output_name = "mhs"
file_ops.write_json_file("mhs_candidate_codewords", settings.OUTPUT_PATH, mhs_candidate_codewords)
file_ops.write_json_file("mhs_pagerank", settings.OUTPUT_PATH, mhs_pagerank)
file_ops.write_json_file("mhs_singular_tokens", settings.OUTPUT_PATH, mhs_singular_tokens)

joblib.dump(mhs_candidate_graph, settings.MODEL_PATH + output_name + "_candidate_graph.pkl.compressed", compress=True)
joblib.dump(mhs_candidate_codewords, settings.MODEL_PATH + output_name + "_candidate_codewords.pkl.compressed", compress=True)
joblib.dump(mhs_pagerank, settings.MODEL_PATH + output_name + "_candidate_pagerank.pkl.compressed", compress=True)

['../data/persistence/mhs_candidate_pagerank.pkl.compressed']

In [23]:
candidates_unbiased_sim_p_at_k = {token:mhs_candidate_codewords[token]["p@k_sim_unbiased"][0] for token in mhs_candidate_codewords}
candidates_unbiased_rel_p_at_k = {token:mhs_candidate_codewords[token]["p@k_rel_unbiased"][0] for token in mhs_candidate_codewords}
candidates_biased_sim_p_at_k = {token:mhs_candidate_codewords[token]["p@k_sim_biased"][0] for token in mhs_candidate_codewords}
candidates_biased_rel_p_at_k = {token:mhs_candidate_codewords[token]["p@k_rel_biased"][0] for token in mhs_candidate_codewords}

visualization.plot_bar_chart(list(candidates_unbiased_rel_p_at_k.keys()), list(candidates_unbiased_rel_p_at_k.values()), list(candidates_biased_rel_p_at_k.values()), "Biased", "Unbiased", "MHS P@k Comparison", orientation="v")
visualization.plot_basic_bar_chart(list(mhs_pagerank.keys()), list(mhs_pagerank.values()), "MHS PageRank Results", orientation="v")

#### 3. Manchester vocab search

In [24]:
search_args = {}
hs_keywords = set(file_ops.read_csv_file("refined_hs_keywords", settings.TWITTER_SEARCH_PATH))

search_args["biased_embeddings"] = [dep_embeddings["core_hate"], word_embeddings["core_hate"]]
search_args["unbiased_embeddings"] = [dep_embeddings["core_clean"], word_embeddings["core_clean"]]
search_args["freq_vocab_pair"] = [manchester_neg_vocab_freqs, unfiltered_stream_neg_vocab_freqs]
search_args["idf_vocab_pair"] = [manchester_pos_vocab_idfs, unfiltered_stream_neg_vocab_idfs]
search_args["hs_keywords"] = hs_keywords
search_args["graph_depth"] = 2
search_args["topn"] = 5
search_args["p_at_k_threshold"] = 0.2
search_args["hs_check"] = True

In [25]:
%%time
manchester_candidate_codewords, manchester_pagerank, manchester_candidate_graph, manchester_singular_tokens = word_enrichment.candidate_codeword_search(**search_args)

CPU times: user 58min 42s, sys: 1h 58min 43s, total: 2h 57min 26s
Wall time: 24min 59s


In [26]:
output_name = "manchester"
file_ops.write_json_file("manchester_candidate_codewords", settings.OUTPUT_PATH, manchester_candidate_codewords)
file_ops.write_json_file("manchester_pagerank", settings.OUTPUT_PATH, manchester_pagerank)
file_ops.write_json_file("manchester_singular_tokens", settings.OUTPUT_PATH, manchester_singular_tokens)

joblib.dump(manchester_candidate_graph, settings.MODEL_PATH + output_name + "_candidate_graph.pkl.compressed", compress=True)
joblib.dump(manchester_candidate_codewords, settings.MODEL_PATH + output_name + "_candidate_codewords.pkl.compressed", compress=True)
joblib.dump(manchester_pagerank, settings.MODEL_PATH + output_name + "_candidate_pagerank.pkl.compressed", compress=True)

['../data/persistence/manchester_candidate_pagerank.pkl.compressed']

In [27]:
candidates_unbiased_sim_p_at_k = {token:manchester_candidate_codewords[token]["p@k_sim_unbiased"][0] for token in manchester_candidate_codewords}
candidates_unbiased_rel_p_at_k = {token:manchester_candidate_codewords[token]["p@k_rel_unbiased"][0] for token in manchester_candidate_codewords}
candidates_biased_sim_p_at_k = {token:manchester_candidate_codewords[token]["p@k_sim_biased"][0] for token in manchester_candidate_codewords}
candidates_biased_rel_p_at_k = {token:manchester_candidate_codewords[token]["p@k_rel_biased"][0] for token in manchester_candidate_codewords}

visualization.plot_bar_chart(list(candidates_unbiased_rel_p_at_k.keys()), list(candidates_unbiased_rel_p_at_k.values()), list(candidates_biased_rel_p_at_k.values()), "Biased", "Unbiased", "Manchester P@k Comparison", orientation="v")
visualization.plot_basic_bar_chart(list(manchester_pagerank.keys()), list(manchester_pagerank.values()), "Manchester PageRank Results", orientation="v")

#### 4. Core tweets vocab search

In [None]:
search_args = {}
hs_keywords = set(file_ops.read_csv_file("refined_hs_keywords", settings.TWITTER_SEARCH_PATH))

search_args["biased_embeddings"] = [dep_embeddings["core_hate"], word_embeddings["core_hate"]]
search_args["unbiased_embeddings"] = [dep_embeddings["core_clean"], word_embeddings["core_clean"]]
search_args["freq_vocab_pair"] = [core_tweets_neg_vocab_freqs, unfiltered_stream_neg_vocab_freqs]
search_args["idf_vocab_pair"] = [core_tweets_neg_vocab_idfs, unfiltered_stream_neg_vocab_idfs]
search_args["hs_keywords"] = hs_keywords
search_args["graph_depth"] = 2
search_args["topn"] = 5
search_args["p_at_k_threshold"] = 0.2
search_args["hs_check"] = True

In [None]:
%%time
core_tweets_candidate_codewords, core_tweets_pagerank, core_tweets_candidate_graph, core_tweets_singular_tokens = word_enrichment.candidate_codeword_search(**search_args)

In [None]:
output_name = "core_tweets"
file_ops.write_json_file("core_tweets_candidate_codewords", settings.OUTPUT_PATH, core_tweets_candidate_codewords)
file_ops.write_json_file("core_tweets_pagerank", settings.OUTPUT_PATH, core_tweets_pagerank)
file_ops.write_json_file("core_tweets_singular_tokens", settings.OUTPUT_PATH, core_tweets_singular_tokens)

joblib.dump(core_tweets_candidate_graph, settings.MODEL_PATH + output_name + "_candidate_graph.pkl.compressed", compress=True)
joblib.dump(core_tweets_candidate_codewords, settings.MODEL_PATH + output_name + "_candidate_codewords.pkl.compressed", compress=True)
joblib.dump(core_tweets_pagerank, settings.MODEL_PATH + output_name + "_candidate_pagerank.pkl.compressed", compress=True)

In [None]:
candidates_unbiased_sim_p_at_k = {token:core_tweets_candidate_codewords[token]["p@k_sim_unbiased"][0] for token in core_tweets_candidate_codewords}
candidates_unbiased_rel_p_at_k = {token:core_tweets_candidate_codewords[token]["p@k_rel_unbiased"][0] for token in core_tweets_candidate_codewords}
candidates_biased_sim_p_at_k = {token:core_tweets_candidate_codewords[token]["p@k_sim_biased"][0] for token in core_tweets_candidate_codewords}
candidates_biased_rel_p_at_k = {token:core_tweets_candidate_codewords[token]["p@k_rel_biased"][0] for token in core_tweets_candidate_codewords}

visualization.plot_bar_chart(list(candidates_unbiased_rel_p_at_k.keys()), list(candidates_unbiased_rel_p_at_k.values()), list(candidates_biased_rel_p_at_k.values()), "Biased", "Unbiased", "Core Tweets P@k Comparison", orientation="v")
visualization.plot_basic_bar_chart(list(core_tweets_pagerank.keys()), list(core_tweets_pagerank.values()), "Manchester PageRank Results", orientation="v")

#### Candidate code word: monkey

In this example we observe something interesting. 
Both the related and similar words from the unbiased dataset reflect the use of the word **monkey** in its general usage. 

However, looking at the same from the biased dataset we observe the HS relation. 

This example shows us that in the general twitter stream, code words most often are used with their normal meaning, but if we do the same for a corpus that is dense in HS material we can observe the alternate use.

In [None]:
# candidate_example = pd.DataFrame.from_dict(ds_candidate_codewords["monkeys"], orient='index')
# candidate_example = candidate_example.transpose()

#### Candidate code word: animals

In this example we observe a similar pattern as we did with the word **monkey**.

In [None]:
# candidate_example = pd.DataFrame.from_dict(ds_candidate_codewords["animal"], orient='index')
# candidate_example = candidate_example.transpose()

#### Candidate code word: jewess

For this example we have a term that appears to be a good HS candidate. The related words in both the biased and unbiased data set bear a HS relation. 

In [None]:
# candidate_example = pd.DataFrame.from_dict(mhs_candidate_codewords["jewess"], orient='index')
# candidate_example = candidate_example.transpose()