# The full python workflow for semasiological token-level clouds on a loop

This notebook reproduces the code in the [single-lemma workflow](createClouds.ipynb) but on a loop across multiple lemmas, as it was run for [*Cloudspotting. Visual Analytics for Distributional Semantics*](https://cloudspotting.marianamontes.me/). The cloud is not run here, only shown with dummy file paths.

## 0. Initial setup 

In [None]:
import os
import sys
import logging
import pandas as pd
sys.path.append('/path/to/nephosem/') # path to the nephosem repository
sys.path.append('path/to/scripts/')# path to semasioFlow repository

In [None]:
from semasioFlow import ConfigLoader
from semasioFlow.load import loadVocab, loadMacro, loadColloc, loadFocRegisters
from semasioFlow.sample import sampleTypes
from semasioFlow.focmodels import createBow, createRel, createPath
from semasioFlow.socmodels import targetPPMI, weightTokens, createSoc
from semasioFlow.utils import plotPatterns

## 1. Configuration 

Depending on what you need, you will have to set up some useful paths settings.
I like to have at least the path to my project (`mydir`), an output path within (`mydir + "output"`) and a GitHub path for the datasets that I will use in the visualization. There is no real reason not to have everything together, except that I did not think of it at the moment. (Actually, there is: the GitHub stuff will be public and huge data would not be included. How much do we want to have public?)

In [None]:
mydir = "../"
output_path = f"{mydir}/output/"
nephovis_path = f"{mydir}/for-nephovis/"logging.basicConfig(filename = f'{mydir}/mylog.log', filemode = 'w', level = logging.DEBUG)

In [None]:
necessary_subfolders = ['vocab', 'cws', 'registers', 'tokens']
for sf in necessary_subfolders:
    if not os.path.exists(output_path + sf):
        os.makedirs(output_path + sf)

The variables with paths is just meant to make it easier to manipulate filenames. The most important concrete step is to adapt the configuration file.

In [None]:
conf = ConfigLoader()
default_settings = conf.settings
conf = ConfigLoader()
settings = conf.update_config('config.ini')
settings['output-path'] = output_path
corpus_name = 'corpus_name'

print(settings['line-machine'])
print(settings['global-columns'])
print(settings['type'], settings['colloc'], settings['token'])

## 2. Frequency lists

The frequency lists are the first thing to create, but once you have them, you just load them. So what we are going to do here is define the filename where we *would* store the frequency list (in this case, where it is actually stored), and if it exists it loads it; if it doesn't, it creates and store it.

In [None]:
full_name = f"{output_path}/vocab/{corpus_name}.nodefreq"
print(full_name)
full = loadVocab(full_name, settings)
full

## 3. Boolean token-level matrices

Even though we first think of the type leven and only afterwards of the token level, with this workflow we don't really need to touch type level until after we obtain the boolean token-level matrices, that is, until we need to use PPMI values to select or weight the context words.

As a first step, we need the type or list of types we want to run; for example `"heet/adj"` or `["vernietig/verb", "verniel/verb"]`, and we subset the vocabulary for that query.

In [None]:
fnames = f"{mydir}/sources/listofnames.fnames"

In [None]:
noun_lemmas = ['horde', 'hoop', 'spot', 'staal', 'stof', 'schaal', 'blik']
adj_lemmas = ['heilzaam', 'hoekig', 'gekleurd', 'dof', 'hachelijk', 'geestig', 'hoopvol',
              'hemels', 'geldig', 'gemeen', 'goedkoop', 'grijs', 'heet']
verb_lemmas = ['herroepen', 'heffen', 'huldigen', 'haten', 'herhalen', 'herinneren',
              'diskwalificeren', 'harden', 'herstellen', 'helpen', 'haken', 'herstructureren']
verb_stems = ['herroep', 'hef', 'huldig', 'haat', 'herhaal', 'herinner',
              'diskwalificeer', 'hard', 'herstel', 'help', 'haak', 'herstructureer']
only_nouns = [(x, [x+'/noun']) for x in noun_lemmas]
only_adjs = [(x, [x+'/adj']) for x in adj_lemmas]
only_verbs = [(x, [y+'/verb']) for x, y in zip(verb_lemmas, verb_stems)]
everything = only_nouns + only_adjs + only_verbs

We could generate the tokens for all 10k tokens, or create a random selection with a certain number and then only use those files. The output of sampleTypes includes a list of token IDs as well as the list of filenames that suffices to extract those tokens. We can then use the new list of filenames when we collect tokens, and the list of tokens to subset the resulting matrices.

Of course, to keep the sample fixed it would be more useful to generate the list, store it and then retrieve it in future runs.

In [None]:
import json
import os.path

with open(f"{mydir}/sources/adjIds.txt", 'r') as f1:
    adjs = [x.strip() for x in f1.readlines()]
with open(f"{mydir}/sources/nounIds.txt", 'r') as f2:
    nouns = [x.strip() for x in f2.readlines()]
with open(f"{mydir}/sources/verbIds.txt", 'r') as f3:
    verbs = [x.strip() for x in f3.readlines()]
tokenlist = adjs + nouns + verbs

# 3. Extract filenames from token ID's and map to paths ================================
token2fname = [x.split('/')[2]+'.conll' for x in tokenlist]
with open(fnames, 'r') as q:
    fnameSample = [x.strip() for x in q.readlines() if x.strip().rsplit('/', 1)[1] in token2fname]

### 3.1 Bag-of-words

In [None]:
lex_pos = [x for x in foc.get_item_list() if x.split("/")[1] in ["noun", "adj", "verb", "adv"]]

In [None]:
foc_win = [(3, 3), (5, 5), (10, 10)] #three options of symmetric windows with 3, 5 or 10 words to each side
foc_pos = {
    "all" : foc.get_item_list(), # the filter has already been applied in the FOC list
    "lex" : lex_pos # further filter by part-of-speech
}
bound = { "match" : "^</sentence>$", "values" : [True, False]}

In [None]:
# 4. On a loop per item, row create Bow ================================
for type_name, query_list in everything:
    query = full.subvocab(query_list)
    bowdata = createBow(query, settings, type_name = type_name, fnames = fnameSample, tokenlist = tokenlist,
         foc_win = foc_win, foc_pos = foc_pos, bound = bound)
    
    # 5. Most probably, store register ================================
    models_fname = f"{output_path}/registers/{type_name}.bow-models.tsv"
    bowdata.to_csv(models_fname, sep="\t", index_label = '_model')

### 3.2 Lemmarel

In [None]:
settings['separator-line-machine'] = "^</sentence>$"

In [None]:
# 6. On a loop per item, row create Path ================================
graphml_name = "LEMMAREL.verbs"
templates_dir = f"{mydir}/templates"
rel_macros = [
    ("LEMMAREL1", loadMacro(templates_dir, graphml_name, "LEMMAREL1.verbs")),
    ("LEMMAREL2", loadMacro(templates_dir, graphml_name, "LEMMAREL2.verbs"))
]
for type_name, query_list in only_verbs:
    query = full.subvocab(query_list)
    sub_tokenlist = [x for x in tokenlist if x.startswith(query_list[0])]
    sub_fnameSample = [x for x in fnameSample if x.rsplit("/", 1)[1] in [y.split("/")[2]+'.conll' for y in sub_tokenlist]]
    print(query_list[0])
    reldata = createRel(query, settings, rel_macros, type_name = type_name,
                        fnames = sub_fnameSample, tokenlist = sub_tokenlist, foc_filter = foc.get_item_list())
    
    # 7. Most probably, store register ================================
    models_fname = f"{output_path}/registers/{type_name}.rel-models.tsv"
    reldata.to_csv(models_fname, sep="\t", index_label = '_model')

In [None]:
# 6. On a loop per item, row create Path ================================
graphml_name = "LEMMAREL.nouns"
templates_dir = f"{mydir}/templates"
rel_macros = [
    ("LEMMAREL1", loadMacro(templates_dir, graphml_name, "LEMMAREL1.nouns")),
    ("LEMMAREL2", loadMacro(templates_dir, graphml_name, "LEMMAREL2.nouns")),
    ("LEMMAREL3", loadMacro(templates_dir, graphml_name, "LEMMAREL3.nouns"))
]
for type_name, query_list in only_nouns:
    query = full.subvocab(query_list)
    sub_tokenlist = [x for x in tokenlist if x.startswith(query_list[0])]
    sub_fnameSample = [x for x in fnameSample if x.rsplit("/", 1)[1] in [y.split("/")[2]+'.conll' for y in sub_tokenlist]]
    print(query_list[0])
    reldata = createRel(query, settings, rel_macros, type_name = type_name,
                        fnames = sub_fnameSample, tokenlist = sub_tokenlist, foc_filter = foc.get_item_list())
    
    # 7. Most probably, store register ================================
    models_fname = f"{output_path}/registers/{type_name}.rel-models.tsv"
    reldata.to_csv(models_fname, sep="\t", index_label = '_model')

In [None]:
# 6. On a loop per item, row create Path ================================
graphml_name = "LEMMAREL.adjs"
templates_dir = f"{mydir}/templates"
rel_macros = [
    ("LEMMAREL1", loadMacro(templates_dir, graphml_name, "LEMMAREL1.adjs")),
    ("LEMMAREL2", loadMacro(templates_dir, graphml_name, "LEMMAREL2.adjs"))
]
for type_name, query_list in only_adjs:
    query = full.subvocab(query_list)
    sub_tokenlist = [x for x in tokenlist if x.startswith(query_list[0])]
    sub_fnameSample = [x for x in fnameSample if x.rsplit("/", 1)[1] in [y.split("/")[2]+'.conll' for y in sub_tokenlist]]
    reldata = createRel(query, settings, rel_macros, type_name = type_name,
                        fnames = sub_fnameSample, tokenlist = sub_tokenlist, foc_filter = foc.get_item_list())
    
    # 7. Most probably, store register ================================
    models_fname = f"{output_path}/registers/{type_name}.rel-models.tsv"
    reldata.to_csv(models_fname, sep="\t", index_label = '_model')

### 3.3 Lemmapath

In [None]:
graphml_name = "LEMMAPATH"
templates_dir = f"{mydir}/templates"
path_templates = [loadMacro(templates_dir, graphml_name, f"LEMMAPATH{i}") for i in [1, 2, 3]]
path_macros = [
    # First group includes templates with one and two steps, no weight
    ("LEMMAPATH2", [path_templates[0], path_templates[1]], None),
    # Second group includes templates with up to three steps, no weight
    ("LEMMAPATH3", [path_templates[0], path_templates[1], path_templates[2]], None),
    # Third group includes templates with up to three steps, with weight
    ("LEMMAPATHweight", [path_templates[0], path_templates[1], path_templates[2]], [1, 0.6, 0.3])
]
settings['separator-line-machine'] = "^</sentence>$"

In [None]:
# 8. On a loop per item, row create Path ================================
for type_name, query_list in everything:
    query = full.subvocab(query_list)
    sub_tokenlist = [x for x in tokenlist if x.startswith(query_list[0])]
    sub_fnameSample = [x for x in fnameSample if x.rsplit("/", 1)[1] in [y.split("/")[2]+'.conll' for y in sub_tokenlist]]
    pathdata = createPath(query, settings, path_macros, type_name = type_name,
          fnames = sub_fnameSample, tokenlist = sub_tokenlist, foc_filter = foc.get_item_list())
    
    # 9. Most probably, store register ================================
    models_fname = f"{output_path}/registers/{type_name}.path-models.tsv"
    pathdata.to_csv(models_fname, sep="\t", index_label = '_model')

## 4 Weight or booleanize

### 4.1. Create/load collocation matrix

In [None]:
coldir = "/path/to/dataframes/"
freq_fname_CW4 = f"{coldir}/{corpus_name}.fullcorpus_CW4.wcmx.freq.pac" # window size of 4

In [None]:
settings['left-span'] = 4
settings['right-span = 4']
freqMTX_CW4 = loadColloc(freq_fname_CW4, settings, row_vocab = full)
freqMTX_CW4

In [None]:
freq_fname_CW10 = f"{coldir}/{corpus_name}.fullcorpus_CW10.wcmx.freq.pac" # window size of 10
settings['left-span'] = 10
settings['right-span = 10']
freqMTX_CW10 = loadColloc(freq_fname_CW10, settings, row_vocab = full)

### 4.2 Register PPMI values
(Done with 4.3)

### 4.3 Implement weighting on selection

In [None]:
from semasioFlow.utils import booleanize

In [None]:
# 11. On a loop per item, row weight models ================================
for type_name, query_list in everything:
    nephovis_type = f"{nephovis_path}/{type_name}"
    ppmi = targetPPMI(query_list, vocabs = {"freq" : full},
               collocs = {"4" : freqMTX_CW4, "10" : freqMTX_CW10},
               type_name = type_name, output_dir = f"{nephovis_type}/",
               main_matrix = "4")
    weighting = {
        "no" : None,
        "selection" : booleanize(ppmi, include_negative=False),
        "weight" : ppmi
    }
    token_dir = f"{output_path}/tokens/{type_name}"
    foc_registers = loadFocRegisters(f"{output_path}/registers", type_name)
    weight_data = weightTokens(token_dir, weighting, foc_registers)
    weight_data["model_register"].to_csv(f"{output_path}/registers/{type_name}.focmodels.tsv", sep = '\t',
                                         index_label = "_model")
    weight_data["token_register"].to_csv(f"{nephovis_type}/{type_name}.variables.tsv", sep = '\t',
                                         index_label = "_id")

### 5 Second-order dimensions

In [None]:
soc_pos = {
    "all" : selection_without_filters,
    "nav" : special_selection
}
lengths = ["FOC", 5000] # a number will take the most frequent; something else will take the FOC items

In [None]:
# 12. On a loop per item, create Soc models ================================
for type_name, query_list in everything:
    registers = pd.read_csv(f"{output_path}/registers/{type_name}.focmodels.tsv",
                            sep = "\t", index_col = "_model")
    token_dir = f"{output_path}/tokens/{type_name}"
    socdata = createSoc(token_dir, registers = registers,
                        soc_pos = soc_pos, lengths = lengths,
                        socMTX = freqMTX_CW4, store_focdists = f"{output_path}/cws/{type_name}/")
    socdata.to_csv(f"{nephovis_path}/{type_name}/{type_name}.models.tsv", sep = "\t", index_label="_model")

### 6 Cosine distances
Once we have all the token-level vectors, as well as our registers,
we can quickly compute and store their cosine distances.

In [None]:
from nephosem import TypeTokenMatrix
from nephosem.specutils.mxcalc import compute_distance

In [None]:
# 13. On a loop per item, compute distances ======================================
input_suffix = ".tcmx.soc.pac" #token by context matrix
output_suffix = ".ttmx.dist.pac" # token by token matrix
for type_name, query_list in everything:
    token_dir = f"{output_path}/tokens/{type_name}"
    socdata = pd.read_csv(f"{github_dir}/{type_name}/{type_name}.models.tsv",
                         sep = "\t", index_col = "_model")
    for modelname in socdata.index:
        input_name = f"{token_dir}/{modelname}{input_suffix}"
        output_name = f"{token_dir}/{modelname}{output_suffix}"
        compute_distance(TypeTokenMatrix.load(input_name)).save(output_name)
    

## Bonus: context word detail

In [None]:
from semasioFlow.contextwords import listContextwords

In [None]:
# # On a loop
for type_name, query_list in everything:
    cws = listContextwords(type_name, tokenlist, fnameSample, settings, left_win=15, right_win = 15)
    cw_fname = f"{nephovis_path}/{type_name}/{type_name}.cws.detail.tsv"
    cws.to_csv(cw_fname, sep = "\t", index_label = "cw_id")