<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Nuclear Incidents
  </div> 

  
<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Topic modeling - Hierarchical Visualization
  </div> 


  <div style=" float:left; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jean-baptiste AUJOGUE
  </div> 
  
  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  Jan 2023
  </div> 

<a id="TOC"></a>

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys
import warnings
import os

# data 
import numpy as np
import pandas as pd

# viz
from vega import Vega

warnings.filterwarnings("ignore")
print('python version :', sys.version)

**Path to data repertory**

In [None]:
path_to_repo = os.path.dirname(os.getcwd())
path_to_data = os.path.join(path_to_repo, 'data', 'processed')

In [None]:
path_to_repo

<a id="classification"></a>

# 1. Topics at different levels

[Table of Content](#TOC)

We cast the topic hierarchy into a json structure:

In [None]:
df_text_topics = pd.read_excel(os.path.join(path_to_data, 'source_titles_topics.xlsx'))
df_para_topics = pd.read_excel(os.path.join(path_to_data, 'source_paragraphs_topics.xlsx'))
df_span_topics = pd.read_excel(os.path.join(path_to_data, 'source_spans_topics.xlsx'))

In [None]:
df_text_topics.head(3)

In [None]:
df_para_topics.head(3)

In [None]:
df_span_topics.head(3)

# 2. Hierarchical visualization

[Table of Content](#TOC)

Put topics into a parent-child format

for each text-level topic, compute para-level topic that belongs to it, with ammount, and adress them an id

TODO: the parent child connectivity should be decided using:
    - a tfidf matrix build upon the list of parents described as BOW of children
    - usinf the top N children in the tfidf weighting of any given parent

In [None]:
def invert_dict(k2v):
    vs = sorted(list(set(k2v.values())), key = lambda s: (len(s), s))
    return {v: [k for k in k2v if k2v[k] == v] for v in vs}

In [None]:
text_id2topic = dict(zip(df_text_topics.Doc_id, df_text_topics.topic_NMF))
para_id2topic = dict(zip(zip(df_para_topics.Doc_id, df_para_topics.Para_id), df_para_topics.topic_LSA))
span_id2topic = dict(zip(zip(df_span_topics.Doc_id, df_span_topics.Para_id, df_span_topics.Sent_id, df_span_topics.Span_id), df_span_topics.topic_LSA))

In [None]:
text_topic2ids = invert_dict(text_id2topic)
para_topic2ids = invert_dict(para_id2topic)
span_topic2ids = invert_dict(span_id2topic)

In [None]:
text_topic2ids

In [None]:
topics_json = [{'id' : 1, 'name': 'Root', 'size': 1}]
for t_i, (t_topic, t_ids) in enumerate(text_topic2ids.items()):
    topics_json += [{'id' : (t_i+2), 'name': t_topic, 'size': len(t_ids), 'parent': 1}]
    
    for p_i, (p_topic, p_ids) in enumerate(para_topic2ids.items()):
        p_size = len([p for p in p_ids if (p[0] in t_ids)])
        if p_size > 10:
            topics_json += [{'id' : int((t_i+2)*1e4 + p_i+2), 'name': p_topic, 'size': p_size, 'parent': (t_i+2)}]

            for sp_i, (sp_topic, sp_ids) in enumerate(span_topic2ids.items()):
                sp_size = len([sp for sp in sp_ids if (tuple(sp[:2]) in p_ids)])
                if sp_size > 50:
                    topics_json += [{'id': int((t_i+2)*1e7 + (p_i+2)*1e4 + sp_i+2), 'name': sp_topic, 'size': sp_size, 'parent': int((t_i+2)*1e4 + p_i+2)}]

In [None]:
len(topics_json)

In [None]:
topics_json[:5]

In [None]:
# import json

# with open('flare.json') as f:
#     data_json = json.load(f)

In [None]:
# example taken from https://vega.github.io/vega/examples/tree-layout/
# with data given at https://vega.github.io/vega/data/flare.json

spec = {
  "$schema": "https://vega.github.io/schema/vega/v5.json",
  "description": "An example of Cartesian layouts for a node-link diagram of hierarchical data.",
  "width": 600,
  "height": 50000, # 1600,
  "padding": 5,

  "signals": [
    {
      "name": "labels", "value": True,
      "bind": {"input": "checkbox"}
    },
    {
      "name": "layout", "value": "tidy",
      "bind": {"input": "radio", "options": ["tidy", "cluster"]}
    },
    {
      "name": "links", "value": "diagonal",
      "bind": {
        "input": "select",
        "options": ["line", "curve", "diagonal", "orthogonal"]
      }
    },
    {
      "name": "separation", "value": False,
      "bind": {"input": "checkbox"}
    }
  ],

  "data": [
    {
      "name": "tree",
      "values": topics_json, # data_json,
      "transform": [
        {
          "type": "stratify",
          "key": "id",
          "parentKey": "parent"
        },
        {
          "type": "tree",
          "method": {"signal": "layout"},
          "size": [{"signal": "height"}, {"signal": "width - 100"}],
          "separation": {"signal": "separation"},
          "as": ["y", "x", "depth", "children"]
        }
      ]
    },
    {
      "name": "links",
      "source": "tree",
      "transform": [
        { "type": "treelinks" },
        {
          "type": "linkpath",
          "orient": "horizontal",
          "shape": {"signal": "links"}
        }
      ]
    }
  ],

  "scales": [
    {
      "name": "color",
      "type": "linear",
      "range": {"scheme": "magma"},
      "domain": {"data": "tree", "field": "depth"},
      "zero": True
    }
  ],

  "marks": [
    {
      "type": "path",
      "from": {"data": "links"},
      "encode": {
        "update": {
          "path": {"field": "path"},
          "stroke": {"value": "#ccc"}
        }
      }
    },
    {
      "type": "symbol",
      "from": {"data": "tree"},
      "encode": {
        "enter": {
          "size": {"value": 100},
          "stroke": {"value": "#fff"}
        },
        "update": {
          "x": {"field": "x"},
          "y": {"field": "y"},
          "fill": {"scale": "color", "field": "depth"}
        }
      }
    },
    {
      "type": "text",
      "from": {"data": "tree"},
      "encode": {
        "enter": {
          "text": {"field": "name"},
          "fontSize": {"value": 9},
          "baseline": {"value": "middle"}
        },
        "update": {
          "x": {"field": "x"},
          "y": {"field": "y"},
          "dx": {"signal": "datum.children ? -7 : 7"},
          "align": {"signal": "datum.children ? 'right' : 'left'"},
          "opacity": {"signal": "labels ? 1 : 0"}
        }
      }
    }
  ]
}


In [2]:
import vegascope

ModuleNotFoundError: No module named 'vegascope'

In [3]:
canvas = vegascope.LocalCanvas()

NameError: name 'vegascope' is not defined

In [None]:
canvas.how()

In [None]:
canvas(spec)

<a id="bottom"></a>

[Table of content](#TOC)