<span style="font-family:Papyrus; font-size:1.5em;">This noteboook is designed to refute or confirm the hypothesis that the name of a document category is related more to documents of that category than to documents of some other category.</span>

$W = \{w_1, w_2, \ldots, w_m\}$, $W$ is the vocabulary set.

$D = \{d_1, d_2, \ldots, d_n\}$, $d_i \in 2^W$, $D$ is the set of documents, $2^W$ is the set of all subsets of $W$.

$Y = \{y_1, y_2, \ldots, y_n\}$, $y_i \in \{1, 2, \ldots, k\}$, $Y$ is the set of labels of documents.

$\nu : \{1, 2, \ldots, k\} \mapsto 2^W$, $\nu(j)$ is the name of the $j$'th category, $j \in \{1, 2, \ldots, k\}$

$\mu : 2^W \times 2^W \mapsto \mathbb{R}$, $\mu$ is some measure of relation of two strings.

Hypothesis: $\mu(\nu(y_i), d_i) > \mu(\nu(y_i), d_j)$  $\forall y_i \neq y_i$ or in other words the name of documents category should relate more to the text of that document than to the text of another document from any other category.

The experiment is caried out on the small collecition of data parsed from kad.arbitr.com

In [1]:
import sys
import os

sys.path.append(
    os.path.join(
        os.environ['CONDA_PREFIX'], 
        "bin/AST-text-analysis"
    )
)


import ast_local
# go get east FROM https://github.com/mikhaildubov/AST-text-analysis
# WARNING :: DO NOT INSTALL EAST USING PIP
import east

import json
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import load_data
import data_utils

Loading data

In [2]:
taxonomy_df = pd.read_csv('data/taxonomies/taxonomy_df.csv')
cat_indices_str = ['28', '29', '30', '31', '32']
cat_IDs = []
cat_names = []
for cat_idx in cat_indices_str:
    catID = int(taxonomy_df[taxonomy_df['cCode']==cat_idx]['ID'])
    cat_name = (taxonomy_df[taxonomy_df['cCode']==cat_idx]['Descr'].values[0])
    cat_IDs.append(catID)
    cat_names.append(cat_name)

Reading labels to choose appropriate indices for cat_indices_str list

In [3]:
READ_DIR ='data/cases_small_preprocessedLEMM/'

Y = []
for _, y in load_data.yield_preprocessed_json_from_disk(READ_DIR):
    Y.append(y)
Y = np.array(Y)
n_docs = len(Y)

100%|█████████████████████████████████████████████████████████████████████████████████| 113/113 [00:06<00:00, 17.97it/s]


Mapping casenumbers to categories IDs.

In [4]:
cases_info = pd.read_csv('data/cases_info.csv')
cases_d = dict(zip(cases_info['Number'], cases_info['CategoryID']))
del cases_info

In [5]:
Y_id = np.array(list(map(lambda case: cases_d[case], Y)))
indices = np.array([i for i in range(len(Y_id)) if Y_id[i] in cat_IDs])

For each category from cat_indices_str list, we will choose fixed number of documents equal to the min number of documents of some category and build new list of indices

In [6]:
n_docs_per_cat = min(data_utils.build_count_dict(Y_id[indices]).values())
indices_new = []
for catID in cat_IDs:
    indices_new += list(
        np.random.choice(indices[Y_id[indices] == catID], size=n_docs_per_cat,
                        replace=False)
    )
indices_new = np.array(sorted(indices_new))
assert len(indices_new) == n_docs_per_cat*len(cat_IDs)

Reading texts from disk and preparing it for AST's

In [7]:
texts = []; Y_id_s = []
j = 0
for x, y in load_data.yield_preprocessed_json_from_disk(READ_DIR):
    if j in indices_new:
        texts.append(x)
        Y_id_s.append(cases_d[y])
    j += 1
Y_id_s = np.array(Y_id_s)
topics = [' '.join(data_utils.preprocess_text(cat)) for cat in cat_names] 

100%|█████████████████████████████████████████████████████████████████████████████████| 113/113 [00:07<00:00, 14.86it/s]


Building ASTs for texts and fitting them

In [8]:
# measuring relevance of each category name to each text using Annotysed Suffix Trees
ast_transformer = ast_local.AST()
ast_transformer.fit(texts, topics)
rel_mat = ast_transformer.relevance_matrix

BUILDING AST'S FOR TEXTS


100%|█████████████████████████████████████████████████████████████████████████████████| 275/275 [00:17<00:00, 16.03it/s]


BUILDING relevance_matrix


100%|█████████████████████████████████████████████████████████████████████████████████| 275/275 [00:07<00:00, 37.61it/s]


For each category from cat_indices_str averaging scores by texts from different categories

In [9]:
# computing a measure of each category to texts from it's category and to texts from another category to check the Hypothesis
measure_mat = np.zeros(shape=(len(cat_indices_str), len(cat_indices_str)))
for i, cat_idx in enumerate(cat_indices_str):
    catID = cat_IDs[i]
    text_mask = (Y_id_s == catID)
    
    cat_scores = np.mean(
        rel_mat[text_mask],
        axis=0
    )
    measure_mat[i] = cat_scores

In [10]:
measure_mat

array([[0.37262633, 0.31574915, 0.26968069, 0.33186648, 0.37404891],
       [0.39602422, 0.50942503, 0.36475667, 0.42019368, 0.41566347],
       [0.38110952, 0.35811023, 0.50288431, 0.4301786 , 0.40570195],
       [0.37415341, 0.34817663, 0.38556567, 0.42427418, 0.39743716],
       [0.36314742, 0.32716192, 0.34312918, 0.31857624, 0.58047568]])

<span style="font-family:Papyrus; font-size:1.5em;">**Conclusion**: In each raw the value of the diag. element is biger than other values. This indicates that the category name is closer to its own texts than to the texts of another category. This may confirm the hypothesis

# Predicting category of text by relevances to different category names

In [11]:
SAMPLE_SIZE = 2500
indices = np.random.choice(np.arange(n_docs), size=SAMPLE_SIZE, replace=False)
# idx_list = np.random.choice(np.arange(len(X)), size=N_SAMPLES)
X_sub = []
Y_id_sub =[]
j = 0
for x_str, y_case in load_data.yield_preprocessed_json_from_disk(READ_DIR):
    if j in indices:
        y_ID = cases_d[y_case]
        if y_ID != -1:
            X_sub.append(x_str)
            Y_id_sub.append(y_ID)
    j += 1
Y_id_sub = np.array(Y_id_sub)
    
cat_IDs_sub = list(set(Y_id_sub))

cat_indices_str_sub = []
cat_names_sub = []
for catID in cat_IDs_sub:
    cat_name = (taxonomy_df[taxonomy_df['ID']==catID]['Descr'].values[0])
    cat_idx =  (taxonomy_df[taxonomy_df['ID']==catID]['cCode'].values[0])
    cat_indices_str_sub.append(cat_idx)
    cat_names_sub.append(cat_name)

topics_sub  = [' '.join(data_utils.preprocess_text(cat)) for cat in cat_names_sub]

100%|█████████████████████████████████████████████████████████████████████████████████| 113/113 [00:07<00:00, 14.86it/s]


In [12]:
# measuring relevance of each category name to each text using Annotysed Suffix Trees
ast_transformer_s = ast_local.AST()
ast_transformer_s.fit(X_sub, topics_sub)
rel_mat_sub = ast_transformer_s.relevance_matrix

BUILDING AST'S FOR TEXTS


100%|███████████████████████████████████████████████████████████████████████████████| 2500/2500 [01:48<00:00, 22.95it/s]


BUILDING relevance_matrix


100%|███████████████████████████████████████████████████████████████████████████████| 2500/2500 [04:24<00:00,  9.45it/s]


In [13]:
predictions = rel_mat_sub.argmax(axis=1)
acc = 0
for i in range(len(predictions)):
    predID = cat_IDs_sub[predictions[i]]
    trueID = Y_id_sub[i]
    acc += (trueID == predID)
acc = acc / len(predictions)
print('Accuracy :: ', acc)

Accuracy ::  0.3372
