## TFIDF Cluster Family Inspection

In this notebook, we compute the Term Frequency-Inverse Document Frequency statistics
used to validate our cluster family names as reported in the SI.

Executing this notebook requires access to the text data contained in the individual clusters,
which is not provided in the data accompanying the paper.
For the United States, the input data can be computed by running our preprocessing and clustering pipelines on the publicly available XML
from the Office of the Law Revision Counsel.
For Germany, we cannot make the input data available due to licensing restrictions.

### Preparations

In [None]:
import networkx as nx
from gensim.utils import simple_preprocess
from gensim import corpora, models
import pandas as pd

from legal_data_clustering.utils.graph_api import cluster_families

In [None]:
# switch between us and de to compute tfidf statistics for the different countries

dataset = 'us'
#dataset = 'de'

base_path = f'../../legal-networks-data/{dataset}/'

### Computing the statistics

In [None]:
G = nx.read_gpickle(
    base_path+'13_cluster_evolution_graph/all_0-0_1-0_-1_a-infomap_n100_m1-0_s0_c1000.gpickle.gz'
)

In [None]:
cluster_families = cluster_families(G,threshold=.15)[:20]
leading_clusters = [c[0] for c in cluster_families]

In [None]:
def read_cluster_texts(node):
    year, cluster = node.split('_')
    with open(f'{base_path}12_cluster_texts/{year}_0-0_1-0_-1_a-infomap_n100_m1-0_s0_c1000/community_{cluster}.txt') as f:
        return f.read()

family_nodes = [
    ' '.join([
        read_cluster_texts(c) 
        for c in clusters
    ])
    for clusters in cluster_families
]

In [None]:
def preprocess(x):
    res = simple_preprocess(x)
    print('done')
    return res

cluster_families_preprocessed = [preprocess(doc) for doc in family_nodes]

In [None]:
family_nodes = None

In [None]:
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in cluster_families_preprocessed]

In [None]:
tfidf = models.TfidfModel(BoW_corpus, smartirs='ntc')

In [None]:
data = [
    {dictionary[key]: freq for key, freq in doc}
    for doc in tfidf[BoW_corpus][:20]
]

In [None]:
data_sorted = [
    sorted([x for x in cluster_family.items()], key=lambda y: y[-1], reverse=True)
    for cluster_family in data
]

In [None]:
df = pd.DataFrame({
    leading: [word for word, cnt in fam_data[:250]]
    for leading, fam_data in zip(leading_clusters, data_sorted)
})
df.to_csv(f'../graphics/tfidf_cluster_family_inspection_{dataset}.csv')

### End