# Gaussian Mixture Model for identifying similarities among reports of human rights violations.

The model is built using Scikit-Learn's and both text-preprocessor and dataset from the paper [*Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases*](https://arxiv.org/abs/2103.13084) together with Scikit-Learn's model building tools to train the mixture model. 

CITATIONS

Dataset:

- @misc{chalkidis2021paragraphlevel,
      title={Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases}, 
      author={Ilias Chalkidis and Manos Fergadiotis and Dimitrios Tsarapatsanis and Nikolaos Aletras and Ion Androutsopoulos and Prodromos Malakasiotis},
      year={2021},
      eprint={2103.13084},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Dataset Repo:

- https://www.kaggle.com/mathurinache/ecthrnaacl2021

Text Encoding: LEGAL BERT Model series by:

- I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras and I. Androutsopoulos. 
"LEGAL-BERT: The Muppets straight out of Law School". 
In Findings of Empirical Methods in Natural Language Processing (EMNLP 2020) 
(Short Papers), to be held online, 2020. (https://aclanthology.org/2020.findings-emnlp.261)

Pretrained Model Repo / Implementation:

- https://huggingface.co/nlpaueb/legal-bert-base-uncased

## Imports

In [None]:
import pandas as pd
import json
import os
import numpy as np
from sklearn.decomposition import PCA
from sklearn.mixture import BayesianGaussianMixture
from sklearn.feature_extraction.text import TfidfVectorizer

# PyTorch
import torch

# Pretrained Transformers from HuggingFace
!pip install transformers
from transformers import AutoTokenizer, AutoModel



In [None]:
from google.colab import drive
drive.mount('/content/drive')

DATASETS_FOLDER = '/content/drive/MyDrive/Colab_Notebooks/models/ReRight/datasets'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Datasets

In [None]:
"""
Dataset: custom Twitter query results
"""

# parsing function
def load_big_qury_json_to_pd(filename):
    with open(filename) as f:

        info = json.load(f)

    info_dict = {}
    info[0]['data']

    for key in info[0]['data'][0]:
        info_dict[key] = [info[0]['data'][i][key] for i in range(len(info[0]['data']))]

    df = pd.DataFrame(info_dict)  # conbert to pandas DataFrame
    df = df.sample(frac=1.0,  random_state=222)  # shuffle all data

    return  df

#dataframe_twitter = load_big_qury_json_to_pd(os.path.join(DATASETS_FOLDER, 'results-20210802-171411.json'))

In [None]:
"""
Dataset: *https://www.kaggle.com/mathurinache/ecthrnaacl2021*
"""

# parsing function
#sample_size: Max 1000. note: system crashes if full dataset is used with bert transformer
NUM_SAMPLES = 1000

def dict_from_european_court_json(filename='/content/dev.jsonl', sample_size=NUM_SAMPLES):
    

    info_dict = {}

    with open(filename) as f:
        all_info = f.readlines()

        dicts_list = [json.loads(info) for info in all_info]

        for single_dict in dicts_list:
            for key, val in single_dict.items():
                if key in info_dict:
                    info_dict[key].append(val)
                else:
                    info_dict[key] =[val]
        df = pd.DataFrame(info_dict)

    df = df.sample(frac=1.0,  random_state=222)  # shuffle all data
    
    df = df[:sample_size]

    return df

subfolder = 'EuropeanCriminalCourt'
filename = 'dev.jsonl'
dataframe_human_rights = dict_from_european_court_json(os.path.join(DATASETS_FOLDER, subfolder, filename))

In [None]:
dataframe_human_rights['facts'] = dataframe_human_rights['facts'].apply(lambda x: ' '.join(x).lower())

In [None]:
print('columns:', dataframe_human_rights.columns)
dataframe_human_rights.head(1)

columns: Index(['case_id', 'case_no', 'title', 'judgment_date', 'facts', 'applicants',
       'defendants', 'allegedly_violated_articles', 'violated_articles',
       'court_assessment_references', 'silver_rationales', 'gold_rationales'],
      dtype='object')


Unnamed: 0,case_id,case_no,title,judgment_date,facts,applicants,defendants,allegedly_violated_articles,violated_articles,court_assessment_references,silver_rationales,gold_rationales
930,001-175663,6131/07,CASE OF KOROBEYNIKOV v. RUSSIA,2017-07-25,5. the applicant was born in 1963 and lives i...,[KOROBEYNIKOV],[RUSSIA],[6],[6],{},[],[]


In [None]:
dataframe_human_rights['facts'].iloc[0]

'5.  the applicant was born in 1963 and lives in pyatigorsk. 6.  the applicant took part in the cleaning-up operation at the chernobyl nuclear disaster site. he was subsequently registered disabled by ukrainian authorities, becoming entitled to various social benefits. 7.  in september 1999 the applicant settled in russia. the welfare authorities rejected re-establishing the applicant’s disability status. the applicant challenged the rejection before the courts. 8.  on 30 august 2005 the pyatigorsk town court granted the applicant’s claim and ordered the administration of labour and social security of the population of pyatigorsk to issue a certificate of benefits. 9.  on 20 september 2009 the judgment became final. 10.  on 18 october 2005 the applicant was issued with the certificate of benefits. 11.  on 1 november 2006 the presidium of stavropol regional court allowed the defendant authority’s application for supervisory review and quashed the judgment of 30 august 2005, considering 

## Vectorization

Tokenizer

In [None]:
"""
Tokenizer and Transformer models created by the dataset authors:
I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras and I. Androutsopoulos. 
"LEGAL-BERT: The Muppets straight out of Law School". 
In Findings of Empirical Methods in Natural Language Processing (EMNLP 2020) 
(Short Papers), to be held online, 2020. (https://aclanthology.org/2020.findings-emnlp.261)
"""

# Tokenizer
# This is a specialized tokenizer designed for use on the dataset
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-uncased-echr")

# ECHR Dataset
# apply text preprocessor
texts = dataframe_human_rights['facts'].to_list()
tokens_echr = tokenizer(texts,
                   padding=True,
                   truncation=True,
                   max_length=256,  # pad/truncate to uniform size
                   return_tensors="pt")  # return in PyTorch format

masked_tokens_echr = tokens_echr['input_ids'] * tokens_echr['attention_mask']                   

Vectorizer

In [None]:
"""
# LEGAL-BERT Model
# note: crashes system if used on full datatset
legal_bert_transforer = AutoModel.from_pretrained("nlpaueb/bert-base-uncased-echr")

use: encoded_data = legal_bert_transforer(**tokens_echr)
"""

'\n# LEGAL-BERT Model\n# note: crashes system if used on full datatset\nlegal_bert_transforer = AutoModel.from_pretrained("nlpaueb/bert-base-uncased-echr")\n'

In [None]:
# Count Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# convert to string for fitting vectorizer
corpus = range(torch.max(tokens_echr['input_ids']))
corpus = [str(c) for c in corpus]  

# fit
tfidf_vectorizer.fit(corpus)

# convert to string for applying vectorizer
masked_tokens_list = masked_tokens_echr.tolist()
masked_tokens_echr_string = [' '.join([str(num) for num in masked_tokens_list[i]]) 
                                for i in range(len(masked_tokens_list))]

encoded_data = tfidf_vectorizer.transform(masked_tokens_echr_string)

In [None]:
encoded_data

<1000x29980 sparse matrix of type '<class 'numpy.float64'>'
	with 132098 stored elements in Compressed Sparse Row format>

## Why not a Transformer Model?

The paper from where we sourced our dataset and text pre-processor also has a custom built Transformer model, that has been made available at PRETRAINED MODEL IMPLEMENTATION from https://huggingface.co/nlpaueb/legal-bert-base-uncased. Unfortunately this model crashes our system when more than a couple hundred samples are used. Instead, we build our own Gaussian Mixture Model below.

# Mixture Models

Dimension Reduction

In [None]:
pca_transform = PCA(n_components=10)
"""
# if using bert transformer data:
encoded_data_condensed = pca_transform.fit_transform(encoded_data['pooler_output'].detach().numpy())  # '.detach().numpy()' needed for proper conversion between pytorch ans scikit
"""
# if using TF-IDF vectorized data
encoded_data_condensed = pca_transform.fit_transform(encoded_data.todense())

Cluster Model

In [None]:
mixture_model = BayesianGaussianMixture(n_components=5, random_state=142)
mixture_model.fit(encoded_data_condensed)



BayesianGaussianMixture(covariance_prior=None, covariance_type='full',
                        degrees_of_freedom_prior=None, init_params='kmeans',
                        max_iter=100, mean_precision_prior=None,
                        mean_prior=None, n_components=5, n_init=1,
                        random_state=142, reg_covar=1e-06, tol=0.001, verbose=0,
                        verbose_interval=10, warm_start=False,
                        weight_concentration_prior=None,
                        weight_concentration_prior_type='dirichlet_process')

In [None]:
clusters_assignments = mixture_model.predict(encoded_data_condensed)

In [None]:
clusters_probs = mixture_model.predict_proba(encoded_data_condensed)
clusters_probs[:5,:]
print(clusters_probs.shape)

(1000, 5)


Combine into dataframe

In [None]:
# hard assignments
dataframe_human_rights['clusters_assignments'] = clusters_assignments
dataframe_human_rights.head(2)

Unnamed: 0,case_id,case_no,title,judgment_date,facts,applicants,defendants,allegedly_violated_articles,violated_articles,court_assessment_references,silver_rationales,gold_rationales,clusters_assignments
930,001-175663,6131/07,CASE OF KOROBEYNIKOV v. RUSSIA,2017-07-25,5. the applicant was born in 1963 and lives i...,[KOROBEYNIKOV],[RUSSIA],[6],[6],{},[],[],1
335,001-166930,3933/12,CASE OF PISKUNOV v. RUSSIA,2016-10-04,7. the applicant was born in 1955. he is curr...,[PISKUNOV],[RUSSIA],"[13, 3]","[13, 3]",{},[],[],1


In [None]:
# soft assignments
soft_assignments_df = pd.DataFrame(clusters_probs, columns=['clusters_probs_0', 'clusters_probs_1', 'clusters_probs_2', 'clusters_probs_3', 'clusters_probs_4'])
soft_assignments_df.head(3)
print(len(soft_assignments_df))

1000


In [None]:
df = pd.concat([dataframe_human_rights, soft_assignments_df.reindex(dataframe_human_rights.index)], axis='columns', join='inner')
len(df)

1000

## Interpretations

In [None]:
cluster_objects = {}
cluster_names = sorted(df['clusters_assignments'].unique().tolist())

for i in cluster_names:
    cluster_objects[i] = df[df['clusters_assignments'] == i][['defendants', 'judgment_date', 'facts', 'allegedly_violated_articles']]

In [None]:
shuffled_cluster = cluster_objects[0].sample(frac=1.0)
for i in range(5):#len(shuffled_cluster)):
    print(shuffled_cluster.iloc[i]['facts'])
    print()

4.  the applicants, whose years of birth are summarised in the appendix, live in troitsko-pechorsk of the komi republic. 5.  they were municipal unitary enterprise employees working for “troitsko-pechorskoye zhkkh” («муп «троицко-печерское жкх», “the company”) in the komi republic. 6.  the company was set up in 2003 in accordance with a decision of head of the troitsko-pechorskiy district (“the district administration”) as a commercial organisation performing the following activities, among others: renovation and maintenance of the municipal housing stock; heating and water supply to the district population and enterprises; maintenance of the sewage systems; maintenance services in respect of municipal housing and adjacent territories; and providing real estate registration services in the troitsko-pechorskiy district. in order to carry out its statutory activities, the company had “the right of economic control” (право хозяйственного ведения) over the assets allocated to it by the tow

In [None]:
cluster_objects[1].sort_values(by=['defendants'])

Unnamed: 0,defendants,judgment_date,facts,allegedly_violated_articles
359,[ALBANIA],2016-10-06,4. the applicant was born in 1947 and lives i...,"[P1-1, 6]"
504,[ALBANIA],2016-12-08,4. the applicant was born in 1952 and lives i...,[6]
60,[ALBANIA],2016-03-17,10. on 11 may 1995 the fier commission recogn...,"[13, P1-1, 6]"
424,[ARMENIA],2016-10-27,6. the applicants are a family who lived in y...,"[P1-1, 8, 6]"
31,[ARMENIA],2016-02-25,5. the applicant was born in 1954 and lives i...,[6]
...,...,...,...,...
395,[UKRAINE],2016-10-13,5. the applicant was born in 1975 and prior t...,"[3, 34]"
301,[UNITED KINGDOM],2016-09-01,4. the applicant was born in 1977 and lives i...,[5]
19,[UNITED KINGDOM],2016-02-18,4. on 16 september 1982 the applicant was sen...,[5]
967,[UNITED KINGDOM],2017-09-14,"4. the facts of the case, as submitted by the...",[8]


In [None]:
cluster_objects[2].sort_values(by=['defendants'])

Unnamed: 0,defendants,judgment_date,facts,allegedly_violated_articles
100,[ALBANIA],2016-04-07,5. on 30 may 2003 the gjirokastra commission ...,"[13, P1-1, 6]"
114,[ALBANIA],2016-04-21,7. the applicant was born in 1951 and lives i...,"[13, P1-1, 46, 6, 34]"
368,[ALBANIA],2016-10-06,6. the applicant was born in 1968 and lives i...,"[13, 6]"
503,[ALBANIA],2016-12-08,4. the applicant was born in 1958 and lives i...,"[P1-1, 6]"
462,[ARMENIA],2016-11-17,"5. the applicants, mr vladimir karapetyan (th...",[10]
...,...,...,...,...
868,[UKRAINE],2017-06-27,5. in disputes between the applicants and the...,[6]
298,[UKRAINE],2016-09-01,5. the first applicant was born in 1940. she ...,"[13, 2, 6, 34]"
558,[UNITED KINGDOM],2017-01-12,4. the applicant was born in 1942 and lives i...,[6]
313,[UNITED KINGDOM],2016-09-15,4. the applicant was born in 1945 and is curr...,[6]


In [None]:
cluster_objects[3].sort_values(by=['defendants'])

Unnamed: 0,defendants,judgment_date,facts,allegedly_violated_articles
695,[],2017-03-23,5. the applicant was born in 1990. 6. the ap...,"[P4-2, 8]"
425,[ARMENIA],2016-10-27,5. on 7 october 2000 s.m. was stabbed by two ...,[6]
757,[ARMENIA],2017-04-27,5. the present case concerns the applicant’s ...,[6]
966,[ARMENIA],2017-09-14,5. the applicant was born in 1987 and was ser...,[3]
481,[ARMENIA],2016-11-24,5. the applicant was born in 1976 and is curr...,[6]
...,...,...,...,...
530,[UKRAINE],2016-12-15,"5. on 9 june 2004, the applicant, 69 at the t...","[13, 3]"
330,[UKRAINE],2016-09-22,5. the applicant was born in 1973 and is curr...,"[5, 13, 3, 6, 34]"
3,[UNITED KINGDOM],2016-02-11,6. the applicant was born in 1977 and lives i...,[7]
58,[UNITED KINGDOM],2016-03-17,5. the applicant was born in 1954 and lives i...,"[5, 13, 6, 34]"


In [None]:
cluster_objects[4].sort_values(by=['defendants'])

Unnamed: 0,defendants,judgment_date,facts,allegedly_violated_articles
86,[ARMENIA],2016-03-31,5. the applicant was born in 1969 and lives i...,"[P1-1, 6]"
658,[AUSTRIA],2017-02-28,4. the applicant was born in 1965 and lives i...,[6]
148,[AUSTRIA],2016-05-17,"5. the applicant, ms gabriele fürst-pfeifer, ...","[10, 8]"
741,[AUSTRIA],2017-04-11,4. the applicant was born in 1939 and lives i...,[6]
416,[AUSTRIA],2016-10-25,5. the applicant company is a limited liabili...,[10]
...,...,...,...,...
151,[UNITED KINGDOM],2016-05-19,4. the applicant was born in 1971 and lives i...,[5]
861,[UNITED KINGDOM],2017-06-22,5. the applicant was born in zimbabwe and liv...,[5]
661,[UNITED KINGDOM],2017-03-02,4. the applicant was born in 1977 and lives i...,"[5, 34]"
89,[UNITED KINGDOM],2016-03-31,3. the present case concerns the applicant’s ...,[6]
