# Unsupervised Text Summarization for Legal Texts

## 1. Dataset
For this project we are concentrating on both French and English Datasets. We are using AUSTLII Dataset in this example which contains legal court procedings from Australia. The dataset contains a list of sentences and their catchphrases. We are using these catch phrases for evaluating our summarization.

## 1.1 Extracting Dataset

In [1]:
import os
import xml.etree.ElementTree as ET
import re

## 1.2 Cleaning the dataset
The XML files contain some parsing issues and are in UTF-8. For us to use the dataset we need to the clean the XML files we get at input

In [2]:
def clean(filename):
	"""
	Cleans the xml for processing
	1. The attribute error in catchphrase tag
	2. Escape characters
	:param filename: file path of file to clean
	:return: cleaned file contents
	"""
	with open(filename) as fp:
		file_content = fp.read()
	file_content = re.sub(r'<catchphrase "id=([a-z][0-9]*)">', r'<catchphrase id="\1">', file_content)
	file_content = re.sub(r'&([a-zA-Z])([a-zA-Z]*);', r'\1', file_content)
	return file_content

## 1.3 Parsing a legal document
Takes out all the sentences and catch phrases from the document

In [3]:
def extract(filename):
	"""
	Fetches all the sentences and catch phrases of a file and stores in a (<full text>, <catch-phrase>) tuple
	:param filename: file path to extract from
	:return: (<full text>, <catch-phrases>) for the path
	"""
	file_content = clean(filename)
	root = ET.fromstring(file_content)
	catchphrase_subtree = root.find('catchphrases')
	catchphrases = []
	for catchphrase in catchphrase_subtree:
		catchphrases.append(catchphrase.text)
	sentence_subtree = root.find('sentences')
	full_text = []
	for sentence in sentence_subtree[:-1]:
		full_text.append(sentence.text)
	return full_text, catchphrases

## 1.4 Parsing all the documents in the Dataset

In [4]:
def getGroundTruth():
	"""
	Fetches all the full texts and their catch phrases and stores in a (<full text>, <catch-phrase>) tuple
	:return: list of (<full text>, <catch-phrase>) tuples
	"""
	print("Fetching dataset...", end=" ")
	gt = []
	for _, _, files in os.walk("./fulltext/"):
		for filename in files:
			gt.append(extract("./fulltext/{}".format(filename)))
	print("Done")
	return gt

## 1.5 Example Parse

In [8]:
gt = getGroundTruth()
print(gt[0][1])

Fetching dataset... Done
['application for leave to appeal', 'authorisation of multiple infringements of copyright established', 'prior sale of realty of one respondent to primary proceedings', 'payment of substantial part of proceeds of sale to offshore company in purported repayment of loan', 'absence of material establishing original making and purpose of loan', 'mareva and ancillary orders made by primary judge', 'affidavits disclosing assets sworn', 'orders made requiring filing of further affidavits of disclosure and cross-examination of one respondent to primary proceedings on her disclosure affidavit', 'no error in making further ancillary orders', 'leave refused', 'practice and procedure']



## 2. Proposed Approach
For this code demonstration we will implement a text summarizer using clustering of the sentence embeddings. To select a representative sentence for each cluster we will use an extractive approach. 

In [39]:
from sentence_transformers import SentenceTransformer
from rouge import Rouge
from sklearn.cluster import KMeans, DBSCAN
import numpy as np
from sklearn.metrics import pairwise_distances_argmin_min
from nltk import ngrams
from collections import Counter

### 2.1 Encoding the sentences
Encoding the sentences to a form which we use to do further analysis on the document as a whole. To encode the sentences we are trying out three approaches to sentence encoding:
* Skip Thought Encoding
* Paragram Phrase Encoding
* BERT Sentence Encoding

In [42]:
# BERT Encoding of sentences
enc_model = SentenceTransformer('bert-base-nli-mean-tokens')

# Skip Thought
model = skipthoughts.load_model()
encoder = skipthoughts.Encoder(model)

I0406 19:49:55.243939 17044 SentenceTransformer.py:29] Load pretrained SentenceTransformer: bert-base-nli-mean-tokens
I0406 19:49:55.253942 17044 SentenceTransformer.py:32] Did not find a '/' or '\' in the name. Assume to download model from server.
I0406 19:49:55.257941 17044 SentenceTransformer.py:68] Load SentenceTransformer from folder: C:\Users\13235/.cache\torch\sentence_transformers\public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip
I0406 19:49:55.271941 17044 configuration_utils.py:182] loading configuration file C:\Users\13235/.cache\torch\sentence_transformers\public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip\0_BERT\config.json
I0406 19:49:55.275943 17044 configuration_utils.py:199] Model config {
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",


In [21]:
def encode_sentence(sentence):
    return enc_model.encode(full_text)

### 2.2 Clustering the sentence embeddings
We now cluster the embeddings of similar sentences, to form clusters. For this we are exploring two techniques:
* K-Means
* DB Scan

In [30]:
def cluster(embeddings, method="kmeans", minimum_samples=6):
	if method == "dbscan":
		clusters = DBSCAN(eps=0.3, min_samples=minimum_samples)
	else:
		kmeans = KMeans(n_clusters=minimum_samples)
		clusters = kmeans.fit(embeddings)
	return clusters

### 2.3 Select Representative Cluster

For selecting the representative cluster, we will be using extractive approaches.

#### 2.3.1 Centroid based extraction
Takes the closest sentence embedding to the centroid from each cluster. We also use the position of each cluster to dictate the ordering of the summary.

In [12]:
def centroid_representative(clusters, sentence_embeddings):
    closest, _ = pairwise_distances_argmin_min(clusters.cluster_centers_, sentence_embeddings)
    return closest

#### 2.3.2 Catch Phrase Extraction
As the dataset is using catch phrases for evalutaion. We need another way to extract the summaries which concentrates on catchphrases. There are many known ways of keyword extraction:
 * TF-IDF
 * RAKE
 * TextRank
 
For this approach we are experimenting with TF-IDF

In [14]:
def mostCommonPhrase(summary):
	result = []
	most_common_phrase = ""
	max_freq = 1
	for n in range(10, 3, -1):
		phrases = []
		for token in ngrams(summary.split(), n):
			phrases.append(' '.join(token))
		phrase, freq = Counter(phrases).most_common(1)[0]
		if freq > max_freq:
			max_freq = freq
			# result.append((phrase, n))
			# print(phrase)
			most_common_phrase = phrase
			summary = summary.replace(phrase, '')
	return most_common_phrase


def getCatchPhrase(cluster, full_text):
	cluster_sent = {}
	catch_phrase = []
	summary = []
	sentence_n = len(full_text)
	for sentence_id in range(sentence_n):
		label = cluster.labels_[sentence_id]
		if label not in cluster_sent:
			cluster_sent[label] = []
		cluster_sent[label].append(full_text[sentence_id])
	for label in cluster_sent.keys():
		summary_label = " ".join(cluster_sent[label])
		catch_phrase.append(mostCommonPhrase(summary_label))
	return catch_phrase

### 2.4 Evaluation
The current state of the art models using this dataset all use ROUGE-1. 

In [15]:
def evaluate(model_sum, gt_sum):
	"""
	Gives rouge score
	:param model_sum: list of summaries returned by the model
	:param gt_sum: list of ground truth summary from catchphrases
	:return: ROUGE score
	"""
	rouge = Rouge()
	return rouge.get_scores(model_sum, gt_sum, avg=True)


### 2.5 Execution

In [37]:
def main():
	"""
	Executes the entire pipeline of the code
	:return: void
	"""
	gt = getGroundTruth()
	model_sum, gt_sum = [], []
	print("Fetching encoder model...", end=" ")
	enc_model = SentenceTransformer('bert-base-nli-mean-tokens')
	print("Done")
	for full_text, catch_phrases in gt[:15]:
		# Embed each sentence
		sentence_embeddings = enc_model.encode(full_text)

		# Cluster each embedding
		cluster_n = 11
		clusters = cluster(sentence_embeddings, minimum_samples=cluster_n)
		centroids = []
		for idx in range(cluster_n):
			centroid_id = np.where(clusters.labels_ == idx)[0]
			centroids.append(np.mean(centroid_id))

		# Select representative cluster
		closest, _ = pairwise_distances_argmin_min(clusters.cluster_centers_, sentence_embeddings)
		ordering = sorted(range(cluster_n), key=lambda k: centroids[k])
		
		summary = '.'.join([full_text[closest[idx]] for idx in ordering]).replace('\n', ' ')
		model_sum.append(summary)
		gt_sum.append(".".join(catch_phrases))
		break
	print("ROUGE score: {}".format(evaluate(model_sum, gt_sum)))

In [38]:
main()

Fetching dataset... 

I0406 19:22:46.262805 17044 SentenceTransformer.py:29] Load pretrained SentenceTransformer: bert-base-nli-mean-tokens
I0406 19:22:46.263804 17044 SentenceTransformer.py:32] Did not find a '/' or '\' in the name. Assume to download model from server.
I0406 19:22:46.265806 17044 SentenceTransformer.py:68] Load SentenceTransformer from folder: C:\Users\13235/.cache\torch\sentence_transformers\public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip
I0406 19:22:46.268806 17044 configuration_utils.py:182] loading configuration file C:\Users\13235/.cache\torch\sentence_transformers\public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip\0_BERT\config.json
I0406 19:22:46.270806 17044 configuration_utils.py:199] Model config {
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",


Done
Fetching encoder model... 

I0406 19:22:48.291752 17044 tokenization_utils.py:327] Model name 'C:\Users\13235/.cache\torch\sentence_transformers\public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip\0_BERT' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, bert-base-finnish-cased-v1, bert-base-finnish-uncased-v1). Assuming 'C:\Users\13235/.cache\torch\sentence_transformers\public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip\0_BERT' is a path or url to a directory containing tok

Done


Batches: 100%|██████████| 29/29 [00:02<00:00,  9.97it/s]


ROUGE score: {'rouge-1': {'f': 0.17447495705888652, 'p': 0.10266159695817491, 'r': 0.5806451612903226}, 'rouge-2': {'f': 0.03241490832149097, 'p': 0.01904761904761905, 'r': 0.10869565217391304}, 'rouge-l': {'f': 0.13213212925538156, 'p': 0.08, 'r': 0.3793103448275862}}
