# Legal Text Classification with Blackstone

This notebook demonstrates the uses of Blackstone on selected Singaporean case law. 

Blackstone is a Python package that uses NLP techniques to detect patterns in case law and classify the text into 5 categories.

The five categories are:

AXIOM - The text appears to postulate a well-established principle

CONCLUSION - The text appears to make a finding, holding, determination or conclusion

ISSUE - The text appears to discuss an issue or question

LEGAL_TEST - The test appears to discuss a legal test

UNCAT - The text does not fall into one of the four categories above

To demonstrate the usage and performance of Blackstone, I used the seminal tort case of Spandeck v DSTA Agency (2007) 4 SLR(R) 100.

The rest of this jupyter notebook documents the text cleaning process and the prediction of the above legal categories.

## 1. Importing relevant packages and the data file

In [2]:
# To check working directory, affects the dataFilePath as defined below for the import of the spandeck file
# import os
# os.getcwd()

'c:\\Users\\Tristan\\Desktop\\Projects\\blackstone\\blackstone-legal-cat\\code'

In [4]:
from pathlib import Path
import pandas as pd

dataFilePath = Path("..", "data", "spandeck.txt")

spandeck = open(dataFilePath, "r", encoding= "utf8")

# This returns a stream (i.e. generator object), and saves it to list
text = spandeck.readlines()

### Function for text pre-processing

This function replaces tabs and new lines and appends the sentences together to form a string.

In [None]:
# Text preprocessing
def text_preprocessing(text):
    """ Accepts a list of unprocessed strings, returns a list of strings without string and tab breaks and empty strings """
    
    processed_text = []

    for string in text:
        string = string.replace("\n", "")
        string = string.replace("\t", " ")
        processed_text.append(string)
    
    processed_text = [string for string in processed_text if string != ""]

    return processed_text

# Clean the text
text = text_preprocessing(text)

In [None]:
import spacy
nlp = spacy.load("en_blackstone_proto")

Since blackstone categoriser works at a sentence level, split the strings into individual sentences using model's sentence boundary detector.

In [None]:
def legal_cats(sentences):
    """
    Function to identify the highest scoring category prediction generated by the text categoriser. 

    Arguments: 
    a list of strings
    
    converts to spacy generator object, splits into sentences using spacy's sentence detector

    returns a tuple of: 
    a list of the split sentences,
    a list of the max cat and max score for each doc in tuples
    """
    doc_sentences = []

    docs = nlp.pipe(sentences, disable = ["tagger", "ner", "textcat"])

    for doc in docs:
        for sentence in doc.sents:
            doc_sentences.append(sentence.text)
    
    docs = nlp.pipe(doc_sentences, disable = ["tagger", "parser", "ner"])
    cats_list = []

    for doc in docs:
        cats = doc.cats
        max_score = max(cats.values()) 
        max_cats = [k for k, v in cats.items() if v == max_score]
        max_cat = max_cats[0]
        cats_list.append((max_cat, max_score))

    return doc_sentences, cats_list

Parse the text through the sentence boundary detector

In [None]:
cats = legal_cats(text)

Appending each sentence to a row of the dataframe

Running the blackstone model on each row to classify the sentence

In [None]:
df = pd.DataFrame({"sentence" : cats[0], "category": [cat[0] for cat in cats[1]], "score": [cat[1] for cat in cats[1]]})

df.head()

df["category"].unique()

for sentence in df.loc[df["category"] == "LEGAL_TEST", "sentence"][:30]:
    print(sentence)
    print("-" * 40)