# Legal Text Classification with Blackstone

### Author: Tristan Koh, NUS Law Year 2
### GitHub: https://github.com/TristanKoh

This Jupyter notebook demonstrates the uses of Blackstone on Singaporean case law. By doing so, I hope to encourage others to get their hands dirty with basic programming and data science, especially law students that are interested in legal technology. Even for students whose interest lies in the law of technology rather than technology of law, I personally believe that one cannot simply discuss "technology" in the abstract when formulating legal rules that govern such technology.

At the same time, I empathise with those who may be apprehensive of programming / coding, as I was one and a half years ago. Hence, through this notebook, I aim to explain each step in the code as simply as possible, to demonstrate that one does not need to be particularly talented to self-learn programming.



## About Blackstone

Blackstone is a Python package that uses Natural Language Processing (NLP) techniques to detect linguistic features in case law and classify the text into 5 categories.

The five categories are:

AXIOM - The text appears to postulate a well-established principle

CONCLUSION - The text appears to make a finding, holding, determination or conclusion

ISSUE - The text appears to discuss an issue or question

LEGAL_TEST - The test appears to discuss a legal test

UNCAT - The text does not fall into one of the four categories above

## How Blackstone fits into the broader data science context

As computers cannot understand text as humans do, NLP packages like Blackstone provide a set of utilities that allows us to create a mathematical model of the text, such that computers are able to process natural language. We call these models "text representations". The simplest of text representations (not used in Blackstone) is the Bag-of-Words representation. It represents the text as a count of words in the document.

As you may imagine, such a representation loses significant semantic meaning, as it ignores word order and relative frequency of words in the text. This means that commonly used but less meaningful words like "can" and "one" have higher weightage in the model than more meaningful words like "technology" and "programming". 

Therefore, there are other, more complicated text representations that retain more semantic meaning in the text, such as as a Tf-IDF representation (which is essentially a weighted count of words) and word embeddings.

Blackstone uses the latter model. For brevity, this article (https://machinelearningmastery.com/what-are-word-embeddings/) better explains how word embeddings work much better than I can, so I shall not go further into the details here.

To demonstrate the usage and performance of Blackstone, I used the seminal Singaporean tort case of Spandeck v DSTA Agency (2007) 4 SLR(R) 100.

The rest of this jupyter notebook documents the text cleaning process and the prediction of the above legal categories.

## 1. Importing relevant packages and the data file

Before we begin, certain packages need to be imported such that it would allow neater and more efficient management of the data. Packages are basically code written in Python that provide specific functionality that are not present in Python itself.

The key packages that are used for this notebook are:

1. Pandas - Enables the structuring of data in a tabular format, with rows and columns (similar to an Excel spreadsheet).

2. Blackstone - As mentioned, a NLP package.

3. Path - Auxilliary package that creates the relative file path to the file that contains case that we are going to test Blackstone on.

In [2]:
# This code checks the location of the working directory, affects the definition of dataFilePath as defined below for the import of the spandeck file
# I have left the code commented since it only needs to be used for checking the file path before starting the rest of the project
# import os
# os.getcwd()

'c:\\Users\\Tristan\\Desktop\\Projects\\blackstone\\blackstone-legal-cat\\code'

In [13]:
from pathlib import Path
import pandas as pd

# DatafilePath is a string that contains the relative file path to the data file
dataFilePath = Path("..", "data", "spandeck.txt")

# Import the spandeck case as text
spandeck = open(dataFilePath, "r", encoding= "utf8")

# .readlines() returns a stream (ie. the text is not saved in memory), hence we save it as a string called "text" which is saved in memory
text = spandeck.readlines()

In [9]:
# This code loads the blackstone NLP model, and saves it into the object called NLP
import blackstone
import en_blackstone_proto
nlp = en_blackstone_proto.load()

## 2. Text pre-processing

As the text as extracted directly from the pdf file is not "clean" (ie. contains formatting and other characters that do not carry any semantic meaning), we will first need to pre-process the text to remove these unessential characters.

There are various packages that come with pre-written functions that can be used for general situations, but here the text only contains line and tab breaks, and hence I have decided to define my own function that removes such text formatting.

### What is a function?

A function in programming is similar to mathematical functions; there is an input and an output, and a bunch of pre-defined steps are applied onto the input.

Apart from pre-defined functions (such as the "print" function), we can also define our own functions. We do so for our convenience, because we can reuse the same lines of code defined within the function later on just by calling the function name.

### Function for text pre-processing

This function replaces tabs and new lines and appends the sentences together to form a single string.

In [16]:
# Text preprocessing
def text_preprocessing(text):
    """ Accepts a list of unprocessed strings, returns a list of strings without string and tab breaks and empty strings """
    
    # This creates an empty list
    processed_text = []

    # This is a for loop; it iterates through the strings in the text, and performs some operations on each string. Hence the name for loop: "For" each string, apply X operations on the string.

    # In this case, for each string, we replace new lines ("\n") with an empty string, and replace tabs ("\t") with a space.
    for string in text:
        string = string.replace("\n", "")
        string = string.replace("\t", " ")
        processed_text.append(string)
    
    # This is a list comprehension; a more concise way of expressing a for loop.
    # We iterate through each string in the processed text, and we only retain strings which are not empty strings (since empty strings are meaningless in this context)

    processed_text = [string for string in processed_text if string != ""]

    return processed_text

# Run the function on the string
text = text_preprocessing(text)

### Function to split the text into individual strings

Since blackstone predicts at a sentence level (ie. we cannot use the entire case as one string as an input to blackstone), this function splits the text into individual strings using blackstone's sentence boundary detector.

The sentence boundary detector is a function within blackstone that detects individual sentences.

In [17]:
def legal_cats(sentences):
    """
    Function to identify the highest scoring category prediction generated by the text categoriser. 

    Arguments: 
    a list of strings
    
    converts to spacy generator object, splits into sentences using spacy's sentence detector

    returns a tuple of: 
    a list of the split sentences,
    a list of the max cat and max score for each doc in tuples
    """
    doc_sentences = []

    # This passes the input string through the nlp model, and converts it to doc object
    # This doc object contains both the original text, and tags the sentences with certain attributes, such as the sentence boundary detector.
    # A doc corresponds to a string.

    docs = nlp.pipe(sentences, disable = ["tagger", "ner", "textcat"])

    # We loop through each document in the documents, and loop again through each sentence in the document, and append the sentence to doc_sentences, an empty list
    for doc in docs:
        for sentence in doc.sents:
            doc_sentences.append(sentence.text)
    
    # We can now categorise each sentence into one of the five abovementioned categories.

    # We convert the newly detected sentences into a doc object again, as it contains the categoriser attribute that we can use to predict
    
    docs = nlp.pipe(doc_sentences, disable = ["tagger", "parser", "ner"])

    # We create a list to store the corresponding category and the score (ie the likelihood of the category that blackstone predicts the sentence to be)
    # This index of the list corresponds to doc_sentences (ie. the first item in cats_list contains the predicted category and score for the first sentence in doc_sentences, the second item in cats_list contains the predicted category and score for the second sentence, so on and so forth)

    cats_list = []

    # We loop through the doc (sentence) in the documents, and return the highest probability category and its score for each sentence

    # We have to select the highest scoring category because blackstone provides the probability of all five categories which the sentence can fall under.

    # We are only concerned with blackstone's best prediction, and hence we only save the highest scoring category.
    for doc in docs:
        cats = doc.cats
        max_score = max(cats.values()) 
        max_cats = [k for k, v in cats.items() if v == max_score]
        max_cat = max_cats[0]
        cats_list.append((max_cat, max_score))

    return doc_sentences, cats_list

## 3. Predicting on the processed text

With the above function defined, we can now finally use blackstone to predict the categories of the text. This just involves calling the function with the cleaned text as the argument.

In [18]:
cats = legal_cats(text)

## 4. Saving the predictions to a dataframe

Appending each sentence to a row of the dataframe

Running the blackstone model on each row to classify the sentence

In [19]:
df = pd.DataFrame({"sentence" : cats[0], "category": [cat[0] for cat in cats[1]], "score": [cat[1] for cat in cats[1]]})

df.head()

df["category"].unique()

for sentence in df.loc[df["category"] == "LEGAL_TEST", "sentence"][:30]:
    print(sentence)
    print("-" * 40)

Tort – Negligence – Duty of care – Applicable test to determine existence of duty of care – Relationship between two-stage test and incremental approach – Application of two-stage test comprising first proximity and second policy considerations with threshold consideration of factual foreseeability – Incremental approach as methodological aid in applying specific criterion of two-stage test
----------------------------------------
Tort – Negligence – Duty of care – Applicable test to determine existence of duty of care – Whether type of damage claimed should result in different test – Application of single (two-stage) test irrespective of type of damage claimed
----------------------------------------
(1)    A single test should determine the imposition of a duty of care in all claims arising out of negligence, irrespective of the type of the damages claimed.
----------------------------------------
There was no justification for a general exclusionary rule against recovery of all econ