<img src="https://legalhackers19.s3.amazonaws.com/Mind+Map.png">

# Legal Hackers Annual Summit 2019
## Open Legal Data Forum

<img src="https://legalhackers19.s3.amazonaws.com/CAP_image.png">

The amassed body of the decisions of judges in the courts serves as an unwitting commentary on how far we have come (or, in some cases, how far we have yet to develop) as civilised societies governed by the rule of law. Case law is jam-packed with hidden insight just waiting to be uncovered. It can tell us how certain issues have risen to prominence and descended back into irrelevance over time. It can provide us with clues about how the occurrence of specific criminal offences relate to broader issues in taking place in society. It can show us how far societal attitudes have evolved in any number of walks of life.

What is strange is that, until very recently, openly accessible case law data was virtually non-existent; the material could only be accessed on a document-by-document basis via expensive subscription services traditionally used by lawyers that, one way or another, prohibited “big data” research at scale. 

The launch of the Case Law Access Project (CAP) by the Library Innovation Lab at Harvard University in 2018 was a significant leap forward for open legal data research: CAP exposes a mega 6.7 million US state decisions and 1.7 million federal decisions in a semi-structured format. 

The CAP dataset provides the potential for a dizzying array of innovative applications, but it also presents us, as researchers, with a range of technical challenges bound up with wrangling and exploring such a large text-based dataset. 

In this workshop, we will:

1. Survey the current state of open access to legal data in the USA and United Kingdom with a focus on case law. Participants are encouraged to share their experience and knowledge of open access in their own jurisdictions
2. Take a tour through existing experiments with the CAP dataset 
3. Explore the fundamentals of wrangling text data and natural language processing 
4. Identify questions we would like to ask the data and devise strategies for answering those questions. 

### Workshop Leaders

**Andrew Silva**, Application Developer at Harvard Law School’s Library Innovation Lab. Andrew is a self-taught developer who was working as a classically trained chef before accepting a position with LIL. He was technical lead during the digitization phase of the Caselaw Access Project and has worked in the field of law library technology at Harvard for about fifteen years.

**Daniel Hoadley**, Head of Research at the Incorporated Council of Law Reporting for England and Wales. Daniel is director of ICLR’s research lab, ICLR&D, and a developer of iclr.co.uk. Daniel is the author of the Blackstone Python library for legal natural language processing. Daniel was called to the English Bar in 2009 and previously worked as a legal journalist. 

---

## 1. Preparation (our *mise en place*)

Before we can start playing with the Caselaw Access Project data, we need to do a bit of prep-work and get our environment ready for business. 

We're going to be doing the data analysis-stuff in this *notebook* environment using *Python 3.6* as a programming language. Notebooks aren't great for full-on development, but there are great for demonstrating code, because we can mix code with text and images! 

We have the following jobs to do to get our enviroment ready:

1. Import the third-party libraries we'll be using at various points throughout the workshop.
2. Load a few statistical models (we'll be using these statistical models to make various *predictions* on the case law data) 
3. Add some additional components to the models

#### 1A. Import libraries

In [2]:
import pandas as pd 

from blackstone.concepts import Concepts
from blackstone.rules.concept_rules import CONCEPT_PATTERNS
from blackstone.compound_cases import CompoundCases

from gensim.summarization import summarize

import spacy 
from spacy import displacy
import spacy.cli

from wasabi import Printer

msg = Printer()

#### 1B. Load the statistical models

We're going to be using two spaCy statistical models in this workshop:

* `en_core_web_sm`: this is a general-purpose NLP model trained on a variety of web-based text
* `en_blackstone_proto`: this is an experimental legal NLP model developed by ICLR&D

In [3]:
%%time
general_nlp = spacy.load('en_core_web_sm')
blackstone_nlp = spacy.load('en_blackstone_proto')

CPU times: user 2.96 s, sys: 422 ms, total: 3.38 s
Wall time: 3.79 s


#### 1C. Set up custom pipelines in the Blackstone model

We're going to be making use of two custom pipeline components that ship with Blackstone:

* `Concept`: this will be used for keyword/keyphrase extraction
* `CompoundCases`: this will be used for the extraction of case_name + citation pairs (e.g. *Smith v Jones* \[1998\] 1 WLR 123)

In [4]:
if 'Concepts' not in blackstone_nlp.pipe_names:
  concepts_pipe = Concepts(blackstone_nlp) 
  blackstone_nlp.add_pipe(concepts_pipe)
if 'CompoundCases' not in blackstone_nlp.pipe_names:
  compound_pipe = CompoundCases(blackstone_nlp)
  blackstone_nlp.add_pipe(compound_pipe)

---
## 2. The Dataset

This notebook uses a dataset consisting of court decisions from Arkansas, Illinois and New Mexico courtesy of the Caselaw Access Project. The dataset consists of approximately 261,000 opinions.

* Arkansas: 59,735 cases in the dataset.
* Illinois: 183,146 cases in the dataset
* New Mexico: 18,338 cases in the dataset.

The dataset is large (approx 2.6gb on disk)! To make this notebook easier to move around, we're going to be loading the data from a publicly accessible AWS S3 bucket.

#### 2A. Load the data

In [6]:
# this takes a little while...

%%time
df = pd.read_csv('https://legalhackers19.s3.amazonaws.com/CAP_compact_data.csv')

CPU times: user 8.38 s, sys: 5.68 s, total: 14.1 s
Wall time: 4min 1s


#### 2B. The "shape" of the data

We've loaded the case law data into a structure called a `DataFrame`. A `DataFrame` is essentially a table composed of rows and columns. 

In [7]:
df.shape

(52244, 5)

#### 2C. Take a peek at the first few rows in the data

In [8]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,casename,opinion,state
0,0,229198,"DANIEL P. YORK, Plaintiff-Appellee, v. GRAND T...",Mr. JUSTICE McGLOON delivered the opinion of t...,Illinois
1,1,215470,"Marie A. Hesse, Administratrix, Defendant in E...",Mr. Justice Barnes delivered the opinion of th...,Illinois
2,2,85404,"THE PEOPLE OF THE STATE OF ILLINOIS, Plaintiff...",JUSTICE K1ZZI delivered the opinion of the cou...,Illinois
3,3,251633,"STATE of New Mexico, Plaintiff-Appellee, v. La...","OPINION SOSA, Senior Justice. Defendant-appell...",New_Mexico
4,4,185073,"Elva M. Boulter, Appellee, vs. The Joliet Nati...",Mr. Justice Dunn delivered the opinion of the ...,Illinois


#### 2D. Examine the text of one of the opinions in the dataset

In [14]:
df['opinion'][1234]

'JUSTICE O’CONNOR delivered the opinion of the court: Petitioner Nancy Carlson obtained a judgment for dissolution of marriage, which respondent Charles Carlson moved to vacate on the grounds of unconscionability and duress. The motion was denied. Charles appeals. For the reasons below, we affirm. Charles and Nancy were married on July 16, 1977, and lived together until April of 1985, when Nancy moved to Lyons, Illinois. Charles remained at the marital residence in Berwyn, a two-flat in which the couple had lived in one apartment and rented the other. They had no children. On May 7, 1985, Nancy filed a petition for dissolution of marriage. On May 28, 1985, Nancy obtained an ex parte temporary restraining order to prevent Charles from harassing her. On August 23, 1985, Charles appeared and filed his response to the petition for dissolution. In January 1986, Nancy filed a petition for a second temporary restraining order, and on February 10, 1986, an agreed preliminary injunction was ent

#### 2E. Convert the table into a dictionary

To make the data a little bit easier to work with, we're going to take a copy of the data in the DataFrame and structure is as a `dictionary`. Our dictionary is going to use the case name as the key and the text of the opinion(s) as the value. 

In [10]:
cases_dictionary = dict(zip(df['casename'], df['opinion']))

In [11]:
len(cases_dictionary)

51607

## 3. Let's do some natural language processing!

We now have a large portion of the CAP data in memory, but before we start playing with it we're going to zoom out and focus on a single sentence taken from the dataset. Here's the sentence:

> Petitioner Nancy Carlson obtained a judgment for dissolution of marriage, which respondent Charles Carlson moved to vacate on the grounds of unconscionability and duress.

We're going to use a phenomenally useful Python library called spaCy to do the heavy lifting. We've already loaded the model, `general_nlp`, at **1B** above. 

In [17]:
sample_text = """Petitioner Nancy Carlson obtained a judgment for dissolution of marriage, which respondent Charles Carlson moved to vacate on the grounds of unconscionability and duress."""

In [18]:
sample_text

'Petitioner Nancy Carlson obtained a judgment for dissolution of marriage, which respondent Charles Carlson moved to vacate on the grounds of unconscionability and duress.'


---
#### A quick word about documents, sentences and er, words!

Before we jump deeper into this section, it's worth saying a few words about some of the terminology and concepts that are going to come up. 

##### Corpus

The first concept is a **Corpus**. A corpus is simply a collection of *documents*. In the current context, our corpus is the CAP data (i.e. our collection of judgments)

##### Document

The second concept is a **Document**. A document, for our purposes, is a collection of sentences. In the current context, a document is a single judgment.

##### Sentence

The third concept is a **Sentence**. A sentence is a series of words and punctuation marks, *tokens*, arranged in an order that (hopefully) makes grammatical sense. In the current context, our sentence is:

> Petitioner Nancy Carlson obtained a judgment for dissolution of marriage, which respondent Charles Carlson moved to vacate on the grounds of unconscionability and duress.

##### Token

The final concept is the weird one: **Tokens**. A token is a single word or punctuation mark. So **>dissolution of marriage,<** consists of four tokens: `dissolution`, `of`, `marriage` and `,`.






<img src="https://legalhackers19.s3.amazonaws.com/Units_of_text.png">

#### 3A. Apply the model to the sample sentence

First, we apply our the `general_nlp` model to the sample sentence.

In [19]:
doc = general_nlp(sample_text)

In [20]:
doc

Petitioner Nancy Carlson obtained a judgment for dissolution of marriage, which respondent Charles Carlson moved to vacate on the grounds of unconscionability and duress.

#### 3B. Tokenization (or "Tokenisation" on the other side of the Atlantic)



In [37]:
print("{0:<20} {1:>20} ".format("Token", "POS"))
print("-----------------------------------------")
for token in doc:
    print("{0:<20} {1:>20} ".format(token.text, token.pos_))

Token                                 POS 
-----------------------------------------
Petitioner                          PROPN 
Nancy                               PROPN 
Carlson                             PROPN 
obtained                             VERB 
a                                     DET 
judgment                             NOUN 
for                                   ADP 
dissolution                          NOUN 
of                                    ADP 
marriage                             NOUN 
,                                   PUNCT 
which                                 DET 
respondent                           NOUN 
Charles                             PROPN 
Carlson                             PROPN 
moved                                VERB 
to                                   PART 
vacate                               NOUN 
on                                    ADP 
the                                   DET 
grounds                              NOUN 
of          

#### Nouns

In [36]:
print("{0:<20} {1:>20} ".format("Token", "POS"))
print("-----------------------------------------")
for token in doc:
    if token.pos_ == "NOUN":
        print("{0:<20} {1:>20} ".format(token.text, token.pos_))

Token                                 POS 
-----------------------------------------
judgment                             NOUN 
dissolution                          NOUN 
marriage                             NOUN 
respondent                           NOUN 
vacate                               NOUN 
grounds                              NOUN 
unconscionability                    NOUN 
duress                               NOUN 


#### Verbs

In [38]:
print("{0:<20} {1:>20} ".format("Token", "POS"))
print("-----------------------------------------")
for token in doc:
    if token.pos_ == "VERB":
        print("{0:<20} {1:>20} ".format(token.text, token.pos_))

Token                                 POS 
-----------------------------------------
obtained                             VERB 
moved                                VERB 


#### Proper Nouns

In [35]:
print("{0:<20} {1:>20} ".format("Token", "POS"))
print("-----------------------------------------")
for token in doc:
    if token.pos_ == "PROPN":
        print("{0:<20} {1:>20} ".format(token.text, token.pos_))

Token                                 POS 
-----------------------------------------
Petitioner                          PROPN 
Nancy                               PROPN 
Carlson                             PROPN 
Charles                             PROPN 
Carlson                             PROPN 


#### Visualising syntactic dependencies in the sentence

https://universaldependencies.org/

In [41]:
displacy.render(doc, jupyter=True, style='dep')

In [42]:
doc = general_nlp(df['opinion'][1234])

In [45]:
sentences = [sent.text for sent in doc.sents]

In [46]:
len(sentences)

81

In [61]:
for ent in doc.ents:
    PERSONS = list(set([ent.text for ent in doc.ents if ent.label_ == 'PERSON']))
    PERSONS.sort()

In [62]:
PERSONS

['Berwyn', 'Charles', 'Charles Carlson', 'Nancy', 'Nancy Carlson', 'Stat']

In [75]:
for ent in doc.ents:
    DATES = list(set([ent.text for ent in doc.ents if ent.label_ == 'DATE']))
    DATES.sort()

In [77]:
DATES

['1983',
 '1987',
 'April of 1985',
 'August 23, 1985',
 'December 1, 1987',
 'December 1, 1987,',
 'February 10',
 'February 10, 1986',
 'January 1986',
 'July 16, 1977',
 'March 31, 1988',
 'May 28, 1985',
 'May 7, 1985',
 'November 10, 1987',
 'September 10',
 'September 10, 1987',
 'September 4',
 'September 4, 1987',
 'September 7',
 'September 7, 1987',
 'just three days',
 'the same day']