# Tutorial 1 - Legal Clause Classification

Our corpus today is [LEDGAR](https://www.aclweb.org/anthology/2020.lrec-1.155.pdf), a dataset proposed in 2020 by Tuggener et al.

Each document is a provision from an actual contract, written in English.

A typical task of automatic discovery of contracts is the labeling of each provision. Today, we will build a classifier that can predict the label of a legal provision.

# Pre-Requisites


* Machine Learning: 
   * `sklearn` LogisticRegression, Pipeline, GridSearchCV
   * Train / Test split, Cross-Validation
* Text Vectorization
   * Count Vectorizer parameters
   * Vocabulary
   * Stop Words
* Useful modules
   * pandas
   * numpy
   * matplotlib
* Platform
   * Colab has the advantage that the downloads are quite fast, and it comes with a good amount of RAM
   * BUT it gives only 1 CPU, so computations can be slow, and parallelism will not improve
   * If you use your own instance of Notebook on your laptop, the download might take more time, consider this and **download in advance**

# Download

If you want to download it on your own:
* Here is the [Download Page](https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A)
* Select `LEDGAR_2016-2019_clean.jsonl.zip`
* Download it to your disk
* Unzip it: it will create a file named `LEDGAR_2016-2019_clean.jsonl`


In [None]:
!curl --header 'Host: drive.switch.ch' --user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' --header 'Accept-Language: en-US,en;q=0.5' --header 'DNT: 1' --referer 'https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A' --cookie 'oc641cdd42e0=13fa9330b2ce3965b18f77fa775559a5; oc_sessionPassphrase=R8jPmBjCrGkdXvI6wU%2FsMQZUqXCizggT9Aeafu3cvoXN671zkATnNRIQDSPQ4wnI7DuS6BRugjqGEjXOASVujRWxtO8BFm%2B56mMQBKUPMPucLCzrehfVBGyP0i06dh9c' --header 'Upgrade-Insecure-Requests: 1' 'https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019_clean.jsonl.zip&downloadStartSecret=038u1w43io1e' --output 'LEDGAR_2016-2019_clean.jsonl.zip'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  164M  100  164M    0     0  6926k      0  0:00:24  0:00:24 --:--:-- 12.4M


In [None]:
!unzip LEDGAR_2016-2019_clean.jsonl.zip -d /tmp/LEDGAR

Archive:  LEDGAR_2016-2019_clean.jsonl.zip
  inflating: /tmp/LEDGAR/LEDGAR_2016-2019_clean.jsonl  


# Prepare Data

The original dataset has a list of labels for each legal provision. In the code hereafter, we reduce our scope to the Top 20 most frequent labels, and assign only one label to each provision.



---


Adjust the path to the JSONL file if needed

In [None]:
import json
data = [json.loads(line) for line in open('/tmp/LEDGAR/LEDGAR_2016-2019_clean.jsonl')]


In [None]:
import pandas as pd
df = pd.DataFrame(data)
df = df.drop(columns=['source'])
print(f'Shape: {df.shape}')
print(f'Columns: {df.columns}')

Shape: (846274, 2)
Columns: Index(['provision', 'label'], dtype='object')


In [None]:
df.sample(20)

Unnamed: 0,provision,label
270277,"Subject to Article 5, the Corporation reserves...","[adoption, amendments]"
116841,The headings contained in this Agreement are f...,[headings]
281062,From and after the Effective Date and during t...,[benefits]
478459,The Employee and the Company agree that this T...,[interpretations]
673174,Subject to compliance with any applicable secu...,[transferability]
603324,In the event of a termination of Executive's e...,[exclusive remedy]
663552,Each Party hereby agrees to take all such acti...,[further assurances]
410889,Upon the Collateral Custodian’s receipt of a C...,[successor collateral custodian]
763178,Any and all notices or other communications or...,[notices]
725926,Other than with respect to Permitted Policy Am...,[modification of investment policies]


In [None]:
type(df.iloc[0]['label'])

list

In [None]:
df['nb_labels'] = df['label'].apply(len)
print(df['nb_labels'].value_counts())

1    707151
2    118525
3     17749
4      2338
5       408
6        58
7        40
8         5
Name: nb_labels, dtype: int64


<a id='focus'></a>
## Focus on some labels


Each provision is associated to a list of labels. For this tutorial, we will focus on predicting a single label for each provision, and we will restrict the list of different labels.

We will focus on only the TOP N most frequent labels. 

Adjust the variable `FOCUS_ON_TOP_N` to the number of labels you want to consider.

We start with 2.




In [None]:
FOCUS_ON_TOP_N = 2

In [None]:
all_labels = [x for ls in df['label'] for x in ls]
proto_labels = pd.Series(all_labels).value_counts()[:FOCUS_ON_TOP_N].index
print(proto_labels)

Index(['governing laws', 'amendments'], dtype='object')


In [None]:
focus = df[df['label'].apply(lambda x: any((z in x for z in proto_labels)))]
print(f'FOCUS on {focus.shape[0]} documents')

FOCUS on 30582 documents


In [None]:
def select_label(list_labels):
    for x in proto_labels:
        try:
            idx = list_labels.index(x)
            return list_labels[idx]
        except ValueError:
            continue
   
    raise ValueError

y = focus['label'].apply(select_label)
X = focus['provision']

In [None]:
print('Labels :')
print(y[:10])
print()
print('Provisions :')
print(X[:10])

Labels :
17         amendments
61     governing laws
81         amendments
83     governing laws
115        amendments
143    governing laws
165        amendments
167    governing laws
173    governing laws
190        amendments
Name: label, dtype: object

Provisions :
17     That Defaulting Lender’s right to approve or d...
61     The validity, interpretation, construction and...
81     This Agreement contains the entire agreement b...
83     This Agreement shall be governed by and constr...
115    The provisions of this Agreement, or any other...
143    This Agreement shall be construed and enforced...
165    Any term, covenant, or condition of this Note ...
167    This Note shall be governed by and construed i...
173    This Amendment shall be governed by and constr...
190    The issuance by the Agent of any amendment, su...
Name: provision, dtype: object


# EXERCISE: Classification

Your task for this tutorial is to use Text Representation and Machine Learning in order to predict the label for each provision.

* **IN**: the text of a provision
* **OUT**: a predicted label

The starting point is to define the terms:
* **CORPUS**:??
* **DOCUMENT**: ??
* **TASK**: ??

Now the starting point: we split Train/Test.
* All vectorizer, etc... will be `fit()` or `fit_transform()` on the **TRAIN** set
* The **TEST** set will be `transform()`
* We use `stratify` to make sure the class balance is the same in TRAIN and TEST

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)

## TODO - Vocabulary / BoW

Use `CountVectorizer` or `TfidfVectorizer` to create the Vocabulary
* Use some values for `min_df`, `max_df`, `ngram_range`, `max_features`
* Select `stop_words='english'`
* Display the size of the vocabulary
* Display random terms from the vocabulary

In [None]:
# TODO - Instantiate a Vectorizer
vect = CountVectorizer(....)

# fit it to the TRAIN texts
fit(...)
# transform the TRAIN texts into TRAIN BoW

train_bow = vect.trans
test_bow = 
# transform the TEST texts into TEST BoW

In [None]:
# TODO - VOCABULARY
vocabulary = vect.feature()
# print the vocabulary size
# print random terms

## TODO - Machine Learning

Train a  `LogisticRegression` model on this classification task.

Remember Machine Learning 101:
* `fit` your model on the **TRAIN** data
* evaluate on the **TEST** data

In [None]:
# TODO - Instantiate a LogisticRegression
# fit it to your TRAIN data

In [None]:
# TODO - Evaluate on the TEST data
# Use classification_report

classification_report(y_true=y_test, y_pred=clf.predict(X_test_bow))

For this part, it is expected that the classifier is a variable named `clf`.

## Visualization

This code will show you how to display which terms have the highest coeffs in the logistic regression. 

High coefficients are attributed to words that are good indicators for a class.

In [None]:
# TODO - adjust names
clf = yourclassifier
words = yourvectorizer.get_feature_names()


print(clf.coef_.shape)
print(f'Nb Classes: {clf.coef_.shape[0]}, Nb Words: {clf.coef_.shape[1]}')

In [None]:
import numpy as np
coefs = pd.DataFrame([{'class': clf.classes_[i], 'word': words[j], 'coef': co} for (i, j), co in np.ndenumerate(clf.coef_)])

In [None]:
coefs.shape

In [None]:
sort_by_coef = coefs.groupby(['class']).apply(lambda x: x.sort_values('coef', ascending=False)).reset_index(drop=True)

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=10, ncols=2, figsize=(20, 100))

cut = 10

for ((_, _), ax), (c, g) in zip(np.ndenumerate(axs), sort_by_coef.groupby('class')):
    t_cut = g.head(cut)
    ax.bar(x=range(cut), height=t_cut['coef'])
    ax.set_xticks(range(cut))
    ax.set_xticklabels(t_cut['word'], rotation=45, ha='right')
    ax.set_title(c)

plt.show()

## TODO - GridSearch

* Go back to Section `Focus on some labels`
* Adjust `FOCUS_ON_TOP_N` to 2 again
* Create a `Pipeline` with a Vectorizer and a Logistic Regression
* We want to run a GridSearch on different hyperparameters:
   * coefficient `C` of the Logistic Regression
   * parameter `ngram_range` of the Vectorizer

For inspiration:
* [SKlearn example](https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html)
* [Blog Post by Analytics Vidhya](https://medium.com/analytics-vidhya/ml-pipelines-using-scikit-learn-and-gridsearchcv-fe605a7f9e05)

In [None]:
# TODO - Create the Pipeline: Vectorizer + LogisticRegression
# Setup some of the parameters of the vectorizer (stop_words, ...) but NOT ngram_range

In [None]:
# TODO - Create the hyperparameter search space
# if a pipeline step is named 'step' and it has a parameter 'parameter', then the parameter dictionary must have an entry 'step__parameter'

In [None]:
# TODO - Fit the grid to the training data

In [None]:
# TODO - Print the best score, the best params
# TODO - Print the classification report with the TEST data

## TODO - More Labels

This is an exploration of the difficulty to carry out more complex tasks, by having more and more classes, more and more documents to deal with.

* Go back to Section `Focus on some labels` higher up.
* Adjust the variable `FOCUS_ON_TOP_N` to a higher value (4, 6)
* Execute all cells of `Focus on some labels`, `Vocabulary`, `Machine Learning`
* See how long it takes to `fit` the training data again (the `%%timeit` displays it)