# Tutorial 1 - Legal Clause Classification

Our corpus today is [LEDGAR](https://www.aclweb.org/anthology/2020.lrec-1.155.pdf), a dataset proposed in 2020 by Tuggener et al.

Each document is a provision from an actual contract, written in English.

A typical task of automatic discovery of contracts is the labeling of each provision. Today, we will build a classifier that can predict the label of a legal provision.

# Pre-Requisites


* Machine Learning: 
   * `sklearn` LogisticRegression, Pipeline, GridSearchCV
   * Train / Test split, Cross-Validation
* Text Vectorization
   * Count Vectorizer parameters
   * Vocabulary
   * Stop Words
* Useful modules
   * pandas
   * numpy
   * matplotlib
* Platform
   * Colab has the advantage that the downloads are quite fast, and it comes with a good amount of RAM
   * BUT it gives only 1 CPU, so computations can be slow, and parallelism will not improve
   * If you use your own instance of Notebook on your laptop, the download might take more time, consider this and **download in advance**

# Download

If you run this Notebook on your computer, using Jupyter Notebook, instead of Google Colab, then download it on your own:
* Here is the [Download Page](https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A)
* Select `LEDGAR_2016-2019_clean.jsonl.zip`
* Download it to your disk
* Unzip it: it will create a file named `LEDGAR_2016-2019_clean.jsonl`
* Skip the next 2 cells
* Adjust the path


In [None]:
!curl --header 'Host: drive.switch.ch' --user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' --header 'Accept-Language: en-US,en;q=0.5' --header 'DNT: 1' --referer 'https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A' --cookie 'oc641cdd42e0=13fa9330b2ce3965b18f77fa775559a5; oc_sessionPassphrase=R8jPmBjCrGkdXvI6wU%2FsMQZUqXCizggT9Aeafu3cvoXN671zkATnNRIQDSPQ4wnI7DuS6BRugjqGEjXOASVujRWxtO8BFm%2B56mMQBKUPMPucLCzrehfVBGyP0i06dh9c' --header 'Upgrade-Insecure-Requests: 1' 'https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019_clean.jsonl.zip&downloadStartSecret=038u1w43io1e' --output 'LEDGAR_2016-2019_clean.jsonl.zip'

In [None]:
!unzip LEDGAR_2016-2019_clean.jsonl.zip -d /tmp/LEDGAR

# Import and Prepare Data

In [None]:
import json
data = [json.loads(line) for line in open('/tmp/LEDGAR/LEDGAR_2016-2019_clean.jsonl')]


In [None]:
import pandas as pd
df = pd.DataFrame(data)
df = df.drop(columns=['source'])
print(f'Shape: {df.shape}')
print(f'Columns: {df.columns}')

In [None]:
df.sample(20)

In [None]:
type(df.iloc[0]['label'])

In [None]:
df['nb_labels'] = df['label'].apply(len)
print(df['nb_labels'].value_counts())

With 6 classes, we can have test results that are not `1.0`. While most of them are still `> 0.9`.

You can try going even higher. It will slow down the LogisticRegression and the Pipeline.

In [None]:
FOCUS_ON_TOP_N = 6

In [None]:
all_labels = [x for ls in df['label'] for x in ls]
proto_labels = pd.Series(all_labels).value_counts()[:FOCUS_ON_TOP_N].index
print(proto_labels)

In [None]:
focus = df[df['label'].apply(lambda x: any((z in x for z in proto_labels)))]
print(focus.shape)

In [None]:
def select_label(list_labels):
    for x in proto_labels:
        try:
            idx = list_labels.index(x)
            return list_labels[idx]
        except ValueError:
            continue
   
    raise ValueError

y = focus['label'].apply(select_label)
X = focus['provision']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)

# Vectorizer and Logistic Regression

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
    stop_words='english',
    min_df=3,
    max_df=0.9
)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [None]:
# Which words
words = vectorizer.get_feature_names()

print(f'Vocabulary size: {len(words)}')
one_every_1000 = '\n'.join(words[::1000])
print(f'Sample:\n{one_every_1000}')

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1e4).fit(X_train_bow, y_train)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_true=y_test, y_pred=clf.predict(X_test_bow)))

# Visualization

In [None]:
print(clf.coef_.shape)
print(f'Nb Classes: {clf.coef_.shape[0]}, Nb Words: {clf.coef_.shape[1]}')

In [None]:
import numpy as np
coefs = pd.DataFrame([{'class': clf.classes_[i], 'word': words[j], 'coef': co} for (i, j), co in np.ndenumerate(clf.coef_)])

In [None]:
coefs.shape

In [None]:
sort_by_coef = coefs.groupby(['class']).apply(lambda x: x.sort_values('coef', ascending=False)).reset_index(drop=True)

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(20, 30))

cut = 10

for ((_, _), ax), (c, g) in zip(np.ndenumerate(axs), sort_by_coef.groupby('class')):
    t_cut = g.head(cut)
    ax.bar(x=range(cut), height=t_cut['coef'])
    ax.set_xticks(range(cut))
    ax.set_xticklabels(t_cut['word'], rotation=45, ha='right')
    ax.set_title(c)

plt.show()

# Grid Search

We consider a pipeline with 2 stages:
* Vectorizer (IN: text, OUT: bag of words = vectors)
* Classifier (IN: vectors, OUT: predictions)


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import numpy as np

pipe = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('logreg', LogisticRegression(solver='sag', multi_class='multinomial', penalty='l2'))
])

In [None]:
parameters = {
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'vect__max_features': (None, 5000),
    'logreg__C': np.logspace(-2, 2, num=5)
}

Warning, this will take quite some time !! 

* Expect 20 minutes on Google Colab
* If you run on your own laptop, adjust `n_jobs=-1` below
* There are ConvergenceWarning messages. If you are patient enough, adjust `max_iter=1e4` above, after `penalty='l2'`

In [None]:
grid = GridSearchCV(pipe, parameters, n_jobs=1, verbose=2)

In [None]:
grid.fit(X_train, y_train)

In [None]:
print(grid.best_score_)

In [None]:
print(grid.best_params_)

In [None]:
grid.score(X_test, y_test)