# Analyzing Legislative Burden Upon Businesses Using NLP and ML

### Imports

In [None]:
import spacy
import pandas as pd

In [None]:
# Loads a pipeline (tokenizer, tagger, parser, etc.) of models (English)
nlp = spacy.load('en')

In [None]:
aoda = pd.read_csv('../data/sents_and_titles_w_labels.csv')

In [None]:
aoda.head()

## Identify burdens

**Objective**

* Extract sentences that define obligations

**Method**

- The structure of legal texts is relatively rigid and the lexicon is limited.
- No labeled examples, so can't do supervised learning.
- Therefore we implemented rule-based extraction based on a lightweight ontology.

In [None]:
from utils import BURDENS

In [None]:
BURDENS

**WordNet®** is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the [browser](http://wordnetweb.princeton.edu/perl/webwn). WordNet is also freely and publicly available for download.

In [None]:
from nltk.corpus import wordnet

synonyms = []

for w in BURDENS:
    for syn in wordnet.synsets(w, pos='v'):
        for l in syn.lemmas():
            synonyms.append(l.name())

The list of verbs in `BURDENS` and `synonyms` define a lightweight ontology, that is used to idenlify the obligations prescribed by AODA. Given a sentence:
- Extract lemma from each term
- If any of the lemmas appears in the ontology, add the sentence to the list of burdens

In [None]:
aoda['tagged_as_burden'] = aoda.text\
    .map(lambda sent: any([True for token in nlp(sent) if token.lemma_ in BURDENS + synonyms]))

### Evaluation

In [None]:
aoda['tagged_as_burden'].sum()

* Accuracy

In [None]:
(aoda.is_burden == aoda.tagged_as_burden.astype(int)).sum()/len(aoda)

* TP: True Positive Rate
* FP: False Positive Rate
* FN: False Negative Rate

In [None]:
TP = ((aoda.is_burden == 1) & (aoda.tagged_as_burden == True)).sum()
FP = ((aoda.is_burden == 0) & (aoda.tagged_as_burden == True)).sum()
FN = ((aoda.is_burden == 1) & (aoda.tagged_as_burden == False)).sum()

* Precision: \$\frac{TP}{TP + FP}$

In [None]:
TP / (TP + FP)

Eaxmple of a sentence incorrectly classified as a burden

In [None]:
aoda.iloc[17]

In [None]:
aoda.iloc[17]['text']

* Recall: \$\frac{TP}{TP + FN}$

In [None]:
TP / (TP + FN)

Example of a burden that isn't extracted by our method

In [None]:
aoda.iloc[33]

In [None]:
aoda.iloc[33]['text']

## Identify the subjects of the burdens

### Objective

- Extract the subjects of the burdens
- Organize burdens into homogeneous groups based on the entities they affect, e.g. public VS private

Dependency parsing tags for sentence subjects: 
* `nsubj`: nominal subject ('the `cat` is in the box'),
* `nsubjpass`: passive nominal subject ('a safety `video` will be played before take-off'),
* `csubj`: clausal subject (a clausal syntactic subject of a clause, e.g. '**what you** `say` makes sense'),
* `csubjpass`: clausal passive subject (a clausal syntactic subject of a passive clause, '**that she** `lied` was suspected by everyone'),
* `agent`: link between a passive participle and the by-PP introducing its agent ('the door was opend by `him`'),
* `expl`: expletive ('`there` is a ghost in the room')

**Obligated organizations that are school boards or educational or training institutions** shall keep a record of the training provided

Obligated **organizations** that are school boards or educational or training institutions shall keep a record of the training provided.

In [None]:
from spacy import displacy

In [None]:
sent = nlp('Obligated organizations that are school boards or educational or training institutions shall keep a record of the training provided.')

In [None]:
displacy.render(sent, style='dep', jupyter=True)

#### Solution

- Combine tags assigned by the Dependency Parser with Breadth First Search
- Navigate the dependency tree and identify the subset of tokens that are related to the subject by a parent-child relationship

In [None]:
from utils import bfs, SUBJECTS

In [None]:
SUBJECTS

In [None]:
sent = nlp('Obligated organizations that are school boards or educational or training institutions shall keep a record of the training provided')

* Extract the verbs that express an obbligation

In [None]:
verbs =  [token.head if token.tag_ == 'MD' else token for token in sent if token.lemma_ in BURDENS]

In [None]:
verbs

* Extract all tokens associated with one of the tags in SUBJECT

In [None]:
all_tokens = [token for token in sent if token.dep_ in SUBJECTS]

In [None]:
bag_of_words = []

for token in all_tokens:
        bag_of_words += bfs(token)

In [None]:
' '.join([t.text for t in sorted(bag_of_words, key=lambda t: t.i)])

In [None]:
from utils import make_sentence

In [None]:
df = pd.DataFrame(
        list(
            aoda[aoda.tagged_as_burden][['index', 'section', 'text', 'part']]\
                .apply(lambda row: make_sentence(row['index'], nlp(row['text']), row['section'], row['part']),
                       axis=1)
    )
)

In [None]:
df.head()

## Grouping subjects

**Objective**

- Aim to find natural grouping of subjects by type, based on linguistic patterns

**Method**

- Normalize
- Project subjects into a semantic space ([GloVe](https://nlp.stanford.edu/projects/glove/))
- Reduce dimensionality
- KMeans clustering

#### Normalization / Lemmatization

* STOPWORDS is the combination of nltk + spacy stopwords, enriched with some words that were frequent in AODA, e.g. section/subsection

In [None]:
from utils import STOPWORDS

In [None]:
df['s_norm'] = df.subj.apply(
    lambda subj: [t.lemma_ for t in nlp(subj) if t.is_alpha and t.lemma_ not in STOPWORDS]
)

#### Project into GloVe space

Reading **GloVe** vectors:

If missing, add shape of vectors as first line, i.e. 400000 50, then, on command line:

In [None]:
!python -m spacy init-model en /tmp/vectors --vectors-loc ../data/glove.6B.50d.txt.zip

In [None]:
glove50 = spacy.load('/tmp/vectors')

In [None]:
import numpy as np

def glove_projection(tokens):
    vectors = [glove50(token).vector for token in tokens]
    return np.mean(vectors, axis=0) if tokens else np.zeros(50)

In [None]:
df['subj_vector'] = df.s_norm.map(glove_projection)

In [None]:
df = pd.concat([
    df,
    pd.DataFrame(df.s_norm.map(glove_projection).tolist(),
                 columns=['s{}'.format(i) for i in range(50)])
], axis=1)

In [None]:
df.shape

#### Dimensionality Reduction

In [None]:
from sklearn.manifold import SpectralEmbedding

In [None]:
n_dim = 2
embeddings = SpectralEmbedding(n_components=n_dim)

In [None]:
subjects = pd.DataFrame(
    embeddings.fit_transform(df[['s{}'.format(i) for i in range(50)]]),
    columns=['x{}'.format(i) for i in range(n_dim)])

#### Kmeans

In [None]:
from sklearn.cluster import KMeans

#### Elbow method

In [None]:
inertia = []
for k in range(1, 10):
    km = KMeans(n_clusters=k).fit(subjects)
    inertia.append(km.inertia_)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
sns.lineplot(x=range(1, 10), y=inertia)

In [None]:
n_groups = 3
km = KMeans(n_clusters=n_groups, random_state=0).fit(subjects)

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(data=subjects, x='x0', y='x1', hue=['g{}'.format(label) for label in km.labels_])

## Visualizing the groups

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

from utils import LemmaTokenizer, combined_plots

In [None]:
counter = CountVectorizer(tokenizer=LemmaTokenizer())
counter.fit(df.subj)

* transform sentences using CountVectorizer

In [None]:
subj = pd.DataFrame(
    counter.fit_transform(df['subj'].astype(str)).toarray(),
    columns=counter.get_feature_names()
)

* add groups labels

In [None]:
subj['label'] = km.labels_

* aggregate frequences at group level

In [None]:
groups = pd.melt(
    subj.groupby('label').sum().reset_index(),
    id_vars='label', var_name='word', value_name='count'
).groupby('label').apply(lambda group: group.sort_values(by='count', ascending=False)).reset_index(drop=True)

* rank by frequency

In [None]:
groups['rank'] = groups.groupby('label')['count'].rank(method='first', ascending=False)

In [None]:
plt.figure(figsize=(20,5))
sns.barplot(data=groups[groups['rank'] <= 5], x='word', y='count', hue='label')
plt.xticks(rotation=35)
plt.xlabel('Word')
plt.ylabel('Number of Occurrences')
plt.show();

#### Group 0

In [None]:
combined_plots(groups, label=0, max_rank=15, max_words=30)

Proportion of burdens in this group

In [None]:
(km.labels_ == 0).sum() / len(km.labels_)

Distribution across sections

In [None]:
pd.DataFrame(df[km.labels_ == 0]['part'].value_counts() / (km.labels_ == 0).sum())

#### Group 1

In [None]:
combined_plots(groups, label=1, max_rank=15, max_words=30)

Proportion of burdens in this group

In [None]:
(km.labels_ == 1).sum() / len(km.labels_)

Distribution across sections

In [None]:
pd.DataFrame(df[km.labels_ == 1]['part'].value_counts() / (km.labels_ == 1).sum())

#### Group 2

In [None]:
combined_plots(groups, label=2, max_rank=15, max_words=30)

Proportion of burdens in this group

In [None]:
(km.labels_ == 2).sum() / len(km.labels_)

Distribution across sections

In [None]:
pd.DataFrame(df[km.labels_ == 2]['part'].value_counts() / (km.labels_ == 2).sum())

## Topic Analysis

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
pipeline = Pipeline([
    ('counts', CountVectorizer(max_features=50, tokenizer=LemmaTokenizer())),
    ('lda', LatentDirichletAllocation(n_components=n_groups, learning_decay=0.5, max_iter=10, random_state=1))
])

In [None]:
pipeline.fit(df.subj)

In [None]:
vectorizer = pipeline.named_steps['counts']
dtm = vectorizer.fit_transform(df.subj)

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [None]:
p = pyLDAvis.sklearn.prepare(pipeline.named_steps['lda'], dtm, vectorizer)

In [None]:
pyLDAvis.display(p)

**Further analysis**

* Refine preprocessing, e.g. extend the list of stopwords to include words like organizations
* Analyse the objects of the sentences