<h1 aligh='center'>Analyzing Legislative Burden Upon Businesses Using NLP and ML</h1>

<table>
    <tr>
        <td>
            <img src=img/SerenaPeruzzo.jpg width=500/>
        </td>
        <td>
            <img src=img/DanielParton.jpg width=500/>
        </td>
    </tr>
</table>

## Agenda

* Background, objectives, challenges
* Overview of NLP methedos
* Burden Extraction
* Subjects identification and analysis

In [1]:
import pandas
import spacy
from spacy import displacy

In [2]:
nlp = spacy.load('en')

In [3]:
parsed = nlp('The ramp must have a minimum clear width of 900 mm')

## Background

- Collaboration between Bardess and Government of Ontario
- Leverage NLP techniques to analyse patterns in the law that can be used to better understand it, extract information that is relevant to the public, create links with other legal and non legal documents


#### Focus for this workshop:

- Analyse the [Accessibility for Ontarians with Disability Act (AODA)](https://www.ontario.ca/laws/statute/05a11#BK11) and its [regulation](https://www.ontario.ca/laws/regulation/110191/v5):
    - Passed in 2005
    - Process for developing and enforcing accessibility standards in Ontario
- Identify and analyse burdens imposed upon businesses and government
    - A burden is a requirement or obligation that organizations have to comply with.
    - Physical, architectural, reporting, training etc.

<img src=img/beach_access.png width=900 align='center'/>

<img src=img/education.png width=900 align='center'/>

## Objectives

* Automate the process of extracting knowledge from legislative texts, in the context of AODA the responsibilities set out by the law.
* Understand who are the entities that are affected by the legislation.
* Provide a framework for efficiently representing the burdens extracted and facilitate searching for relevant information.

## Challenges

* Language parsing and tokenization are made harder by the use of formatting, abbreviations, and references that are specific to legal documents.
* The lexicon is relatively limited and very specialized, but the interpretation is highly sensitive to the context and there are no industry-specific pre-trained models that incorporate semantic analysis.
* Information extraction is further complicated by the syntactic complexity of sentences, which is often non-linear.
* Context sensitivity also effects supervised learning, where training sets coming from a specific context don’t generalize well to others, e.g. training on legislation from England in order to analyze the legislation of the United States.


## Spacy

- [spacy.io](https://spacy.io/)
- Written in Python and Cython
- Pre-trained neural networks models for English, German, Spanish, Portuguese, French, Italian and Dutch
- Multi-language models

### Comparison with nltk
- Performance and production focused VS teaching and research
- Less functionalities
- Faster

Full comparison and details on accuracy and speed [here](https://spacy.io/usage/facts-figures#other-libraries)

### Tokenization

Given a character sequence and a defined document unit, tokenization is the task of segmenting it into units, called tokens

A **token** is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: |
|The | ramp | must | have | a | minimum | clear | width | of | 900 | mm |

### Stop words

Usually the most common words in a language (e.g. "the", "of") which are filtered out before processing.

No universal list, largely dependent on the use case

### Normalization

The process of standardizing tokens so that matches occur despite superficial differences in the character sequences of the tokens

* analyse $\to$ analyze
* anti-discriminatory $\to$ antidiscriminatory

### Lemmatization

The process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form

{ walks, walking, walked } $\to$ walk

**Lemmatizatizers** aim to identifying the lemma of a word given its meaning and function in the sentece (uses POS tagging), as opposed to a **Stemmer**, which applies a set of rules that don't take the context in account.

* "better" $\to$ "good" would be missed by a stemmer
* "walking" $\to$ "walk" would be correctly mapped by both a stemmer and a lemmatizer

A Lemmatizer would also attempt to correctly map "meeting" $\to$ "meet" or "meeting" depending on the context (verb VS noun)

### Named Entity Recognition (NER)

* Locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, dates etc.

In [4]:
displacy.render(parsed, style='ent', jupyter=True)

### Parts Of Speech (POS) Tagging and Dependency Parsing

Parts Of Speech tagging: assign words with similar grammatical properties to the same tag/category (Part Of Speech)

Dependency Parsing: Assigning syntactic dependency labels, describing the relations between individual tokens. Builds a tree representing the grammatical relashionships between words in the sentence

| i | token | POS | TAG | DEP
| :-: | :-: | :-: | :-: | :-: |
| 0 | The | DET | DT | det |
| 1 |  ramp | NOUN  | NN | nsubj |
| 2 |  must | VERB | MD | aux |
| 3 | have | VERB | VB | ROOT
| 4 | a | DET | DT | det
| 5 | minimum | ADJ  | JJ | amod
| 6 | clear | ADJ | JJ | amod |
| 7 | width | NOUN | NN | dobj |
| 8 | of | ADP | IN | prep |
| 9 | 900 | NUM | CD  | nummod |
| 10 | mm | NOUN | NN  | pobj |

In [5]:
displacy.render(parsed, style='dep', jupyter=True)

### A Little Grammar Review

`subject`: 
- The word or phrase that indicates "who" or "what" performs the action
- Entities that are responsible for complying with the burdens/obligations

**Every employer** shall provide individualized workplace emergency response information to employees who have a disability

`object`:
- The entity that is acted upon by the subject
- The burden specification

Every employer shall **provide individualized workplace emergency response information** to employees who have a disability

In [6]:
import spacy

nlp = spacy.load('en')

![](img/pipeline.png)

<h2 align="center">[Notebook]</h2> 

<h2 align="center">Breadth First Search</h2> 

<img src=img/bfs.jpg width=400/>

## GloVe: Global Vectors for Word Representation

* Corpus of 6B tokens (Wikipedia 2014 + Gigaworld 5)
* 400k vocabulary
* **50d**, 100d, 200d, & 300d vectors
* Trained on the matrix of word-word co-occurrence counts


<table>
    <tr>
        <td>
            <img src=img/man_woman.jpg width=500/>
        </td>
        <td>
            <img src=img/comparative_superlative.jpg width=500/>
        </td>
    </tr>
</table>

## Spectral Embeddings

* Non linear embeddings
* Assume the data points lie on an (unknown) manifold
* Model the manifold with a graph, where points are connected by an edge if they are close to each other on the manifold 
* Laplacian Eigenmaps Algorithm: find a low dimensional representation of the data using a spectral decomposition of the Graph Laplacian
* Points that are close to each other on the manifold are mapped close to each other in the low dimensional space
* Local distances are preserved

## KMeans Clustering

* Partition n observations in k clusters
* Each observation is assigned to the cluster with the nearest mean
* The means/centroids of the clusters serve as prototypes for the groups

<table>
    <tr>
        <td>
            <img src=img/kmeans1.png width=250/>
        </td>
        <td>
            <img src=img/kmeans2.png width=250/>
        </td>
        <td>
            <img src=img/kmeans3.png width=250/>
        </td>
        <td>
            <img src=img/kmeans4.png width=250/>
        </td>
    </tr>
</table>


## Topic Analysis

* bag-of-words assumption
* K probability distributions over a collection of words (topics)

### Generative model

For each document:
1. Select the number of words
2. Draw a distribution of topics
3. For each word in the document:
    1. Draw a specific topic
    2. Draw a word from a multinomial probability conditioned on the topic

## Summary

* Automate the extraction of burdens/obligations
* Identify entities that are responsible for compliance
* Organize burdens into homogeneous groups with respect of their impact on various entities
* Identify ambiguities in the legislation
* Generic framework