Scraping tribunal orders

In [13]:
import requests

url = "https://forms.justice.govt.nz/solr/TTV2/select?facet=true&rows=20&fl=*%2C%20score&hl=true&hl.fl=*&hl.simple.pre=%3Cspan%20class%3D%22highlight%22%3E&hl.simple.post=%3C%2Fspan%3E&hl.fragsize=10000&hl.requireFieldMatch=true&hl.usePhraseHighlighter=true&facet.limit=-1&facet.mincount=-1&sort=decisionDateIndex_l%20desc&json.nl=map&q=publishedDate_dt%3A%5B2023-01-01T23%3A59%3A59.999Z%20TO%20*%5D&fq=jurisdictionCode_s%3ATT%20AND%20publishedDate_dt%3A%5BNOW-3YEARS%20TO%20NOW%5D&wt=json&json.wrf=jQuery112407980557764722094_1698392158615&_=1698392158617"

r = requests.get(url)

In [14]:
r.text



Extracting text from a pdf

In [2]:
from pypdf import PdfReader
import os

directory = 'C:\\Users\\chris\\tenancy_tribunal\\PDFs'


In [67]:
parts = []

def visitor_body(text, cm, tm, font_dict, font_size):
    y = cm[5]

    if y > 100 and y < 800:
        parts.append(text)

In [72]:
tribunal_orders = []

for file in os.listdir(directory):
    reader = PdfReader(os.path.join(directory,file))

    for page in reader.pages[:2]:
        page.extract_text(visitor_text=visitor_body)
    
    text_body = "".join(parts)

    tribunal_orders.append(text_body)

    parts = []

In [77]:
print(tribunal_orders[0])

[2023] NZTT 4438983 
TENANCY TRIBUNAL - [Event location suppressed]
APPLICANT: [The applicant/s]
 Tenant
RESPONDENT: Barfoot & Thompson Limited - Meadowlands As Agent For 
Cheng Hang
 Landlord
TENANCY ADDRESS: [Tenancy address suppressed]
ORDER
1. An application for suppression has been made in this case, and the Tribunal 
orders suppression of the tenants’ names and identifying details.
2. The tenants are authorised to install a cat door at the premises at their cost. At 
the end of the tenancy (in the absence of an agreement to leave the cat door in 
place) they must remove the cat door and reglaze the glass pane where it was 
installed. The reglazing work must be done by a professional window repairer.
3. The application for compensation for cleaning costs is withdrawn.
4. Barfoot & Thompson Limited – Meadowlands as agent for Cheng Hang must pay 
[The tenant/s].Reasons:
1. Both parties attended the hearing. Mr Xiang represented the landlord.
2. The tenants have applied for an order 

# Summarising

#### Use pre-built model to summarise

Pre-trained transformer models can be downloaded to provide a summary of a text. Bart-cnn provides a simple interface for this where text can be provided and a summary produced.

In [1]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
summariser = pipeline('summarization', model='facebook/bart-large-cnn')

In [8]:
from pypdf import PdfReader

reader = PdfReader('PDFs\8476027-Tribunal_Order_Redacted.pdf')

example1 = ''

for page in reader.pages:
    example1 = example1 + page.extract_text()

reader = PdfReader('PDFs\8496512-Tenancy_Tribunal_Order.pdf')

example2 = ''

for page in reader.pages:
    example2 = example2 + page.extract_text()


reader = PdfReader('PDFs\8496758-Tenancy_Tribunal_Order.pdf')

example3 = ''

for page in reader.pages:
    example3 = example3 + page.extract_text()

In [4]:
print(summariser(example1, max_length=130, min_length=30, do_sample=False, truncation=True))

[{'summary_text': 'Tribunal orders suppression of the tenants’ names and identifying details. The tenants are authorised to install a cat door at the premises at their cost. At the end of the tenancy they must remove the cat door and reglaze the glass pane where it was installed. The application for compensation for cleaning costs is withdrawn.'}]


In [5]:
print(summariser(example2, max_length=130, min_length=30, do_sample=False, truncation=True))

[{'summary_text': 'The Landlord applied for compensation following the end of the tenancy. The Tenant has cross-applied for compensation also. The Landlord claims the Tenant did not leave the premises reasonably clean and tidy and did not remove all rubbish.'}]


In [6]:
print(summariser(example3, max_length=130, min_length=30, do_sample=False, truncation=True))

[{'summary_text': 'No application for suppression has been made in this case. Bond Services is to release from the Bond the sum of $253.00 to Edge Real Estate Limited Michelle Conquer immediately. Bond services is to pay the balance of the Bond, $1,907.00, to Chelsea Pearl Collins-Kemp immediately.'}]


# Information Extraction

Techniques that can be used for information extraction:
- TF-IDF for key words
- Named entity recognition for getting named entities
- Text classification for identify what different sections of an order relate to (e.g. procedural, facts, references, decisions)
- Topic modelling to identify topics

#### Extracting information via TF-IDF

TF-IDF extracts key terms from a corpus of documents or a single document. This technique describes what is key across a corpus of documents or within a single document.

In [1]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.util import ngrams
from collections import Counter

In [4]:
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token.lower() for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    return tokens

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chris\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
def get_tfidf_scores(corpus):
    tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocess_text)
    tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
    feature_names = tfidf_vectorizer.get_feature_names_out()
    return tfidf_matrix, feature_names

In [5]:
def extract_key_terms(tfidf_matrix, feature_names, top_n=10):
    term_scores = [(term, score) for term, score in zip(feature_names, tfidf_matrix.sum(axis=0).tolist()[0])]
    term_scores = sorted(term_scores, key=lambda x: x[1], reverse=True)
    return term_scores[:top_n]

In [6]:
def extract_key_phrases(text, n=2, top_n=10):
    tokens = preprocess_text(text)
    phrases = ngrams(tokens, n)
    phrase_counter = Counter(phrases)
    top_phrases = phrase_counter.most_common(top_n)
    return top_phrases

In [13]:
corpus = [example1, example2, example3]

tfidf_matrix, feature_names = get_tfidf_scores(corpus)

key_terms = extract_key_terms(tfidf_matrix, feature_names)
print("Key Terms:", key_terms)

text = "Your legal document text here cat."
key_phrases = extract_key_phrases(text, n=2)
print("Key Phrases:", key_phrases)

Key Terms: [('tenant', 0.8483324978583497), ('tenancy', 0.7298980073455188), ('landlord', 0.6086589845916258), ('appeal', 0.4950975237144872), ('premises', 0.45704137526056177), ('order', 0.42946243160647124), ('damage', 0.4172754067817982), ('tenants', 0.3832309793806731), ('cat', 0.3276064527021754), ('must', 0.32416673796570594)]
Key Phrases: [(('legal', 'document'), 1), (('document', 'text'), 1), (('text', 'cat'), 1)]


#### Topic Modelling

In [1]:
# Import necessary libraries
import gensim
from gensim import corpora
from gensim.models import LdaModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

# Sample documents
documents = [
    "Topic modeling is a technique in natural language processing.",
    "It helps us discover hidden topics in a collection of documents.",
    "Latent Dirichlet Allocation (LDA) is a popular algorithm for this task.",
    "Gensim is a library for topic modeling in Python.",
    "We will use Gensim to perform topic modeling in this example."
]

# Preprocess the documents
stop_words = set(stopwords.words('english'))
processed_documents = []
for document in documents:
    # Tokenize and remove punctuation and stopwords
    words = [word for word in word_tokenize(document) if word.isalpha() and word not in stop_words]
    processed_documents.append(words)

# Create a dictionary and a corpus
dictionary = corpora.Dictionary(processed_documents)
corpus = [dictionary.doc2bow(doc) for doc in processed_documents]

# Train the LDA model
num_topics = 2  # You can change the number of topics
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)

# Print the topics
for topic in lda_model.print_topics():
    print(topic)

# Let's also assign topics to documents
for i, doc in enumerate(processed_documents):
    doc_bow = dictionary.doc2bow(doc)
    topics = lda_model[doc_bow]
    print(f"Document {i + 1} Topics: {topics}")

# Example: To get the most probable topic for a specific document
doc_to_check = "Topic modeling can be useful in text analysis."
doc_to_check = [word for word in word_tokenize(doc_to_check) if word.isalpha() and word not in stop_words]
doc_bow = dictionary.doc2bow(doc_to_check)
topics = lda_model[doc_bow]
print(f"Topics for the document to check: {topics}")

(0, '0.078*"modeling" + 0.046*"Allocation" + 0.046*"Latent" + 0.046*"algorithm" + 0.046*"LDA" + 0.046*"popular" + 0.046*"Dirichlet" + 0.046*"task" + 0.046*"natural" + 0.046*"Topic"')
(1, '0.052*"topic" + 0.052*"Gensim" + 0.051*"collection" + 0.051*"It" + 0.051*"us" + 0.051*"documents" + 0.051*"helps" + 0.051*"topics" + 0.051*"hidden" + 0.051*"discover"')
Document 1 Topics: [(0, 0.92336446), (1, 0.07663553)]
Document 2 Topics: [(0, 0.057706654), (1, 0.9422933)]
Document 3 Topics: [(0, 0.9345122), (1, 0.0654878)]
Document 4 Topics: [(0, 0.88859737), (1, 0.11140264)]
Document 5 Topics: [(0, 0.07989222), (1, 0.9201078)]
Topics for the document to check: [(0, 0.8118212), (1, 0.18817878)]


#### Text Classification

Identify parts of an order that relate to:
- Facts of the matter
- Decision