# Extract German Text from Court Rulings

Receives text input as HTML from data source linked below and converts the HTML text into plain text format.

## Setup PySpark Runtime
... and store 100k legal texts as CSV file.
 
Data Source: http://openlegaldata.io/research/2019/02/19/court-decision-dataset.html

In [None]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-08-15 10:12:15--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-08-15 10:12:15--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1608 (1.6K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               setup Colab for PySpark 3.1.2 and Spark NLP 3.2.1

2021-08-15 10:12:15 (1.54 

In [None]:
from pyspark.ml import PipelineModel
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp

spark = sparknlp.start()
spark

## Get Text Corpus from Online Resource, Uncompress and Review it

### Get Data from World Widw Web (www)

In [None]:
# Open Legal Data releases dataset of 100,000 German court decisions and 444,000 citations
# http://openlegaldata.io/research/2019/02/19/court-decision-dataset.html
#
!wget https://static.openlegaldata.io/dumps/de/2019-02-19_oldp_cases.json.gz

--2021-08-15 10:14:41--  https://static.openlegaldata.io/dumps/de/2019-02-19_oldp_cases.json.gz
Resolving static.openlegaldata.io (static.openlegaldata.io)... 176.9.1.19
Connecting to static.openlegaldata.io (static.openlegaldata.io)|176.9.1.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 808973127 (771M) [application/octet-stream]
Saving to: ‘2019-02-19_oldp_cases.json.gz’


2021-08-15 10:18:37 (3.29 MB/s) - ‘2019-02-19_oldp_cases.json.gz’ saved [808973127/808973127]



In [None]:
!ls -lha

total 986M
drwxr-xr-x  1 root root 4.0K Aug 15 10:14 .
drwxr-xr-x  1 root root 4.0K Aug 15 10:11 ..
-rw-r--r--  1 root root 772M Feb 19  2019 2019-02-19_oldp_cases.json.gz
drwxr-xr-x  4 root root 4.0K Jul 16 13:19 .config
drwxr-xr-x  1 root root 4.0K Jul 16 13:20 sample_data
drwxr-xr-x 13 1000 1000 4.0K May 24 05:00 spark-3.1.2-bin-hadoop2.7
-rw-r--r--  1 root root 215M May 24 05:01 spark-3.1.2-bin-hadoop2.7.tgz


### Uncompress It

In [None]:
!gzip -d 2019-02-19_oldp_cases.json.gz

In [None]:
!ls -lha

### Review HTML Text Input with Spark NLP

In [None]:
df = spark.read.json("2019-02-19_oldp_cases.json")
df.printSchema()
df.show()

root
 |-- content: string (nullable = true)
 |-- court: struct (nullable = true)
 |    |-- city: long (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- jurisdiction: string (nullable = true)
 |    |-- level_of_appeal: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- slug: string (nullable = true)
 |    |-- state: long (nullable = true)
 |-- created_date: string (nullable = true)
 |-- date: string (nullable = true)
 |-- ecli: string (nullable = true)
 |-- file_number: string (nullable = true)
 |-- id: long (nullable = true)
 |-- slug: string (nullable = true)
 |-- type: string (nullable = true)
 |-- updated_date: string (nullable = true)

+--------------------+--------------------+--------------------+----------+--------------------+--------------+------+--------------------+--------------------+--------------------+
|             content|               court|        created_date|      date|                ecli|   file_number|    id|            

In [None]:
!head -10 2019-02-19_oldp_cases.json

{"id": 188482, "slug": "olgmuen-2019-02-07-34-ar-11418", "court": {"id": 277, "name": "Oberlandesgericht M\u00fcnchen", "slug": "olgmuen", "city": null, "state": 4, "jurisdiction": null, "level_of_appeal": "Oberlandesgericht"}, "file_number": "34 AR 114/18", "date": "2019-02-07", "created_date": "2019-02-11T11:04:18Z", "updated_date": "2019-02-13T12:21:02Z", "type": "Beschluss", "ecli": "", "content": "<h2>Tenor</h2>\n\n<div>\n\t\t\t\t\t\n\t\t\t\t\t<p>Als funktional zust&#228;ndig wird die allgemeine Zivilkammer bestimmt.</p>\n\t\t\t\t</div>\n\t\t\t\n<h2>Gr\u00fcnde</h2>\n\n<div>\n\t\t\t\t\t\n\t\t\t\t\t<p>I.</p>\n\t\t\t\t\t<p><rd nr=\"1\"/>Die in M&#252;nchen ans&#228;ssige Kl&#228;gerin, ein Versicherungsunternehmen, begehrt nach Abgabe an das im Mahnbescheid als Streitgericht bezeichnete Landgericht Augsburg mit Anspruchsbegr&#252;ndung vom 12.2.2018 von dem im Bezirk des Landgerichts Augsburg wohnhaften Beklagten, einem Versicherungsvermittler, R&#252;ckzahlung von Provisionsvorsch&

In [None]:
df.select('content').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Remove HTML and Convert to Plain Text

In [None]:
from bs4 import BeautifulSoup

# https://stackoverflow.com/questions/14694482/converting-html-to-text-with-python
def convertText2HTML(html):
    elem = BeautifulSoup(html, features="html.parser")
    text = ''
    for e in elem.descendants:
        if isinstance(e, str):
            text += ' ' + e.strip() + ' ' # Add additional space to ensure unchanged tokens
        elif e.name in ['br',  'p', 'h1', 'h2', 'h3', 'h4','tr', 'th']:
            text += '\n'
        elif e.name == 'li':
            text += '\n- '
    return text


In [None]:
import pyspark.sql.functions as F
from pyspark.sql.types import *

convertText2HTMLUDF = F.udf(lambda z:convertText2HTML(z),StringType())   

dfText = df.withColumn("text_Content", convertText2HTMLUDF(F.col("content"))) \
  .select("id","text_Content")
  
dfText.show(truncate=False)

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Store resulting CSV file in Google Drive Storage

### Write Result to VM Storage and Compress it

In [None]:
fileName = "2019-02-19_oldp_cases_textout_1.json"

In [None]:
%%time
#!rm -r -f {fileName}
dfText.coalesce(1).write.format('json').save(fileName)

In [None]:
!ls {fileName} -lha

total 2.5G
drwxr-xr-x 2 root root 4.0K Aug 15 10:47 .
drwxr-xr-x 1 root root 4.0K Aug 15 10:20 ..
-rw-r--r-- 1 root root 2.4G Aug 15 10:47 part-00000-34f1a926-888b-4ea2-863f-b7b20f26b50b-c000.json
-rw-r--r-- 1 root root  20M Aug 15 10:47 .part-00000-34f1a926-888b-4ea2-863f-b7b20f26b50b-c000.json.crc
-rw-r--r-- 1 root root    0 Aug 15 10:47 _SUCCESS
-rw-r--r-- 1 root root    8 Aug 15 10:47 ._SUCCESS.crc


In [None]:
!mv {fileName + '/part-00000-34f1a926-888b-4ea2-863f-b7b20f26b50b-c000.json'} {"./File_" + fileName} 

In [None]:
!ls . -lha

total 6.0G
drwxr-xr-x  1 root root 4.0K Aug 15 10:57 .
drwxr-xr-x  1 root root 4.0K Aug 15 10:11 ..
-rw-r--r--  1 root root 3.4G Feb 19  2019 2019-02-19_oldp_cases.json
drwxr-xr-x  2 root root 4.0K Aug 15 10:57 2019-02-19_oldp_cases_textout_1.json
drwxr-xr-x  4 root root 4.0K Jul 16 13:19 .config
-rw-r--r--  1 root root 2.4G Aug 15 10:47 File_2019-02-19_oldp_cases_textout_1.json
drwxr-xr-x  1 root root 4.0K Jul 16 13:20 sample_data
drwxr-xr-x 13 1000 1000 4.0K May 24 05:00 spark-3.1.2-bin-hadoop2.7
-rw-r--r--  1 root root 215M May 24 05:01 spark-3.1.2-bin-hadoop2.7.tgz


In [None]:
#!rm -f {fileName + '.zip'}

In [None]:
%%time
!zip {fileName + '.zip'} {"./File_" + fileName}

  adding: File_2019-02-19_oldp_cases_textout_1.json (deflated 72%)


In [None]:
!ls -lha

total 6.7G
drwxr-xr-x  1 root root 4.0K Aug 15 11:01 .
drwxr-xr-x  1 root root 4.0K Aug 15 10:11 ..
-rw-r--r--  1 root root 3.4G Feb 19  2019 2019-02-19_oldp_cases.json
drwxr-xr-x  2 root root 4.0K Aug 15 10:57 2019-02-19_oldp_cases_textout_1.json
-rw-r--r--  1 root root 679M Aug 15 11:01 2019-02-19_oldp_cases_textout_1.json.zip
drwxr-xr-x  4 root root 4.0K Jul 16 13:19 .config
-rw-r--r--  1 root root 2.4G Aug 15 10:47 File_2019-02-19_oldp_cases_textout_1.json
drwxr-xr-x  1 root root 4.0K Jul 16 13:20 sample_data
drwxr-xr-x 13 1000 1000 4.0K May 24 05:00 spark-3.1.2-bin-hadoop2.7
-rw-r--r--  1 root root 215M May 24 05:01 spark-3.1.2-bin-hadoop2.7.tgz


### Mount Google Drive and Copy Result from VM to Permanent Storage

In [None]:
ls -lha

total 6.7G
drwxr-xr-x  1 root root 4.0K Aug 15 11:01 [0m[01;34m.[0m/
drwxr-xr-x  1 root root 4.0K Aug 15 11:08 [01;34m..[0m/
-rw-r--r--  1 root root 3.4G Feb 19  2019 2019-02-19_oldp_cases.json
drwxr-xr-x  2 root root 4.0K Aug 15 10:57 [01;34m2019-02-19_oldp_cases_textout_1.json[0m/
-rw-r--r--  1 root root 679M Aug 15 11:01 2019-02-19_oldp_cases_textout_1.json.zip
drwxr-xr-x  4 root root 4.0K Jul 16 13:19 [01;34m.config[0m/
-rw-r--r--  1 root root 2.4G Aug 15 10:47 File_2019-02-19_oldp_cases_textout_1.json
drwxr-xr-x  1 root root 4.0K Jul 16 13:20 [01;34msample_data[0m/
drwxr-xr-x 13 1000 1000 4.0K May 24 05:00 [01;34mspark-3.1.2-bin-hadoop2.7[0m/
-rw-r--r--  1 root root 215M May 24 05:01 spark-3.1.2-bin-hadoop2.7.tgz


In [None]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [None]:
destinationPath = "/gdrive/MyDrive/GermanDataSets/CourtRulings/"
!ls {destinationPath.replace(' ', '\ ')} -lha

total 683M
-rw------- 1 root root 7.1M Aug 12 19:29  1k_2019-02-19_oldp_cases_textout.json.zip
-rw------- 1 root root 672M Aug 12 18:59  2019-02-19_oldp_cases_textout.json.zip
-rw------- 1 root root 3.6M Aug 15 11:01 'Extract Text from Court Rulings.ipynb'
-rw------- 1 root root 1.1M Aug 12 19:29  Untitled0.ipynb


In [None]:
!cp 2019-02-19_oldp_cases_textout_1.json.zip {destinationPath.replace(' ', '\ ')}

In [None]:
!ls {destinationPath.replace(' ', '\ ')} -lha

total 1.4G
-rw------- 1 root root 7.1M Aug 12 19:29  1k_2019-02-19_oldp_cases_textout.json.zip
-rw------- 1 root root 679M Aug 15 11:24  2019-02-19_oldp_cases_textout_1.json.zip
-rw------- 1 root root 672M Aug 12 18:59  2019-02-19_oldp_cases_textout.json.zip
-rw------- 1 root root 3.6M Aug 15 11:23 'Extract Text from Court Rulings.ipynb'
-rw------- 1 root root 1.1M Aug 12 19:29  Untitled0.ipynb
