Models hub legal (#1315)

* Add model 2023-07-04-leggen_contract_nli_en (#442) Co-authored-by: gadde5300 <gadde5300@gmail.com> * Add model 2023-08-04-legclf_law_stack_exchange_en (#499) Co-authored-by: Mary-Sci <meryemyildiz366@gmail.com> * 2023-08-07-legner_bert_subpoenas_sm_en (#502) * Add model 2023-08-07-legner_bert_subpoenas_sm_en * Update 2023-08-07-legner_bert_subpoenas_sm_en.md --------- Co-authored-by: gadde5300 <gadde5300@gmail.com> Co-authored-by: GADDE SAI SHAILESH <69344247+gadde5300@users.noreply.github.com> * 2023-08-30-legpipe_deid_en (#567) * Add model 2023-08-30-legpipe_deid_en * Update 2023-08-30-legpipe_deid_en.md --------- Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com> Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com> * 2023-08-30-legpipe_deid_en (#568) * Add model 2023-08-30-legpipe_deid_en * Update 2023-08-30-legpipe_deid_en.md * Add model 2023-08-30-legpipe_deid_en --------- Co-authored-by: SKocer <samedkocer22@gmail.com> Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com> * Add model 2023-08-30-legpipe_deid_en (#573) Co-authored-by: gokhanturer <mgturer@gmail.com> * Delete 2023-08-30-legpipe_deid_en.md * Add model 2023-08-30-legpipe_deid_en (#575) Co-authored-by: SKocer <samedkocer22@gmail.com> * Add model 2023-09-01-legpipe_deid_en (#585) Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com> * Add model 2023-09-01-legpipe_deid_en (#591) Co-authored-by: SKocer <samedkocer22@gmail.com> * Add model 2023-09-01-legpipe_deid_en (#596) Co-authored-by: gokhanturer <mgturer@gmail.com> * Add model 2023-09-04-legqa_flant5_mpre_en (#616) Co-authored-by: gadde5300 <gadde5300@gmail.com> * Add model 2023-09-04-leggen_flant5_mpre_en (#617) Co-authored-by: gadde5300 <gadde5300@gmail.com> * Add model 2023-10-04-legmulticlf_law_stack_exchange_en (#682) Co-authored-by: Mary-Sci <meryemyildiz366@gmail.com> * Add model 2023-11-05-legembedding_e5_base_en (#742) Co-authored-by: gadde5300 <gadde5300@gmail.com> * 2023-11-10-legembeddings_bge_base_en (#751) * Add model 2023-11-10-legembeddings_bge_base_en * Update 2023-11-10-legembeddings_bge_base_en.md --------- Co-authored-by: gadde5300 <gadde5300@gmail.com> Co-authored-by: David Cecchini <dadachini@hotmail.com> * 2024-05-15-legner_lener_base_pt (#1202) * Add model 2024-05-15-legner_lener_base_pt * Add model 2024-05-15-legner_lener_large_pt --------- Co-authored-by: gadde5300 <gadde5300@gmail.com> * 2024-05-17-legner_sec_edgar_le_en (#1209) * Add model 2024-05-17-legner_sec_edgar_le_en * Update 2024-05-17-legner_sec_edgar_le_en.md * Add model 2024-05-17-legner_bert_subpoenas_sm_le_en * Add model 2024-05-21-legner_deid_le_en * Add model 2024-05-21-legal_word_embeddings_en * Add model 2024-06-07-legner_contract_doc_parties_le_en * Add model 2024-06-10-legal_bge_base_embeddings_en * Add model 2024-06-28-legner_subpoenas_sm_en * Update 2024-06-28-legner_subpoenas_sm_en.md * Update 2024-06-28-legner_subpoenas_sm_en.md * Delete docs/_posts/gadde5300/2024-05-17-legner_bert_subpoenas_sm_le_en.md * Add model 2024-07-02-legmulticlf_edgar_le_en * Update 2024-07-02-legmulticlf_edgar_le_en.md * Add model 2024-07-03-legmulticlf_mnda_sections_paragraph_other_le_en * Add model 2024-07-04-legmulticlf_mnda_sections_paragraph_other_le_en * Delete docs/_posts/gadde5300/2024-07-03-legmulticlf_mnda_sections_paragraph_other_le_en.md * Update 2024-07-04-legmulticlf_mnda_sections_paragraph_other_le_en.md * Delete docs/_posts/gadde5300/2024-07-02-legmulticlf_edgar_le_en.md * Add model 2024-07-04-legmulticlf_edgar_le_en --------- Co-authored-by: gadde5300 <gadde5300@gmail.com> Co-authored-by: GADDE SAI SHAILESH <69344247+gadde5300@users.noreply.github.com> --------- Co-authored-by: jsl-models <74001263+jsl-models@users.noreply.github.com> Co-authored-by: gadde5300 <gadde5300@gmail.com> Co-authored-by: Mary-Sci <meryemyildiz366@gmail.com> Co-authored-by: GADDE SAI SHAILESH <69344247+gadde5300@users.noreply.github.com> Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com> Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com> Co-authored-by: SKocer <samedkocer22@gmail.com> Co-authored-by: gokhanturer <mgturer@gmail.com>
JohnSnowLabs · Jul 10, 2024 · 40ffe18 · 40ffe18
1 parent 8fd3da2
commit 40ffe18
Show file tree

Hide file tree

Showing 8 changed files with 972 additions and 0 deletions.
diff --git a/docs/_posts/gadde5300/2024-05-17-legner_sec_edgar_le_en.md b/docs/_posts/gadde5300/2024-05-17-legner_sec_edgar_le_en.md
@@ -0,0 +1,131 @@
+---
+layout: model
+title: Legal NER on EDGAR Documents
+author: John Snow Labs
+name: legner_sec_edgar_le
+date: 2024-05-17
+tags: [en, ner, legal, sec, edgar, licensed]
+task: Named Entity Recognition
+language: en
+edition: Legal NLP 1.0.0
+spark_version: 3.0
+supported: true
+annotator: LegalNerModel
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+This Legal NER model extracts `ORG`, `INST`, `LAW`, `COURT`, `PER`, `LOC`, `MISC`, `ALIAS`, and `TICKER` entities from the US SEC EDGAR documents, was trained using custom legal word embeddings.
+
+## Predicted Entities
+
+`ORG`, `INST`, `LAW`, `COURT`, `PER`, `LOC`, `MISC`, `ALIAS`, `TICKER`
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_sec_edgar_le_en_1.0.0_3.0_1715941721099.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_sec_edgar_le_en_1.0.0_3.0_1715941721099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+documentAssembler = nlp.DocumentAssembler()\
+        .setInputCol("text")\
+        .setOutputCol("document")
+
+sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
+        .setInputCols(["document"])\
+        .setOutputCol("sentence")
+
+tokenizer = nlp.Tokenizer()\
+        .setInputCols(["sentence"])\
+        .setOutputCol("token")
+
+embeddings = nlp.WordEmbeddingsModel.pretrained("legal_word_embeddings", "en", "legal/models")\
+            .setInputCols(["sentence","token"])\
+            .setOutputCol("embeddings")
+
+ner_model = legal.NerModel.pretrained("legner_sec_edgar_le", "en", "legal/models")\
+        .setInputCols(["sentence", "token", "embeddings"])\
+        .setOutputCol("ner")
+
+ner_converter = nlp.NerConverter()\
+        .setInputCols(["sentence","token","ner"])\
+        .setOutputCol("ner_chunk")
+
+nlpPipeline = nlp.Pipeline(stages=[
+        documentAssembler,
+        sentenceDetector,
+        tokenizer,
+        embeddings,
+        ner_model,
+        ner_converter])
+
+empty_data = spark.createDataFrame([[""]]).toDF("text")
+
+model = nlpPipeline.fit(empty_data)
+
+text = ["""In our opinion, the accompanying consolidated balance sheets and the related consolidated statements of operations, of changes in stockholders' equity, and of cash flows present fairly, in all material respects, the financial position of SunGard Capital Corp. II and its subsidiaries ( SCC II ) at December 31, 2010, and 2009, and the results of their operations and their cash flows for each of the three years in the period ended December 31, 2010, in conformity with accounting principles generally accepted in the United States of America."""]
+
+
+res = model.transform(spark.createDataFrame([text]).toDF("text"))
+```
+
+</div>
+
+## Results
+
+```bash
++----------------------------------------+-----+
+|chunk                                   |label|
++----------------------------------------+-----+
+|SunGard Capital Corp. II                |ORG  |
+|SCC II                                  |ALIAS|
+|accounting principles generally accepted|LAW  |
+|United States of America                |LOC  |
++----------------------------------------+-----+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|legner_sec_edgar_le|
+|Compatibility:|Legal NLP 1.0.0+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[sentence, token, embeddings]|
+|Output Labels:|[ner]|
+|Language:|en|
+|Size:|14.6 MB|
+
+## References
+
+In-house annotations
+
+## Benchmarking
+
+```bash
+               precision    recall  f1-score   support
+ALIAS              0.88      0.87      0.87        84
+COURT              1.00      1.00      1.00         6
+INST               0.94      0.83      0.88        76
+LAW                0.92      0.91      0.91       166
+LOC                0.93      0.91      0.92       140
+MISC               0.88      0.84      0.86       226
+ORG                0.91      0.95      0.93       430
+PER                0.97      0.94      0.95        66
+TICKER             1.00      0.86      0.92         7
+micro-avg          0.91      0.90      0.91      1201
+macro-avg          0.94      0.90      0.92      1201
+weighted-avg       0.91      0.90      0.91      1201
+```
diff --git a/docs/_posts/gadde5300/2024-05-21-legal_word_embeddings_en.md b/docs/_posts/gadde5300/2024-05-21-legal_word_embeddings_en.md
@@ -0,0 +1,66 @@
+---
+layout: model
+title: Legal Word Embeddings
+author: John Snow Labs
+name: legal_word_embeddings
+date: 2024-05-21
+tags: [legal, word_embeddings, en, licensed]
+task: Embeddings
+language: en
+edition: Legal NLP 1.0.0
+spark_version: 3.0
+supported: true
+annotator: WordEmbeddingsModel
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+The word embedding models were based on Word2Vec, trained on a mix of different datasets. We used public data and in-house annotated documents.
+
+## Predicted Entities
+
+
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legal_word_embeddings_en_1.0.0_3.0_1716300540404.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legal_word_embeddings_en_1.0.0_3.0_1716300540404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+model =  nlp.WordEmbeddingsModel.pretrained("legal_word_embeddings","en","legal/models")\
+	.setInputCols(["sentence","token"])\
+	.setOutputCol("embeddings")
+```
+
+</div>
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|legal_word_embeddings|
+|Type:|embeddings|
+|Compatibility:|Legal NLP 1.0.0+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[document, token]|
+|Output Labels:|[word_embeddings]|
+|Language:|en|
+|Size:|84.9 MB|
+|Case sensitive:|false|
+|Dimension:|200|
+
+## References
+
+Public data and in-house annotated documents
diff --git a/docs/_posts/gadde5300/2024-05-21-legner_deid_le_en.md b/docs/_posts/gadde5300/2024-05-21-legner_deid_le_en.md
@@ -0,0 +1,137 @@
+---
+layout: model
+title: Generic Deidentification NER (Legal)
+author: John Snow Labs
+name: legner_deid_le
+date: 2024-05-21
+tags: [en, legal, ner, deid, deidentification, licensed]
+task: Named Entity Recognition
+language: en
+edition: Legal NLP 1.0.0
+spark_version: 3.0
+supported: true
+annotator: LegalNerModel
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+This is a Legal NER model trained using custom legal embeddings which allows you to detect some generic entities that may require to be masked or obfuscated to be compliant with different regulations, as GDPR and CCPA. This is just an NER model, make sure you try the full De-identification pipelines available in Models Hub.
+
+## Predicted Entities
+
+`AGE`, `CITY`, `COUNTRY`, `DATE`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `ORG`, `PERSON`, `PHONE`, `PROFESSION`, `STATE`, `STREET`, `URL`, `ZIP`
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_deid_le_en_1.0.0_3.0_1716291298762.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_deid_le_en_1.0.0_3.0_1716291298762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+documentAssembler = nlp.DocumentAssembler()\
+        .setInputCol("text")\
+        .setOutputCol("document")
+
+sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
+        .setInputCols(["document"])\
+        .setOutputCol("sentence")
+
+tokenizer = nlp.Tokenizer()\
+        .setInputCols(["sentence"])\
+        .setOutputCol("token")
+
+embeddings = nlp.WordEmbeddingsModel.pretrained("legal_word_embeddings", "en", "legal/models")\
+            .setInputCols(["sentence","token"])\
+            .setOutputCol("embeddings")
+
+ner_model =legal.NerModel.pretrained("legner_deid_le", "en", "legal/models")\
+      .setInputCols(["sentence", "token", "embeddings"])\
+      .setOutputCol("ner")
+
+ner_converter = nlp.NerConverter()\
+        .setInputCols(["sentence","token","ner"])\
+        .setOutputCol("ner_chunk")
+
+nlpPipeline = nlp.Pipeline(stages=[
+        documentAssembler,
+        sentenceDetector,
+        tokenizer,
+        embeddings,
+        ner_model,
+        ner_converter])
+
+empty_data = spark.createDataFrame([[""]]).toDF("text")
+
+model = nlpPipeline.fit(empty_data)
+
+text = [""" This LICENSE AND DEVELOPMENT AGREEMENT (this Agreement) is entered into effective as of Nov. 02, 2019 (the Effective Date) by and between Bioeq IP AG, having its principal place of business at 333 Twin Dolphin Drive, Suite 600, Redwood City, CA, 94065, USA (Licensee). """]
+
+res = model.transform(spark.createDataFrame([text]).toDF("text"))
+```
+
+</div>
+
+## Results
+
+```bash
++----------------------+------+
+|chunk                 |label |
++----------------------+------+
+|Nov. 02, 2019         |DATE  |
+|333 Twin Dolphin Drive|STREET|
+|Redwood City          |CITY  |
+|CA                    |STATE |
++----------------------+------+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|legner_deid_le|
+|Compatibility:|Legal NLP 1.0.0+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[sentence, token, embeddings]|
+|Output Labels:|[ner]|
+|Language:|en|
+|Size:|14.8 MB|
+
+## References
+
+In-house annotated documents with protected information
+
+## Benchmarking
+
+```bash
+                 precision    recall  f1-score   support
+AGE                  0.97      0.97      0.97       266
+CITY                 0.85      0.76      0.80       120
+COUNTRY              0.89      0.63      0.74        38
+DATE                 0.98      0.98      0.98      2206
+EMAIL                1.00      1.00      1.00         1
+FAX                  0.00      0.00      0.00         2
+LOCATION-OTHER       1.00      0.50      0.67         6
+ORG                  0.69      0.48      0.56        42
+PERSON               0.96      0.96      0.96      1295
+PHONE                0.84      0.85      0.85        62
+PROFESSION           0.80      0.54      0.65        76
+STATE                0.94      0.93      0.94        90
+STREET               0.95      0.90      0.92        81
+URL                  0.00      0.00      0.00         1
+ZIP                  0.97      0.96      0.96        67
+micro-avg            0.96      0.95      0.95      4353
+macro-avg            0.79      0.70      0.73      4353
+weighted-avg         0.96      0.95      0.95      4353
+
+```