Skip to content

Commit

Permalink
Models hub legal (#1315)
Browse files Browse the repository at this point in the history
* Add model 2023-07-04-leggen_contract_nli_en (#442)

Co-authored-by: gadde5300 <gadde5300@gmail.com>

* Add model 2023-08-04-legclf_law_stack_exchange_en (#499)

Co-authored-by: Mary-Sci <meryemyildiz366@gmail.com>

* 2023-08-07-legner_bert_subpoenas_sm_en (#502)

* Add model 2023-08-07-legner_bert_subpoenas_sm_en

* Update 2023-08-07-legner_bert_subpoenas_sm_en.md

---------

Co-authored-by: gadde5300 <gadde5300@gmail.com>
Co-authored-by: GADDE SAI SHAILESH <69344247+gadde5300@users.noreply.github.com>

* 2023-08-30-legpipe_deid_en (#567)

* Add model 2023-08-30-legpipe_deid_en

* Update 2023-08-30-legpipe_deid_en.md

---------

Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com>
Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com>

* 2023-08-30-legpipe_deid_en (#568)

* Add model 2023-08-30-legpipe_deid_en

* Update 2023-08-30-legpipe_deid_en.md

* Add model 2023-08-30-legpipe_deid_en

---------

Co-authored-by: SKocer <samedkocer22@gmail.com>
Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com>

* Add model 2023-08-30-legpipe_deid_en (#573)

Co-authored-by: gokhanturer <mgturer@gmail.com>

* Delete 2023-08-30-legpipe_deid_en.md

* Add model 2023-08-30-legpipe_deid_en (#575)

Co-authored-by: SKocer <samedkocer22@gmail.com>

* Add model 2023-09-01-legpipe_deid_en (#585)

Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com>

* Add model 2023-09-01-legpipe_deid_en (#591)

Co-authored-by: SKocer <samedkocer22@gmail.com>

* Add model 2023-09-01-legpipe_deid_en (#596)

Co-authored-by: gokhanturer <mgturer@gmail.com>

* Add model 2023-09-04-legqa_flant5_mpre_en (#616)

Co-authored-by: gadde5300 <gadde5300@gmail.com>

* Add model 2023-09-04-leggen_flant5_mpre_en (#617)

Co-authored-by: gadde5300 <gadde5300@gmail.com>

* Add model 2023-10-04-legmulticlf_law_stack_exchange_en (#682)

Co-authored-by: Mary-Sci <meryemyildiz366@gmail.com>

* Add model 2023-11-05-legembedding_e5_base_en (#742)

Co-authored-by: gadde5300 <gadde5300@gmail.com>

* 2023-11-10-legembeddings_bge_base_en (#751)

* Add model 2023-11-10-legembeddings_bge_base_en

* Update 2023-11-10-legembeddings_bge_base_en.md

---------

Co-authored-by: gadde5300 <gadde5300@gmail.com>
Co-authored-by: David Cecchini <dadachini@hotmail.com>

* 2024-05-15-legner_lener_base_pt (#1202)

* Add model 2024-05-15-legner_lener_base_pt

* Add model 2024-05-15-legner_lener_large_pt

---------

Co-authored-by: gadde5300 <gadde5300@gmail.com>

* 2024-05-17-legner_sec_edgar_le_en (#1209)

* Add model 2024-05-17-legner_sec_edgar_le_en

* Update 2024-05-17-legner_sec_edgar_le_en.md

* Add model 2024-05-17-legner_bert_subpoenas_sm_le_en

* Add model 2024-05-21-legner_deid_le_en

* Add model 2024-05-21-legal_word_embeddings_en

* Add model 2024-06-07-legner_contract_doc_parties_le_en

* Add model 2024-06-10-legal_bge_base_embeddings_en

* Add model 2024-06-28-legner_subpoenas_sm_en

* Update 2024-06-28-legner_subpoenas_sm_en.md

* Update 2024-06-28-legner_subpoenas_sm_en.md

* Delete docs/_posts/gadde5300/2024-05-17-legner_bert_subpoenas_sm_le_en.md

* Add model 2024-07-02-legmulticlf_edgar_le_en

* Update 2024-07-02-legmulticlf_edgar_le_en.md

* Add model 2024-07-03-legmulticlf_mnda_sections_paragraph_other_le_en

* Add model 2024-07-04-legmulticlf_mnda_sections_paragraph_other_le_en

* Delete docs/_posts/gadde5300/2024-07-03-legmulticlf_mnda_sections_paragraph_other_le_en.md

* Update 2024-07-04-legmulticlf_mnda_sections_paragraph_other_le_en.md

* Delete docs/_posts/gadde5300/2024-07-02-legmulticlf_edgar_le_en.md

* Add model 2024-07-04-legmulticlf_edgar_le_en

---------

Co-authored-by: gadde5300 <gadde5300@gmail.com>
Co-authored-by: GADDE SAI SHAILESH <69344247+gadde5300@users.noreply.github.com>

---------

Co-authored-by: jsl-models <74001263+jsl-models@users.noreply.github.com>
Co-authored-by: gadde5300 <gadde5300@gmail.com>
Co-authored-by: Mary-Sci <meryemyildiz366@gmail.com>
Co-authored-by: GADDE SAI SHAILESH <69344247+gadde5300@users.noreply.github.com>
Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com>
Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com>
Co-authored-by: SKocer <samedkocer22@gmail.com>
Co-authored-by: gokhanturer <mgturer@gmail.com>
  • Loading branch information
9 people committed Jul 10, 2024
1 parent 8fd3da2 commit 40ffe18
Show file tree
Hide file tree
Showing 8 changed files with 972 additions and 0 deletions.
131 changes: 131 additions & 0 deletions docs/_posts/gadde5300/2024-05-17-legner_sec_edgar_le_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
layout: model
title: Legal NER on EDGAR Documents
author: John Snow Labs
name: legner_sec_edgar_le
date: 2024-05-17
tags: [en, ner, legal, sec, edgar, licensed]
task: Named Entity Recognition
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

This Legal NER model extracts `ORG`, `INST`, `LAW`, `COURT`, `PER`, `LOC`, `MISC`, `ALIAS`, and `TICKER` entities from the US SEC EDGAR documents, was trained using custom legal word embeddings.

## Predicted Entities

`ORG`, `INST`, `LAW`, `COURT`, `PER`, `LOC`, `MISC`, `ALIAS`, `TICKER`

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_sec_edgar_le_en_1.0.0_3.0_1715941721099.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_sec_edgar_le_en_1.0.0_3.0_1715941721099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("legal_word_embeddings", "en", "legal/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_sec_edgar_le", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")

ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text = ["""In our opinion, the accompanying consolidated balance sheets and the related consolidated statements of operations, of changes in stockholders' equity, and of cash flows present fairly, in all material respects, the financial position of SunGard Capital Corp. II and its subsidiaries ( SCC II ) at December 31, 2010, and 2009, and the results of their operations and their cash flows for each of the three years in the period ended December 31, 2010, in conformity with accounting principles generally accepted in the United States of America."""]


res = model.transform(spark.createDataFrame([text]).toDF("text"))
```

</div>

## Results

```bash
+----------------------------------------+-----+
|chunk |label|
+----------------------------------------+-----+
|SunGard Capital Corp. II |ORG |
|SCC II |ALIAS|
|accounting principles generally accepted|LAW |
|United States of America |LOC |
+----------------------------------------+-----+
```

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|legner_sec_edgar_le|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|14.6 MB|

## References

In-house annotations

## Benchmarking

```bash
precision recall f1-score support
ALIAS 0.88 0.87 0.87 84
COURT 1.00 1.00 1.00 6
INST 0.94 0.83 0.88 76
LAW 0.92 0.91 0.91 166
LOC 0.93 0.91 0.92 140
MISC 0.88 0.84 0.86 226
ORG 0.91 0.95 0.93 430
PER 0.97 0.94 0.95 66
TICKER 1.00 0.86 0.92 7
micro-avg 0.91 0.90 0.91 1201
macro-avg 0.94 0.90 0.92 1201
weighted-avg 0.91 0.90 0.91 1201
```
66 changes: 66 additions & 0 deletions docs/_posts/gadde5300/2024-05-21-legal_word_embeddings_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
layout: model
title: Legal Word Embeddings
author: John Snow Labs
name: legal_word_embeddings
date: 2024-05-21
tags: [legal, word_embeddings, en, licensed]
task: Embeddings
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

The word embedding models were based on Word2Vec, trained on a mix of different datasets. We used public data and in-house annotated documents.

## Predicted Entities



{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legal_word_embeddings_en_1.0.0_3.0_1716300540404.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legal_word_embeddings_en_1.0.0_3.0_1716300540404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
model = nlp.WordEmbeddingsModel.pretrained("legal_word_embeddings","en","legal/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
```

</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|legal_word_embeddings|
|Type:|embeddings|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[word_embeddings]|
|Language:|en|
|Size:|84.9 MB|
|Case sensitive:|false|
|Dimension:|200|

## References

Public data and in-house annotated documents
137 changes: 137 additions & 0 deletions docs/_posts/gadde5300/2024-05-21-legner_deid_le_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
layout: model
title: Generic Deidentification NER (Legal)
author: John Snow Labs
name: legner_deid_le
date: 2024-05-21
tags: [en, legal, ner, deid, deidentification, licensed]
task: Named Entity Recognition
language: en
edition: Legal NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

This is a Legal NER model trained using custom legal embeddings which allows you to detect some generic entities that may require to be masked or obfuscated to be compliant with different regulations, as GDPR and CCPA. This is just an NER model, make sure you try the full De-identification pipelines available in Models Hub.

## Predicted Entities

`AGE`, `CITY`, `COUNTRY`, `DATE`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `ORG`, `PERSON`, `PHONE`, `PROFESSION`, `STATE`, `STREET`, `URL`, `ZIP`

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_deid_le_en_1.0.0_3.0_1716291298762.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_deid_le_en_1.0.0_3.0_1716291298762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("legal_word_embeddings", "en", "legal/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")

ner_model =legal.NerModel.pretrained("legner_deid_le", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")

ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text = [""" This LICENSE AND DEVELOPMENT AGREEMENT (this Agreement) is entered into effective as of Nov. 02, 2019 (the Effective Date) by and between Bioeq IP AG, having its principal place of business at 333 Twin Dolphin Drive, Suite 600, Redwood City, CA, 94065, USA (Licensee). """]

res = model.transform(spark.createDataFrame([text]).toDF("text"))
```

</div>

## Results

```bash
+----------------------+------+
|chunk |label |
+----------------------+------+
|Nov. 02, 2019 |DATE |
|333 Twin Dolphin Drive|STREET|
|Redwood City |CITY |
|CA |STATE |
+----------------------+------+
```

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|legner_deid_le|
|Compatibility:|Legal NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|14.8 MB|

## References

In-house annotated documents with protected information

## Benchmarking

```bash
precision recall f1-score support
AGE 0.97 0.97 0.97 266
CITY 0.85 0.76 0.80 120
COUNTRY 0.89 0.63 0.74 38
DATE 0.98 0.98 0.98 2206
EMAIL 1.00 1.00 1.00 1
FAX 0.00 0.00 0.00 2
LOCATION-OTHER 1.00 0.50 0.67 6
ORG 0.69 0.48 0.56 42
PERSON 0.96 0.96 0.96 1295
PHONE 0.84 0.85 0.85 62
PROFESSION 0.80 0.54 0.65 76
STATE 0.94 0.93 0.94 90
STREET 0.95 0.90 0.92 81
URL 0.00 0.00 0.00 1
ZIP 0.97 0.96 0.96 67
micro-avg 0.96 0.95 0.95 4353
macro-avg 0.79 0.70 0.73 4353
weighted-avg 0.96 0.95 0.95 4353

```
Loading

0 comments on commit 40ffe18

Please sign in to comment.