Skip to content

Commit

Permalink
Models hub finance (#1316)
Browse files Browse the repository at this point in the history
* Add model 2023-08-03-finner_bert_subpoenas_sm_en (#493)

Co-authored-by: gadde5300 <gadde5300@gmail.com>

* Delete subpoenas ner finance

* Add model 2023-08-30-finpipe_deid_en (#566)

Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com>

* Add model 2023-08-30-finpipe_deid_en (#570)

Co-authored-by: SKocer <samedkocer22@gmail.com>

* Add model 2023-08-30-finpipe_deid_en (#571)

Co-authored-by: SKocer <samedkocer22@gmail.com>

* Delete 2023-08-30-finpipe_deid_en.md

* Add model 2023-08-30-finpipe_deid_en (#572)

Co-authored-by: gokhanturer <mgturer@gmail.com>

* Add model 2023-08-30-finpipe_deid_en (#574)

Co-authored-by: SKocer <samedkocer22@gmail.com>

* Add model 2023-09-01-finpipe_deid_en (#586)

Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com>

* Add model 2023-09-01-finpipe_deid_en (#589)

Co-authored-by: SKocer <samedkocer22@gmail.com>

* Add model 2023-09-01-finpipe_deid_en (#593)

Co-authored-by: gokhanturer <mgturer@gmail.com>

* 2023-10-06-finembedding_e5_base_en (#685)

* Add model 2023-10-06-finembedding_e5_base_en

* Add model 2023-10-06-finner_absa_sm_en

* Add model 2023-10-06-finassertion_absa_sm_en

---------

Co-authored-by: dcecchini <dadachini@hotmail.com>

* Add model 2023-11-09-finembedding_e5_large_en (#745)

Co-authored-by: dcecchini <dadachini@hotmail.com>

* 2023-11-11-finner_aspect_based_sentiment_md_en (#754)

* Add model 2023-11-11-finner_aspect_based_sentiment_md_en

* Add model 2023-11-11-finassertion_aspect_based_sentiment_md_en

* Update 2023-11-11-finner_aspect_based_sentiment_md_en.md

* Update 2023-11-11-finassertion_aspect_based_sentiment_md_en.md

---------

Co-authored-by: Mary-Sci <meryemyildiz366@gmail.com>
Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com>

* Add model 2023-12-07-finembeddings_bge_base_en (#812)

Co-authored-by: dcecchini <dadachini@hotmail.com>

* 2024-05-17-finner_sec_edgar_fe_en (#1211)

* Add model 2024-05-17-finner_sec_edgar_fe_en

* Add model 2024-05-17-finner_deid_sec_fe_en

* Update 2024-05-17-finner_deid_sec_fe_en.md

* Add model 2024-05-21-finner_aspect_based_sentiment_fe_en

* Add model 2024-05-21-finance_word_embeddings_en

* Add model 2024-06-07-finner_financial_xlarge_fe_en

* Update 2024-06-07-finner_financial_xlarge_fe_en.md

* Add model 2024-06-10-finel_nasdaq_company_name_stock_screener_fe_en

* Add model 2024-06-10-finel_edgar_company_name_fe_en

* Add model 2024-06-10-finance_bge_base_embeddings_en

* Add model 2024-06-11-finel_names2tickers_fe_en

* Add model 2024-06-12-finel_tickers2names_fe_en

* Add model 2024-06-21-finassertion_aspect_based_sentiment_md_fe_en

---------

Co-authored-by: gadde5300 <gadde5300@gmail.com>
Co-authored-by: GADDE SAI SHAILESH <69344247+gadde5300@users.noreply.github.com>

---------

Co-authored-by: jsl-models <74001263+jsl-models@users.noreply.github.com>
Co-authored-by: gadde5300 <gadde5300@gmail.com>
Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com>
Co-authored-by: SKocer <samedkocer22@gmail.com>
Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com>
Co-authored-by: gokhanturer <mgturer@gmail.com>
Co-authored-by: Mary-Sci <meryemyildiz366@gmail.com>
Co-authored-by: GADDE SAI SHAILESH <69344247+gadde5300@users.noreply.github.com>
  • Loading branch information
9 people committed Jul 10, 2024
1 parent 4550767 commit 8fd3da2
Show file tree
Hide file tree
Showing 11 changed files with 1,203 additions and 0 deletions.
138 changes: 138 additions & 0 deletions docs/_posts/gadde5300/2024-05-17-finner_deid_sec_fe_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
---
layout: model
title: Generic Deidentification NER (Finance)
author: John Snow Labs
name: finner_deid_sec_fe
date: 2024-05-17
tags: [deid, deidentification, anonymization, en, licensed]
task: Named Entity Recognition
language: en
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: FinanceNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

This is a NER model trained using custom finance embeddings which allows you to detect some generic entities that may require to be masked or obfuscated to be compliant with different regulations, as GDPR and CCPA. This is just an NER model, make sure you try the full De-identification pipelines available in Models Hub.

## Predicted Entities

`AGE`, `CITY`, `COUNTRY`, `DATE`, `EMAIL`, `LOCATION-OTHER`, `FAX`, `ORG`, `PERSON`, `PHONE`, `PROFESSION`, `STATE`, `STREET`, `URL`, `ZIP`

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_deid_sec_fe_en_1.0.0_3.0_1715953927003.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_deid_sec_fe_en_1.0.0_3.0_1715953927003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("finance_word_embeddings", "en", "finance/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")

ner_model =finance.NerModel.pretrained("finner_deid_sec_fe", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")

ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text = [""" This LICENSE AND DEVELOPMENT AGREEMENT (this Agreement) is entered into effective as of Nov. 02, 2019 (the Effective Date) by and between Bioeq IP AG, having its principal place of business at 333 Twin Dolphin Drive, Suite 600, Redwood City, CA, 94065, USA (Licensee). """]

res = model.transform(spark.createDataFrame([text]).toDF("text"))
```

</div>

## Results

```bash
+----------------------+------+
|chunk |label |
+----------------------+------+
|Nov. 02, 2019 |DATE |
|333 Twin Dolphin Drive|STREET|
|Redwood City |CITY |
|CA |STATE |
|94065 |ZIP |
+----------------------+------+

```

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|finner_deid_sec_fe|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|14.6 MB|

## References

In-house annotated documents with protected information

## Benchmarking

```bash
precision recall f1-score support
AGE 0.97 0.95 0.96 266
CITY 0.86 0.80 0.83 120
COUNTRY 0.86 0.63 0.73 38
DATE 0.98 0.98 0.98 2206
EMAIL 1.00 1.00 1.00 1
FAX 0.00 0.00 0.00 2
LOCATION-OTHER 1.00 0.33 0.50 6
ORG 0.82 0.55 0.66 42
PERSON 0.95 0.95 0.95 1295
PHONE 0.89 0.89 0.89 62
PROFESSION 0.75 0.55 0.64 76
STATE 0.90 0.92 0.91 90
STREET 0.92 0.89 0.91 81
URL 0.00 0.00 0.00 1
ZIP 0.97 0.94 0.95 67
micro-avg 0.96 0.94 0.95 4353
macro-avg 0.79 0.69 0.73 4353
weighted-avg 0.96 0.94 0.95 4353
```
130 changes: 130 additions & 0 deletions docs/_posts/gadde5300/2024-05-17-finner_sec_edgar_fe_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
layout: model
title: Financial NER on EDGAR Documents
author: John Snow Labs
name: finner_sec_edgar_fe
date: 2024-05-17
tags: [en, licensed, finance, ner, sec]
task: Named Entity Recognition
language: en
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: LegalNerModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

This Financial NER model extracts ORG, INST, LAW, COURT, PER, LOC, MISC, ALIAS, and TICKER entities from the US SEC EDGAR documents, was trained using custom finance word embeddings.

## Predicted Entities

`ORG`, `INST`, `LAW`, `COURT`, `PER`, `LOC`, `MISC`, `ALIAS`, `TICKER`

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_sec_edgar_fe_en_1.0.0_3.0_1715948751469.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_sec_edgar_fe_en_1.0.0_3.0_1715948751469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("finance_word_embeddings", "en", "finance/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_sec_edgar_fe", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")

ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text = ["""In our opinion, the accompanying consolidated balance sheets and the related consolidated statements of operations, of changes in stockholders' equity, and of cash flows present fairly, in all material respects, the financial position of SunGard Capital Corp. II and its subsidiaries ( SCC II ) at December 31, 2010, and 2009, and the results of their operations and their cash flows for each of the three years in the period ended December 31, 2010, in conformity with accounting principles generally accepted in the United States of America."""]

result = model.transform(spark.createDataFrame([text]).toDF("text"))
```

</div>

## Results

```bash
+----------------------------------------+-----+
|chunk |label|
+----------------------------------------+-----+
|SunGard Capital Corp |ORG |
|SCC II |ALIAS|
|accounting principles generally accepted|LAW |
|United States of America |LOC |
+----------------------------------------+-----+
```

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|finner_sec_edgar_fe|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|en|
|Size:|2.2 MB|

## References

In-house annotations

## Benchmarking

```bash
precision recall f1-score support
ALIAS 0.91 0.80 0.85 84
COURT 1.00 1.00 1.00 6
INST 0.92 0.76 0.83 76
LAW 0.89 0.86 0.87 166
LOC 0.87 0.87 0.87 140
MISC 0.86 0.75 0.80 226
ORG 0.88 0.91 0.89 430
PER 0.89 0.88 0.89 66
TICKER 1.00 0.86 0.92 7
micro-avg 0.88 0.85 0.87 1201
macro-avg 0.91 0.85 0.88 1201
weighted-avg 0.88 0.85 0.86 1201
```
66 changes: 66 additions & 0 deletions docs/_posts/gadde5300/2024-05-21-finance_word_embeddings_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
layout: model
title: Finance Word Embeddings
author: John Snow Labs
name: finance_word_embeddings
date: 2024-05-21
tags: [en, finance, licensed, word_embeddings]
task: Embeddings
language: en
edition: Finance NLP 1.0.0
spark_version: 3.0
supported: true
annotator: WordEmbeddingsModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

The word embedding models were based on Word2Vec, trained on a mix of different datasets. We used public data and in-house annotated documents.

## Predicted Entities



{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finance_word_embeddings_en_1.0.0_3.0_1716300545868.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finance_word_embeddings_en_1.0.0_3.0_1716300545868.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
model = nlp.WordEmbeddingsModel.pretrained("finance_word_embeddings","en","finance/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
```

</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|finance_word_embeddings|
|Type:|embeddings|
|Compatibility:|Finance NLP 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[word_embeddings]|
|Language:|en|
|Size:|103.4 MB|
|Case sensitive:|false|
|Dimension:|200|

## References

Public data and in-house annotated documents
Loading

0 comments on commit 8fd3da2

Please sign in to comment.